November 1st, 2011

Use PowerShell to Remove Duplicate Lines from a CSV File

Doctor Scripto
Scripter

Summary: Learn how to use Windows PowerShell to easily remove duplicates from a CSV file.

 

Hey, Scripting Guy! QuestionHey, Scripting Guy! I have a comma-separated value (CSV) file that comes from a Microsoft Excel spreadsheet. We use this CSV file to create new users automatically in Active Directory. The problem is that every once in a while, duplicates end up in the CSV file. As you can well imagine, this creates havoc and the import fails. I began to think about how I could ensure that the data in the CSV file is unique. I was thinking of the old-fashioned dictionary object trick I used back in the VBScript days, and I tried some things using the Windows PowerShell hashtable. The problem is it quickly got rather complicated. Next, I decided to try a similar thing using an array, and once again, things went pear-shaped. I looked on the Scripting Guys Script Repository and found a pretty cool VBScript that uses a recordset and ADO. I am guessing I could modify it, but I am hoping for something easier.

—JB

 

Hey, Scripting Guy! AnswerHello JB,

Microsoft Scripting Guy Ed Wilson here. Some things are simple, such as today’s date: 11-1-11. (That’s how we present it in the United States anyway.) In fact, this year has been one of simple dates. It began with 1-1-11, and in 10 days, it will be 11-11-11. Other things appear complicated, and then a sudden revelation shows simplicity and elegance of design.

One of the hardest things for people who formerly used VBScript as they move to Windows PowerShell is to quit thinking in terms of what they would have done in VBScript. The overriding principle in Windows PowerShell is that everything is an object. “Oh yeah,” you might say, “I know that.” But until you really know that, and until you use that as fundamental in your approach, you will not actually get it. After you really know that everything is an object, you begin to look for solutions in a new and different way—one that was not possible in VBScript or some other scripting language.

A perfect case in point, JB, is your problem with needing to remove duplicates from a CSV file. First, if I am going to work with a CSV file, I need to import it. I then need to see which properties are available. To do this, I use the Import-CSV cmdlet and the Get-Member cmdlet. In the output that follows, I see four noteproperties that correspond to the column headers from the CSV file.

PS C:\> Import-Csv C:\fso\UsersConsolidated.csv | Get-Member

 

 

   TypeName: System.Management.Automation.PSCustomObject

 

Name                           MemberType                                         Definition

Equals                           Method                                                 bool Equals(System.Object obj)

GetHashCode                Method                                                 int GetHashCode()

GetType                        Method                                                 type GetType()

ToString                        Method                                                 string ToString()

Fname                           NoteProperty                                        System.String Fname=Ed

Group                           NoteProperty                                        System.String Group=Engineering

Lname                           NoteProperty                                        System.String Lname=Banti

OU                               NoteProperty                                        System.String OU=Atlanta

 

I now need to decide which duplicates I want to remove. Obviously, I do not want to remove duplicate group names or OU names. I only want to remove duplicate first name and last name combinations. It is okay to have a duplicate first name, and it is okay to have a duplicate last name—just not a duplicate first name/last name combination. I therefore need to sort the output by lname (last name) and fname(first name) columns. To do this, I use the Sort-Object cmdlet. This command is shown here (sort is actually an alias for Sort-Object):

Import-Csv C:\fso\UsersConsolidated.csv | sort lname,fname

The command to sort an imported CSV file and the associated output is shown in the following figure.

Image of command and associated output

After I have the contents of the CSV file sorted, I use the unique switch to return only unique rows from the file:

Import-Csv C:\fso\UsersConsolidated.csv | sort lname,fname –Unique

The command and associated output are shown in the following figure.

Image of command and associated output

JB, that is all there is to using the Sort-Object cmdlet and the Import-CSV cmdlet to remove duplicates from a CSV file. Join me tomorrow when we continue to explore the cool things that Windows PowerShell offers.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace

Ed Wilson, Microsoft Scripting Guy

 

 

Author

The "Scripting Guys" is a historical title passed from scripter to scripter. The current revision has morphed into our good friend Doctor Scripto who has been with us since the very beginning.

1 comment

Discussion is closed. Login to edit/delete existing comments.

Newest
Newest
Popular
Oldest
  • Gutha sairam

    Hey.scripting guy..! In the above blog you removed a duplicate name adam..but those 2 adam's belongs to two different groups.Now in my case i need to check a duplicate row which has same values in all the coloums that means in the above output if the group names are diffferent it should not show as a duplicate.and also i dont want to delete the duplicate's..i want to export those duplicate rows to a new excell...

    Read more

Feedback