Use PowerShell to Remove Duplicate Lines from a CSV File
Summary: Learn how to use Windows PowerShell to easily remove duplicates from a CSV file.
Hey, Scripting Guy! I have a comma-separated value (CSV) file that comes from a Microsoft Excel spreadsheet. We use this CSV file to create new users automatically in Active Directory. The problem is that every once in a while, duplicates end up in the CSV file. As you can well imagine, this creates havoc and the import fails. I began to think about how I could ensure that the data in the CSV file is unique. I was thinking of the old-fashioned dictionary object trick I used back in the VBScript days, and I tried some things using the Windows PowerShell hashtable. The problem is it quickly got rather complicated. Next, I decided to try a similar thing using an array, and once again, things went pear-shaped. I looked on the Scripting Guys Script Repository and found a pretty cool VBScript that uses a recordset and ADO. I am guessing I could modify it, but I am hoping for something easier.
Microsoft Scripting Guy Ed Wilson here. Some things are simple, such as today’s date: 11-1-11. (That’s how we present it in the United States anyway.) In fact, this year has been one of simple dates. It began with 1-1-11, and in 10 days, it will be 11-11-11. Other things appear complicated, and then a sudden revelation shows simplicity and elegance of design.
One of the hardest things for people who formerly used VBScript as they move to Windows PowerShell is to quit thinking in terms of what they would have done in VBScript. The overriding principle in Windows PowerShell is that everything is an object. “Oh yeah,” you might say, “I know that.” But until you really know that, and until you use that as fundamental in your approach, you will not actually get it. After you really know that everything is an object, you begin to look for solutions in a new and different way—one that was not possible in VBScript or some other scripting language.
A perfect case in point, JB, is your problem with needing to remove duplicates from a CSV file. First, if I am going to work with a CSV file, I need to import it. I then need to see which properties are available. To do this, I use the Import-CSV cmdlet and the Get-Member cmdlet. In the output that follows, I see four noteproperties that correspond to the column headers from the CSV file.
PS C:\> Import-Csv C:\fso\UsersConsolidated.csv | Get-Member
Name MemberType Definition
Equals Method bool Equals(System.Object obj)
GetHashCode Method int GetHashCode()
GetType Method type GetType()
ToString Method string ToString()
Fname NoteProperty System.String Fname=Ed
Group NoteProperty System.String Group=Engineering
Lname NoteProperty System.String Lname=Banti
OU NoteProperty System.String OU=Atlanta
I now need to decide which duplicates I want to remove. Obviously, I do not want to remove duplicate group names or OU names. I only want to remove duplicate first name and last name combinations. It is okay to have a duplicate first name, and it is okay to have a duplicate last name—just not a duplicate first name/last name combination. I therefore need to sort the output by lname (last name) and fname(first name) columns. To do this, I use the Sort-Object cmdlet. This command is shown here (sort is actually an alias for Sort-Object):
Import-Csv C:\fso\UsersConsolidated.csv | sort lname,fname
The command to sort an imported CSV file and the associated output is shown in the following figure.
After I have the contents of the CSV file sorted, I use the unique switch to return only unique rows from the file:
Import-Csv C:\fso\UsersConsolidated.csv | sort lname,fname –Unique
The command and associated output are shown in the following figure.
JB, that is all there is to using the Sort-Object cmdlet and the Import-CSV cmdlet to remove duplicates from a CSV file. Join me tomorrow when we continue to explore the cool things that Windows PowerShell offers.
I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at firstname.lastname@example.org, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace
Ed Wilson, Microsoft Scripting Guy
Hey.scripting guy..! In the above blog you removed a duplicate name adam..but those 2 adam’s belongs to two different groups.Now in my case i need to check a duplicate row which has same values in all the coloums that means in the above output if the group names are diffferent it should not show as a duplicate.and also i dont want to delete the duplicate’s..i want to export those duplicate rows to a new excell sheet..can you help me to write a powershell script for this requirement..Thanks in advance