July 17th, 2010

Hey, Scripting Guy! Weekend Scripter: Uniquifying and Validating Link Lists with Windows PowerShell

 

Microsoft Scripting Guy Ed Wilson here. It is a beautiful Saturday morning here in Charlotte, North Carolina, in the United States. The sky is dark blue and there are few if any clouds visible. It almost makes one dizzy while standing in the front yard leaning back and looking upward because there is nothing to give any frame of reference for depth perception. I am reminded of a wall dive I did in Maui, Hawaii. I snapped the following photo on that dive.

Photo Ed took while diving off Maui, Hawaii

I have a CSV file that contains a number of hyperlinks in it. Unfortunately, there are many duplicates in the file. To make matters worse, some of the links are not really valid hyperlinks. I thought about opening the CSV file in Microsoft Excel, using the sort option, and then manually cleaning up the file, but Microsoft Excel has developed a bad habit of adding extra quotation marks to my CSV files. In addition, it seems that CSV files opened with Microsoft Excel pick up extra commas along the way. Both of these annoyances have caused me to quit using Microsoft Excel to work with CSV files. The CSV file is shown in Notepad in the following image.

Image of CSV file in Notepad

Therefore, as I was mulling my dilemma over a bowl of shredded cardboard (I believe the box would have tasted better than the alleged cereal), I initially hit upon the idea of creating a hash table. After the hash table was created, I would then add the hyperlinks as keys in the hash table, handle the duplicate key errors, errors, and pass the newly created hash table to the Export-CSV Windows PowerShell cmdlet. Though that approach would accomplish what I need to do, it just seems like too much work for a Saturday morning.

When I went upstairs to my office, I opened Windows PowerShell and said to myself, “I wish I could use the Sort-Object cmdlet with the –unique parameter to provide me with a unique list of hyperlinks. While I am at it, it would be really cool if I could pass the sorted contents to Export-CSV.” As a matter of fact I can, and here is the command I used (I typed this on a single line in my Windows PowerShell console; the line is broken up here for readability):

 

Import-Csv .\HSGlinks_RAW.csv | Sort-Object -Property text -Unique | 

Export-Csv .\sortedlinks.csv

 

After I had a sorted unique list of hyperlinks, I next decided to filter out invalid hyperlinks. I had noticed my file contained a number of Universal Naming Convention (UNC) paths, and I am looking for only Internet addresses in this list. This time, I imported the CSV file, piped the results to the Where-Object cmdlet and used a very simple regular expression to look for the presence of “http” at the beginning of the string. If that match is found, the results are piped to the Export-CSV cmdlet. Here is the code that I used:

 

Import-csv .\sortedlinks.csv | where-object { $_.url -match “^http”} | 

Export-Csv .\sortedlinksCleaned.csv

 

Well, that is it—short and sweet. Now, I am going to head out to my woodworking shop. The Scripting Wife has been making fun of me because I have five projects in progress at the same time. It is time to finish one or two of them. I wonder if this would work:

 

Get-WoodworkingProject | where-object { $_.status -match “incomplete”} |

Complete-WoodworkingProject

 

You never know. The CSV project was easy to complete. I just need to find a woodworking provider.

If you want to know exactly what we will be looking at tomorrow, follow us on Twitter or Facebook. If you have any questions, send e-mail to us at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum.. See you tomorrow. Until then, peace.

 

Ed Wilson and Craig Liebendorfer, Scripting Guys

Author

0 comments

Discussion are closed.