Summary: Use Windows PowerShell to fix address issues prior to exporting the data to another database.
Microsoft Scripting Guy, Ed Wilson, is here. This morning I am listening to The Doors on my Zune HD, and sipping a very nice cup of tea. I used three spoons of Darjeeling tea, one spoon of lemon grass, one spoon of marshmallow root, one spoon of licorice root, and one cinnamon stick. I steeped it in my pot for four minutes. The water is 208 degrees (and of course, I preheated the pot by filling the pot half full of hot water and letting it sit for a minute).
The tea is naturally sweet, with the cinnamon, licorice root, and lemon grass adding just enough sweetness that the tea does not really need any additional sugar or honey. The marshmallow root complements the Darjeeling tea, and adds a bit of earthiness that rounds out the flavor. It makes for a really mellow tea that seems to go well with sixties hippy music.
Of course, I am also reading the email sent to scripter@microsoft.com. It is cool outside this morning in Charlotte—67 degrees Fahrenheit (19 degrees Celsius), and it is the kind of day that offers unlimited potential. It is even the kind of day where I can use Windows PowerShell to fix data before I import it to SQL Server.
If a database entry form is not designed properly, and if it does not have data input validation, it is very likely that the data will be inconsistent. This may be OK if you never intend to do anything with the data. But the moment you try to do something, such as read the database or print mailing labels (which can be hard enough with properly formatted data), things can really fall off the rails.
Suppose the data looks like that in the following spreadsheet:
At first glance, I see that street is either spelled out completely or it is abbreviated as st, ST, or St. This is not too bad, but it can certainly mess up sorting. I also see that the state is either spelled out or abbreviated as all caps, all lower case, mixed case, and with a period or not. Hmm, this can cause problems also.
I also see that in some cases there is a comma between the state and the postal (zip) code, and in other places, not. Splitting the field with a comma will really cause problems here because in some places, the postal code will be its own field, and in other cases, it will be part of the state field. MAJOR problems. In addition, on at least one occasion, the city is in its own field, and the state and postal code are combined.
The first thing I need to do is to import the CSV file, and store the results in a variable. I do this by using the Import-CSV cmdlet. I use the same variable name I have used in other scripts this week. Here is the command:
$datain = import-csv C:\DataIn\DataDump.csv
Now I need to walk through the collection of records, and work on each field that is available. To do this, I use the Split method from the string class. This is because my address field contains a single string in the property. This is shown here:
Foreach ($d in $datain)
{
$a = $d.address.split(',')
Now that I have split the string contained in the address property of the object, I need to look at the different elements in the array that is created by Split. The first element is the street element. I need to fix it so if the words Street or street appear, I replace it with a capital ST. This is shown here:
$str = $a[0] -replace '(Street|street)','ST'
The second element is the city field. I noticed, when looking through the data, that the city, Dewpoint, was at times abbreviated as dwpt. So I simply replace that with the correct spelling of the name. This command is shown here:
$city = If($a[1] -match 'dwpt') {"Dewpoint"} ELSE {$a[1]}
It seems from looking at my custom data, that only one person is associated with a spelled-out state name, and the state spelled out is South Carolina. So I replace it with the proper abbreviation, as shown here:
$state = if($a[2] -match 'South Carolina') {"SC"} ELSE {$a[2]}
The next field is where some problems seem to appear. Therefore, I need to look at how long the postal code field is. If it is too long, I simply generate an error message. If I can identify a common type of problem in this field, I can also write code to detect and repair this field. The first step is to call out the problem and look at it:
$zip = If($a[3].trim().Length -gt 5) {"Error for $a"} ELSE {$a[3]}
Now I need to output a custom object. Here is the command I use:
[PSCustomObject]@{
Street = $str.ToUpper()
City = $city.ToUpper()
State = $state.Trim().Substring(0,2).ToUpper()
Zip = $zip
}
That is it. The complete FixAddress.PS1 script is shown here:
$datain = import-csv C:\DataIn\DataDump.csv
Foreach ($d in $datain)
{
$a = $d.address.split(',')
$str = $a[0] -replace '(Street|street)','ST'
$city = If($a[1] -match 'dwpt') {"Dewpoint"} ELSE {$a[1]}
$state = if($a[2] -match 'South Carolina') {"SC"} ELSE {$a[2]}
$zip = If($a[3].trim().Length -gt 5) {"Error for $a"} ELSE {$a[3]}
[PSCustomObject]@{
Street = $str.ToUpper()
City = $city.ToUpper()
State = $state.Trim().Substring(0,2).ToUpper()
Zip = $zip
}
}
When I run the completed script, the following output appears. Notice that my script picked up two records that need special attention. I was not aware of the second one.
That is all there is to using Windows PowerShell to fix the address field from a data import. Data Manipulation Week will continue tomorrow when I will talk about outputting custom objects.
I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.
Ed Wilson, Microsoft Scripting Guy
0 comments