April 14th, 2009

Hey, Scripting Guy! How Can I Convert a Tab-Delimited File to a Comma-Separated Value File?

Hey, Scripting Guy! Question

Hey, Scripting Guy! I have this tab-delimited file that was produced by a report from an old database application. All it knows how to do is make a tab-delimited file. I hate this format, and our other applications do too. I have been opening the thing in Notepad and trying to find and replace tabs with commas, but Notepad does not seem to do a good job cleaning up tab-delimited files either. I am wondering if you can do something with Windows PowerShell?

– OG

SpacerHey, Scripting Guy! Answer

Hi OG,

It is cloudy, gray, and overcast outside. Where we grew up, when the the clock said it was noon, we had every right to expect sun. I miss dependable sun. Perhaps I could bring a few rays of figurative sunshine to your day, OG. You need to be able to convert a tab-delimited file to a comma-separated value file. To do that, we will need to use regular expressions.

This week we are focusing on regular expressions. There are some VBScript examples in the Script Center. Here is a good introduction from the 2008 Winter Scripting Games (by the way, in the 2009 Summer Scripting Games, I can pretty much guarantee you will need to be able to do something with regular expressions for one of the events). The Regex .NET Framework class from the System.Text.RegularExpressions namespace is documented on MSDN. This is one of the main classes we use in Windows PowerShell when working with regular expressions. You also will find some information about regular expressions in the Microsoft Press book, Windows PowerShell Scripting Guide. Here is a very good article about regular expression use in VBScript. In this week’s articles, we are using Windows PowerShell for our samples. Please refer to the Windows PowerShell Scripting Hub for more information about this exciting new technology.

You can also review a VBScript version of this script. Actually, OG, using Windows PowerShell I can solve your problem with a single line of code:

Get-Content –path C:\fso\tabDelimited.txt | ForEach-Object {$_ -replace “\s”,”,” } |  
Out-File -filepath C:\fso\tabDelimited.csv

As seen in the following image, each column is separated by a tab value, which results in an even amount of white space between each column of data:

Image of the columns separated by an even amount of white space

 

To read from a text file using Windows PowerShell, we use the Get-Content cmdlet and supply the path to the file. When we do this, the result is a stream of text that goes wherever we have directed it. If we simply read the file, the output will be the console as seen here:

PS C:\> Get-Content -Path C:\fso\TabDelimited.txt
head1   head2   head3   head4
data1   data21  data31  data41
data2   data22  data32  data42
data3   data23  data33  data43
data4   data24  data34  data44
data5   data25  data35  data45
data6   data26  data36  data46
data7   data27  data37  data47
data8   data28  data38  data48
data9   data29  data39  data49

Many times, displaying text output on the screen is all that interests us. At other times, however, we need to parse the information or use it in other ways. There are two ways we can further refine data we receive from Get-Content. The first way to do this is to read the content of the file, and then store the results in a variable. This is the way things were done in VBScript and is illustratedhere:

PS C:\> $tabFileContents = Get-Content -Path C:\fso\TabDelimited.txt
PS C:\> $tabFileContents
head1   head2   head3   head4
data1   data21  data31  data41
data2   data22  data32  data42
data3   data23  data33  data43
data4   data24  data34  data44
data5   data25  data35  data45
data6   data26  data36  data46
data7   data27  data37  data47
data8   data28  data38  data48
data9   data29  data39  data49

The variable $tabFileContents contains an array of text items. We can now use standard array techniques to work with the text. We can, for example, index directly into the array and retrieve a specific line of text. This is helpful if we are interested in obtaining information that is always stored on a specific line within a text file. This is seen here:

PS C:\> $tabFileContents[0]
head1   head2   head3   head4

We can also use the ForEach statement and walk through the items in the $tabFileContents variable, but I generally do not do things like that unless I am working in a script. I think the code gets kind of jumbled and is hard to read. Here is what it would look like:

PS C:\> $tabFileContents = Get-Content -Path C:\fso\TabDelimited.txt
PS C:\> ForEach ($line in $tabFileContents) { $line }
head1   head2   head3   head4
data1   data21  data31  data41
data2   data22  data32  data42
data3   data23  data33  data43
data4   data24  data34  data44
data5   data25  data35  data45
data6   data26  data36  data46
data7   data27  data37  data47
data8   data28  data38  data48
data9   data29  data39  data49

This could have been written on a single line by using the semicolon to separate the two logical lines of code, but it would hurt readability. One of the mistakes people make when moving from VBScript to Windows PowerShell is not taking advantage of the pipeline. With small files such as the one we are working on here, there is no performance impact. However, with larger files there can be a significant performance impact both in terms of memory utilization and speed of the operation. This is because reading a text file and storing the content into a variable is a linear process that must be completed before we can begin to work on the data. With a pipeline, the data from the file begins to stream over the pipeline nearly immediately. This causes the memory consumption to be less, but it also results in faster processing time, because the reading and processing are taking place at the same time.

After we have obtained the content of the tab-delimited file, we pipeline the resulting text to the ForEach-Object cmdlet. The ForEach-Object cmdlet is how we are able to work with individual lines of text as they come through the pipeline. We use the replace operator to replace items that match the regular expression pattern with other text. This can be as simple as substituting one literal string for another one as seen here:

PS C:\> “this is an string” -replace “an”,”a”
this is a string

For an introduction to using regular expression patterns, refer to yesterday’s “Hey, Scripting Guy!” article. The pattern we are using today is simple: an “\s” pattern to look for white space in the text file. All white space, regardless of the number of spaces, is identified by our pattern. When white space is found, it is changed to a comma. This line of code is seen here:

ForEach-Object {$_ -replace “\s”,”,” }

After the tab values have been replaced with commas, the file is saved as a .csv file. To do this, we pipeline the results to the Out-File cmdlet and specify the file path for the new .csv file. This is seen here:

Out-File -filepath C:\fso\tabDelimited.csv

After the new file is created, you can open it in Notepad or Microsoft Excel, or use it as an import source for your database program. The .csv file is seen here:

Image of the .csv file that is created

 

Well OG, we could have written a script. If we did, it would have looked like ConvertTabFileToCSV.ps1. As you will no doubt notice, we added a variable to hold the path, and used replace to change the file name from “.txt” to “.csv”. We then substituted variables in the body of the code. Other than that, the script and our one-liner command are the same. Here’s the script:

body of the code. Other than that, the script and our one-liner command are the same. Here’s the script:

ConvertTabFileToCSV.ps1

$path = “C:\fso\tabDelimited.txt”
$outPath = $path -replace “.txt”,”.csv”
Get-Content -path $path | 
ForEach-Object {$_ -replace “\s”,”,” } |  
Out-File -filepath $outPath

OG, I hope you have enjoyed our discussion about replacing text. In the end, we did write a script, didn’t we? But we did not have to write the script to accomplish our goals. This is one of the cool things about Windows PowerShell. I believe a famous person once said something like this: “To script or not to script? That is the question.” In the end we could not make up our mind, and we did both. Hope you will join us tomorrow as Regular Expressions Week continues. Until then, peace.

 

Ed Wilson and Craig Liebendorfer, Scripting Guys

Author

0 comments

Discussion are closed.

Feedback