June 17th, 2008

Hey, Scripting Guy! How Can I Remove Specified Commas from a Comma-Separated Values File?

Hey, Scripting Guy! Question

Hey, Scripting Guy! I have a comma-separated values file in which several fields have a comma embedded in the field value; that’s causing problems when I try and import the information into another application. I was wondering if there is a way to remove all the commas that are enclosed inside double quote marks, leaving all the other commas in the file alone. Thanks for your help.
— RF

SpacerHey, Scripting Guy! AnswerScript Center

Hey, RF. Well, the Scripting Son is now officially a high school graduate; graduation for Juanita High School was held last Friday (Friday the 13th, for those of you who believe in signs and omens). Did the Scripting Guy who writes this column feel a sense of pride on Friday as he watched his son graduate? Did the Scripting Guy who writes this column feel a little wistful on Friday, knowing that his son was no longer a little boy? Maybe it was a little of both: maybe he felt both proud and wistful?

Actually, for the most part the Scripting Guy who writes this column felt bored on Friday. If you haven’t been to a high school graduation, the ceremonies last for about 2 hours, and consist primarily of people you don’t know taking turns monopolizing the microphone and going on and on and on about something no one else even cares about.

Heck, if that’s how the Scripting Guy who writes this column wanted to spend 2 hours he’d have just stayed at work and attended a typical Microsoft meeting.

At any rate, the Scripting Guy who writes this column is proud of his son; high school is hardly the most exciting thing he’ll ever get to do, but he not only persevered but he did quite well at it, too. In fact, the Scripting Guy who writes this column was so proud that he said, “Son, I’d like to reward you for all the hard work you’ve put in over the past 3 years. Name the one thing you’d like more than anything else in the world, and I’ll get it for you.”

“You know what, Dad?” said the Scripting Son. “The one thing I’d like more than anything else in the world is a script that can remove all the commas that are enclosed inside double quote marks, leaving all the other commas in the file alone.”

“Let me make sure I understand you completely, son,” said the Scripting Guy who writes this column. “You’re saying that you have a text file with lines similar to this, right?”

"Toshiba Laptop, AC Adapter, 50CT, 70CT","1,A","$1.82",0,"$0.00",0,"$0.00"
"Hewlett-Packard Laptop, AC Adapter, 50CT, 70CT","2,B","$1.82",0,"$0.00",0,"$0.00"

“That’s right.”

“And you’d like to remove any commas found in between double quote marks; for example, you’d like to remove the commas after Laptop, Adapter, and 50CT in the first line. Is that correct?”

“That’s correct.”

Happy graduation, son:

Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForReading)

Do Until objFile.AtEndOfStream
    strLine = objFile.ReadLine
    intLength = Len(strLine)

    blnStart = False

    For i = 1 to intLength
        strCharacter = Mid(strLine, i, 1)
        If strCharacter = Chr(34) Then
            If blnStart = True Then
                blnStart = False
            Else
                blnStart = True
            End If
        End If

        If strCharacter = "," Then
            If blnStart = "False" Then
                strNewCharacters = strNewCharacters & strCharacter
            End If
        Else
            strNewCharacters = strNewCharacters & strCharacter
        End If
    Next

    strNewContents = strNewContents & strNewCharacters & vbCrLf
    strNewCharacters = ""
Loop
       
objFile.Close

Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForWriting)
objFile.Write strNewContents
objFile.Close

As you can probably tell by looking at this code, removing certain commas from a text file (while leaving other commas in place) isn’t the easiest thing in the world; that’s because a seemingly-simple command like “Remove any commas that are found embedded between double quote marks” is nowhere near as simple as it might sound. Why not? Well, consider this simple example:

"A,B",C,"D,E"

So which commas do we need to remove from the string? That’s easy (for us): we just need to remove the comma between A and B and the comma between D and E. After all, those are the only two commas that are embedded within double quote marks, right? Right.

Well, except for these two, of course:

",C,"

Yes, we know that these two commas aren’t supposed to count as embedded commas. Sure, technically they do come between two double quote marks, but …. Like we said, this is something human beings can deal with quite easily; the problem we have is getting the computer to understand that these two commas don’t count as embedded commas. That’s much tougher.

So how did we tackle this issue? Well, our first thought was to use a regular expression of some kind. We soon changed our mind about that, however: although this problem can be solved using regular expressions, well, the resulting regular expression is a bit complicated, to say the least. The truth is, in cases like this, regular expressions also have difficulty determining which commas should be deleted and which ones shouldn’t be deleted.

That’s why we took a completely different approach: rather than searching and replacing, we’re methodically looking at each character in the text file and making decisions on a comma-by-comma basis. What does that mean, and is it really going to work? Hey, we can’t tell you that, at least not right now; after all, that’s what the rest of today’s column is for.

As you can see, our script starts off simply enough: all we do in the first two lines is define a pair of constants (ForReading and ForWriting) that we’ll need in order to open our text file. After defining the constants we create an instance of the Scripting.FileSystemObject, then use the following command to open the file C:\Scripts\Test.txt for reading:

Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForReading)

So what do we do with this file once it’s open? Well, for starters, we set up a Do Until loop that continues reading through the file until there’s nothing left to read. (Or, if you’re a stickler for technical accuracy, until the file’s AtEndOfStream property is True.) Inside this loop we use the ReadLine method to read the first line in the file and store that value in a variable named strLine. That means that, the first time through the loop, strLine will be equal to this:

"Toshiba Laptop, AC Adapter, 50CT, 70CT","1,A","$1.82",0,"$0.00",0,"$0.00"

Our next step is to use the Len function to determine the number of characters in strLine:

intLength = Len(strLine)

Note. Why do we need to know the number of characters in strLine? Well, as you might recall, we said that we were going to look at each individual character in each line in the text file; in order to do that we need to know how many characters are in each line.

After executing the Len function we set the value of a variable named blnStart to False (we’ll have more on blnStart – and what it’s for – in just a minute). That brings us to this line of code:

For i = 1 to intLength

All we’re doing here is setting up a For Next loop that starts at 1 and continues looping until we’ve looked at each character in strLine. Inside this loop we use the Mid function to grab the very first character in strLine and store it in a variable named strCharacter:

strCharacter = Mid(strLine, i, 1)

Believe it or not we’re actually beginning to make some progress here. Once we’ve grabbed the first character we can check to see if this character happens to be a double quote mark (that is, does it have an ASCII value of 34). That’s what this line of code is for:

If strCharacter = Chr(34) Then

If it turns out that we don’t have a double quote mark then we’re simply going to ignore this character for now. If we do have a double quote mark then we’re going to execute this block of code:

If strCharacter = Chr(34) Then
    If blnStart = True Then
        blnStart = False
    Else
        blnStart = True
    End If
End If

What’s going on here? Oh, darn; we were going to ask you that! Well, as you might recall, a little while ago we created a variable named blnStart and assigned it the value False. We’re going to use this variable to tell us whether or not we’re inside a set of double quote marks. If blnStart is False that means we are not inside a set of double quote marks; that also means that we don’t want to delete any commas that we might find. If blnStart is True that means that we are inside a set of double quote marks; in turn, that means that any commas we find should be deleted.

Make sense? Oh. OK, then maybe we should explain it a little better. Here’s a little more detail on what is happening, and why.

If the first character in the first line of our text file is a double quote mark (which it is), we “flip” the value of blnStart from False to True. We then execute this block of block of code:

If strCharacter = "," Then
    If blnStart = "False" Then
        strNewCharacters = strNewCharacters & strCharacter
    End If
Else
    strNewCharacters = strNewCharacters & strCharacter
End If

What we’re doing here is checking to see if the current character is a comma. If it’s not a comma then we simply append this value to a variable named strNewCharacters, a variable we’ll use to build a “new” text file (that is, a text file in which the embedded commas have been deleted):

strNewCharacters = strNewCharacters & strCharacter

The first time through the loop strCharacter will not be a comma; therefore, the variable strNewCharacters will end up being assigned the following value:

"

And then it’s back to the top of the loop, where we repeat the process with the next character in strLine.

After we’ve run through the loop a few times strLine will be equal to this:

"Toshiba Laptop

And then – oh, dear: the next character in the line is a comma! Should we be panicking about now?

Yes, we should. But not because the next character in the line is a comma; our script can handle that just fine. We just figured that, the state of the world being what it is, any time seems like a good time to panic.

Our script, however, will remain perfectly calm and in control. If strCharacter happens to be a comma, we check to see if blnStart is equal to False. If it is, that means we’re not inside a set of double quote marks; that also means that we want to keep this comma. (Why? Because it’s a separator in our comma-separated values file.) If blnStart is False we append this comma to strNewCharacters; if blnStart is True then we discard this comma altogether. Why do we throw it out? You got it: because we’re getting rid of any commas that are enclosed in double quote marks.

Eventually, strNewCharacters will be equal to this (note that all the embedded commas have been deleted):

"Toshiba Laptop AC Adapter 50CT 70CT

At this point, the next character in the text file is a double quote mark. Because this is the closing quote mark in our set of double quotes we end up flipping blnStart back to False; that’s because we’re no longer in a set of double quotes. (The first quote mark turns the set on; the second quote mark turns the set off.) That also means we will not discard the comma that immediately follows this second double quote mark. Why not? That’s right: because this comma is not embedded in a set of double quotes. As it turns out, distinguishing between commas that should be deleted and commas that shouldn’t be deleted is no more difficult than that.

After we’ve run through all the characters in line 1 we then add the modified value of strNewCharacters to a variable named strNewContents. We reset the value of strNewCharacters to an empty string (“”), then pop back to the top of our Do Until loop and repeat the entire process with the next line in the text file.

And then once we’ve read all the lines in the text file we close the file, reopen it for writing, then use the Write method to replace the existing contents with the value of strNewContents:

objFile.Write strNewContents

All that’s left now is to close the file a final time, and watch the script glide gently to a stop.

And what will Test.txt look like now? It should look like this, with nary an embedded comma to be found:

"Toshiba Laptop AC Adapter 50CT 70CT","1A","$1.82",0,"$0.00",0,"$0.00"
"Hewlett-Packard Laptop AC Adapter 50CT 70CT","2B","$1.82",0,"$0.00",0,"$0.00"

Success!

That should do it, RF; let us know if you run into any problems with this. As for those of you still curious about the high school graduation ceremonies, we might point out that, over the course of the evening, literally hundreds of people were not only thanked for their efforts, but received awards recognizing their achievements. Needless to say, the Scripting Guy who writes this column was neither thanked nor rewarded.

Hmmm; maybe he did stay at work after all!

Author

0 comments

Discussion are closed.