Hey, Scripting Guy! How Can I Remove Delimiters, and Fix Apostrophes, in a CSV File?
Hey, Scripting Guy! I have a comma-separated values file (CSV) that uses double quote marks as the field delimiter; unfortunately, this file also uses a double quote mark any time it should use an apostrophe (like in the name O’Malleys). Somehow I need to remove all the delimiters and convert the appropriate double quote marks to an apostrophe. How can I do that?
Hey, MP. Well, the Scripting Guys can finally breathe a sigh of relief: things are back to normal today. Last Thursday afternoon something very strange and totally unexpected occurred: the clouds parted, and the sun actually came out. Even more bizarre, the temperature actually went up rather than down. (Which was good; after all, the Scripting Guy who writes this column had pretty much decided he’d been ripped off and been sold a one-way thermometer instead of one in which the temperature could go up as well as down.)
Thursday afternoon was nice: the temperature actually crept above 70 degrees Fahrenheit. Friday and Saturday were even better: temperatures were in the 80s and, depending on your exact location, even topped 90 degrees. Sunday was a little bit cooler, but still nice, at least for the better part of the day. By Sunday night, however, the rain had returned and, as of Monday afternoon (when this column was written) the sky was once again gloomy and overcast, and the temperature a rousing 58 degrees. And not a moment too soon. Temperatures in the 80s? No human being could possibly withstand prolonged exposure to warm, sunny weather.
Actually the Scripting Guys were perfectly content with the warm sunshine. That wasn’t necessarily true of our fellow Seattleites, however. When the Scripting Guy who writes this column drove home Friday afternoon he didn’t pass a single car with the windows open; instead, everyone had the windows tightly rolled-up and the air conditioner blasting away. (And yes, people in Seattle do buy air conditioners. And we turn them on any time the temperature gets above 50 degrees.)
In other words, two or three times a year.
In addition to that, a few minutes ago this same Scripting Guy heard two of his colleagues standing in the hallway discussing the weather. “It was just way too hot on Saturday,” said the one Microsoft employee. “You’re telling me,” said the other. “I hope that’s it for hot weather this year.”
So do we. Because it’s always best to get summer over with by May 20th, if at all possible.
At any rate, today the weather is back to Seattle gray, and everyone is breathing a sigh of relief. Well, everyone except the Scripting Guys, that is. The Scripting Guys both spent most of the past weekend outdoors; in fact, the Scripting Guy who writes this column even managed to get a little color into his skin. (OK, true: he already had a little color in his skin, assuming you count Pasty Pale White as a color. The Scripting Editor never gets ay color into her skin, because she uses sun block. Which does seem a bit redundant when you live in Seattle, doesn’t it?)
Needless to say, though, neither of the Scripting Guys will be spending much time outside today. OK, granted, the Scripting Guy who writes this column might, if Microsoft finally reaches the breaking point and tosses him out on his rear end. But, barring that, he’ll be inside all day today.
Which means that he might as well try to write a script that can remove delimiters from a text file and, just for good measure, substitute apostrophes for incorrectly-place double quote marks as well. Before we do that, however, let’s take a peek at the CSV file that MP sent along:
"1","HOUSE","JOE J","SMITH","01/01/2005" "2","BUSINESS","BAKERY","SMITH2","02/01/2005" "3","BUSINESS","O"MALLEYS TIRE AND AUTO","SMITH3","03/01/2005" "4","HOUSE","JASON","SMITH4","04/01/2005"
As you can see, this is actually a fairly standard little CSV file, with individual fields enclosed in double quote marks and separated by commas. Line 1, for example, contains the following field values:
There is one bit of goofiness, however: this particular file (or, more correctly, the application that generates it) uses a double quote mark instead of an apostrophe. Take a peek at the third field in line 3:
"O"MALLEYS TIRE AND AUTO"
Needless to say, that’s supposed to look like this:
"O'MALLEYS TIRE AND AUTO"
In the end, that means there are two things we need to do. First, we need to get rid of the double quote marks; MP has no use for these delimiters. Second, we need to substitute an apostrophe (a single quote mark) as needed. That, by the way, is the sticking point. If all we had to do was get rid of the double quote marks that would be easy: we could simply replace each double quote mark with nothing. But we can’t do that; if we did, then O’Malleys Tire and Auto would look like this:
OMALLEYS TIRE AND AUTO
That’s not what we want, not what we want at all.
So how are we going to tackle these two problems? To begin with, we might note that MP sent along a sample script that almost did the trick; that script gave him output that looked like this:
"1,HOUSE,JOE J,SMITH,01/01/2005" "2,BUSINESS,BAKERY,SMITH2,02/01/2005" "3,BUSINESS,O"MALLEYS TIRE AND AUTO,SMITH3,03/01/2005" "4,HOUSE,JASON,SMITH4,04/01/2005"
What MP did was search for each instance of a comma enclosed in double quote marks (“,”) and replace each instance with just a comma. Like we said, that almost worked; unfortunately, though, it left a double quote mark at the beginning and at the end of each line, and it did nothing to address the apostrophe issue. With any luck, however, the following script should take care of all of MP’s problems:
Const ForReading = 1 Const ForWriting = 2 Set objFSO = CreateObject("Scripting.FileSystemObject") Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForReading) Do Until objFile.AtEndOfStream strLine = objFile.ReadLine intLength = Len(strLine) strLine = Mid(strLine, 2, intLength - 2) strText = Chr(34) & "," & Chr(34) strLine = Replace(strLine, strText, ",") strLine = Replace(strLine, Chr(34), "'") strNewText = strNewText & strLine & vbCrLf Loop objFile.Close Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForWriting) objFile.Write strNewText objFile.Close
OK, now we’re ready to launch into a discussion of the code and how it works. To begin with, we create a pair of constants, ForReading and ForWriting; we’ll need these constants in order to open our CSV file in the correct mode (that is, either for reading or for writing). After defining the constants we create an instance of the Scripting.FileSystemObject, then use the OpenTextFile method to open the file C:\Scripts\Test.txt for reading:
Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForReading)
After the file is open we set up a Do Until loop that will enable us to read the file, line-by-line, until there aren’t any lines left to read. (Or, to put a more technical spin on it, until the file’s AtEndOfStream property is True.) Inside that loop the first thing we do is use the ReadLine method to read the first line in the file and store it in a variable named strLine:
strLine = objFile.ReadLine
And yes, now the fun begins. We kick off the festivities by using the Len function to determine the number of characters in the variable strLine. Why do we do that? Well, as you recall, one of the problems MP ran into was this: his search-and-replace operation left a double quote mark at the beginning and at the end of each line. One easy way to get around that problem is to simply grab all the characters in each line except for the beginning and ending character. In the case of line 1, that gives us this following value:
What good does that do us? Actually, it does us quite a bit of good; with the two endpoints gone we can replace each instance of “,” with a comma. In turn, that makes line 1 equal to this:
Which, of course, is just exactly what we want it to be equal to.
But more on that in just a moment. Instead, let’s try doing first things first, and explain how we get to that point. In order to extract all the characters except the first and last characters from strLine we need to know the total number of characters in that string. Once we know that value we can use the Mid function to grab all the characters in strLine except the first and last characters:
strLine = Mid(strLine, 2, intLength - 2)
There’s really nothing fancy going on here: we’re simply asking Mid to start at character position 2 and then grab x number of characters (with x being equal to the length of the string minus 2). By starting at character position 2 we skip over the first character (the beginning double quote mark) altogether. And by grabbing the next x characters (with x, again, being equal to the length of the string minus 2) that lets us grab everything else except for the very last character.
For example, suppose strLine happened to be a string six characters long:
If we start at character 2 we’ll start on the T in Test. And suppose we grab the total number of characters minus 2? Well, the total number of characters is 6, and 6 minus 2 is 4. If we start on the T and take a total of 4 characters that results in us grabbing the following value:
Pretty clever, if we do say so ourselves.
Note. And we do say so ourselves, because we assume that no one else will.
Our next chore is to replace each instance of a comma enclosed in double quote marks with just a comma. In order to do that, we first run this line of code:
strText = Chr(34) & "," & Chr(34)
All we’re doing here is defining the text that we want to search for: a double quote mark (Chr(34)) followed by a comma followed by another double quote mark. After we’ve defined the search text we can replace each instance of that text with a plain old comma by using this line of code:
strLine = Replace(strLine, strText, ",")
That’s great, but it still leaves us with a problem. As-is, our script will correctly remove all the delimiters from lines 1 and 2; it will even remove all the delimiters from lines 3 and 4 as well. When we reach line 3, however, we run smack-dab into the apostrophe issue:
O"MALLEYS TIRE AND AUTO
Great. What do we do about that?
Well, the one thing that we shouldn’t do is panic. At this point in time we’ve eliminated the double quote marks that appear at the beginning and at the end of each line; we’ve also removed all the double quote marks that were used as field delimiters. If there’s a double quote mark left in the string (and there is) then this is undoubtedly a double quote mark that’s supposed to be an apostrophe. And if that’s the case (and it is) we can substitute an apostrophe for the double quote mark by using this line of code:
strLine = Replace(strLine, Chr(34), "'")
Once we’ve done that we tack the new value of strLine plus a carriage return-linefeed character (vbCrLf) onto the end of a string variable named strNewText:
strNewText = strNewText & strLine & vbCrLf
And then it’s back to the top of the loop, where we repeat the process with the next line in the text file. After we’ve processed each and every line in the file we close the CSV file, then immediately reopen it, this time for writing:
Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForWriting)
From here we use the Write method to replace the existing contents of Test.txt with the value of strNewText. And once we’ve done that we simply call the Close method to close the file for good.
That should do it, MP; let us know if it doesn’t. In the meantime, here’s the current weather forecast for the Seattle area:
Springtime in Seattle … it just doesn’t get any better than this, does it?