May 1st, 2008

Hey, Scripting Guy! How Can I Sort a Text File Based on a Specific Column in That File?

Hey, Scripting Guy! Question

Hey, Scripting Guy! I have a text file composed of many different lines; each line has a number of fields, with the fields separated from one another by a tab character. I need to sort this text file based on the value of one of these fields. How can I do that?
— IR

SpacerHey, Scripting Guy! AnswerScript Center

Hey, IR. You know, that’s an interesting question, if only because – oops, hold on a second, we just received an email with the Subject line Message From the Treasury Department. We better check this out; after all, that could mean that Scripting Guy Peter Costantini has started to write counterfeit VBScript scripts again.

Note. Actually, counterfeiting falls under the province of the United States Secret Service (which obviously isn’t all-that-secret), which is merely a division of the Treasury Department. The Treasury Department itself exists primarily to act in the role of “the steward of U.S. economic and financial systems, and as an influential participant in the global economy.” In addition, the Treasury Department works to “predict and prevent economic and financial crises.” You know that subprime mortgage thing that’s currently wreaking havoc on the US economy? Well, don’t feel bad; apparently the Treasury Department didn’t know about it, either.

Anyway, please excuse us for a second while we deal with this important email. According to the message we received we need to log onto our account with the Treasury Department and verify that our personal information is correct; as it turns out, that’s the only way that the Treasury Department can protect us against fraud. We better do that right away.

And yes, we know: many of you are thinking, “Don’t do it, Scripting Guy who writes that column; this sounds like a scam to us.” And we have to admit that, at first, we had the same thought ourselves. However, take a look at just one portion of the email:

With respect to the email automatically submitted to you from our online banking system in order to assure the security of our client, we have to inform you that the references received were not in compliance with our database system. Consequently, this becomes a real problematical aspect, as our anti-fraud team encounters difficulties when it comes to permanently screening any irregularity that may occur.

See? Two sentences, neither of which makes a lick of sense. This has to be a message from a government agency!

Or technical documentation from Microsoft.

At any rate, we’re trying to log onto the Treasury Department Web site right now. (We tried to call the toll-free number included in the email, but it turns out that they didn’t provide us with a valid phone number. Can’t even get their own phone number right? Now try telling us that this email didn’t come from a government agency!)

While we wait (the servers seem to be a bit slow his morning, probably because the Treasury Department is busy advising the President on economic and financial issues, encouraging sustainable economic growth, and fostering improved governance in financial institutions) we thought we’d take a moment to answer IR’s question. Might as well, eh?

To begin with, IR has a tab-delimited text file; that is, a text file in which the individual fields are separated using tabs. Here’s a version of the text file, as provided by IR himself:

Wer     red     ring    cic     Mswf
win     der     jhg     ppp     Iwol
swe     ffo     ccc     aqs     Ihte
dde     uyt     jhg     der     Cabc

What we’re interested in is column 5, the next-to-last column in the file. (IR is actually interested in column 33 in his real text file, but we’re simplifying things a bit for the purposes of this column.) What IR needs to do is sort the text file on the value found in column 5; in other words, he needs output that looks like this:

dde     uyt     jhg     der     Cabc
swe     ffo     ccc     aqs     Ihte
win     der     jhg     ppp     Iwol
wer     red     ring    cic     Mswf

How’s he going to do that, and, by extension, how are we going to do that? Why, by running the following script, of course:

Const adVarChar = 200
Const MaxCharacters = 255
Const adFldIsNullable = 32
Const ForReading = 1

Set DataList = CreateObject(“ADOR.Recordset”) DataList.Fields.Append “LineText”, adVarChar, MaxCharacters, adFldIsNullable DataList.Fields.Append “SortCharacter”, adVarChar, MaxCharacters, adFldIsNullable DataList.Open

Set objFSO = CreateObject(“Scripting.FileSystemObject”) Set objFile = objFSO.OpenTextFile(“C:\Scripts\Test.txt”, ForReading)

Do Until objFile.AtEndOfStream strLine = objFile.ReadLine arrFields = Split(strLine, vbTab) strCharacter = arrFields(4)

DataList.AddNew DataList(“LineText”) = strLine DataList(“SortCharacter”) = strCharacter DataList.Update Loop

objFile.Close

DataList.Sort = “SortCharacter, LineText”

Do Until DataList.EOF Wscript.Echo DataList.Fields.Item(“LineText”) DataList.MoveNext Loop

As most of you know by now, sorting a text file isn’t exactly VBScript’s strong point; that’s especially true when it comes to sorting a file by the value found in the fifth column in a set of tab-separated values. Therefore, we decided to cheat a little: we aren’t going to sort the text file, at least not directly. Instead, we’re going to grab the contents of the file, stash that information in a disconnected recordset, and then sort the disconnected recordset.

And yes, you would think that this is the sort of thing that the Treasury Department would take care of, wouldn’t you? Somehow or another, though, this task seems to have fallen through the governmental cracks. Which means that, as usual, it’s up to the Scripting Guys.

Note. Incidentally, if you aren’t familiar with the term, a “disconnected recordset” is simply a database table that exists only in memory; that is, the data is tied to an actual table in an actual database. We aren’t going to discuss the whys and wherefores of disconnected recordsets in any detail today; if you need a little background information on the subject you might check out this section of the Microsoft Windows 2000 Scripting Guide.

As for our script, we begin by defining a bunch of constants. The first three constants are needed to add fields to our disconnected recordset; in particular:

adVarChar tells the script that we want these new fields to have the variant data type.

MaxCharacters tells the script that the field can contain a maximum of 255 characters.

adFldIsNullable tells the script that it’s OK to have Null values in the field. We won’t actually have Null values in either of our fields, but this covers us in case we run into a line of text where column 5 is blank.

And then we have one of our old favorites, the constant ForReading, which we’ll use when we open the text file C:\Scripts\Test.txt. Good to see you, ForReading; glad you could make it.

After we define all the constants we create an instance of the ADOR.Recordset object, the object that enables us to construct a disconnected recordset. We then use these two lines of code to add two fields (LineText and SortCharacter) to the recordset:

DataList.Fields.Append “LineText”, adVarChar, MaxCharacters, adFldIsNullable
DataList.Fields.Append “SortCharacter”, adVarChar, MaxCharacters, adFldIsNullable

Once we’ve done that we use the Open method to open the recordset and prepare for data entry:

DataList.Open

Next we create an instance of the Scripting.FileSystemObject object, then use the OpenTextFile method to open the file Test.txt for reading:

Set objFile = objFSO.OpenTextFile(“C:\Scripts\Test.txt”, ForReading)

And then we’re done. See you all tomorrow!

Oh, wait; apparently we aren’t done, at least not yet. With that in mind, we next set up a Do Until loop designed to run until we’ve read all the lines in the text file (that is, until the file’s AtEndOfStream property is True):

Do Until objFile.AtEndOfStream

What do we do inside this loop? Well, for starters, we use the ReadLine method to read the first line in the text file and store that information in a variable named strLine:

strLine = objFile.ReadLine

At that point we execute these two lines of code:

arrFields = Split(strLine, vbTab)
strCharacter = arrFields(4)

What are we doing here? Well, as you might recall, in addition to the production of coin and currency, the disbursement of payments to the American public, revenue collection, and the borrowing of funds necessary to run the federal government, our script also needs to sort the lines in the text file based on the value of column 5. That’s fine, except for this: how in the world are we supposed to know what the value of column 5 is?

Well, one easy way to get at that information is to use VBScript’s Split method to split the line of text into an array (splitting, of course, on the tab character, which is represented by the VBScript constant vbTab). When we do that we’ll get an array that looks like this:

wer     
red     
ring     
cic     
Mswf

As you can see, the last item in the array, item 4 (remember, the first item in an array is always item 0), just happens to be column 5. That’s how we can determine the value of column 5, a value we stash in a variable named strCharacter.

We now have the complete text of the line stored in the variable strLine, and the value of column 5 stored in the variable strCharacter. And because we do have both those pieces of information we can use this block of code to add the text and the value of column 5 to our database:

DataList.AddNew
DataList(“LineText”) = strLine
DataList(“SortCharacter”) = strCharacter
DataList.Update

As you can see, we store the complete text of the line in the LineText field and the value of column 5 in the SortCharacter field. By storing these two values in separate fields it’s going to be very easy to sort the lines in the text file by the value in column 5.

As you’ll find out in just a second.

After we add the information from line 1 to the disconnected recordset we zip back to the top of the loop and repeat the process with the next line in the text file. Once we’ve processed each line in the file we close Test.txt, then use the Sort method to sort these lines of text, first by column 5 (SortCharacter) and then by the actual line of text itself:

DataList.Sort = “SortCharacter, LineText”

After that we set up a Do Until loop designed to loop through all the records in the recordset (that is, until the recordset’s EOF – end-of-file – property is True):

Do Until DataList.EOF

Inside that loop we simply echo back the value of the LineText property, then use the MoveNext method to move on to the next record in the recordset:

Wscript.Echo DataList.Fields.Item(“LineText”)
DataList.MoveNext

And what will that give us? That will give us output that looks like this:

dde     uyt     jhg     der     Cabc
swe     ffo     ccc     aqs     Ihte
win     der     jhg     ppp     Iwol
wer     red     ring    cic     Mswf

In case you’re wondering, that’s the text from Test.txt, sorted by the value in column 5. Mission accomplished. And now we are done.

Well, we’re done with IR’s problem, that is. As for the problem with the Scripting Guy’s account at the Treasury Department, well, we’re still working on that (must be a lot of people trying to log onto the server and make sure that they are in compliance with the database). In the meantime, the Scripting Guy who writes this column did stumble upon an interesting Web site that lists the current national debt for the USA. In case you’re wondering, as of this writing the US owes a total of $ 9,334,965,822,890.37.

But don’t worry; turns out that you can write a check (made out to the Bureau of Public Debt) and help pay down that balance of $9 trillion. Just send your check here:

Attn Dept G

Bureau Of the Public Debt

P. O. Box 2188

Parkersburg, WV 26106-2188

And that’s a good question: wouldn’t the Scripting Guys be doing everyone a public service by just going ahead and paying off the national debt? Yes, we would, and that’s something we’ll seriously consider. Granted, what with the Scripting Son starting college this fall it’s not going to be easy to come up with an additional $9 trillion. But we’ll see what we can do.

Author

0 comments

Discussion are closed.

Feedback