October 23rd, 2007

Hey, Scripting Guy! How Can I Extract All the Text Between Two Tags in a Text File?

Hey, Scripting Guy! Question

Hey, Scripting Guy! I have a whole bunch of text files in a folder. I need to open each of these files, extract everything between the <filecount> and </filecount> tags, and then write that information to a separate file. How can I do that?

— RP

SpacerHey, Scripting Guy! AnswerScript Center

Hey, RP. You know, the other day the Scripting Guy who writes this column and the Scripting Son were discussing the upcoming Scripting Guys’ trip to Barcelona for TechEd IT Forum 2007. “I still don’t get it,” said the Scripting Son. “Why would anyone want to talk to you?”

“Well,” said the Scripting Dad. “I guess I’m kind of famous.”

“No you’re not,” said the Scripting Son. “If you were famous, you’d be in Wikipedia.”

And you know what? The Scripting Son was right: if the Scripting Guys were famous they would be in Wikipedia. And yet they aren’t. So does that mean that maybe the Scripting Guys aren’t anywhere near as famous and important as they think they are?

No, of course not; that just means the Scripting Guys, being the modest and unassuming types, haven’t revealed enough about their lives to enable someone to write a Wikipedia entry. Admittedly, we could go write such an entry ourselves; however, that doesn’t seem very sporting. Therefore, if anyone out there is interested in creating a Wikipedia entry for the Scripting Guys, well, here are some choice anecdotes to get you started:

In 1949, Scripting Guy Jean Ross was third runner-up in the Miss Iowa beauty pageant. However, she was forced to abdicate her title several months later due to her role in the great Dairy Farm Scandal of 1950.

In February of 1606, Galileo Galilei mentioned to his young apprentice, Scripting Guy Peter Costantini, that he could “really use a script that would help him monitor the processes on running on a computer.” That very day Peter wrote his first column on how to monitor the processes running on a computer. Today, the Script Center boasts not only that first column, but also the 347,286 follow-up columns that Peter has composed on the same subject.

In August, 2005 Scripting Guy Dean Tsaltas made himself a sandwich. “It was pretty good,” Dean recalled. “Not as good as you get at Subway or Quiznos. But still pretty good.”

In October, 2007 Scripting Guy Greg Stemp was told, “If you were famous, you’d be in Wikipedia.”

We assume that should be more than enough information to get a Scripting Guys entry accepted by Wikipedia’s editorial board. If it’s not, however, well, just wait until the folks at Wikipedia get a whiff of this: a script that can extract the information found between two tags in a text file. Let’s show you the code for performing this feat on a single file, then we’ll show you a fancier script, one that can extract this information for each file in a folder.

Here’s the basic script:

Const ForReading = 1

Set objFSO = CreateObject(“Scripting.FileSystemObject”) Set objFile = objFSO.OpenTextFile(“C:\Scripts\Test.txt”, ForReading)

strContents = objFile.ReadAll objFile.Close

strStartText = “<filecount>” strEndText = “</filecount>”

intStart = InStr(strContents, strStartText) intStart = intStart + Len(strStartText)

intEnd = InStr(strContents, strEndText)

intCharacters = intEnd – intStart

strText = Mid(strContents, intStart, intCharacters)

Wscript.Echo strText

Actually we agree: this does sort of look like gibberish, doesn’t it? But don’t worry: we promise to de-gibberish this before we call it a day.

Note. Yes, “de-gibberish” is a technical term, a technical term that only someone truly famous – and deserving of recognition in an online encyclopedia – would be able to use.

And use correctly.

To begin with, we define a constant named ForReading and set the value to 1; we’ll use this constant when we go to open our text file. We next create an instance of the Scripting.FileSystemObject, then use the OpenTextFile method to open the file C:\Scripts\Test.txt for reading:

Set objFSO = CreateObject(“Scripting.FileSystemObject”)
Set objFile = objFSO.OpenTextFile(“C:\Scripts\Test.txt”, ForReading)

So what are we going to do with this file now that it’s open? Well, to tell you the truth, not much: we’re simply going to call the ReadAll method to read the contents into a variable named strContents, then call the Close method to immediately close the file.

But don’t worry; once the contents of the file are in memory we won’t need the file anymore. Listen, you can trust us on that; have the Scripting Guys ever let you down before?

Note. It would probably be best if all those times we let people down were left out of our Wikipedia entry. After all, we assume that even Wikipedia imposes some sort of maximum length on articles.

As RP noted, he wants to extract all the text found between the <filecount> and </filecount> tags. We don’t know for sure what RP’s text files look like, so we’re going to use the following as our sample file:

<filename>Test.txt</filename>
<filedate>10/23/2007</filedate>
<filecount>786</filecount>
<filelocation>C:\Scripts</filelocation>

We should point out that we’re assuming that the two target tags appear only once in each file. Will this script work if those tags appear more than once? No. But maybe in the near future we’ll write a follow-up column of our own and show you how to do that.

We should also add that one nice thing about this script is that it will work regardless of what the file looks like. Suppose the file looked like this:

<filename>Test.txt</filename><filedate>10/23/2007
</filedate><filecount>786</filecount><filelocation>
C:\Scripts</filelocation>

No problem; the script will still be able to pick out the text between the target tags. That’s true even if the text looked like this:

<filename>Test.txt</filename><filedate>10/23/2007
</filedate><filecount>786
</filecount><filelocation>C:\Scripts</filelocation>

Or if it looked like – well, you get the idea.

At any rate, our next step is to assign our two tags to a pair of variables, one named strStartText, the other named strEndText:

strStartText = “<filecount>”
strEndText = “</filecount>”

That brings us to this little block of code:

intStart = InStr(strContents, strStartText)
intStart = intStart + Len(strStartText)

What’s going on here? Well, in the first line we’re using the InStr function to determine the character position where the <filecount> tag begins; in our sample text file, that’s going to be character position 65. In the second line, we’re adding the length of the <filecount> tag (that is, the number of characters in the tag) to this value. The tag length is 11 which, when added to 65, makes the final value of intStart equal to 76. This also means that our target text is going to begin at character position 76.

If you find that a bit confusing, take a peek at the following table. Here we’ve mapped out character positions 65 through 75. As you can see, if <filecount> begins in character position 65 that means that it ends in character position 75:

65

66

67

68

69

70

71

72

73

74

75

<

f

i

l

e

c

o

u

n

t

>

And that means that the text we want to extract has to begin in character position 76.

Next we use this line of code to determine where the second tag (</filecount>) begins:

intEnd = InStr(strContents, strEndText)

In our sample script, that’s character position 79.

We’re getting closer now. Our next step is to determine how many characters we need to extract. We can calculate that value by subtracting the starting position of our target text (76) from the starting position of our closing tag (</filecount>):

intCharacters = intEnd – intStart

What does that give us? That gives us 79 minus 76 which, in turn, gives us 3. And 3 just happens to be the number of characters between the two tags.

Once we know that, we can then call the Mid function and extract those characters:

strText = Mid(strContents, intStart, intCharacters)

All we’re doing here is telling the script to take the value of strContents, start at character position 76 (intStart) and then count over 3 characters (intCharacters), scooping up each character along the way. And what exactly will we scoop up? This:

786

Pretty cool, huh?

Now, without any additional explanation, here’s a script that can perform this feat on all the text files in a folder, writing the retrieved file counts to a file named Totals.txt:

Const ForReading = 1

Set objFSO = CreateObject(“Scripting.FileSystemObject”) strComputer = “.”

Set objWMIService = GetObject(“winmgmts:\\” & strComputer & “\root\cimv2”)

Set colFiles = objWMIService.ExecQuery _ (“ASSOCIATORS OF {Win32_Directory.Name=’C:\Scripts’} Where ” _ & “ResultClass = CIM_DataFile”)

For Each objFile In colFiles Set objFile = objFSO.OpenTextFile(objFile.Name, ForReading)

strContents = objFile.ReadAll objFile.Close

strStartText = “<filecount>” strEndText = “</filecount>”

intStart = InStr(strContents, strStartText) intStart = intStart + Len(strStartText)

intEnd = InStr(strContents, strEndText)

intCharacters = intEnd – intStart strCount = Mid(strContents, intStart, intCharacters)

strText = strtext & strCount & vbCrLf Next

Set objFile = objFSO.CreateTextFile(“C:\Scripts\Totals.txt”)

objFile.Write strText objFile.Close

Is that really going to work? Give it a try and see for yourself.

If you’re planning to rush out and create a new Wikipedia entry for the Scripting Guys, we should mention that Wikipedia prohibits “patent nonsense.” According to Wikipedia’s deletion guidelines, patent nonsense includes:

1.

Total nonsense, i.e., text or random characters that have no assignable meaning at all. This includes sequences such as “i9da7gy98sdygida%£U%ETDFHc8vda097tt{%£^O&£^IEUyrhgietysbvd}TYu{og;d”, in which keys of the keyboard have been pressed with no regard for what is typed, (or typed with the eyes closed.)

2.

Content that, while apparently meaningful after a fashion, is so completely and irredeemably confused that no reasonable person can be expected to make any sense of it whatsoever.

What does that mean to you? It means that you can go ahead and create a Scripting Guys entry; just don’t try reprinting one of the Hey, Scripting Guy! columns.

Note. In case you’re wondering, the answer is no; the Scripting Guys have no plans to sue Wikipedia. That’s true even though the phrase “Content that, while apparently meaningful after a fashion, is so completely and irredeemably confused that no reasonable person can be expected to make any sense of it whatsoever” was obviously taken directly from the Script Center Style Guide. On top of that, the Scripting Guy who writes this column holds the patent on creating columns by pressing keys on the keyboard with no regard for what is being typed.

But that’s OK; we’ll let this one slide.

Editor’s Note. For the millions of faithful Hey, Scripting Guy! fans who follow this column regularly (okay, maybe “millions” is a bit of an exaggeration – but we know there are two or three of you out there), the Scripting Editor – who was nowhere near even being born yet in 1949 – thought we should tie up some loose ends.

Last week, the Scripting Guy who writes this column mentioned the Washington vs. Oregon football game. This week he didn’t. Enough said.

The Scripting Guy who writes this column predicted the winner of the World Series. He also mentioned public ridicule…Boston Red Sox fans, feel free to ridicule.

Author

0 comments

Discussion are closed.