Hey, Scripting Guy! How can I count the number of words in a text file?
— LA
Hey, LA. You know, this is one of those questions where the Scripting Guys outsmarted themselves. (Not that outsmarting the Scripting Guys is particularly hard to do, mind you.) For one thing, we’re writing this column on a Friday, and we always look for an easy way out on Fridays. For another, we were just involved in a discussion on word counts the other day, so the subject was already on our minds. This question sounded easy and we’d already been thinking about word counts: add the two together and you have the perfect column for a Friday.
Or so we thought.
The first hint of trouble occurred right off the bat, when we sat down to figure out the answer to your question. After all, there are several different ways we could approach this problem. For example, it’s easy to calculate word counts using Microsoft Word, so our first thought was, “Let’s just use Microsoft Word.” But that seemed like overkill, and we didn’t want to imply that you couldn’t count the number of words in a text file unless you went out and bought Microsoft Office. (Although if the Office team would give us a commission we’d reconsider that position.) We then thought, “You know, this is probably the perfect scenario for using regular expressions.” But then we got a headache just thinking about regular expressions and so we abandoned that idea, too.
We then came up with this simple and elegant solution:
Const ForReading = 1Set objFSO = CreateObject(“Scripting.FileSystemObject”) Set objFile = objFSO.OpenTextFile(“c:\scripts\test.txt”, ForReading)
strText = objFile.ReadAll objFile.Close
arrWords = Split(strText, ” “) Wscript.Echo Ubound(arrWords) + 1
Simple and elegant indeed: all we did here was open the text file C:\Scripts\Test.txt and store the entire text file into a variable named strText. We then used the Split function to split the array on blank spaces (figuring that the only time you would have a blank space would be between words.) Having used the Split function to create an array named arrWords (an array in which each element represents a single word), all we had to do then was echo back the Ubound (upper bound) value of the array, plus 1. (Why plus 1? Because the Ubound value of an array is always the number of items in the array minus 1.)
That worked – sort of. As it turned out, though, the text file we used occasionally had extra blank spaces to align information:
Name Date Ken Myer 3/30/2006 Pilar Ackerman 3/31/2006
That created a problem: each of those extra blank spaces was counted as being a word. Thus our final word count was a little bit higher than it should have been.
Back to the drawing board:
Const ForReading = 1Set objFSO = CreateObject(“Scripting.FileSystemObject”) Set objFile = objFSO.OpenTextFile(“c:\scripts\test.txt”, ForReading)
strText = objFile.ReadAll objFile.Close
arrWords = Split(strText, ” “)
For Each strWord in arrWords If Len(strWord) > 0 Then i = i + 1 End If Next
Wscript.Echo i
As you can see, this time around we didn’t echo back the Ubound value. Instead, we set up a For Each loop to loop through all the items in the array. Inside that loop we used the Len function to determine the number of characters in each individual item. If the length of the item was 0, that meant we had encountered one of our excess blank spaces. In that case we simply skipped that item (because few words have 0 characters in them). If the length was greater than 0, then we incremented a counter variable by 1:
i = i + 1
After looping through the entire array we then echoed back the value of our counter variable:
Wscript.Echo i
This was much better, but the word count still seemed a little too high. After puzzling this over for a minute or two we realized why. Suppose our text file consisted of this sentence:
Two plus two = four
Most people would say that there are four words in this sentence; however, our script insisted that there were five words in the sentence:
Two plus two = four.
Why five words? Because the script was counting the equals sign (=) as a word. Likewise, we had other “extraneous” characters in the document: for example, this construction counted as three words all by itself:
. . .
Yuck.
We didn’t like that, and so we modified the script one final time, using a series of Replace functions to replace characters such as the equals sign and the period with blank spaces:
Const ForReading = 1Set objFSO = CreateObject(“Scripting.FileSystemObject”) Set objFile = objFSO.OpenTextFile(“c:\scripts\test.txt”, ForReading)
strText = objFile.ReadAll
strText = Replace(strText, “,”, ” “) strText = Replace(strText, “.”, ” “) strText = Replace(strText, “!”, ” “) strText = Replace(strText, “?”, ” “) strText = Replace(strText, “>”, ” “) strText = Replace(strText, “<“, ” “) strText = Replace(strText, “&”, ” “) strText = Replace(strText, “*”, ” “) strText = Replace(strText, “=”, ” “)
strText = Replace(strText, vbCrLf, ” “)
objFile.Close
arrWords = Split(strText, ” “)
For Each strWord in arrWords If Len(strWord) > 0 Then i = i + 1 End If Next
Wscript.Echo i
This one we liked better. Like our previous scripts, we start this one off by defining a constant named ForReading; this constant tells the FileSystemObject that we want to read the text file (as opposed to writing or appending to it). Next we create an instance of the FileSystemObject and use the OpenTextFile method to open the file C:\Scripts\Test.txt. Once we get the FileSystemObject up and running we then use the ReadAll method to read the entire file into a variable named strText:
strText = objFile.ReadAll
Following that we execute a series of Replace functions to replace characters in the variable strText. (Note that we aren’t touching the actual file itself, just the copy of the file stored in memory.) For example, this line of code replaces all the commas in strText with a blank space:
strText = Replace(strText, “,”, ” “)
We’ll leave it up to you to decide which characters – if any – you want to replace. If you’re OK with the equals sign and the plus sign (+) being counted as individual words then you might not have to make any replacements at all.
Wait, check that: there’s one replacement that you will have to make. Suppose we have a text file that looks like this:
A B C D E
How many words in this text file? We would have said 5, too, but the script said there was just 1. Why? Well, we told the script to split the text on the blank space; however, this file doesn’t have any blank spaces, just carriage return-linefeeds at the end of each line. Therefore, our array has only one item in it. Ouch.
So how do we overcome that problem? That was actually pretty easy: we just replaced all the carriage return-linefeeds (vbCrLf) with blank spaces:
strText = Replace(strText, vbCrLf, ” “)
Once we had blank spaces between each character (rather than carriage return-linefeeds between each character) the script correctly reported back 5 words for this sample text file.
Now where we were? Oh, yeah. After we close the file we then call then Split function to split strText into an array. We then use the For Each loop we already showed you to count the number of words in the array (and hence the number of words in the text file), skipping over excess blank spaces. We then echo back the value of our counter variable and we’re done.
At least to our satisfaction. Whether or not the word count is 100% accurate is somewhat subjective. For example suppose you have this line in the text file:
2+2=4
Do you have 5 words in this line (2, +, 2, =, and 4)? Maybe you have just three words: 2, 2, and 4. Or maybe you just have one word: 2+2=4. (Microsoft Word sees this as being a single word.) You’ll have to make those decisions on your own. As for us, we’ve decided that the next time we find an “easy” question to answer we’ll just skip that one and try something else!
0 comments