June 28th, 2006

How Can I Get a List of the Unique Words Used in a Microsoft Word Document?

Hey, Scripting Guy! Question

Hey, Scripting Guy! How can I get a list of the unique words used in a Microsoft Word document?

— RK

SpacerHey, Scripting Guy! AnswerTechNet Script Center

Hey, RK. Funny you should mention unique words. Last Saturday the Scripting Coach’s baseball team played in the city championship. Despite the importance of the game the team was missing two key players, and the Scripting Coach knew that meant that his infield would be a little weak defensively. What he didn’t realize was just how weak it would be. In the bottom of the second inning the opposition’s leadoff hitter singled. That would be the last ball that would leave the infield in that inning, yet the opposing team managed to score 6 runs. (How? By having seven consecutive batters reach base on an infield error). As you might expect, the Scripting Coach’s team never really recovered after that.

Needless to say, a number of words ran through the Scripting Coach’s head during that disastrous second inning, with many of those words being very … unique ….

Well, except to other baseball coaches, of course.

Ah, but you don’t want to hear about the city championship game, do you? (Which is good, because this Scripting Guy doesn’t want to talk about it, either.) You’d rather talk about a script that can produce a list of the unique words used in a Microsoft Word document. You know, a script similar to this:

Set objDictionary = CreateObject(“Scripting.Dictionary”)

Set objWord = CreateObject(“Word.Application”) objWord.Visible = True

Set objDoc = objWord.Documents.Open(“C:\Scripts\Sample.doc”)

Set colWords = objDoc.Words

For Each strWord in colWords strWord = LCase(strWord) If objDictionary.Exists(strWord) Then Else objDictionary.Add strWord, strWord End If Next

Set objDoc2 = objWord.Documents.Add() Set objSelection = objWord.Selection

For Each strItem in objDictionary.Items objSelection.TypeText strItem & vbCrLf Next

Set objRange = objDoc2.Range objRange.Sort

Let’s see if we can figure out how this script works. As you can see, we start out simple enough: we just create an instance of the Scripting.Dictionary object. (Of course, everything starts simple, just like the second inning of the city championship game did.) In just a moment we’ll use the Dictionary object as a place to store all the unique words in the document. We then use these three lines of code to create a visible instance of the Word.Application object and open the document C:\Scripts\Sample.doc:

Set objWord = CreateObject(“Word.Application”)
objWord.Visible = True

Set objDoc = objWord.Documents.Open(“C:\Scripts\Sample.doc”)

That was easy, wasn’t it? (Apparently much easier than catching a lazy little pop fly.) With our document open we’re now ready to grab a list of the unique words. To do that, we first need to get a list of all the words. That sounds like a complicated procedure, but, fortunately, it’s not; that’s because all the words in a Microsoft Word document are stored in the document’s Words collection. That’s a collection we can retrieve using just one line of code:

Set colWords = objDoc.Words

Our next step is to weed out all duplicate words in the collection; that will leave us (and RK) with a list of unique words. For example, suppose our Word document contains the following words:

these
words
are
the
words
in
the
document

When we weed out all the duplicates (like multiple instances of the words word and the) we’re left with this:

these
words
are
the
in
document

Which is the very thing RK is hoping to get.

To weed out the duplicate words we use this block of code:

For Each strWord in colWords
    strWord = LCase(strWord)
    If objDictionary.Exists(strWord) Then
    Else
        objDictionary.Add strWord, strWord
   End If
Next

As you can see, we start by setting up a For Each loop to loop through the collection of words in the document. Inside the loop we examine each word individually, using the LCase function to convert the word to all lowercase letters. (Why? Well, that helps us avoid any problems like having the words Cat, cat, and CAT being marked as different words.)

After the word has been converted to lowercase we then use the Exists method to determine whether or not the word is already in the Dictionary:

If objDictionary.Exists(strWord) Then

If the word is already in the Dictionary the script simply loops around and repeats this process with the next word in the collection. If the word is not in the Dictionary then we use this line of code to add the word (specifying the same value – the word itself – as both the Dictionary item and Dictionary key):

objDictionary.Add strWord, strWord

Note. Sorry; we thought that the Dictionary object was like the Heimlich Maneuver, something everyone already knows. If you aren’t familiar with the Dictionary object and how to use it, take a peek at the Microsoft Windows 2000 Scripting Guide.

What do you mean you don’t know the Heimlich Maneuver, either? Fine; we’ll see if we can locate an email address for Hey, Heimlich Maneuver Guy!

After we’re done with the loop the unique words in the document will be safely stashed in the Dictionary. If we wanted to, we could simply echo back those values; that requires no more code than this:

For Each strItem in objDictionary.Items
    Wscript.Echo strItem 
Next

We thought we’d go one better than that, however, and add these words – in alphabetical order – to a brand-new Word document. To do that we need to create a new document (notice the new object reference, objDoc2) and then create an instance of the Word Selection object, which simply positions the cursor at the beginning of the document:

Set objDoc2 = objWord.Documents.Add()
Set objSelection = objWord.Selection

Once we’ve done that we can then loop through the items in the Dictionary, using the TypeText method to add the word (plus a carriage return-linefeed) to the document:

For Each strItem in objDictionary.Items
    objSelection.TypeText strItem & vbCrLf
Next

That gives us a Word document that looks something like this:

these
words
are
the
in
document

What’s that? Alphabetical order? No one ever said anything about – oh, that’s right, we did say we’d sort these words in alphabetical order, didn’t we? OK, that’s easy enough:

Set objRange = objDoc2.Range
objRange.Sort

That’s all we have to do. We create a new instance of the Range object; because we provided no additional parameters the new range will, by default, encompass the entire document. And then we call the Sort method; when we call Sort without any parameters we get the items sorted in alphabetical order. Just like this:

are
document
in
the
these
words

Pretty slick, huh?

We should point out that you might get a few anomalies in your list of unique words: that’s because Microsoft Word considers some crazy things – like periods – to be words. If you don’t want punctuation marks to be tagged as words you can add code to, say, weed out anything that doesn’t start with a letter. We won’t discuss this revised script; we’ll just note that it uses the ASC function to filter out anything that doesn’t start with a letter:

Set objDictionary = CreateObject(“Scripting.Dictionary”)

Set objWord = CreateObject(“Word.Application”) objWord.Visible = True

Set objDoc = objWord.Documents.Open(“C:\Scripts\Sample.doc”)

Set colWords = objDoc.Words

For Each strWord in colWords strWord = LCase(strWord) strLetter = Left(strWord, 1) If ASC(strLetter) < 97 OR ASC(strLetter) > 122 Then Else If objDictionary.Exists(strWord) Then Else objDictionary.Add strWord, StrWord End If End If Next

Set objDoc2 = objWord.Documents.Add() Set objSelection = objWord.Selection

For Each strItem in objDictionary.Items objSelection.TypeText strItem & vbCrLf Next

Set objRange = objDoc2.Range objRange.Sort

As for baseball, after Saturday’s debacle both the Scripting Coach and his Scripting Son vowed that they were through with the sport forever. Of course, that night another team called the Scripting Son and asked if he’d be willing to join their squad for the remainder of the season. And, needless to say, the Scripting Coach said he’d be willing to help out if the team needed his help, too.

But other than that, both father and son are through with baseball. Forever.

Or at least until Fall baseball starts up in August.

Author

0 comments

Discussion are closed.

Feedback