October 12th, 2009

Hey, Scripting Guy! How Do I Count the Number of Words in a Group of Office Word Documents?

Bookmark and Share

Hey, Scripting Guy! Question

Hey, Scripting Guy! I have a folder that contains a number of Microsoft Word documents that are all related to a specific project. I need to know the number of words in all of those Word documents. This is because the project is billable based upon the number of words. In the past, I used Explorer to open each file to get the word count of each Microsoft Word document. I then added them up with calculator. This was a time-consuming and tedious process. But when you are dealing with money, you do not mind taking a little extra effort. The problem is this new project is a very large project with hundreds of document files. To make matters worse, the project folder also has Microsoft Excel files, Microsoft Project files, bitmap images, and assorted Notepad text files. It is a mess, and I do not want to waste several hours on this if I can help it. Can you help me please?

<

p class=”MsoNormal” style=”margin: 0in 0in 8pt;”>– WS

Hey, Scripting Guy! Answer

Hello WS,

Microsoft Scripting Guy Ed Wilson here. You know, contrary to popular belief, Scripting Guys are not paid by the word. A long time ago when I wrote for a newspaper, I was paid by the word. I have also been paid by the word for magazine articles, so your question is not something that is completely off the wall. In addition to being paid by the word, when I write a proposal for a new book project, I must supply an estimated word count for the new book. This is used to determine estimated page count, which the book publisher can use as a guide for setting the cost of new book. When writing the book, if my chapters are tending to be either too long or too short, my developmental editor will start to get excited.

Currently, I am not too excited. I am sipping a cup of English breakfast tea with a cinnamon stick in it and a bit of lemon grass. I am listening to “Travelin’ Blues” by Dave Brubeck on my Zune.

Truth in Blogging notice to comply with recent USA FTC requirements: I purchased my Zune using my own money. It was not a gift, a bribe, or a prize for blogging about the device. I also purchased the Scripting Wife’s Zune as well. I bought both of these devices because they are cool!

The grass outside is damp from the evening’s rain, and patches of fog lounge on the lawn like deer near a stream in the deep woods. While listening to “Linus and Lucy,” also by Dave Brubeck, I wrote the CountWordsInWord.ps1 script seen here.

CountWordsInWord.ps1

Function Set-Variables
{
 $folderpath = “c:fso*”
 $fileTypes = “*.docx”,”*doc”
 $confirmConversion = $false
 $readOnly = $true
 $addToRecent = $false
 $passwordDocument = “password”
 $wordCountFile = “C:fsowordCount.csv”
 $numberOfWords = 0
 Set-OutPutFile
} #end Set-Variables

Function Set-OutputFile
{
 if(Test-Path -path $wordCountFile)
   { Remove-Item -path $wordCountFile }
 “name,wordCount” >> $wordCountFile
 Get-WordDocuments
} #end Set-OutputFile

Function Get-WordDocuments
{
  “Counting Words in Word Docs in $folderPath”
 $word = New-Object -ComObject word.application
 $word.visible = $false
 Get-ChildItem -path $folderpath -include $fileTypes |
 foreach-object `
  {
   $path =  ($_.fullname).substring(0,($_.FullName).lastindexOf(“.”))
   $doc = $word.documents.open($_.fullname, $confirmConversion, $readOnly,
   $addToRecent,   $passwordDocument)
     $($_.name), $($doc.words.count)”  >> $wordCountFile
   $doc.close()
  } #end Foreach-Object
 $word.Quit()
 Get-WordCount
} #end Get-WordDocuments

Function Get-WordCount
{
 $wdcsv = import-csv -path $wordCountFile
 for ($i = 0 ; $i -le $wdcsv.length -1 ; $i++)
 {
  $numberOfWords += [int32]$wdcsv[$i].wordCount
 }
 $numberOfWords
} #end Get-WordCount

# *** Entry Point to Script ***

Set-Variables

The CountWordsInWord.ps1 script contains a series of functions, each of which calls another function until the word count report is produced. This is not done to promote code reuse (which is one reason for writing functions), but to make the script easier to read and to understand (another reason for writing functions). You will notice the script is a bit complicated. If a similar script were written using VBScript, it would not be significantly different. The main advantage of using Windows PowerShell for this script is that the Get-ChildItem cmdlet is easier to use than the FileSystemObject. We have a few other advantages due to the more compact syntax of Windows PowerShell over VBScript. The complexity of the script is due to using the Microsoft Word automation objects to access the word count of the Word documents.

The first function in the CountWordsInWord.ps1 script is called Set-Variables. As the name implies, it sets the values of variables that will be used in the script. A variable table is a good way to keep track of a large number of variables in a script. The variable table for the CountWordsInWord.ps1 script is seen in Table 1.

Table 1  Variable Table

Variable

Initial value

Use in script

$folderpath

 “c:fso*”

Path to search for Word documents.

$fileTypes

 “*.docx”,”*doc”

Two types of Word documents to look for.

$confirmConversion

 $false

Used by Open method of Word. Do not prompt if conversion is needed.

$readOnly

 $true

Used by Open method of Word. Open document read-only.

$addToRecent

 $false

Used by Open method of Word. Do not add to recently used.

$passwordDocument

 “password”

Used by Open method of Word. The password to use for any password protected documents.

$wordCountFile

 “C:fsowordCount.csv”

The path for the output file that holds file names and word count.

$numberOfWords

 0

Used to hold the total number of words for all Word documents.

After all of the variables have been initialized, the Set-Variable function calls the Set-OutPutFile function. By calling the Set-OutPutFile function from within the Set-Variables function, all of the variables will be made available to the Set-OutPutFile function. This is because the Set-Variables function will become the parent namespace for the Set-OutPutFile function. All variables that are created in the Set-Variables namespace will be available within the Set-Variables namespace, as well as child namespaces. If, of course, the variables were marked as private, they would only be available within the Set-Variables function. WS, you will need to modify the value of the $folderpath variable and the $wordCountFile variable to match your computer. The Set-Variables function is seen here:

Function Set-Variables
{
 $folderpath = “c:fso*”
 $fileTypes = “*.docx”,”*doc”
 $confirmConversion = $false
 $readOnly = $true
 $addToRecent = $false
 $passwordDocument = “password”
 $wordCountFile = “C:fsowordCount.csv”
 $numberOfWords = 0
 Set-OutputFile
} #end Set-Variables

In the Set-OutputFile function, the path to the $wordCountFile file is checked. If the file exists, it is deleted. A new file is then created by using the redirection arrows. The header for the wordCount.csv file is name and wordCount. The header for the file is written at the same time that the file is created. After creating the wordCount.csv file, the Set-OutputFile function calls the Get-WordDocuments function. The Set-OutputFile function is shown here:

Function Set-OutputFile
{
 if(Test-Path -path $wordCountFile)
   { Remove-Item -path $wordCountFile }
 “name,wordCount” >> $wordCountFile
 Get-WordDocuments
} #end Set-OutputFile

The Get-WordDocuments function is the function that interacts with the Word automation object model. The first thing the Get-WordDocuments function does after displaying a status message on the Windows PowerShell console is create an instance of the word.application object. This object is the main object you use when working with Microsoft Word. The word.application object has a number of methods and properties available to it, which are all documented on MSDN. The visible property of the application object is set to $false because there is no need to pop up a hundred or more Microsoft Word documents when all you want to do is to obtain the word count from the document. This part of the Get-WordDocuments function is seen here:

 “Counting Words in Word Docs in $folderPath”
 $word = New-Object -ComObject word.application
 $word.visible = $false

You can use the Get-ChildItem cmdlet from Windows PowerShell to obtain a listing of all of the Microsoft Word documents in the specified folder. Each of the found files is piped to the ForEach-Object cmdlet where the path to file is obtained. This is seen here:

Get-ChildItem -path $folderpath -include $fileTypes |
 foreach-object `
  {
   $path =  ($_.fullname).substring(0,($_.FullName).lastindexOf(“.”))

After the path to the Microsoft Word document has been retrieved, you can use it with the open method from the documents collection. The documents collection object contains the open method and is retrieved by querying the documents property from the word.application object. After the document is open, the

Author

0 comments

Discussion are closed.