June 21st, 2010

Hey, Scripting Guy! How Can I Count the Words in a Bunch of Microsoft Word Documents?

Bookmark and Share

 

Hey, Scripting Guy! Question

Hey, Scripting Guy! I need to be able to use Windows PowerShell 2.0 to count the number of words in a bunch of Microsoft Word documents. All of the documents are stored in a specific folder, but there are a mixture of Microsoft Word 2010 and Microsoft Word 2007 documents and older document types in the same folder. Our company has recently opened an office in South America, and we need to have these documents translated into Spanish. To obtain a quotation from the translation service, I need to tell them how many words are contained in the documents.

If it were just a few documents, I would perform this action manually, but there are a lot of files. I am a Human Resources (HR) manager, and not a real IT pro. I have been reading the Hey, Scripting Guy! posts for years not only because you are funny, but also because scripting makes me a more efficient power user. I have searched the Internet, but have not found a script that will do this. Help me, Scripting Guy! Help!.

— AS

 

Hey, Scripting Guy! Answer

Hello AS,

Microsoft Scripting Guy Ed Wilson here. I was out and about this morning, and stopped by to see my friend who owns a scuba shop. He loves to read, and we swap books from time to time. After discussing the merits of various translations of the Three Musketeers, we got down to talking about scuba diving. He is planning a trip to Cozumel, Mexico, in the next few weeks, and he is really looking forward to the trip. I went with his group on the last trip. Sadly, ear problems have me grounded for now and I am going to miss this trip. The diving down there is awesome with visibility that extends for more than 100 feet and water that is around 82 degrees or so this time of year. In addition to some excellent drift diving, there are also some really nice coral formations with white sand and friendly sea creatures such as the giant grouper seen here.

Photo of giant grouper

AS, per your request, I wrote the CountNumberOfWordsInDocuments.ps1 script that you described in your email message. The script will retrieve the number of words in documents that are in one or more folders. To use the CountNumberOfWordsInDocuments.ps1 script, you will need to modify the path to the folder. The complete script is shown here.

CountNumberOfWordsInDocuments.ps1

$intDocs = $intWords = $null
$path = “C:\data\ScriptingGuys”
$application = New-Object -ComObject word.application
$application.visible = $false
Get-ChildItem -Path $path -include *.doc,*.docx -Recurse |
ForEach-Object {
“Processing $($_.fullname)”
$document = $application.documents.open($_.fullname)
$intWords += $document.words.count
$intDocs ++
$document.close() | out-null
}
$application.quit()
$application = $null
[gc]::collect()
[gc]::WaitForPendingFinalizers()
“There are $intDocs documents in the $path folder.
They contain a total of $intWords words.”

The first thing that needs to be done is to initialize a few variables. The first two variables, $intDocs and $intWords, are counter variables that are used to keep track of the number of documents processed and the number of words that those documents contain. Because it is possible you may wish to run the script more than once, it is a best practice to initialize those variables. The easiest way to do this is to set them equal to $nulleach time the script runs, if the variable exists it will be assigned the value $null. If the variable does not exist, it will be created and assigned the value of $null. I love the “equals equals” type of syntax because it is a quick and efficient way of initializing multiple variables. For example, if I had three variables $a, $b, and $c that I wanted to initialize with the value of 1, I could do it as shown here. Notice that if a variable does not exist, such as $d in this example, no value is displayed.

PS C:\> $a=$b=$c=1
PS C:\> $a
1
PS C:\> $b
1
PS C:\> $c
1
PS C:\> $d
PS C:\>

In addition to initializing the counter variables, I also assign a value for the path that contains all of the document files. This folder, which is shown in the following image, contains a large number of subfolders and sub-subfolders. This portion of the script is shown here:

$intDocs = $intWords = $null
$path = “C:\data\ScriptingGuys”

Image of folder containing document files

The next thing we must do is create the word.application object. The application object for Microsoft Word is created whenever a script needs to perform automation. This object is used when working with the Microsoft Word application, and therefore has a number of methods and properties that perform things such as opening documents, closing the application, sizing the window, or even creating a list of keyboard shortcuts for Microsoft Word (something we will look at on Wednesday this week).

$application = New-Object -ComObject word.application

After the word.application object has been created and stored in the $application variable, the visible property is set to $false. This will prevent the application from appearing while each document is opened and the number of words counted. The only thing to remember when setting the application object to not visible is that the quit method must be called to ensure that dozens of copies of Winword.exe are not left running and eating up resources. At 50–70 megabytes of memory each, it does not take too long before leftover processes begin to impact performance. By not requiring Microsoft Word to be visible and to display each document as it is processed, the script will run significantly quicker. This line of code is shown here:

$application.visible = $false

The Get-ChildItem cmdlet is used to retrieve all the document files in one or more folders that are specified in the $path variable. The –recurse parameter is used to include the directory and subdirectories. Each system.io.fileinfo object that is retrieved by the Get-ChildItem cmdlet is piped to the Foreach-Object cmdlet. Because Windows PowerShell passes objects instead of text, each Microsoft Word document that is discovered by the Get-ChildItem is represented by an instance of the system.io.fileinfo .NET Framework class. The members of this class are shown here:

PS C:\> Get-Item C:\fso\HSG-6-1-10.Doc | get-member

   TypeName: System.IO.FileInfo

Name                      MemberType     Definition
—-                      ———-     ———-
Mode                      CodeProperty   System.String Mode{get=Mode;}
AppendText                Method         System.IO.StreamWriter AppendText()
CopyTo                    Method         System.IO.FileInfo CopyTo(string destFil…
Create                    Method         System.IO.FileStream Create()
CreateObjRef              Method         System.Runtime.Remoting.ObjRef CreateObj…
CreateText                Method         System.IO.StreamWriter CreateText()
Decrypt                   Method         System.Void Decrypt()
Delete                    Method         System.Void Delete()
Encrypt                   Method         System.Void Encrypt()
Equals                    Method         bool Equals(System.Object obj)
GetAccessControl          Method         System.Security.AccessControl.FileSecuri…
GetHashCode               Method         int GetHashCode()
GetLifetimeService        Method         System.Object GetLifetimeService()
GetObjectData             Method         System.Void GetObjectData(System.Runtime…
GetType                   Method         type GetType()
InitializeLifetimeService Method         System.Object InitializeLifetimeService()
MoveTo                    Method         System.Void MoveTo(string destFileName)
Open                      Method         System.IO.FileStream Open(System.IO.File…
OpenRead                  Method         System.IO.FileStream OpenRead()
OpenText                  Method         System.IO.StreamReader OpenText()
OpenWrite                 Method         System.IO.FileStream OpenWrite()
Refresh                   Method         System.Void Refresh()
Replace                   Method         System.IO.FileInfo Replace(string destin…
SetAccessControl          Method         System.Void SetAccessControl(System.Secu…
ToString                  Method         string ToString()
PSChildName               NoteProperty   System.String PSChildName=HSG-6-1-10.Doc
PSDrive                   NoteProperty   System.Management.Automation.PSDriveInfo…
PSIsContainer             NoteProperty   System.Boolean PSIsContainer=False
PSParentPath              NoteProperty   System.String PSParentPath=Microsoft.Pow…
PSPath                    NoteProperty   System.String PSPath=Microsoft.PowerShel…
PSProvider                NoteProperty   System.Management.Automation.ProviderInf…
Attributes                Property       System.IO.FileAttributes Attributes {get…
CreationTime              Property       System.DateTime CreationTime {get;set;}
CreationTimeUtc           Property       System.DateTime CreationTimeUtc {get;set;}
Directory                 Property       System.IO.DirectoryInfo Directory {get;}
DirectoryName             Property       System.String DirectoryName {get;}
Exists                    Property       System.Boolean Exists {get;}
Extension                 Property       System.String Extension {get;}
FullName                  Property       System.String FullName {get;}
IsReadOnly                Property       System.Boolean IsReadOnly {get;set;}
LastAccessTime            Property       System.DateTime LastAccessTime {get;set;}
LastAccessTimeUtc         Property       System.DateTime LastAccessTimeUtc {get;s…
LastWriteTime             Property       System.DateTime LastWriteTime {get;set;}
LastWriteTimeUtc          Property       System.DateTime LastWriteTimeUtc {get;set;}
Length                    Property       System.Int64 Length {get;}
Name                      Property       System.String Name {get;}
BaseName                  ScriptProperty System.Object BaseName {get=if ($this.Ex…
VersionInfo               ScriptProperty System.Object VersionInfo {get=[System.D…

 

Inside the Foreach-Object cmdlet, the members (both methods and properties) are available to use in processing each Microsoft Word document that has been discovered. In this script, I only need to use the fullname property because it contains the path, the file name, and the file extension. There are four properties I routinely use when working with files: fullname, name, basename, and extension. By using the proper property from a system.io.fileinfo object, it simplifies a number of string concatenation and string splitting issues that used to arise when working in VBScript. Each property is illustrated here:

PS C:\> (Get-Item C:\fso\HSG-6-1-10.Doc).fullname
C:\fso\HSG-6-1-10.Doc
PS C:\> (Get-Item C:\fso\HSG-6-1-10.Doc).name
HSG-6-1-10.Doc
PS C:\> (Get-Item C:\fso\HSG-6-1-10.Doc).basename
HSG-6-1-10
PS C:\> (Get-Item C:\fso\HSG-6-1-10.Doc).extension
.Doc
PS C:\>

To make it easier to keep track of these four properties, I decided to include them in Table 1.

Table 1: Useful FileInfo properties

Property name

Meaning

Example

fullname

Path, filename and extension

C:\fso\HSG-6-1-10.Doc

name

Filename and extension

HSG-6-1-10.Doc

basename

Filename

HSG-6-1-10

extension

Extension

.Doc

The section of the script that retrieves the Office 2007 and later Microsoft Word documents (.docx extension) and the legacy Microsoft Word documents (.doc extension) is shown here:

Get-ChildItem -Path $path -include *.doc,*.docx -Recurse |
ForEach-Object {
“Processing $($_.fullname)”

When opening a document, the complete path to the file is required; therefore, the fullname property is used with the open method from the documents object. The returned document object is stored in the $document variable. This is shown here:

$document = $application.documents.open($_.fullname)

The count property from the words collection object of the document object is used to determine how many words are in the document. The $intWords variable keeps a running tally of all the words from all of the documents that have been currently processed. To do this, the += operator is used. I love using the += operator because it is much easier than the way that VBScript required, which was to use i = i + wordCount. Using the += operator is illustrated here:

PS C:\> $i = 5
PS C:\> $i
5
PS C:\> $i += 5
PS C:\> $i
10
PS C:\>

If you do not like using the += operator, you can still use the VBScript way of incrementing a counter (I know this is not really just VBScript syntax and that in fact lots of other programming languages use this syntax. I am just calling it a VBScript way of doing things for convenience and to distinguish it from the += operator). This is shown here:

PS C:\> $i = 5
PS C:\> $i = $i + 5
PS C:\> $i
10
PS C:\>

The word count is stored in the $intWords variable, as shown here:

$intWords += $document.words.count

A running count of all the documents that have been processed in the script is stored in the $intDocs variable. If I love using the += operator, I really love using the ++ operator. This operator will take the current value stored in a variable and increment it by one, and then store the value back in the variable. This is illustrated here:

PS C:\> $i = 5
PS C:\> $i ++
PS C:\> $i
6
PS C:\>

The ++ operator is the same as stating that i = i + 1. This is shown here:

PS C:\> $i = 5
PS C:\> $i = $i + 1
PS C:\> $i
6
PS C:\>

The ++ operator is much more compact. Its use in the script is shown here:

$intDocs ++

The close method from the document object is used to close the current document. To prevent cluttering the screen, the results are piped to the Out-Null cmdlet. This is shown here:

$document.close() | out-null

}

The quit method of the application object is called. Next, the $application variable is set to $null, and garbage collection is called. This is shown here:

$application.quit()
$application = $null
[gc]::collect()
[gc]::WaitForPendingFinalizers()

The number of documents and the number of words is displayed on the screen. This is shown here:

“There are $intDocs documents in the $path folder.
They contain a total of $intWords words.”

 

AS, that is all there is to using Microsoft Word to count the number of words contained in DOC files. Microsoft Office Week will continue tomorrow.

If you want to know exactly what we will be looking at tomorrow, follow us on Twitter or Facebook. If you have any questions, send e-mail to us at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson and Craig Liebendorfer, Scripting Guys

Author

0 comments

Discussion are closed.

Feedback