Hey, Scripting Guy! I need to be able to use Windows PowerShell 2.0 to count the number of words in a bunch of Microsoft Word documents. All of the documents are stored in a specific folder, but there are a mixture of Microsoft Word 2010 and Microsoft Word 2007 documents and older document types in the same folder. Our company has recently opened an office in South America, and we need to have these documents translated into Spanish. To obtain a quotation from the translation service, I need to tell them how many words are contained in the documents.
If it were just a few documents, I would perform this action manually, but there are a lot of files. I am a Human Resources (HR) manager, and not a real IT pro. I have been reading the Hey, Scripting Guy! posts for years not only because you are funny, but also because scripting makes me a more efficient power user. I have searched the Internet, but have not found a script that will do this. Help me, Scripting Guy! Help!.
— AS
Hello AS,
Microsoft Scripting Guy Ed Wilson here. I was out and about this morning, and stopped by to see my friend who owns a scuba shop. He loves to read, and we swap books from time to time. After discussing the merits of various translations of the Three Musketeers, we got down to talking about scuba diving. He is planning a trip to Cozumel, Mexico, in the next few weeks, and he is really looking forward to the trip. I went with his group on the last trip. Sadly, ear problems have me grounded for now and I am going to miss this trip. The diving down there is awesome with visibility that extends for more than 100 feet and water that is around 82 degrees or so this time of year. In addition to some excellent drift diving, there are also some really nice coral formations with white sand and friendly sea creatures such as the giant grouper seen here.
AS, per your request, I wrote the CountNumberOfWordsInDocuments.ps1 script that you described in your email message. The script will retrieve the number of words in documents that are in one or more folders. To use the CountNumberOfWordsInDocuments.ps1 script, you will need to modify the path to the folder. The complete script is shown here.
CountNumberOfWordsInDocuments.ps1
$intDocs = $intWords = $null
$path = “C:\data\ScriptingGuys”
$application = New-Object -ComObject word.application
$application.visible = $false
Get-ChildItem -Path $path -include *.doc,*.docx -Recurse |
ForEach-Object {
“Processing $($_.fullname)”
$document = $application.documents.open($_.fullname)
$intWords += $document.words.count
$intDocs ++
$document.close() | out-null
}
$application.quit()
$application = $null
[gc]::collect()
[gc]::WaitForPendingFinalizers()
“There are $intDocs documents in the $path folder.
They contain a total of $intWords words.”
The first thing that needs to be done is to initialize a few variables. The first two variables, $intDocs and $intWords, are counter variables that are used to keep track of the number of documents processed and the number of words that those documents contain. Because it is possible you may wish to run the script more than once, it is a best practice to initialize those variables. The easiest way to do this is to set them equal to $null—each time the script runs, if the variable exists it will be assigned the value $null. If the variable does not exist, it will be created and assigned the value of $null. I love the “equals equals” type of syntax because it is a quick and efficient way of initializing multiple variables. For example, if I had three variables $a, $b, and $c that I wanted to initialize with the value of 1, I could do it as shown here. Notice that if a variable does not exist, such as $d in this example, no value is displayed.
PS C:\> $a=$b=$c=1
PS C:\> $a
1
PS C:\> $b
1
PS C:\> $c
1
PS C:\> $d
PS C:\>
In addition to initializing the counter variables, I also assign a value for the path that contains all of the document files. This folder, which is shown in the following image, contains a large number of subfolders and sub-subfolders. This portion of the script is shown here:
$intDocs = $intWords = $null
$path = “C:\data\ScriptingGuys”
The next thing we must do is create the word.application object. The application object for Microsoft Word is created whenever a script needs to perform automation. This object is used when working with the Microsoft Word application, and therefore has a number of methods and properties that perform things such as opening documents, closing the application, sizing the window, or even creating a list of keyboard shortcuts for Microsoft Word (something we will look at on Wednesday this week).
$application = New-Object -ComObject word.application
After the word.application object has been created and stored in the $application variable, the visible property is set to $false. This will prevent the application from appearing while each document is opened and the number of words counted. The only thing to remember when setting the application object to not visible is that the quit method must be called to ensure that dozens of copies of Winword.exe are not left running and eating up resources. At 50–70 megabytes of memory each, it does not take too long before leftover processes begin to impact performance. By not requiring Microsoft Word to be visible and to display each document as it is processed, the script will run significantly quicker. This line of code is shown here:
$application.visible = $false
The Get-ChildItem cmdlet is used to retrieve all the document files in one or more folders that are specified in the $path variable. The –recurse parameter is used to include the directory and subdirectories. Each system.io.fileinfo object that is retrieved by the Get-ChildItem cmdlet is piped to the Foreach-Object cmdlet. Because Windows PowerShell passes objects instead of text, each Microsoft Word document that is discovered by the Get-ChildItem is represented by an instance of the system.io.fileinfo .NET Framework class. The members of this class are shown here:
PS C:\> Get-Item C:\fso\HSG-6-1-10.Doc | get-member
TypeName: System.IO.FileInfo
Name MemberType Definition
—- ———- ———-
Mode CodeProperty System.String Mode{get=Mode;}
AppendText Method System.IO.StreamWriter AppendText()
CopyTo Method System.IO.FileInfo CopyTo(string destFil…
Create Method System.IO.FileStream Create()
CreateObjRef Method System.Runtime.Remoting.ObjRef CreateObj…
CreateText Method System.IO.StreamWriter CreateText()
Decrypt Method System.Void Decrypt()
Delete Method System.Void Delete()
Encrypt Method System.Void Encrypt()
Equals Method bool Equals(System.Object obj)
GetAccessControl Method System.Security.AccessControl.FileSecuri…
GetHashCode Method int GetHashCode()
GetLifetimeService Method System.Object GetLifetimeService()
GetObjectData Method System.Void GetObjectData(System.Runtime…
GetType Method type GetType()
InitializeLifetimeService Method System.Object InitializeLifetimeService()
MoveTo Method System.Void MoveTo(string destFileName)
Open Method System.IO.FileStream Open(System.IO.File…
OpenRead Method System.IO.FileStream OpenRead()
OpenText Method System.IO.StreamReader OpenText()
OpenWrite Method System.IO.FileStream OpenWrite()
Refresh Method System.Void Refresh()
Replace Method System.IO.FileInfo Replace(string destin…
SetAccessControl Method System.Void SetAccessControl(System.Secu…
ToString Method string ToString()
PSChildName NoteProperty System.String PSChildName=HSG-6-1-10.Doc
PSDrive NoteProperty System.Management.Automation.PSDriveInfo…
PSIsContainer NoteProperty System.Boolean PSIsContainer=False
PSParentPath NoteProperty System.String PSParentPath=Microsoft.Pow…
PSPath NoteProperty System.String PSPath=Microsoft.PowerShel…
PSProvider NoteProperty System.Management.Automation.ProviderInf…
Attributes Property System.IO.FileAttributes Attributes {get…
CreationTime Property System.DateTime CreationTime {get;set;}
CreationTimeUtc Property System.DateTime CreationTimeUtc {get;set;}
Directory Property System.IO.DirectoryInfo Directory {get;}
DirectoryName Property System.String DirectoryName {get;}
Exists Property System.Boolean Exists {get;}
Extension Property System.String Extension {get;}
FullName Property System.String FullName {get;}
IsReadOnly Property System.Boolean IsReadOnly {get;set;}
LastAccessTime Property System.DateTime LastAccessTime {get;set;}
LastAccessTimeUtc Property System.DateTime LastAccessTimeUtc {get;s…
LastWriteTime Property System.DateTime LastWriteTime {get;set;}
LastWriteTimeUtc Property System.DateTime LastWriteTimeUtc {get;set;}
Length Property System.Int64 Length {get;}
Name Property System.String Name {get;}
BaseName ScriptProperty System.Object BaseName {get=if ($this.Ex…
VersionInfo ScriptProperty System.Object VersionInfo {get=[System.D…
Inside the Foreach-Object cmdlet, the members (both methods and properties) are available to use in processing each Microsoft Word document that has been discovered. In this script, I only need to use the fullname property because it contains the path, the file name, and the file extension. There are four properties I routinely use when working with files: fullname, name, basename, and extension. By using the proper property from a system.io.fileinfo object, it simplifies a number of string concatenation and string splitting issues that used to arise when working in VBScript. Each property is illustrated here:
PS C:\> (Get-Item C:\fso\HSG-6-1-10.Doc).fullname
C:\fso\HSG-6-1-10.Doc
PS C:\> (Get-Item C:\fso\HSG-6-1-10.Doc).name
HSG-6-1-10.Doc
PS C:\> (Get-Item C:\fso\HSG-6-1-10.Doc).basename
HSG-6-1-10
PS C:\> (Get-Item C:\fso\HSG-6-1-10.Doc).extension
.Doc
PS C:\>
To make it easier to keep track of these four properties, I decided to include them in Table 1.
Table 1: Useful FileInfo properties
Property name |
Meaning |
Example |
fullname |
Path, filename and extension |
C:\fso\HSG-6-1-10.Doc |
name |
Filename and extension |
HSG-6-1-10.Doc |
basename |
Filename |
HSG-6-1-10 |
extension |
Extension |
.Doc |
The section of the script that retrieves the Office 2007 and later Microsoft Word documents (.docx extension) and the legacy Microsoft Word documents (.doc extension) is shown here:
Get-ChildItem -Path $path -include *.doc,*.docx -Recurse |
ForEach-Object {
“Processing $($_.fullname)”
When opening a document, the complete path to the file is required; therefore, the fullname property is used with the open method from the documents object. The returned document object is stored in the $document variable. This is shown here:
$document = $application.documents.open($_.fullname)
The count property from the words collection object of the document object is used to determine how many words are in the document. The $intWords variable keeps a running tally of all the words from all of the documents that have been currently processed. To do this, the += operator is used. I love using the += operator because it is much easier than the way that VBScript required, which was to use i = i + wordCount. Using the += operator is illustrated here:
PS C:\> $i = 5
PS C:\> $i
5
PS C:\> $i += 5
PS C:\> $i
10
PS C:\>
If you do not like using the += operator, you can still use the VBScript way of incrementing a counter (I know this is not really just VBScript syntax and that in fact lots of other programming languages use this syntax. I am just calling it a VBScript way of doing things for convenience and to distinguish it from the += operator). This is shown here:
PS C:\> $i = 5
PS C:\> $i = $i + 5
PS C:\> $i
10
PS C:\>
The word count is stored in the $intWords variable, as shown here:
$intWords += $document.words.count
A running count of all the documents that have been processed in the script is stored in the $intDocs variable. If I love using the += operator, I really love using the ++ operator. This operator will take the current value stored in a variable and increment it by one, and then store the value back in the variable. This is illustrated here:
PS C:\> $i = 5
PS C:\> $i ++
PS C:\> $i
6
PS C:\>
The ++ operator is the same as stating that i = i + 1. This is shown here:
PS C:\> $i = 5
PS C:\> $i = $i + 1
PS C:\> $i
6
PS C:\>
The ++ operator is much more compact. Its use in the script is shown here:
$intDocs ++
The close method from the document object is used to close the current document. To prevent cluttering the screen, the results are piped to the Out-Null cmdlet. This is shown here:
$document.close() | out-null
}
The quit method of the application object is called. Next, the $application variable is set to $null, and garbage collection is called. This is shown here:
$application.quit()
$application = $null
[gc]::collect()
[gc]::WaitForPendingFinalizers()
The number of documents and the number of words is displayed on the screen. This is shown here:
“There are $intDocs documents in the $path folder.
They contain a total of $intWords words.”
AS, that is all there is to using Microsoft Word to count the number of words contained in DOC files. Microsoft Office Week will continue tomorrow.
If you want to know exactly what we will be looking at tomorrow, follow us on Twitter or Facebook. If you have any questions, send e-mail to us at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.
Ed Wilson and Craig Liebendorfer, Scripting Guys
0 comments