Summary: Microsoft Scripting Guy, Ed Wilson, talks about using Windows PowerShell to display a progress bar while counting the number of words in Word documents.
Microsoft Scripting Guy, Ed Wilson, is here. One of the things that a writer does is keep track of words. This is not so much because one gets paid by the word (although in the past that was actually true, and it is still true for some magazines). It is more because publishers need to know how many words will be in a book for planning purposes. It goes to the production budget, it goes to planning for shelf space in bookstores, and it influences the cost of editorial services. So it is important to keep track of words.
In Microsoft Word, there is a property on a document object that tells me how many words that document contains. But at the file level, there is not a property. So I need to open each Word document, and maintain a running count. I also, of course, need to find all of the Word documents, and the path to those documents, so I can open them and count the words.
I decided that because I could find documents faster by using the –Filter parameter (see yesterday's post, Using PowerShell to Look for Documents), I would rewrite my standard count for the words in a Word script. While I was at it, I also decided to release the variables better, add a Write-Progress bar, and return objects, instead of just plopping stuff out to the Windows PowerShell output pane.
I can tell you that the script is still slow. This is because it involves opening and closing thousands of documents. This is the reason that I added a nice progress bar—so I could get an idea of how much longer it would take.
Initialize some variables
The first thing I do is initialize some variables. I specify the path, set a number of variables to $null, and assign a couple of values to other variables. I use the code I wrote yesterday to count the number of documents I need to process. Because this code runs pretty fast, I don’t mind running it to gather the number of documents. This code is shown here:
$path = "E:\Data\ScriptingGuys"
$year = $NumberOfDocs = $NumberOfWords = $null
$i = 1
$totalDocs = (Get-ChildItem E:\Data\ScriptingGuys -filter "*doc*" -Recurse -file |
Where {$_.BaseName -match '^(HSG|WES|QHF)'}).count
Standard Word stuff
The next thing I do is standard Word stuff. I need to create an instance of the Word.Application object, and I set Word to be invisible. This will make the script run a lot faster because I am not popping each Word document to the screen:
$word = New-Object -ComObject word.application
$word.visible = $false
Go through folders, then process documents
I only want to process folders that have four letters in the name. This is because my folders are named things like 2014 and 2015 for each year. So I create a filter to find only my annual Word folders. Then I go through each folder and find my Word documents that are named HSG, WES, and QHF. (I talked about this naming convention yesterday.) Here is that code:
Get-ChildItem $path -filter "????" -Directory |
ForEach-Object {
$year = $_.name
Get-ChildItem $_.FullName -filter "*doc*" -Recurse -file |
Where-Object {$_.BaseName -match '^(HSG|WES|QHF)'} |
For each Word document I find that meets my naming convention, I want to increment my progress bar, open the Word document, and get the number of words contained therein. Here is that code:
ForEach-Object {
$i++
Write-Progress -Activity "Processing $($_.BaseName)" `
-PercentComplete (($i / $totalDocs)*100) -Status "Working on $year"
$document = $word.documents.open($_.fullname)
$NumberOfWords += $document.words.count
Because I am counting my progress for processing the documents, I need to keep track of which document I am on. This is the $numberOfDocuments variable. I increment it, and I use it to calculate my progress through the total number of documents. The $TotalDocs variable contains the number of documents I need to process. I also add together my word count to the $NumberOfWords variable. This variable keeps track of the number of words per year.
Clean up a bit, and output objects
After I track of the number of words, I close my document, release the document object, remove the document variable, and create an object that contains the number of documents, words, and year. This code is shown here:
$NumberOfDocs ++
$document.close() | out-null
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($document) |
Out-Null
Remove-Variable Document }
[PSCustomObject]@{
"NumberOfDocuments" = $NumberOfDocs
"NumberOfWords" = $NumberOfWords
"Year" = $year}
I then set my variables back to null:
$NumberOfDocs = $NumberOfWords = $year = $null
When I have finished going through all of the year folders, it is time to quit the Word application, release the Word.Application object, and initiate garbage collection. This is shown here:
$word.quit()
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($word) | Out-Null
Remove-Variable Word
[gc]::collect()
[gc]::WaitForPendingFinalizers()
While the script runs, it produces the following output:
Here is the complete script:
$path = "E:\Data\ScriptingGuys"
$year = $NumberOfDocs = $NumberOfWords = $null
$i = 1
$totalDocs = (Get-ChildItem E:\Data\ScriptingGuys -filter "*doc*" -Recurse -file |
Where {$_.BaseName -match '^(HSG|WES|QHF)'}).count
$word = New-Object -ComObject word.application
$word.visible = $false
Get-ChildItem $path -filter "????" -Directory |
ForEach-Object {
$year = $_.name
Get-ChildItem $_.FullName -filter "*doc*" -Recurse -file |
Where-Object {$_.BaseName -match '^(HSG|WES|QHF)'} |
ForEach-Object {
$i++
Write-Progress -Activity "Processing $($_.BaseName)" `
-PercentComplete (($i / $totalDocs)*100) -Status "Working on $year"
$document = $word.documents.open($_.fullname)
$NumberOfWords += $document.words.count
$NumberOfDocs ++
$document.close() | out-null
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($document) |
Out-Null
Remove-Variable Document }
[PSCustomObject]@{
"NumberOfDocuments" = $NumberOfDocs
"NumberOfWords" = $NumberOfWords
"Year" = $year}
$NumberOfDocs = $NumberOfWords = $year = $null }
$word.quit()
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($word) | Out-Null
Remove-Variable Word
[gc]::collect()
[gc]::WaitForPendingFinalizers()
I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.
Ed Wilson, Microsoft Scripting Guy
0 comments