November 12th, 2008

Hey, Scripting Guy! How Can I Convert Word Files to PDF Files?

Hey, Scripting Guy! Question

Hey, Scripting Guy! I have a folder full of Word documents. The pointy-headed boss (PHB) is being particularly obtuse this week, and has decided that he would like to have all the Word documents turned into PDF files so everyone can view them. This is despite the fact that a couple of months ago he had us deploy Word viewer, so everyone can view the documents. Geez, one of these days, one of these days. Luckily, we have deployed Word 2007, and have also deployed the PDF add-in. But still, there are thousands of files in there, and it will take at least two weeks to open and save and close all those files. Any ideas?

– MS

SpacerHey, Scripting Guy! Answer

Hi MS,

I have a great idea. Tell the PHB that it will take you at least two weeks to do this project. Bring him into your office, demonstrate creating and saving one file, and then tell him that due to the importance of the project you should not be disturbed. Go spend two weeks on the golf course. Return to work tanned and relaxed, and then run this script:

$wdFormatPDF = 17
$word = New-Object -ComObject word.application
$word.visible = $false
$folderpath = “c:\fso\*”
$fileTypes = “*.docx”,”*doc”
Get-ChildItem -path $folderpath -include $fileTypes |
foreach-object `
{
 $path =  ($_.fullname).substring(0,($_.FullName).lastindexOf(“.”))
 “Converting $path to pdf …”
 $doc = $word.documents.open($_.fullname)
 $doc.saveas([ref] $path, [ref]$wdFormatPDF)
 $doc.close()
}
$word.Quit()

To begin with, we have a folder such as the one seen in this image. The folder contains .doc files, .docx files, .txt files, and other files:

Image of a folder with a variety of files

 

The first thing we do in the script is create a variable that we will use to tell the Word Document object how to save our file. The value is 17 for a .pdf file. This is from the Word enumeration value called wdSaveFormat. The enumeration name for a .pdf file is wdFormatPDF, so we decided to use that as our variable name:

$wdFormatPDF = 17

When we have created our variable to hold the enumeration value, we need to create an instance of the Word application object. This is the main object and is one that we always need to create if we are working with Word. To create the Word application object, we use the New-Object cmdlet to specify that it is a COM object and save the resulting object into the $word variable. This is illustrated here:

$word = New-Object -ComObject word.application

We do not want to make the Word application visible because we will be iterating through a collection of many documents. To allow the Word application to be invisible, we set the visible property to false. This is seen here:

$word.visible = $false

The folder path that contains the .doc and .docx files we want to convert to .pdf files is assigned here as a string. The thing to keep in mind is that we need to include the trailing backslash and star: \*. This is required by the Get-ChildItem cmdlet and is seen here:

$folderpath = “c:\fso\*”

We now create an array of file extensions that we will pass to the Get-ChildItem cmdlet. We are interested in .docx and .doc files and we include each of them in a set of quotation marks. We need to include the * wild card character to tell Get-ChildItem we are interested in any .doc or .docx files. This is shown here:

$fileTypes = “*.docx”,”*doc”

Next we call the Get-ChildItem cmdlet and tell it to file all the file types we stored in the $fileTypes variable, and to search the path we stored in the $folderPath variable. The Get-ChildItem cmdlet returns a collection of items. We take this collection of items (Word documents in this example) and pass them into the pipeline. The bar (|) character is used for the pipeline character. This code can be seen here:

Get-ChildItem -path $folderpath -include $fileTypes |

The ForEach-Object cmdlet is used to examine items as they come streaming through the pipeline. The way that it returns a single item can be thought of as working in a similar fashion as the Foreach statement. The backtick character is used for line continuation. This is shown here:

foreach-object `

Now we want to retrieve the path to the file, the file name but not the file extension. The reason we do not want the file extension is that we are going to save the file as a .pdf file. This code is a little hairy. Let’s take it kind of slow. The $_ is an automatic variable that refers to the current item on the pipeline. This is like your enumerator when you use the Foreach statement in either VBScript or in Windows PowerShell. We are interested in the fullname property because it includes both the path and the entire filename. Substring is a string method, and it takes two parameters. The first one is the position to start looking. We use 0, which means to start at the beginning. We then use parentheses to group the next operation. We again use $_, which is the current item on the pipeline, and the fullname property, which includes both the path and the file name. Okay, that part is just like the previous part. We now use the lastIndexOf string method, which will return a number that represents where it found the last occurrence of a character. We are looking for the last occurrence of the period, which is used to offset the file extension. When we have the location where that period occurs, we use that number to tell the substring method how many letters to return from our string. This code is seen here:

{ $path =  ($_.fullname).substring(0,($_.FullName).lastindexOf(“.”))

We print out a friendly message to let us know that the script is actually doing something. The results of the friendly message are seen in the image just below. And the code for creating the friendly message is seen here:

“Converting $path to pdf …”
Image of the friendly message

 

Next we need to open the Word document. To do this, we use the documents property to return a documents object, and use the open method from the documents object. It takes the path to the file; therefore, we give it the $_ variable, which is the current item on the pipeline, and the fullname property, which includes both the file name and the path. We store the resulting document object in the $doc variable. This is shown here:

$doc = $word.documents.open($_.fullname)

Next we call the saveas method from the document object. The saveas method takes a large number of parameters. Luckily, we only need to use the first two. The path and the document format are all we need to supply. The saveas method requires the parameters to be passed as a reference type. We use the [ref] type to force the saveas method to accept our string values. This is seen here:

$doc.saveas([ref] $path, [ref]$wdFormatPDF)

Because we have saved the document as a .pdf file, we now want to close the document. To do this, we use the close method from the document object that is stored in the $doc variable. After the document is closed and we are done working through the collection of files that were found by the Get-ChildItem cmdlet, it is time to close the Word application as well. To do this, we use the quit method from the Word application object that we stored in the $word variable:

$doc.close()
}
$word.Quit()

You can now see the folder in the image just below, which contains our newly created .pdf files. One .pdf file is created for each of the .doc or .docx files that resides in the folder:

Image of the folder with newly created .pdf files

 

MS, that is all there is to creating .pdf files from both .doc and .docx files. I recommend you put your two weeks of free time to good use, such as with scuba diving or wood working. Or perhaps you could learn how to make the perfect croque monsieur, travel to Paris, and meet the person of your dreams. Or not. See you tomorrow. Same bat time, same bat channel.

Ed Wilson and Craig Liebendorfer, Scripting Guys

Author

0 comments

Discussion are closed.