Hey Scripting Guy! You are not going to believe this. I have just been given the task of counting how many images are embedded in Word documents. The document folder, of course, has hundreds of documents. I am not sure why they want to know how many documents have embedded images, but I think it has something to do with updating a catalog we produce with new images. I know this sounds crazy, but is there anything you can do to help?
– JZ
Hi JZ,
As they say, image is everything. This is true whether you are playing tennis, selling cameras, or looking for images. The main problem with this is to figure out which object you need to use. When you have found the appropriate object, the script basically writes itself. Here it is:
$folder = "c:\fso\*" $include = "*.doc","*.docx" $word = new-object -comobject word.application $word.visible = $false Get-ChildItem -path $folder -include $include | ForEach-Object ` { $doc = $word.documents.open($_.fullname) $_.name + " has " + $doc.inlineshapes.count + " images in the file" } $word.quit()
When we talk about images in Word documents, we mean those images that are embedded in the document itself. In this image, we see a typical Word document with an embedded image:
This particular Word document happens to have an image of an Elasmobranches embedded in it. To be able to identify such Word documents we need to query the InlineShapes property to return an InlineShapes collection object. If we look at the members of the InlineShapesCollection object, we find there is a property named count. This is the jaguar shark we have been so diligently searching for. Let’s dive in. The water is fine. (Cue the creepy music.)
We begin the script by assigning a string value to the variable $folder. The $folder variable will be fed to the Get-ChildItem cmdlet, and it is used to govern the folder that will be searched for Word documents. Note that the value of the string has a trailing wildcard character (*). This is seen here:
$folder = "c:\fso\*"
Next we define the extensions we will look for. We do this by assigning an array of string values to the $include variable. We use *.doc to represent legacy Word documents and *.docx to represent the Word 2007 format. We will use this variable with the Get-ChildItem cmdlet. This line of code is seen here:
$include = "*.doc","*.docx"
To work with Word we need to create an instance of the Word.Application object. This is the main object in the Word automation model, and it is always created when working with Word. The application object is very rich and is worth perusing because there is no end to the scripts you can write using its methods and properties. This line of code is seen here:
$word = new-object -comobject word.application
We do not need to see the documents as we work our way through the collection, and we set the visible property to $false. This is seen here:
$word.visible = $false
When you set the visible property to false, it is important to remember to use quit to exit Word. If you do not, you could end up with multiple copies of Word running on your system.
We pipeline the results of the Get-ChildItem cmdlet to the ForEach-Object cmdlet. We use Get-ChildItem to search the folder we specified in the $folder variable for the file types we specified in the $include variable. In this example, we are searching the C:\fso folder for all .doc and .docx files. It is possible such a query could return a large result set, and we choose to pipeline the results rather than store them in a variable. This will improve performance as it allows the ForEach-Object cmdlet to begin processing as soon as the first document is found. This code is seen here:
Get-ChildItem -path $folder -include $include |
We use the ForEach-Object cmdlet to allow us to work with the items as they come across the pipeline. To allow us to match up the curly brackets of the code block, we use line continuation via the back tick. This is seen here:
ForEach-Object `
We have curly brackets to denote our code block. The code within these curly brackets will be executed once for each item that comes across the pipeline. The first thing we want to do is open the Word document. We use the documents property to return the documents collection object. The documents object is not a very large object, but it does have the open method, which is useful if you want to open a Word document. The open method has a number of interesting parameters, but today we will content ourselves with passing the only required parameter, which is the file name to open. We need to give it the entire path to the document, so we use the fullname property from the system.io.fileinfo object that was given to us by the Get-ChildItem cmdlet. When we have the document open, we want to query the count property of the InlineShapes object to find out how many images are in the shape collection. We print out the name of the file by using the file property. We then concatenate by using the + symbol, and use ” “ to set off a string. We then concatenate again to print out the count of the images, and then we concatenate yet again the string “ images in the file”. This sounds worse than it is. Here is the code:
{ $doc = $word.documents.open($_.fullname) $_.name + " has " + $doc.inlineshapes.count + " images in the file" }
When we run the script we receive this output:
We are done. So we need to exit the Word application. To do this, we use the quit method seen here.
$word.quit()
There is good news, JZ. Because you can save so much time on your new assignment, you will have time to go scuba diving. If you are lucky, you will get to see a shark. If you are really lucky, you can get close enough to him to take a nice picture. It is a wonder the picture above turned out as well as it did, considering how much the camera was shaking. See you tomorrow.
Ed Wilson and Craig Liebendorfer, Scripting Guys
0 comments