September 7th, 2010

Use PowerShell to Remove Personal Information from Excel

Summary: Learn how to use Windows PowerShell to remove personal and other information from Microsoft Excel workbooks in this Scripting Guys article.

 

Hey, Scripting Guy! Question Hey, Scripting Guy! At times, I really hate where I work. It seems everyone is always scared. They are scared to say what they really think in meetings, they are scared to make decisions, and they are scared to talk to one another. The term walking on eggshells comes to mind; it might be a neat trick, but after a while, it begins to make your arches ache. Management has absolutely no vision, no strategy, and no leadership. It might be a result of having had three layoffs in the last six months, but I believe the general malaise had crept into our culture long before that—the layoffs are, in that regard, merely symptomatic. When you combine scared with no strategy, you get reactivity. We are constantly being pulled in one direction, and then we are told to drop everything and reverse directions. It is as predictable as the tides, and as changeable as the Bay of Fundy. This would be bad enough in a normal company, but in our organization, it is nearly intolerable.

The latest scourge is over “hidden” information in Microsoft Excel workbooks. It seems that someone forwarded an Excel workbook to a customer, and it had “all kinds” of confidential information in the form of comments on various cells. To make matters worse, the person who forwarded the file to the customer claimed she created the Excel workbook, but the owner of the document was someone else. Now we have to review every single Microsoft Excel workbook on our shared files folder and delete everything “that is not part of the spreadsheet.” I am not a Microsoft Excel expert, and I cannot seem to get a complete list of all the files metadata, nor do I have time to personally review thousands of files.

To make matters worse, I feel certain that if I miss something, I will be fired. Is this something that can be scripted?

— EC

 

Hey, Scripting Guy! Answer

Hello EC,

Microsoft Scripting Guy Ed Wilson here. Dude (or dudette), I feel depressed for you. I am glad I do not have to work in such an environment (I absolutely love working for Microsoft because we are the complete opposite of everything you described). I can understand what it would be like because I have been there. Unfortunately, I cannot solve all your problems, but I can at least help you with this latest flail.

I wrote the Remove-ExcelDocumentInformation.ps1 for you. It does exactly what you require. It removes all of the “extra” information (or metadata) from all the Microsoft Excel workbooks in a specific folder. All you will need to do is modify the value of the $path variable. It requires Windows PowerShell 2.0. The RemoveDocumentInformation method is available for Microsoft Word, Microsoft Excel, Microsoft PowerPoint, and possibly other applications in Microsoft Office. The method was first available with 2007 Microsoft Office System. The complete Remove-ExcelDocumentInformation.ps1 script is shown here.

Remove-ExcelDocumentInformation.ps1

$path = “c:\fso”
Add-Type -AssemblyName Microsoft.Office.Interop.Excel
$xlRemoveDocType = “Microsoft.Office.Interop.Excel.XlRemoveDocInfoType” -as [type]
$excelFiles = Get-ChildItem -Path $path -include *.xls, *.xlsx -recurse
$objExcel = New-Object -ComObject excel.application
$objExcel.visible = $false
foreach($wb in $excelFiles)
{
$workbook = $objExcel.workbooks.open($wb.fullname)
“Removing document information from $wb”
$workbook.RemoveDocumentInformation($xlRemoveDocType::xlRDIAll)
$workbook.Save()
$objExcel.Workbooks.close()
}
$objExcel.Quit()

EC, in Microsoft Excel 2010, the Info tab in the File menu shows the metadata associated with a Microsoft Excel file. On the far right side of the following image, you can see the properties such as title, tags, and comments. If you were to scroll down, you could see additional values. Most of these properties can be directly edited from this location.

Image of Info tab in Excel file

The Remove-ExcelDocumentInformation.ps1 script begins by adding the Microsoft.Office.Interop.Excel assembly. The Add-Type cmdlet is used to do this. If this assembly is not loaded, the access to the XLRemoveDocInfoType will not work the first time the script runs in the Windows PowerShell ISE. However, the second time the script runs, it will work. This is because the Windows PowerShell ISE will dynamically load the Microsoft.Office.Interop.Excel assembly! Please keep this in mind because it can make troubleshooting a script horrible. Interestingly enough, yesterday’s script did not require loading the Microsoft.Office.Interop.Excel assembly—strange.

The error that is generated the first time the script runs without loading the Excel interop assembly is shown here:

PS C:\Users\ed.NWTRADERS> C:\data\ScriptingGuys\2010\HSG_9_6_10\Remove-ExcelDocumentInformation.ps1
Removing document information from c:\fso\books.pdf
Exception calling “RemoveDocumentInformation” with “1” argument(s): “The parameter is 
incorrect. (Exception from HRESULT: 0x80070057 (E_INVALIDARG))”
At C:\data\ScriptingGuys\2010\HSG_9_6_10\Remove-ExcelDocumentInformation.ps1:19 char:37
+ $workbook.RemoveDocumentInformation <<<< ($xlRemoveDocType::xlRDIAll)
   
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
   
+ FullyQualifiedErrorId : ComMethodTargetInvocation

As you can see, nothing about a missing assembly is indicated. The only clue is the part about an incorrect parameter. It does not understand our $xlRemoveDocType::xlRDIAll command. Once again, the strange thing is that the syntax we used to create the type is exactly the same as the syntax that was used yesterday.

When troubleshooting a script such as this, it is important to keep in mind the feature of dynamically loading assemblies, and—as you narrow down the problems that need to be fixed—to close and to re-open the Windows PowerShell ISE. I do not know of a way to unload an assembly after it has been loaded, assuming that I had figured that out in the first place. When you make some inconsequential change to a script, run it, and it works, you tend to think, “It is fixed.” Save it, close it out, or share it with someone else, and—it does not work for them.

The command to load the Microsoft.Office.Interop.Excel assembly is shown here:

Add-Type -AssemblyName Microsoft.Office.Interop.Excel

The RemoveDocumentInformation method accepts an enumeration value to tell it which document information to remove. The XlRemoveDocInfoType enumeration specifies 17 different things that can be removed. The enumeration names and values can be obtained by using a script such as the Get-xlConstantAndValue.ps1 script.

Get-xlContstantAndValue.ps1

Add-Type -AssemblyName Microsoft.Office.Interop.Excel
$xl = “Microsoft.Office.Interop.Excel.XlRemoveDocInfoType” -as [type]
[enum]::getNames($xl) | % { “$_” + “`t” + $xl::$_.value__ }

The output from the Get-xlConstantAndValue.ps1 script is shown here:

PS C:\Users\ed.NWTRADERS> C:\data\ScriptingGuys\2010\HSG_9_6_10\Get-xlContstantAndValue.ps1
xlRDIComments 1
xlRDIRemovePersonalInformation 4
xlRDIEmailHeader 5
xlRDIRoutingSlip 6
xlRDISendForReview 7
xlRDIDocumentProperties 8
xlRDIDocumentWorkspace 10
xlRDIInkAnnotations 11
xlRDIScenarioComments 12
xlRDIPublishInfo 13
xlRDIDocumentServerProperties 14
xlRDIDocumentManagementPolicy 15
xlRDIContentType 16
xlRDIDefinedNameComments 18
xlRDIInactiveDataConnections 19
xlRDIPrinterPath 20
xlRDIAll 99

I have written a number of articles about working with enumeration values. For more information about enumeration values including creating your own enums, refer to those articles.

The Get-ChildItem cmdlet is used to retrieve all the Microsoft Excel workbooks in the folder. Next, the excel.application object is created, and the visible property is set to false. This is shown here:

$excelFiles = Get-ChildItem -Path $path -include *.xls, *.xlsx -recurse

$objExcel = New-Object -ComObject excel.application

$objExcel.visible = $false

 

All of the files are collected in the $excelFiles variable, and each workbook is opened. A string is displayed that indicates the fullname of each file that is being processed. This is shown here:

foreach($wb in $excelFiles)

{

$workbook = $objExcel.workbooks.open($wb.fullname)

“Removing document information from $wb”

 

The part of the script, that is unique to this particular operation is the RemoveDocumentInformation method call. This method is present on a workbook object and it accepts an instance of the previously mentioned xlRemoveDocInfoType enumeration. After this method is called, the workbook needs to be saved and closed. This is shown here:

$workbook.RemoveDocumentInformation($xlRemoveDocType::xlRDIAll)

$workbook.Save()

$objExcel.Workbooks.close()

 

The script will then loop back around to the next Microsoft Excel workbook in the collection. After all the files have been processed, the quit method from the excel.application object is called:

$objExcel.Quit()

When the Remove-ExcelDocumentInformation.ps1 script runs, the following output is displayed.

Image of script output

You may want to audit a few of the Microsoft Excel workbooks that are in the path that was supplied to the script. A cleaned Microsoft Excel workbook is shown in the following image.

Image of workbook with cleaned metadata

 

EC, that is all there is to using Windows PowerShell to remove metadata from Microsoft Excel workbooks. Microsoft Excel Week will continue tomorrow when we will talk about copying data from one Microsoft Excel spreadsheet into another spreadsheet and saving the new spreadsheet as a CSV file. It’s a way cool script, and like today’s script, it solves a very real business problem.

We invite you to follow us on Twitter and Facebook. If you have any questions, send email to us at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, take it easy.

 

Ed Wilson and Craig Liebendorfer, Scripting Guys

Author

0 comments

Discussion are closed.

Feedback