{"id":52243,"date":"2009-10-15T00:01:00","date_gmt":"2009-10-15T00:01:00","guid":{"rendered":"https:\/\/blogs.technet.microsoft.com\/heyscriptingguy\/2009\/10\/15\/hey-scripting-guy-how-can-i-list-the-unique-words-from-a-microsoft-office-word-document\/"},"modified":"2009-10-15T00:01:00","modified_gmt":"2009-10-15T00:01:00","slug":"hey-scripting-guy-how-can-i-list-the-unique-words-from-a-microsoft-office-word-document","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/scripting\/hey-scripting-guy-how-can-i-list-the-unique-words-from-a-microsoft-office-word-document\/","title":{"rendered":"Hey, Scripting Guy! How Can I List the Unique Words from a Microsoft Office Word Document?"},"content":{"rendered":"<p><!-- AddThis Button BEGIN --><a class=\"addthis_button\" href=\"http:\/\/www.addthis.com\/bookmark.php?v=250&amp;pub=scriptingguys\"><img decoding=\"async\" alt=\"Bookmark and Share\" src=\"http:\/\/s7.addthis.com\/static\/btn\/v2\/lg-share-en.gif\" width=\"125\" height=\"16\"><\/a>     <!-- AddThis Button END --><\/p>\n<p><img decoding=\"async\" title=\"Hey, Scripting Guy! Question\" border=\"0\" alt=\"Hey, Scripting Guy! Question\" align=\"left\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/29\/2019\/02\/q-for-powertip.jpg\" width=\"34\" height=\"34\"><\/p>\n<p class=\"MsoNormal\">Hey, Scripting Guy! I need to obtain a listing of unique words from a Microsoft Word document. I know that there is the <b>Sort-Object<\/b> cmdlet that can be used to retrieve unique items, and there is the <b>Get-Content<\/b> cmdlet that can read the text of a text file. However, the <b>Get-Content<\/b> cmdlet is not able to read a Microsoft Word document, and I do not think I can use the <b>Sort-Object<\/b> cmdlet to produce a unique listing of words. <\/p>\n<p class=\"MsoNormal\">&#8212; EM<\/p>\n<p><img decoding=\"async\" title=\"Hey, Scripting Guy! Answer\" border=\"0\" alt=\"Hey, Scripting Guy! Answer\" align=\"left\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/29\/2019\/02\/a-for-powertip.jpg\" width=\"34\" height=\"34\"><\/p>\n<p class=\"MsoNormal\">Hello EM, <\/p>\n<p class=\"MsoNormal\">Microsoft Scripting Guy Ed Wilson here. I am listening to the Bourbon Street Rag on my Zune, and was day dreaming a bit about my last trip to New Orleans. The really good news is that Tech<span>&#8729;<\/span>Ed 2010 will be held in New Orleans, and (drum roll please) the Microsoft Scripting Guys already have set aside the budget to be there! &#8220;Do you know what it means to miss New Orleans?&#8221; the song continues to amble. Now the upbeat sound of Van Halen is coming from my Zune. Quite the segue! It&rsquo;s a shuffle kind of day. <\/p>\n<p class=\"MsoNormal\">I am having a great day today, and I have responded to several really cool questions sent to <a href=\"http:\/\/blogs.technet.commailto:scripter@microsoft.com\"><font face=\"Segoe\">scripter@microsoft.com<\/font><\/a> e-mail. EM, your question was really interesting, and I decided to write the GetUniqueWordsFromWord.ps1 script that is shown here. <\/p>\n<p class=\"CodeBlockScreenedHead\"><strong>GetUniqueWordsFromWord.ps1<\/strong><font><\/font><\/p>\n<p class=\"CodeBlockScreened\"><span><font><font face=\"Lucida Sans Typewriter\">$document = &#8220;C:fsoWhyUsePs2.docx&#8221;<br>$app = New-Object -ComObject word.application<br>$app.Visible = $false<br>$doc = $app.Documents.Open($document)<br>$words = $doc.words<br>$outputObject = @()<br>&#8220;There are &#8221; + $words.count + &#8221; words in the document&#8221;<br>For($i = 1 ; $i -le $words.count ; $i ++) <br>&nbsp;&nbsp;&nbsp;&nbsp; { <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $object = New-Object -typeName PSObject<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $object | <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Add-Member -MemberType noteProperty -name word -value $words.item($i).text <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $outputObject += $object<br>&nbsp; &nbsp;&nbsp;&nbsp;}<br>$doc.close()<br>$app.quit()<br>$outputObject | sort-object -property word -unique<\/p>\n<p><\/font><\/font><\/span><\/p>\n<p class=\"CodeBlockScreenedHead\">Before jumping into the GetUniqueWordsFromWord.ps1 script, take a look at the Word document seen here:<\/p>\n<p class=\"Fig-Graphic\"><img decoding=\"async\" title=\"Image of Word document with 231 words\" alt=\"Image of Word document with 231 words\" src=\"http:\/\/img.microsoft.com\/library\/media\/1033\/technet\/images\/scriptcenter\/qanda\/hsg\/2009\/october\/hey1015\/hsg-10-15-09-01.jpg\" width=\"600\" height=\"429\"><br><a href=\"http:\/\/img.microsoft.com\/library\/media\/1033\/technet\/images\/scriptcenter\/qanda\/hsg\/2009\/october\/hey1015\/hsg-10-15-09-01.jpg\"><font face=\"Segoe\"><\/font><\/a><\/p>\n<p class=\"MsoNormal\">As you can see, there are 231 words in the document. Many of these words are unique such as &#8220;after,&#8221; but some of the words are not unique such as the word &#8220;the.&#8221; The GetUniqueWordsFromWord.ps1 script will display a list of all the unique words in the Microsoft Word document. <\/p>\n<p class=\"MsoNormal\">To display the unique words in the Microsoft Word document, the GetUniqueWordsFromWord.ps1 script begins by using the <b>$document<\/b> variable to hold the path to the Microsoft Word document that is to be analyzed. Next, the <b>word.application<\/b> COM object is used to create an instance of the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb148369.aspx\">application object<\/a>. The <b>application<\/b> object is the main object that is used when working with the Microsoft Word automation model. The <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb179600.aspx\"><font face=\"Segoe\">visible property<\/font><\/a> is set to <b>$false<\/b>, which means the Microsoft Word document will not be visible while the Windows PowerShell script is running. This section of the script is shown here:<\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">$document = &#8220;C:fsoWhyUsePs2.docx&#8221;<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">$app = New-Object -ComObject word.application<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">$app.Visible = $false<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"MsoNormal\">After the application object has been created, the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb212729.aspx\"><font face=\"Segoe\">documents property<\/font><\/a> from the <b>application<\/b> object is used to obtain an instance of the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb211902.aspx\">documents collection object<\/a>. The <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb216319.aspx\">open method<\/a> from the <b>documents<\/b> collection object is used to open the document that is specified in the <b>$document<\/b> variable. The <b>open<\/b> method from the <b>documents<\/b> collection object returns a <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb211897.aspx\"><font face=\"Segoe\">document object<\/font><\/a> that is stored in the <b>$doc<\/b> variable. This line of the script is shown here:<\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">$doc = $app.Documents.Open($document)<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"MsoNormal\">The <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb216304.aspx\"><font face=\"Segoe\">words property<\/font><\/a> of the <b>document<\/b> object is used to return a <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb212252.aspx\">words collection object<\/a> that represents all of the words in the document. The <b>words<\/b> collection object is stored in the <b>$words<\/b> variable as seen here:<\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">$words = $doc.words<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"MsoNormal\">After the <b>words<\/b> collection has been created, it is time to create an empty array that will be used to store the custom object the script will create. It is also time to display a message on the Windows PowerShell console that indicates how many words are in the document. Please note that in most cases, the number of words displayed by the <b>count<\/b> property of the <b>words<\/b> collection object will not correspond with the number that is shown at the bottom of the Microsoft Word document. This is because different characters are considered words by the <b>count<\/b> property than the ones shown in the document. This section of the script is seen here: <\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">$outputObject = @()<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">&#8220;There are &#8221; + $words.count + &#8221; words in the document&#8221;<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"MsoNormal\">The <b>for<\/b> statement is used to set up a loop that will be used to walk through the collection of words stored in the <b>words<\/b> collection object. The loop begins at 1 and continues as long as the value of the variable <b>$i<\/b> is less than or equal to the count of the number of words in the collection. On each pass through the loop, the value of the <b>$i<\/b> variable will be incremented by 1. This is seen here:<\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">For($i = 1 ; $i -le $words.count ; $i ++) <\/p>\n<p><\/font><\/span><\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">&nbsp;&nbsp;&nbsp;&nbsp; {<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"MsoNormal\">Inside each loop, a custom Windows PowerShell PSObject is created by using the <b>New-Object<\/b> cmdlet and the returned PSObject is stored in the <b>$object<\/b> variable. This is shown here:<\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $object = New-Object -typeName PSObject<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"MsoNormal\">The <b>Add-Member<\/b> cmdlet is used to add a <b>noteProperty<\/b> to the PSObject stored in the <b>$object<\/b> variable. The name of the <b>noteProperty<\/b> is <b>word<\/b>, and the value is the next word in the collection of words. The <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb215697.aspx\"><font face=\"Segoe\">item method<\/font><\/a> is used to retrieve the word from the <b>words<\/b> collection by index number. This is not a direct retrieval, however, because the item method returns a <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb259519.aspx\"><font face=\"Segoe\">range object<\/font><\/a> and not a <b>word<\/b> object. The <b>range<\/b> object does have a <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb179332.aspx\"><font face=\"Segoe\">text property<\/font><\/a> that is used either to get or to set the value of the text in the selected range. Because this range is a single word, the <b>text<\/b> property from the <b>range<\/b> object retrieves the next word from the <b>words<\/b> collection object. This is shown here:<\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $object | <\/p>\n<p><\/font><\/span><\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Add-Member -MemberType noteProperty -name word -value $words.item($i).text <\/p>\n<p><\/font><\/span><\/p>\n<p class=\"MsoNormal\">After the <b>word<\/b> property has been added to the PSObject, the PSObject that is stored in the <b>$object<\/b> variable is added to the <b>$outputObject<\/b> array, as shown here:<\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $outputObject += $object<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">&nbsp;&nbsp;&nbsp;&nbsp; }<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"MsoNormal\">The <b>document<\/b> object is closed by using the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb214403.aspx\">close method<\/a> and the <b>application<\/b> object is destroyed by calling the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/bb215475.aspx\">quit method<\/a>. This is shown here:<\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">$doc.close()<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">$app.quit()<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"MsoNormal\">The array of objects stored in the <b>$outputObject<\/b> variable is piped to the <b>Sort-Object<\/b> cmdlet, where the object is sorted on the <b>word<\/b> property and only unique words are displayed on the Windows PowerShell console. This line of code is shown here:<\/p>\n<p class=\"CodeBlock\"><span><font face=\"Lucida Sans Typewriter\">$outputObject | sort-object -property word -unique<\/p>\n<p><\/font><\/span><\/p>\n<p class=\"MsoNormal\">When the script is run, the output shown in the following image is displayed:<b><\/p>\n<p><\/b><\/p>\n<p class=\"Fig-Graphic\"><img decoding=\"async\" title=\"Image of output of the script\" alt=\"Image of output of the script\" src=\"http:\/\/img.microsoft.com\/library\/media\/1033\/technet\/images\/scriptcenter\/qanda\/hsg\/2009\/october\/hey1015\/hsg-10-15-09-02.jpg\" width=\"600\" height=\"297\"><br><a href=\"http:\/\/img.microsoft.com\/library\/media\/1033\/technet\/images\/scriptcenter\/qanda\/hsg\/2009\/october\/hey1015\/hsg-10-15-09-02.jpg\"><font face=\"Segoe\"><\/font><\/a><\/p>\n<p class=\"MsoNormal\">Well, EM, that is about all there is to retrieving unique words from a Microsoft Word document. <\/p>\n<p class=\"MsoNormal\">If you want to know exactly what we will be looking at tomorrow, follow us on <a href=\"http:\/\/www.twitter.com\/scriptingguys\/\" target=\"_blank\"><font face=\"Segoe\">Twitter<\/font><\/a> or <a href=\"http:\/\/www.facebook.com\/group.php?gid=5901799452\" target=\"_blank\">Facebook<\/a>. If you have any questions, send e-mail to us at <a href=\"http:\/\/blogs.technet.commailto:scripter@microsoft.com\" target=\"_blank\"><font face=\"Segoe\">scripter@microsoft.com<\/font><\/a> or post your questions on the <a href=\"http:\/\/social.technet.microsoft.com\/Forums\/en\/ITCG\/threads\/\" target=\"_blank\">Official Scripting Guys Forum<\/a>. See you tomorrow. Until then, peace.<\/p>\n<p><b><span>Ed Wilson and Craig Liebendorfer, Scripting Guys<\/p>\n<p><\/span><\/b><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hey, Scripting Guy! I need to obtain a listing of unique words from a Microsoft Word document. I know that there is the Sort-Object cmdlet that can be used to retrieve unique items, and there is the Get-Content cmdlet that can read the text of a text file. However, the Get-Content cmdlet is not able [&hellip;]<\/p>\n","protected":false},"author":595,"featured_media":87096,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[84,49,3,45],"class_list":["post-52243","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-scripting","tag-microsoft-word","tag-office","tag-scripting-guy","tag-windows-powershell"],"acf":[],"blog_post_summary":"<p>Hey, Scripting Guy! I need to obtain a listing of unique words from a Microsoft Word document. I know that there is the Sort-Object cmdlet that can be used to retrieve unique items, and there is the Get-Content cmdlet that can read the text of a text file. However, the Get-Content cmdlet is not able [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/posts\/52243","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/users\/595"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/comments?post=52243"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/posts\/52243\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/media\/87096"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/media?parent=52243"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/categories?post=52243"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/scripting\/wp-json\/wp\/v2\/tags?post=52243"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}