{"id":10931,"date":"2006-04-25T12:17:46","date_gmt":"2006-04-25T12:17:46","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/powershell\/2006\/04\/25\/duplicate-files-2\/"},"modified":"2019-02-18T13:25:07","modified_gmt":"2019-02-18T20:25:07","slug":"duplicate-files-2","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/powershell\/duplicate-files-2\/","title":{"rendered":"Duplicate Files 2"},"content":{"rendered":"<p class=\"MsoNormal\"><span>A long time ago I posted a filter (AddNote) for adding notes to objects.<span>&nbsp; <\/span>Some time later I posted a function (Get-MD5) for calculating the MD5 hash of a file and somebody asked how that could be used in a script to list all the files in a given folder that are very likely the same. <span>&nbsp;<\/span>I like that question because the answer it allows me to combine both these functions in a way I find pretty neat. <span>&nbsp;<\/span>First of all, lets create another filter called AttachMD5. <\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">filter AttachMD5<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">{<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\"><span>&nbsp; <\/span>$md5hash = Get-MD5 $_;<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\"><span>&nbsp; <\/span>return ($_ | AddNote MD5 $md5Hash);<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">}<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span>The filter expects to get a [System.IO.FileInfo] object via the pipeline. <span>&nbsp;<\/span>It will calculate its MD5 hash, use the AddNote function to add the hash as a note called MD5 and finally it will return the object. <\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;$foo = dir test.txt | AttachMD5<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;$foo.MD5<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">216 129 182 155 10 202 51 188 245 219 199 220 92 68 140 194<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span>Now we have all the pieces we need to write a script that will tell us if there are any files that are very likely duplicates. <span>&nbsp;<\/span>The plan is to get a list of fileinfo objects, attach the MD5 to each one, then group by length and MD5 and finally print out all the groups that have more than one item. <span>&nbsp;<\/span>Here is one way to do that:<\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">$input | <\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\"><span>&nbsp; <\/span>where { $_ -is [System.IO.FileInfo] } | <\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\"><span>&nbsp; <\/span>AttachMD5 |<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\"><span>&nbsp; <\/span>group-object Length,MD5 |<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\"><span>&nbsp; <\/span>where { $_.Count -gt 1 } |<span>&nbsp; <\/span><\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\"><span>&nbsp; <\/span>foreach { &#8220;$($_.Group |&nbsp;foreach { $_.FullName } )&#8221; }<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span>Take that bit and copy it into a script along with the other functions and filters and lets try it out.<\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;&#8221;abc&#8221; &gt; a.txt<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;&#8221;xyz&#8221; &gt; b.txt<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;&#8221;abc&#8221; &gt; c.txt<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;&#8221;xyz&#8221; &gt; d.txt<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;&#8221;jkl&#8221; &gt; e.txt<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;&#8221;abc&#8221; &gt; f.txt<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;dir | c:\\monad\\getdups.msh<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">C:\\temp\\a.txt C:\\temp\\c.txt C:\\temp\\f.txt<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">C:\\temp\\b.txt C:\\temp\\d.txt<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span>If we wanted to find all the very likely duplicate files in a directory structure we could just recurse through it and pipe it to the script:<\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;dir . -recurse | c:\\monad\\getdups.msh<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">&nbsp;<\/font><\/span><\/p>\n<p class=\"MsoNormal\"><span>Now\u2026 you should know that this script isn\u2019t exactly the most performant thing in the world.&nbsp; After all, it\u2019s calculating the MD5 hash for all the files which isn&#8217;t really neccesary.&nbsp;<span>&nbsp;<\/span>I\u2019ll leave improving the performance as an exercise for&nbsp;you guys.<span>&nbsp; <\/span>One quick way to improve performance would be to group via Length first, discard all those groups that donn\u2019t have more than 1 and only then calculate the MD5.<span>&nbsp; <\/span>Want to measure if you are really improving performance?<span>&nbsp;&nbsp;<\/span>Give the time-expression cmdlet a try.<\/span><\/p>\n<p class=\"MsoNormal\"><span>&nbsp;<\/span><\/p>\n<p class=\"MsoNormal\"><span><font face=\"Courier New\">MSH&gt;time-expression { dir | getdups.msh }<\/font><\/span><\/p>\n<p>[<i>Edit: Monad has now been renamed to Windows PowerShell.  This script or discussion may require slight adjustments before it applies directly to newer builds.<\/i>]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A long time ago I posted a filter (AddNote) for adding notes to objects.&nbsp; Some time later I posted a function (Get-MD5) for calculating the MD5 hash of a file and somebody asked how that could be used in a script to list all the files in a given folder that are very likely the [&hellip;]<\/p>\n","protected":false},"author":600,"featured_media":13641,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-10931","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-powershell"],"acf":[],"blog_post_summary":"<p>A long time ago I posted a filter (AddNote) for adding notes to objects.&nbsp; Some time later I posted a function (Get-MD5) for calculating the MD5 hash of a file and somebody asked how that could be used in a script to list all the files in a given folder that are very likely the [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts\/10931","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/users\/600"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/comments?post=10931"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/posts\/10931\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/media\/13641"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/media?parent=10931"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/categories?post=10931"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/powershell\/wp-json\/wp\/v2\/tags?post=10931"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}