March 19th, 2015

PSImaging Part 1: Test-Image

Doctor Scripto
Scripter

Summary: Guest blogger, Ben Vierck, talks about using Windows PowerShell to determine if a file is an image.

Microsoft Scripting Guy, Ed Wilson, is here. I am happy to introduce a new guest blogger here at the Hey, Scripting Guy! Blog: Ben Vierck. Ben has been around for a while, using and supporting Windows PowerShell, and he certainly is not a noob by any stretch of the imagination. I had not previously talked to him about writing a guest blog, and then Teresa mentioned it to me. What would I do without the Scripting Wife? Let’s hope I don’t have to find out!

Ben is presenting a three-part series about images. Now here’s Ben…

The process of ingesting paper documents into our systems and managing the lifecycle of that paper is fragile. Legacy software systems can do that work—systems with core code that hasn't been touched in over ten years and with architectures that weren't designed to accommodate the modern world of the cloud and IoT. The Windows community—especially the corner of that community that is obsessed with automation, Windows PowerShell, and Azure—have a lot to offer this aging industry. Let's go!

It all starts with a piece of paper, typically delivered by mail, leafed in a document, which is one in a batch of documents. These batches are placed on scanners (sometimes as large as a room) and digitized. It's at this point that the capture process begins.  

Fundamental to the capture process is the ability to manipulate the digital artifacts created during the scanning process. Before we jump in and begin feeding our pipelines like this:

dir *.tiff, *.jpg, *.png, *.pdf

…let's ask the question, "What is an image and how do we know that a file is one?" We need a Test-Image cmdlet! You might suggest that we compare the file extension against a set of known image file extensions:

function Test-Image {

[CmdletBinding()]

param(

   [parameter(Mandatory=$true, Position=0, ValueFromPipeline=$true)]

        [ValidateNotNullOrEmpty()]

        [Alias('PSPath')]

        $Path

PROCESS {

        $knownImageExtensions = @( ".jpg", ".bmp", ".gif", ".tif", ".pdf", ".png" )

        $extension = [System.IO.Path]::GetExtension($Path.FullName)

        return $knownImageExtensions -contains $extension.ToLower()

}

}

Let's try that. Oh no…

Image of command output

The first time out, it's failed to identify a .tiff. Sure, I could go back and modify the $knownImageExtensions, but perhaps it would be better to come up with an algorithm that is more resilient to the whims of the users on my systems, so that they can arbitrarily name the image files with whatever extension they'd like. Let's begin cracking open these image files in a binary editor to see what they're made of. In this screenshot, I've opened one of my test TIFF files in the Visual Studio Binary Editor:

Image of file

After opening several TIFFs in a binary editor, I notice that all of those files share the same first 3 bits: 49 49 2A. A quick look at the TIFF Specification confirms the discovery. Similarly, other formats have distinct signatures. Here is a table of some well-known image file signatures:

 Type

 Bit 1

 Bit 2

 Bit 3

 Bit 4

 Bit 5

 Bit 6

 Bit 7

 Bit 8

 jpg

 FF

 D8

 

 

 

 

 

 

 bmp

 42

 4D

 

 

 

 

 

 

 gif

 47

 49

 46

 

 

 

 

 

 tif

 49

 49

 2A

 

 

 

 

 

 pdf

 25

 50

 44

 46

 

 

 

 

 png

 89

 50

 4E

 47

 0D

 0A

 1A

 0A

The algorithm to search for these patterns almost writes itself. I'll make a reference table for the known image header bit signatures:

$knownHeaders = @{

    jpg = @( "FF", "D8" );

    bmp = @( "42", "4D" );

    gif = @( "47", "49", "46" );

    tif = @( "49", "49", "2A" );

    pdf = @( "25", "50", "44", "46" );

    png = @( "89", "50", "4E", "47", "0D", "0A", "1A", "0A" );

}

Now read the first 8 bits of a file:

$bytes = Get-Content $path -Encoding Byte -ReadCount 1 -TotalCount 8

Convert the read bits into the same format as our reference arrays:

$fileHeader = ($bytes | select -first $knownHeaders['tif'].Length | % { $_.ToString("X2") })

Compare the file byte array to the reference arrays:

Compare-Object -ReferenceObject $knownHeaders['tif'] -DifferenceObject $fileHeader

If there's a match, the file is an image, regardless of what its file extension says. If not, it's not. Putting it all together, the script looks like this:

function Test-Image {

    [CmdletBinding()]

    [OutputType([System.Boolean])]

    param(

        [Parameter(Mandatory=$true, Position=0, ValueFromPipeline=$true)]

        [ValidateNotNullOrEmpty()]

        [Alias('PSPath')]

        [string] $Path

    )

    PROCESS {

        $knownHeaders = @{

            jpg = @( "FF", "D8" );

            bmp = @( "42", "4D" );

            gif = @( "47", "49", "46" );

            tif = @( "49", "49", "2A" );

            png = @( "89", "50", "4E", "47", "0D", "0A", "1A", "0A" );

            pdf = @( "25", "50", "44", "46" );

        }

        # coerce relative paths from the pipeline into full paths

        if($_ -ne $null) {

            $Path = $_.FullName

        }

         # read in the first 8 bits

        $bytes = Get-Content -LiteralPath $Path -Encoding Byte -ReadCount 1 -TotalCount 8 -ErrorAction Ignore

         $retval = $false

        foreach($key in $knownHeaders.Keys) {

             # make the file header data the same length and format as the known header

            $fileHeader = $bytes |

                Select-Object -First $knownHeaders[$key].Length |

                ForEach-Object { $_.ToString("X2") }

            if($fileHeader.Length -eq 0) {

                continue

            }

             # compare the two headers

            $diff = Compare-Object -ReferenceObject $knownHeaders[$key] -DifferenceObject $fileHeader

            if(($diff | Measure-Object).Count -eq 0) {

                $retval = $true

            }

        }

        return $retval

    }

}

That's functional, easy-to-use, and tolerant of variable file extensions. Here's the output:

Image of command output

Next up in the series…

We've got an image of a document. How do we find out what kind of document it is?

Note  This script and the others included in this series are maintained on GitHub: Positronic-IO/PSImaging.

~Ben

Thanks, Ben. Be sure to come back tomorrow for Part 2 of this series.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy 

Author

The "Scripting Guys" is a historical title passed from scripter to scripter. The current revision has morphed into our good friend Doctor Scripto who has been with us since the very beginning.

0 comments

Discussion are closed.