Summary: Guest blogger, Ben Vierck, talks about using Windows PowerShell to determine if a file is an image.
Microsoft Scripting Guy, Ed Wilson, is here. I am happy to introduce a new guest blogger here at the Hey, Scripting Guy! Blog: Ben Vierck. Ben has been around for a while, using and supporting Windows PowerShell, and he certainly is not a noob by any stretch of the imagination. I had not previously talked to him about writing a guest blog, and then Teresa mentioned it to me. What would I do without the Scripting Wife? Let’s hope I don’t have to find out!
Ben is presenting a three-part series about images. Now here’s Ben…
The process of ingesting paper documents into our systems and managing the lifecycle of that paper is fragile. Legacy software systems can do that work—systems with core code that hasn't been touched in over ten years and with architectures that weren't designed to accommodate the modern world of the cloud and IoT. The Windows community—especially the corner of that community that is obsessed with automation, Windows PowerShell, and Azure—have a lot to offer this aging industry. Let's go!
It all starts with a piece of paper, typically delivered by mail, leafed in a document, which is one in a batch of documents. These batches are placed on scanners (sometimes as large as a room) and digitized. It's at this point that the capture process begins.
Fundamental to the capture process is the ability to manipulate the digital artifacts created during the scanning process. Before we jump in and begin feeding our pipelines like this:
dir *.tiff, *.jpg, *.png, *.pdf
…let's ask the question, "What is an image and how do we know that a file is one?" We need a Test-Image cmdlet! You might suggest that we compare the file extension against a set of known image file extensions:
function Test-Image {
[CmdletBinding()]
param(
[parameter(Mandatory=$true, Position=0, ValueFromPipeline=$true)]
[ValidateNotNullOrEmpty()]
[Alias('PSPath')]
$Path
)
PROCESS {
$knownImageExtensions = @( ".jpg", ".bmp", ".gif", ".tif", ".pdf", ".png" )
$extension = [System.IO.Path]::GetExtension($Path.FullName)
return $knownImageExtensions -contains $extension.ToLower()
}
}
Let's try that. Oh no…
The first time out, it's failed to identify a .tiff. Sure, I could go back and modify the $knownImageExtensions, but perhaps it would be better to come up with an algorithm that is more resilient to the whims of the users on my systems, so that they can arbitrarily name the image files with whatever extension they'd like. Let's begin cracking open these image files in a binary editor to see what they're made of. In this screenshot, I've opened one of my test TIFF files in the Visual Studio Binary Editor:
After opening several TIFFs in a binary editor, I notice that all of those files share the same first 3 bits: 49 49 2A. A quick look at the TIFF Specification confirms the discovery. Similarly, other formats have distinct signatures. Here is a table of some well-known image file signatures:
Type |
Bit 1 |
Bit 2 |
Bit 3 |
Bit 4 |
Bit 5 |
Bit 6 |
Bit 7 |
Bit 8 |
jpg |
FF |
D8 |
|
|
|
|
|
|
bmp |
42 |
4D |
|
|
|
|
|
|
gif |
47 |
49 |
46 |
|
|
|
|
|
tif |
49 |
49 |
2A |
|
|
|
|
|
|
25 |
50 |
44 |
46 |
|
|
|
|
png |
89 |
50 |
4E |
47 |
0D |
0A |
1A |
0A |
The algorithm to search for these patterns almost writes itself. I'll make a reference table for the known image header bit signatures:
$knownHeaders = @{
jpg = @( "FF", "D8" );
bmp = @( "42", "4D" );
gif = @( "47", "49", "46" );
tif = @( "49", "49", "2A" );
pdf = @( "25", "50", "44", "46" );
png = @( "89", "50", "4E", "47", "0D", "0A", "1A", "0A" );
}
Now read the first 8 bits of a file:
$bytes = Get-Content $path -Encoding Byte -ReadCount 1 -TotalCount 8
Convert the read bits into the same format as our reference arrays:
$fileHeader = ($bytes | select -first $knownHeaders['tif'].Length | % { $_.ToString("X2") })
Compare the file byte array to the reference arrays:
Compare-Object -ReferenceObject $knownHeaders['tif'] -DifferenceObject $fileHeader
If there's a match, the file is an image, regardless of what its file extension says. If not, it's not. Putting it all together, the script looks like this:
function Test-Image {
[CmdletBinding()]
[OutputType([System.Boolean])]
param(
[Parameter(Mandatory=$true, Position=0, ValueFromPipeline=$true)]
[ValidateNotNullOrEmpty()]
[Alias('PSPath')]
[string] $Path
)
PROCESS {
$knownHeaders = @{
jpg = @( "FF", "D8" );
bmp = @( "42", "4D" );
gif = @( "47", "49", "46" );
tif = @( "49", "49", "2A" );
png = @( "89", "50", "4E", "47", "0D", "0A", "1A", "0A" );
pdf = @( "25", "50", "44", "46" );
}
# coerce relative paths from the pipeline into full paths
if($_ -ne $null) {
$Path = $_.FullName
}
# read in the first 8 bits
$bytes = Get-Content -LiteralPath $Path -Encoding Byte -ReadCount 1 -TotalCount 8 -ErrorAction Ignore
$retval = $false
foreach($key in $knownHeaders.Keys) {
# make the file header data the same length and format as the known header
$fileHeader = $bytes |
Select-Object -First $knownHeaders[$key].Length |
ForEach-Object { $_.ToString("X2") }
if($fileHeader.Length -eq 0) {
continue
}
# compare the two headers
$diff = Compare-Object -ReferenceObject $knownHeaders[$key] -DifferenceObject $fileHeader
if(($diff | Measure-Object).Count -eq 0) {
$retval = $true
}
}
return $retval
}
}
That's functional, easy-to-use, and tolerant of variable file extensions. Here's the output:
Next up in the series…
We've got an image of a document. How do we find out what kind of document it is?
Note This script and the others included in this series are maintained on GitHub: Positronic-IO/PSImaging.
~Ben
Thanks, Ben. Be sure to come back tomorrow for Part 2 of this series.
I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.
Ed Wilson, Microsoft Scripting Guy
0 comments