{"id":8487,"date":"2018-05-07T10:20:03","date_gmt":"2018-05-07T17:20:03","guid":{"rendered":"https:\/\/www.microsoft.com\/developerblog\/?p=8487"},"modified":"2020-03-19T13:38:52","modified_gmt":"2020-03-19T20:38:52","slug":"handwriting-detection-and-recognition-in-scanned-documents-using-azure-ml-package-computer-vision-azure-cognitive-services-ocr","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/handwriting-detection-and-recognition-in-scanned-documents-using-azure-ml-package-computer-vision-azure-cognitive-services-ocr\/","title":{"rendered":"Making sense of Handwritten Sections in Scanned Documents using the Azure ML Package for Computer Vision and Azure Cognitive Services"},"content":{"rendered":"<h2><strong>Business Problem<\/strong><\/h2>\n<p>For businesses of all sorts, one of the great advantages of the shift from physical to digital documents is the fast and effective search and knowledge extraction methods now available. Gone are the days of reviewing documents line-by-line to find particular information. However, things get more complicated when the researcher needs to extract general concepts, rather than specific phrases. And it\u2019s even more complicated when applied to mixed-quality scanned documents containing handwritten annotations.<\/p>\n<p>Microsoft recently teamed with EY (Ernst &amp; Young Global Limited) to improve its contract search and knowledge extraction results.\u00a0 EY\u2019s professional services personnel spend significant amounts of time reviewing clients\u2019 contracts in order to extract information about relevant concepts of interest. Automated entity and knowledge extraction from these contracts would significantly reduce the amount of time their staff need to spend on the more mundane elements of this review work.<\/p>\n<p>It is challenging to achieve acceptable extraction accuracy when applying traditional search and knowledge extraction methods to these documents. Chief among these challenges are poor document image quality and handwritten annotations. The poor image quality stems from the fact that these documents are frequently scanned copies of signed agreements, stored as PDFs, often one or two generations removed from the original. This causes many optical character recognition (OCR) errors that introduce nonsense words. Also, most of these contracts include handwritten annotations which amend or define critical terms of the agreement. The handwriting legibility, style, and orientation varies widely; and the handwriting can appear in any location on the machine-printed contract page. Handwritten pointers and underscoring often note where the handwriting should be incorporated into the rest of the printed text of the agreement.<\/p>\n<p>We collaborated with EY to tackle these challenges as part of their search and knowledge extraction pipeline.<\/p>\n<h2>Technical Problem Statement<\/h2>\n<p>Despite recent progress, standard OCR technology performs poorly at recognizing handwritten characters on a machine-printed page. The recognition accuracy varies widely for the reasons described above, and the software often misplaces the location of the handwritten information when melding it in line with the adjoining text. While pure handwriting recognizers have long had stand-alone applications, there are few solutions that work well with document OCR and search pipelines.<\/p>\n<p>In order to enable entity and knowledge extraction from documents with handwritten annotations, the aim of our solution was first to identify handwritten words on a printed page, then recognize the characters to transcribe the text, and finally to reinsert these recognized characters back into the OCR result at the correct location. For a good user experience, all this would need to be seamlessly integrated into the document ingestion workflow.<\/p>\n<h2>Approach<\/h2>\n<p>In recent years, computer vision <a href=\"https:\/\/en.wikipedia.org\/wiki\/Object_detection\">object detection<\/a> models using deep neural networks have proven to be effective at a <a href=\"https:\/\/www.researchgate.net\/publication\/257484936_50_Years_of_object_recognition_Directions_forward\">wide variety<\/a> of object recognition tasks, but require a vast amount of expertly labeled training data. Fortunately, models pre-trained on standard datasets such as <a href=\"http:\/\/cocodataset.org\/#home\">COCO<\/a>, containing millions of labeled images, can be used to create powerful custom detectors with limited data via <a href=\"https:\/\/en.wikipedia.org\/wiki\/Transfer_learning\">transfer learning<\/a> \u2013 a method of fine-tuning an existing model to accomplish a different but related task. Transfer learning has been demonstrated to dramatically reduce the amount of training data required to achieve state-of-the-art accuracy for a <a href=\"http:\/\/ruder.io\/transfer-learning\/\">wide range of applications<\/a>.<\/p>\n<p>For this particular case, transfer learning from a pre-trained model was an obvious choice, given our small sample of labeled handwritten annotation and the availability of relevant state-of-the-art pre-trained models.<\/p>\n<p><span style=\"float: none; background-color: transparent; color: #333333; cursor: text; font-family: Georgia,'Times New Roman','Bitstream Charter',Times,serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; text-align: left; text-decoration: none; text-indent: 0px;\">Our workflow, from <\/span>object detection<span style=\"float: none; background-color: transparent; color: #333333; cursor: text; font-family: Georgia,'Times New Roman','Bitstream Charter',Times,serif; font-size: 16px; font-style: normal; font-variant: normal; font-weight: 400; letter-spacing: normal; text-align: left; text-decoration: none; text-indent: 0px;\"> to handwriting recognition and replacement in the contract image OCR result, is summarized in Figure 1 below. To start, we applied a custom object detection model on an image of a contract printed page to detect handwriting and identify its bounding box.<\/span><\/p>\n<p>The sample Jupyter notebook for object detection and customizable utilities and functions for data preparation and transfer learning in the new <a href=\"https:\/\/aka.ms\/aml-packages\/vision\">Azure ML Package for Computer Vision<\/a> (AML-PCV) made our work much easier.\u00a0The AML-PCV notebook and supporting utilities take advantage of the <a href=\"https:\/\/arxiv.org\/abs\/1506.01497\">Faster R-CNN<\/a> object detection model with Tensorflow back-end, which has produced state-of-the-art results in object detection challenges in the field.<\/p>\n<p>Our <a href=\"https:\/\/github.com\/CatalystCode\/Handwriting\/tree\/master\/Notebooks\">project Jupyter Notebooks<\/a> using AML-PCV\u00a0are available on our <a href=\"https:\/\/github.com\/CatalystCode\/Handwriting\">project GitHub repo<\/a>.\u00a0 You can find more details on the implementation and customizable parameters in AML-PCV available on the <a href=\"https:\/\/github.com\/tensorflow\/models\/tree\/master\/research\/object_detection\">Tensorflow object detection website<\/a>. AML-PCV comes with support for transfer learning using faster_rcnn_resnet50_coco_2018_01_28, a model trained on the <a href=\"http:\/\/cocodataset.org\/#home\">Coco Common Object in Context dataset<\/a> containing more than 200k labeled images and 1.5 million object instances across 80 categories.<\/p>\n<p>For our custom application, we used the <a href=\"https:\/\/github.com\/Microsoft\/VoTT\">Visual Object Tagging Tool<\/a> (VOTT) to manually label a small set of <a href=\"https:\/\/www.gsa.gov\/real-estate\/real-estate-services\/leasing-policy-procedures\/lease-documents\">public government contract data<\/a>\u00a0containing both machine-printed text and handwriting, as we\u2019ll detail in the data section below. We labelled two classes of handwriting objects in the VOTT tool \u2013 signatures and non-signature (general text such as dates) \u2013 recording the bounding box and label for each instance. This set of labeled data were passed into the AML-PCV notebook to train a custom handwriting detection model.<\/p>\n<p>Once we had recognized the handwritten annotations, we used the <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/computer-vision\/\">Microsoft Cognitive Services Computer Vision API<\/a> to apply OCR to recognize the characters of the handwriting. You can find the Jupyter Notebooks for this project, and a sample of the data on the <a href=\"https:\/\/github.com\/CatalystCode\/Handwriting\">project GitHub repo<\/a>.<\/p>\n<p><figure id=\"attachment_8499\" aria-labelledby=\"figcaption_attachment_8499\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"aligncenter size-full wp-image-10735\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/05\/Workflow-1320x732-1.png\" alt=\"Image Workflow 1320 215 732\" width=\"1320\" height=\"732\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/Workflow-1320x732-1.png 1320w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/Workflow-1320x732-1-300x166.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/Workflow-1320x732-1-1024x568.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/Workflow-1320x732-1-768x426.png 768w\" sizes=\"(max-width: 1320px) 100vw, 1320px\" \/><figcaption id=\"figcaption_attachment_8499\" class=\"wp-caption-text\"><br \/>Figure 1. Model Workflow<\/figcaption><\/figure><\/p>\n<h2>The Data<\/h2>\n<p>Using VOTT allowed us to produce a training set of 182 labelled images from a sample of <a href=\"https:\/\/www.gsa.gov\/real-estate\/real-estate-services\/leasing-policy-procedures\/lease-documents\">Government contracts<\/a> in a matter of a few hours. We drew our test set from an additional 100 contract images, chosen from different states than the training set. As described in the approach, we labelled two classes: handwritten signatures and handwritten non-signatures. Our objective was primarily to correctly interpret the non-signature objects, as these were germane to the entities and concepts we were trying to extract. The signatures typically did not contain this payload. Classifying signature handwriting as a different class allowed us focus on the non-signature handwriting that was of interest.<\/p>\n<p>The output of VOTT writes an XML file for each image in Pascal-VOC format, with bounding box location information for each labelled object. This format can be read into the AML-PCV directly, with further processing done by utilities called from the notebook. You can access the full set of images and labeled data from this project on an Azure blob public data repository with URI https:\/\/handwriting.blob.core.windows.net\/leasedata. You can also find a smaller sample of the data in the <a href=\"https:\/\/github.com\/CatalystCode\/Handwriting\/\">project GitHub repo<\/a>. Figure 2 shows an example of a typical contract section with relevant handwritten parts \u2013 in this case the start date of a real estate lease.<\/p>\n<p><figure id=\"attachment_8498\" aria-labelledby=\"figcaption_attachment_8498\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"aligncenter size-full wp-image-10731\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/05\/2a_sample-1320x509-1.png\" alt=\"Image 2a sample 1320 215 509\" width=\"1320\" height=\"509\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/2a_sample-1320x509-1.png 1320w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/2a_sample-1320x509-1-300x116.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/2a_sample-1320x509-1-1024x395.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/2a_sample-1320x509-1-768x296.png 768w\" sizes=\"(max-width: 1320px) 100vw, 1320px\" \/><figcaption id=\"figcaption_attachment_8498\" class=\"wp-caption-text\"><br \/>Figure 2. Screenshot of a Contract with Handwriting<\/figcaption><\/figure><\/p>\n<h2>Method<\/h2>\n<p>Here we provide detail on using the vision toolkit to train the custom object detection model.<\/p>\n<p>One of the key timesavers provided by the AML-PCV utilities\u00a0is the utilities to recognize, format, and pre-process our labeled training and test data. The code below will import the VOTT labeled dataset and pre-process images to create a suitable training set for the Faster-RCNN model:<\/p>\n<pre title=\"Preprocess Labelled Training and Text Data with AML-PCV\" class=\"font-size:13 line-height:16 left-set:true right-set:true lang:python decode:true\">import os, time\r\nfrom cvtk.core import Context, ObjectDetectionDataset, TFFasterRCNN\r\nfrom cvtk.utils import detection_utils\r\n\r\nimage_folder = \"&lt;input image folder including subfolders of jpg and xml&gt;\" # training data from VOTT labeling tool\r\nmodel_dir = \"&lt;saved model directory&gt;\" # dir for saved training models\r\nimage_path = \"&lt;test image path&gt;\" # scoring image path\r\nresult_path = \"&lt;results path&gt;\" # dir for saving images with detection boxes and placeholder text\r\ndata_train = ObjectDetectionDataset.create_from_dir_pascal_voc(dataset_name='training_dataset', data_dir=image_folder)\r\n<\/pre>\n<p>Below are the default hyperparameter selections for our handwriting object detection model. Since the default minibatch size is set to 1, we set the num_steps equal to the num_epochs multiplied by the number of images in the training set. The learning rate and step number are default parameters from AML-PCV and discussed in details in the <a href=\"https:\/\/github.com\/CatalystCode\/Handwriting\/tree\/master\/Notebooks\">notebook<\/a>.<\/p>\n<pre title=\"Set Hyperparameters \" class=\"font-size:13 line-height:16 left-set:true right-set:true lang:python decode:true\">score_threshold = 0.0       # Threshold on the detection score, use to discard lower-confidence detections.\r\nmax_total_detections = 300  # Maximum number of detections. A high value will slow down training but might increase accuracy.\r\nmy_detector = TFFasterRCNN(labels=data_train.labels, \r\n                           score_threshold=score_threshold, \r\n                           max_total_detections=max_total_detections)\r\n# to get good results, use a larger value for num_steps, e.g., 5000.\r\nnum_steps = len(dataset_train.images)*30\r\nlearning_rate = 0.001 # learning rate\r\nstep1 = 200 \r\n\r\nstart_train = time.time()\r\nmy_detector.train(dataset=data_train, num_steps=num_steps, \r\n                  initial_learning_rate=learning_rate,\r\n                  step1=step1,\r\n                  learning_rate1=learning_rate)\r\nend_train = time.time()\r\nprint(\"the total training time is {}\".format(end_train-start_train))\r\n<\/pre>\n<p>With these parameter settings and our training set of 182 images, training took 4080 seconds on a standard Azure NC6 DLVM (<a href=\"https:\/\/azuremarketplace.microsoft.com\/en-us\/marketplace\/apps\/microsoft-ads.dsvm-deep-learning\">Azure Deep Learning Virtual Machine<\/a>), with one GPU.<\/p>\n<p>Let\u2019s examine the pipeline on one government contract page below. Figure 3 shows the detected handwriting boxes in the contract \u2013 blue for non-signature and green for signature. If we are able to recognize the handwritten words, the challenge then is to decide where to insert this in the output of the printed text OCR process.<\/p>\n<p>Our approach was to generate unique tokens designed to be reliably transcribed by the OCR software. We then inserted those tokens in the areas where we detected handwritten text,\u00a0replacing the original handwritten section, and using these tokens as anchor points in the OCR output. After some experimentation, we proceeded with tokens comprising 5 digits, starting and ending with the number \u201c8\u201d, with 3 randomly generated numbers in between. Figure 4 shows the result of replacing the original text with these numbers.<\/p>\n<p><figure id=\"attachment_8506\" aria-labelledby=\"figcaption_attachment_8506\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"aligncenter size-full wp-image-10737\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/05\/3_sample.png\" alt=\"Image 3 sample\" width=\"725\" height=\"897\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/3_sample.png 725w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/3_sample-242x300.png 242w\" sizes=\"(max-width: 725px) 100vw, 725px\" \/><figcaption id=\"figcaption_attachment_8506\" class=\"wp-caption-text\"><br \/>Figure 3. Annotated Handwriting in One Page of PDF Contract<\/figcaption><\/figure><\/p>\n<p>&nbsp;<\/p>\n<p><figure id=\"attachment_8505\" aria-labelledby=\"figcaption_attachment_8505\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"aligncenter size-full wp-image-10732\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/05\/4_sample.png\" alt=\"Image 4 sample\" width=\"720\" height=\"929\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/4_sample.png 720w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/4_sample-233x300.png 233w\" sizes=\"(max-width: 720px) 100vw, 720px\" \/><figcaption id=\"figcaption_attachment_8505\" class=\"wp-caption-text\">Figure 4. Inserted Placeholder Texts in Each Detected Handwriting Box<\/figcaption><\/figure><\/p>\n<p>&nbsp;<\/p>\n<p>We then used the <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/computer-vision\/\">Microsoft Cognitive Services Computer Vision API<\/a> OCR service to transcribe each detected handwriting box. Below is a helper function from our notebook to call to the Computer Vision API and return recognized characters.<\/p>\n<pre title=\"Helper Function to Call Cognitive Services API\" class=\"font-size:13 line-height:16 left-set:true right-set:true lang:python decode:true\">def pass2CV(img_array):\r\n    import requests\r\n    \r\n    print ('pass image to computer vision API ......')\r\n    image_data = cv2.imencode('.jpg', img_array)[1].tostring() # convert image array to image bytes\r\n    \r\n    vision_base_url = \"https:\/\/westus2.api.cognitive.microsoft.com\/vision\/v1.0\/\"  \r\n    #International URLs: Replace 'westus2' with your local area, for example 'westeurope' or 'eastasia\u2019.\r\n    #Full list of internal URL\u2019s is available at this link:  \r\n    #https:\/\/westus.dev.cognitive.microsoft.com\/docs\/services\/56f91f2d778daf23d8ec6739\/operations\/56f91f2e778daf14a499e1fa\r\n\r\n    text_recognition_url = vision_base_url + \"RecognizeText\"\r\n\r\n    # change to your Computer Vision API subscription key\r\n    headers  = {'Ocp-Apim-Subscription-Key': subscription_key,\r\n               \"Content-Type\": \"application\/octet-stream\"}\r\n    params   = {'handwriting' : True}\r\n\r\n    response = requests.post(text_recognition_url, headers=headers, params=params, data=image_data)\r\n    response.raise_for_status()\r\n\r\n    operation_url = response.headers[\"Operation-Location\"]\r\n\r\n    import time\r\n\r\n    analysis = {}\r\n    while not \"recognitionResult\" in analysis:\r\n        response_final = requests.get(response.headers[\"Operation-Location\"], headers=headers)\r\n        analysis       = response_final.json()\r\n        time.sleep(1)\r\n\r\n    polygons = [(line[\"boundingBox\"], line[\"text\"]) for line in analysis[\"recognitionResult\"][\"lines\"]]\r\n    \r\n    return polygons\r\n<\/pre>\n<p>Meanwhile, we built a dictionary matching the anchor digits with the handwriting OCR results. This simplifies our final task of replacing the anchor strings in the printed OCR results. Below, we see a sample of the dictionary for handwriting boxes in Figure 3.<\/p>\n<pre title=\"Dictionary for Handwriting Tokens\" class=\"font-size:13 line-height:16 left-set:true right-set:true lang:default decode:true\">{'81548': 'couken Chapman 1 ( signature )',\r\n '83428': 'Mhub',\r\n '83728': 'U a L ( Signature )',\r\n '87018': 'ton w 13901',\r\n '87598': 'EO',\r\n '88078': '12 2 State St , BIX ( Addr',\r\n '88298': '2',\r\n '88488': 'I ( Signature ) 2 th'}\r\n<\/pre>\n<p>In the last step, we matched the handwriting detected blocks with the dictionary and replaced the placeholder text with the OCR result from the Computer Vision API. Figure 5a shows OCR results for the contract page, where the placeholder text is detected well. Then, based on the dictionary above, we replaced the digits with the handwriting OCR results from Computer Vision API. The results are shown in Figure 5b.<\/p>\n<p><!-- TODO - missing image --><\/p>\n<p><figure id=\"attachment_8504\" aria-labelledby=\"figcaption_attachment_8504\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"wp-image-8504\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/5a_sample.png\" alt=\"Figure 5a. CV API Results of One Page of Contract with Placeholder Text\" width=\"850\" height=\"608\" \/><figcaption id=\"figcaption_attachment_8504\" class=\"wp-caption-text\">Figure 5a. CV API Results of One Page of Contract with Placeholder Text<\/figcaption><\/figure><\/p>\n<p>&nbsp;<\/p>\n<p><figure id=\"attachment_8503\" aria-labelledby=\"figcaption_attachment_8503\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"aligncenter size-full wp-image-10733\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2018\/05\/5b_sample.png\" alt=\"Image 5b sample\" width=\"975\" height=\"699\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/5b_sample.png 975w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/5b_sample-300x215.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2018\/05\/5b_sample-768x551.png 768w\" sizes=\"(max-width: 975px) 100vw, 975px\" \/><figcaption id=\"figcaption_attachment_8503\" class=\"wp-caption-text\">Figure 5b. Match the Detected Handwriting Back into the Contract Text<\/figcaption><\/figure><\/p>\n<h2>Results<\/h2>\n<p>There were two main components to this project: handwriting object detection and handwriting OCR. The results on detecting handwritten words were promising. Transcribing the handwritten text was less successful and only occasionally produced useful results. Performance metrics were calculated as described below:<\/p>\n<p>Our test set comprised 71 contract pages from the same government contract data source, but not yet seen by our model. All pages had both signature and non-signature handwriting present on the image.<\/p>\n<p>For each image, we defined two groups:<\/p>\n<ul>\n<li><strong>At<\/strong> = union of pixels inside true label bounding boxes (ground truth, green squares below).<\/li>\n<li><strong>Am<\/strong> = union of pixels inside bounding boxes found by the model (red squares below).<\/li>\n<\/ul>\n<p>Then we used the union of these two groups, instead of the sum, to account for overlapping boxes. We defined \u2018success\u2019 for our objective as:<\/p>\n<ul>\n<li><strong>Ai<\/strong> = intersection between At and Am.<\/li>\n<\/ul>\n<p><figure id=\"attachment_8502\" aria-labelledby=\"figcaption_attachment_8502\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"wp-image-8502\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/6_sample.png\" alt=\"Figure 6. Screenshot of the Contract to Compare Predicted Handwriting Box and Ground Truth\" width=\"730\" height=\"342\" \/><figcaption id=\"figcaption_attachment_8502\" class=\"wp-caption-text\">Figure 6. Screenshot of the Contract to Compare Predicted Handwriting Box and Ground Truth<\/figcaption><\/figure><\/p>\n<p>We refined the traditional <a href=\"https:\/\/www.pyimagesearch.com\/2016\/11\/07\/intersection-over-union-iou-for-object-detection\/\">intersection over union<\/a> (IOU) object detection results measures for our task, as we were interested in our ability to retrieve just the areas of interest within a machine printed text, and within the non-signature handwriting to detect as precisely as possible to enable accurate OCR in the later process. We defined our results on a per image (or per page) basis as follows:<\/p>\n<ul>\n<li><strong>per-image recall <\/strong>= Ai \/ At, i.e. the fraction of target pixels actually covered by the model.<\/li>\n<li><strong>per-image precision<\/strong> = Ai \/ Am, i.e. what fraction of the pixels detected were in the actual handwriting box.<\/li>\n<\/ul>\n<p>We calculated per image recall and precision for each category with our test set. For pages where the model detects non-signature handwriting but there is no non-signature handwriting in the ground truth for that detected location or vice versa, we will define precision = 0 and recall =0, which gives us a conservative performance measure.<\/p>\n<p>Figure 7 shows the min, max, 25% quantile, 75% quantile and median of these metrics over all the test images. For more than 25% of the images, the non-signature precision and recall are zero. Manual inspection shows that some of these are due to incorrect labeling or noisy artifacts of the scan being recognized incorrectly. Additional training data could potentially improve these results.<\/p>\n<p>Our results on the handwriting object detection were relatively good for both signature handwriting detection and non-signature handwriting object classes. The performance of non-signature handwriting detection is slightly worse and more variable than that of signature handwriting detection.<\/p>\n<p><figure id=\"attachment_8501\" aria-labelledby=\"figcaption_attachment_8501\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"wp-image-8501 \" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/7_results.png\" alt=\"Figure 7. Boxplot of Precision and Recall for Non-Signature and Signature Labels\" width=\"801\" height=\"641\" \/><figcaption id=\"figcaption_attachment_8501\" class=\"wp-caption-text\">Figure 7. Boxplot of Precision and Recall for Non-Signature and Signature Labels<\/figcaption><\/figure><\/p>\n<p>Figure 8 gives an example where there is a missing labeling in the ground truth, but the model detects it. On the figure, the green box represents the ground truth and the red box is model prediction. In this case, we defined that precision and recall=0.<\/p>\n<p><figure id=\"attachment_8500\" aria-labelledby=\"figcaption_attachment_8500\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"wp-image-8500\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/8_results.png\" alt=\"Figure 8. Non-signature handwriting detection example\" width=\"671\" height=\"870\" \/><figcaption id=\"figcaption_attachment_8500\" class=\"wp-caption-text\">Figure 8. Non-signature handwriting detection example<\/figcaption><\/figure><\/p>\n<p>Our results on the Computer Vision API handwriting OCR had limited success and reveal an area for future work and improvement. This OCR leveraged the more targeted handwriting section cropped from the full contract image from which to recognize text. The table below shows an example comparing the Computer Vision API and Human OCR for the page shown in Figure 5. Following standard <a href=\"https:\/\/abbyy.technology\/en:kb:tip:ocr-accuracy\">approaches<\/a>, we used word-level accuracy, meaning that the entire proper word should be found. It shows that the accuracy for pure digits and easily readable handwriting are much better than others. We plan to update our results with the new Cognitive Services Computer Vision API OCR capabilities, which include updates of their handwriting OCR capabilities coming in the near future.<\/p>\n<p><figure id=\"attachment_8534\" aria-labelledby=\"figcaption_attachment_8534\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"wp-image-8534\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/5aeb6e2d6c6f3_Screen-Shot-2018-05-03-at-3.13.51-PM.png\" alt=\"\" width=\"641\" height=\"128\" \/><figcaption id=\"figcaption_attachment_8534\" class=\"wp-caption-text\">Table 1. OCR Accuracy<\/figcaption><\/figure><\/p>\n<h2><\/h2>\n<h2>Conclusion<\/h2>\n<p>For EY, the handwriting recognition function and the integration of handwriting OCR back into page OCR unblocked their contract search scenario, saving sometimes hours of review time on each contract. While the handwriting character recognition portion of the solution did not do so well, the solution already improved the performance of the existing system. Alternative handwriting OCR tools and models can easily be integrated into the pipeline if exposed as APIs. As we mentioned, we plan to update our solution with the improved <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/computer-vision\/\">Cognitive Services Computer Vision API<\/a> OCR capabilities.<\/p>\n<p>We leveraged the <a href=\"https:\/\/aka.ms\/aml-packages\/vision\">Azure ML Package for Computer Vision<\/a>, including the <a href=\"https:\/\/github.com\/Microsoft\/VoTT\">VOTT<\/a> labelling tool, available by following the provided links. Our code, in <a href=\"https:\/\/github.com\/CatalystCode\/Handwriting\/tree\/master\/Notebooks\">Jupyter notebooks<\/a>, and a sample of the training data are available on our <a href=\"https:\/\/github.com\/CatalystCode\/Handwriting\">GitHub repository<\/a>. We invite your comments and contributions to this solution.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Extracting general concepts, rather than specific phrases, from documents and contracts is challenging. It&#8217;s even more complicated when applied to scanned documents containing handwritten annotations. We describe using object detection and OCR with Azure ML Package for Computer Vision and Cognitive Services API. <\/p>\n","protected":false},"author":21411,"featured_media":10734,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[14,19],"tags":[86,124,170,176,228,280,381],"class_list":["post-8487","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cognitive-services","category-machine-learning","tag-azure-ml-package-for-computer-vision","tag-cognitive-services","tag-entity-extraction","tag-faster-r-cnn","tag-knowledge-extraction","tag-ocr","tag-vott"],"acf":[],"blog_post_summary":"<p>Extracting general concepts, rather than specific phrases, from documents and contracts is challenging. It&#8217;s even more complicated when applied to scanned documents containing handwritten annotations. We describe using object detection and OCR with Azure ML Package for Computer Vision and Cognitive Services API. <\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/8487","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21411"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=8487"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/8487\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/10734"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=8487"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=8487"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=8487"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}