{"id":4466,"date":"2017-07-05T11:00:01","date_gmt":"2017-07-05T18:00:01","guid":{"rendered":"https:\/\/www.microsoft.com\/reallifecode\/?p=4466"},"modified":"2021-03-25T09:55:01","modified_gmt":"2021-03-25T16:55:01","slug":"imageregognitionandclassificationmlforautomating-receipt-processing","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/imageregognitionandclassificationmlforautomating-receipt-processing\/","title":{"rendered":"Automating Receipt Processing"},"content":{"rendered":"<p>Claiming expenses is usually a manual process. This project aims to improve the efficiency of receipt processing by looking into ways to automate this process.\u00a0<\/p>\n<p>This code story describes how we created a skeletal framework to achieve the following:<\/p>\n<ol style=\"margin: 0px 0px 15px 30px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">\n<li>Classify the type of expense<\/li>\n<li>Extract the amount spent<\/li>\n<li>Extract the retailer (our example is limited to the most common retailers)<\/li>\n<\/ol>\n<p style=\"margin: 0px 0px 15px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">We found a few challenges in addressing these goals. For instance, the quality of an Optical Character Recognizer (OCR) is crucial to tasks like accurately extracting the information of interest and modeling text-based classifiers (e.g., the expense category classifier). In addition, we discovered some retailers use\u00a0logos instead of text for their names, which makes the extraction process more complex.<\/p>\n<p style=\"margin: 0px 0px 15px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\"><!--more--><\/p>\n<p style=\"margin: 0px 0px 15px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">The figure below shows the stages of addressing the goals of the framework, as well as the aforementioned challenges.<\/p>\n<p style=\"margin: 0px 0px 15px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\"><img decoding=\"async\" class=\"alignnone wp-image-11254 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/process-1024x476.png\" alt=\"Image process\" width=\"640\" height=\"298\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/process-1024x476.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/process-300x140.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/process-768x357.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/process-1536x715.png 1536w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/process.png 1939w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/p>\n<p style=\"margin: 0px 0px 15px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">First, a receipt is captured via a camera. Next, that image is passed to the <strong>Logo Recognizer<\/strong>, and the\u00a0<span class=\"annotation\" data-annotation=\"is this step simultaneous? does it happen after the logo recognition?\" data-author=\"Laura Dolan\"><strong>Text Line Localizer<\/strong>, where the outputs (smaller chunks of texts)<\/span>\u00a0are then passed on to the\u00a0<strong>Optical Character Recognizer (OCR)<\/strong>\u00a0in the\u00a0<strong>Text Extractor<\/strong>.<\/p>\n<p style=\"margin: 0px 0px 15px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">The output of an OCR is a string of characters, which is then simultaneously passed to the\u00a0<strong>Text-based Retailer Recognizer<\/strong>,\u00a0<strong>Expense Category Recognizer<\/strong>, and\u00a0<strong>Total Amount Extractor<\/strong>. The\u00a0<strong>Retailer Recognizer<\/strong>\u00a0consists of two components: the\u00a0<strong>Logo Recognizer <\/strong>and the\u00a0<strong>Text-based Retailer Recognizer<\/strong>.<\/p>\n<p style=\"margin: 0px 0px 15px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">One of the predicted results from both the\u00a0<strong>Logo Recognizer <\/strong>and the\u00a0<strong>Text-based Retailer Recognizer <\/strong>is then selected.<\/p>\n<p style=\"margin: 0px 0px 15px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">The functions, methodologies, and <a href=\"https:\/\/github.com\/ryubidragonfire\/automate-receipt-processing\">code<\/a> used for each module are as follows:<\/p>\n<h2>Text Extractor<\/h2>\n<p>The purpose of an <strong>OCR<\/strong> is to extract text out of an image. Here, we have experimented with\u00a0<a href=\"https:\/\/azure.microsoft.com\/en-gb\/services\/cognitive-services\/computer-vision\/\">Microsoft Computer Vision OCR<\/a>, and open-source\u00a0<a href=\"https:\/\/github.com\/tesseract-ocr\/tesseract\/wiki\">Tesseract OCR<\/a>\u00a0(online\u00a0<a href=\"http:\/\/tesseract.projectnaptha.com\/\">demo<\/a>). Both support multiple languages. There are other\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Comparison_of_optical_character_recognition_software\">OCRs<\/a>\u00a0out there which are mostly licensed. The table below shows\u00a0example output from both OCRs.<\/p>\n<table>\n<thead>\n<tr>\n<th>Receipt<\/th>\n<th>Microsoft-OCR<\/th>\n<th>Tesseract-OCR<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><img decoding=\"async\" class=\"alignnone size-medium wp-image-11246\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-coop-169x300.jpg\" alt=\"Image receipt coop\" width=\"169\" height=\"300\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-coop-169x300.jpg 169w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-coop.jpg 540w\" sizes=\"(max-width: 169px) 100vw, 169px\" \/><\/td>\n<td><img decoding=\"async\" class=\"alignnone size-medium wp-image-11244\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-coop-microsoft-ocr-126x300.jpg\" alt=\"Image receipt coop microsoft ocr\" width=\"126\" height=\"300\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-coop-microsoft-ocr-126x300.jpg 126w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-coop-microsoft-ocr.jpg 387w\" sizes=\"(max-width: 126px) 100vw, 126px\" \/><\/td>\n<td><img decoding=\"async\" class=\"alignnone size-medium wp-image-11245\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-coop-tesseract-ocr-206x300.jpg\" alt=\"Image receipt coop tesseract ocr\" width=\"206\" height=\"300\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-coop-tesseract-ocr-206x300.jpg 206w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-coop-tesseract-ocr.jpg 618w\" sizes=\"(max-width: 206px) 100vw, 206px\" \/><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3><strong>Text Line Localizer<\/strong><\/h3>\n<p><a href=\"https:\/\/arxiv.org\/abs\/1609.03605\">Text Connectionist Proposal Network (TCPN)<\/a>\u00a0and its\u00a0<a href=\"https:\/\/github.com\/tianzhi0549\/CTPN\">authors\u2019 Caffee implementation<\/a>, is used to break the whole image into smaller sub-images based on the existence of text.In other words, it locates lines of text in a natural image. It is a deep learning approach based on both recurrent neural networks and convolutional networks. Ideally, when an image is broken up into smaller regions before passing them into an OCR, it will help to boost the OCR&#8217;s performance. In a small set of samples, we found performance gains when feeding a whole image versus a sub-image into an OCR. The figure below shows the comparison with and without\u00a0<strong>Text Line Localizer<\/strong>, in combination with\u00a0<a href=\"https:\/\/github.com\/tesseract-ocr\/tesseract\/wiki\">Tesseract-OCR<\/a>.<\/p>\n<table>\n<thead>\n<tr>\n<th>Receipt<\/th>\n<th>Tesseract-OCR<\/th>\n<th>TCPN + Tesseract-OCR Text Localization<\/th>\n<th>TCPN + Tesseract-OCR Extracted Text<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><img decoding=\"async\" class=\"alignnone wp-image-5171 size-medium\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-235x300.jpg\" alt=\"\" width=\"235\" height=\"300\" \/><img decoding=\"async\" class=\"alignnone size-medium wp-image-5171\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-235x300.jpg\" alt=\"\" width=\"235\" height=\"300\" \/><\/td>\n<td>\n<pre><span lang=\"FI\"><span style=\"color: #000000; font-family: Calibri;\">tlttl vuxen 200.000r<\/span><\/span>\r\n<span lang=\"FI\"><span style=\"color: #000000; font-family: Calibri;\">ARLANDA m 9a-rtomrttrSlOg-<\/span><\/span>\r\n<span lang=\"FI\"><span style=\"color: #000000; font-family: Calibri;\">_ 20N1<\/span><\/span>\r\n<span lang=\"FI\"><span style=\"color: #000000; font-family: Calibri;\">'CT, I Zen 1 \" Arhnda 0<\/span><\/span>\r\n<span lang=\"FI\"><span style=\"color: #000000; font-family: Calibri;\">Kontant<\/span><\/span><\/pre>\n<\/td>\n<td><img decoding=\"async\" class=\"alignnone size-medium wp-image-5172\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-chuncked-235x300.jpg\" alt=\"\" width=\"235\" height=\"300\" \/><img decoding=\"async\" class=\"alignnone wp-image-5172\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-chuncked-235x300.jpg\" alt=\"\" width=\"161\" height=\"205\" \/><\/td>\n<td><img decoding=\"async\" class=\"alignnone size-medium wp-image-5173\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-chuncked-extracted-text-300x267.jpg\" alt=\"\" width=\"300\" height=\"267\" \/><img decoding=\"async\" class=\"alignnone size-medium wp-image-5173\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/receipt-chuncked-extracted-text-300x267.jpg\" alt=\"\" width=\"300\" height=\"267\" \/><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The authors claimed that the algorithm works reliably on multi-scale and multi- language texts without further post-processing and computationally efficient. We have made available\u00a0<a href=\"https:\/\/github.com\/ryubidragonfire\/automate-receipt-processing\/tree\/master\/CTPN\">the code for deploying this Caffe model in the Windows environment<\/a>, specifically in an\u00a0<a href=\"https:\/\/azuremarketplace.microsoft.com\/en-us\/marketplace\/apps\/microsoft-ads.standard-data-science-vm\">Azure Data Science Virtual Machine<\/a>.<\/p>\n<p>If you are interested in using both\u00a0<strong>Text Line Localizer<\/strong>\u00a0and\u00a0<strong>OCR<\/strong>\u00a0(in this example,\u00a0<a href=\"https:\/\/github.com\/tesseract-ocr\/tesseract\/wiki\">Tesseract-OCR<\/a>) in a sequential manner, wrapped in a web API, please refer to\u00a0<a href=\"https:\/\/ryubidragonfire.github.io\/blogs\/2017\/06\/06\/TODO\/eero\"><em>Node.js web server with ML model<\/em>\u00a0(TODO @Eero)<\/a>.<\/p>\n<h2>Information of Interest Extractor<\/h2>\n<p>This consists of a Retailer Recognizer, Expense Type Recognizer and Total Amount Extractor.<\/p>\n<h3>Retailer Recognizer (Logo Recognizer)<\/h3>\n<p>In this example, we use Microsoft Cognitive Services&#8217;\u00a0<a href=\"https:\/\/customvision.ai\">Custom Vision<\/a>\u00a0Service to build a custom model for recognising retailers based on the look of a receipt and\/or a logo. Custom Vision allows you to easily customize your own computer vision models to fit your unique use case. It requires a few dozen samples of labeled images for each class. We trained a model using the entire receipt images in our example. It is possible to only provide a specific region during training; for instance, if most of the time the logo will be at the top of the receipt, then we can use only that part for model-building. When it comes to prediction, either just the top region\u00a0or the whole receipt can be fed into the predictor. The figure below shows the overall performance and some experimentation results with various classes and the respective number of samples uploaded to\u00a0Custom Vision \u2014 namely, rail (33),\u00a0bandq\u00a0(14),\u00a0pizzaexpress\u00a0(18),\u00a0walmart\u00a0(34) and\u00a0asda\u00a0(26).<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-11257 size-full\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/performance.jpg\" alt=\"Image performance\" width=\"910\" height=\"1002\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/performance.jpg 910w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/performance-272x300.jpg 272w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/performance-768x846.jpg 768w\" sizes=\"(max-width: 910px) 100vw, 910px\" \/><\/p>\n<table>\n<thead>\n<tr>\n<th><\/th>\n<th>Example results<\/th>\n<th><\/th>\n<th><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1.<\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/rail1-test.jpg\" alt=\"jpg: rail1-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/rail2-test.jpg\" alt=\"jpg: rail2-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/rail3-test.jpg\" alt=\"jpg: rail3-test\" \/><\/td>\n<\/tr>\n<tr>\n<td>2.<\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/bandq1-test.jpg\" alt=\"jpg: bandq1-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/bandq2-test.jpg\" alt=\"jpg: bandq2-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/bandq3-test.jpg\" alt=\"jpg: bandq3-test.jpg\" \/><\/td>\n<\/tr>\n<tr>\n<td>3.<\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/pizzaexpress1-test.jpg\" alt=\"jpg: pizzaexpress1-test.jpg\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/pizzaexpress2-test.jpg\" alt=\"jpg: pizzaexpress2-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/pizzaexpress3-test.jpg\" alt=\"jpg: pizzaexpress3-test.jpg\" \/><\/td>\n<\/tr>\n<tr>\n<td>4.<\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/walmart1-test.jpg\" alt=\"jpg: walmart1-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/walmart2-test.jpg\" alt=\"jpg: walmart2-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/walmart3-test.jpg\" alt=\"jpg: walmart3-test\" \/><\/td>\n<\/tr>\n<tr>\n<td>5.<\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/asda1-test.jpg\" alt=\"jpg: asda1-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/asda2-test.jpg\" alt=\"jpg: asda2-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/asda3-test.jpg\" alt=\"jpg: asda3-test\" \/><\/td>\n<\/tr>\n<tr>\n<td>6.<\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/nonreceipt1-test.jpg\" alt=\"jpg: nonreceipt1-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/nonreceipt2-test.jpg\" alt=\"jpg: nonreceipt2-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/nonreceipt3-test.jpg\" alt=\"jpg: nonreceipt3-test\" \/><\/td>\n<\/tr>\n<tr>\n<td>7.<\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/nonreceipt4-test.jpg\" alt=\"jpg: nonreceipt4-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/nonreceipt5-test.jpg\" alt=\"jpg: nonreceipt5-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/nonreceipt6-test.jpg\" alt=\"jpg: nonreceipt6-test\" \/><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The table above shows some example\u00a0exemplar results. This model classifies well most of the time for receipts from known retailers. It is able to distinguish a receipt from a non-receipt (see row 6) such as a zebra. Note the confident probability scores shown. In row 2 and 4, there is a test image with multiple\u00a0<em>bandq<\/em>\u00a0receipts and a test image with multiple\u00a0<em>walmart<\/em>\u00a0test receipts. In this example, multiple Walmart receipts appear in the same image and are correctly classified as\u00a0<em>walmart<\/em>, whilst a group of B&amp;Q receipts in the same image are not correctly classified. This sample is just to show how the model will behave in corner cases like this. In practice, restrictions can be put in place to help, such as limiting it to one single receipt at a time. Row 7 shows receipts from retailers of which the model has no knowledge. Unfortunately, the classifier is confused when a receipt which does not belong to any of the known classes is tested.<\/p>\n<p>To address the issue above, try adding a class called\u00a0<em>others<\/em>, which will be a collection of receipts that have no logos, as well as any receipts that are not from the intended 5 classes. How to decide between these two options will depend on application-specific requirements.<\/p>\n<p>The figures below shows the performance of a different model which has the class\u00a0<em>others <\/em>incorporated.<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-11258 size-full\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/performance-others.jpg\" alt=\"Image performance others\" width=\"606\" height=\"691\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/performance-others.jpg 606w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/03\/performance-others-263x300.jpg 263w\" sizes=\"(max-width: 606px) 100vw, 606px\" \/><\/p>\n<p>In this example, 76 samples were uploaded to\u00a0<a href=\"https:\/\/www.customvision.ai\/\">Custom Vision<\/a>\u00a0for the model building.<\/p>\n<table>\n<thead>\n<tr>\n<th>Example results<\/th>\n<th><\/th>\n<th><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/others2-test.jpg\" alt=\"jpg: others2-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/others3-test.jpg\" alt=\"jpg: others3-test\" \/><\/td>\n<td><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/others1-test.jpg\" alt=\"jpg: others1-test\" \/><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span class=\"annotation\" data-annotation=\"your images are very small, can you upload larger versions?\" data-author=\"Laura Dolan\">Note that the model\u2019s confidence in ruling out those receipts without logos<\/span> as one of the 5 classes of receipts with logos. The first and third receipts are confidently classified as\u00a0<em>others<\/em>. However, consider the test image in the middle: while it is predicted as\u00a0<em>others<\/em>\u00a0with the highest probability, the score for\u00a0<em>pizzaexpress<\/em>\u00a0is rather high too. This confusion could perhaps be mitigated with increased variety of training samples for the class\u00a0<em>others<\/em>.<\/p>\n<h3>Text-based Retailer Recognizer<\/h3>\n<p>In contrast to the\u00a0<strong>Logo Recognizer<\/strong>, this module recognises the retailer by the text extracted from the receipt. As per the\u00a0<strong>Logo Recognizer<\/strong>, instead of using the whole receipt, it is possible just to use certain portion of text extracted to train a text-based retailer recognizer. This is essentially a text classification problem, similar to the\u00a0<strong>Expense Recognizer<\/strong>.<\/p>\n<h3>Selector<\/h3>\n<p>In this example, we naively select a result from either the\u00a0<strong>Logo Recognizer<\/strong>\u00a0or the\u00a0<strong>Text-Based Retailer Recognizer<\/strong>\u00a0(whichever has the highest probability). There are better ways to select a more confident result, such as weighted or non-weighted, fuzzy method, linear or non-linear combination, intuition-based and so on. While this topic is beyond the scope of this post, it will become increasingly important when more classifiers are applied in an ensemble manner. <a href=\"https:\/\/www.toptal.com\/machine-learning\/ensemble-methods-machine-learning\">This lightweight, high-level\u00a0article<\/a>\u00a0touches on some basic methods.<\/p>\n<h3>Expense Recognizer<\/h3>\n<p>Its purpose is to recognize the type of expense, like accommodation, meal, transport etc. This is based on text occurring within the receipt. It is trained within\u00a0<a href=\"https:\/\/studio.azureml.net\/\">Azure ML Studio<\/a>\u00a0using the typical process of text classification. In this example, the\u00a0<strong>Expense Recognizer<\/strong>\u00a0is based on\u00a0<a href=\"https:\/\/ryubidragonfire.github.io\/blogs\/2017\/01\/02\/expense-recognition.html\"><em>Using Microsoft Cognitive Services within Azure ML Studio to Predict Expense Type from Receipts<\/em><\/a>. While this is a simplistic approach, many variations of text processing can be applied. Again, the discussion of what are the best approaches or algorithms is beyond the scope of this post. This approach is similar to that of the\u00a0<strong>Text-based Retailer Recognizer<\/strong>.<\/p>\n<h3>Total Amount Extractor<\/h3>\n<p>At the time of writing, this extractor is naively extracting numbers that show the format of monetary value, and assume the largest value is the total amount spent. Most of the time it works well with credit card purchases but will encounter issues if it is a cash purchase.<\/p>\n<p>These modules are tied together via Azure Functions, Azure Storage and Azure Queue. The figure below shows the architectural diagram. The article\u00a0<a href=\"http:\/\/blog.codemoggy.com\/index.php\/2017\/06\/20\/using-azure-functions-to-enable-ocr-processing-of-images\/\"><em>Using Azure Functions to enable processing of Receipt Images with OCR<\/em><\/a>\u00a0highlights the end-to-end\u00a0implementation\u00a0and the benefits of using such architecture.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/595d084a97ac3_architecture.png\" alt=\"png: architecture\" \/><\/p>\n<p style=\"margin: 0px 0px 15px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">All the modules within\u00a0Information of Interest Extractor\u00a0are implemented within\u00a0<a href=\"http:\/\/studio.azureml.net\">Azure ML Studio<\/a>\u00a0for easy deployment as a web service, except for the\u00a0Logo Recognizer, whose the model is built using\u00a0Custom Vision\u00a0resulting in a web service. This API is then called from within\u00a0Azure ML Studio. Implementation of all the modules stated in the Information of Interest Extractor\u00a0can be used within the Execute Python Script\u00a0module as part of an Azure ML Studio experiment.<\/p>\n<h2 id=\"what-we-have-tried-but-need-further-work\" style=\"margin: 0px 0px 15px; padding: 0px; font-weight: 400; font-size: 32px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-style: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">Further Work<\/h2>\n<p>Here are some other things we have tried but need further work.<\/p>\n<h3 id=\"fast-r-cnn-cntk-implementation\" style=\"margin: 0px 0px 15px; padding: 0px; font-weight: 400; font-size: 26px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-style: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">Fast-R-CNN (CNTK implementation)<\/h3>\n<p>We found that if we can increase the reliability of OCR, the better quality of text extracted may help to improve the information extraction process.\u00a0<a href=\"https:\/\/github.com\/Microsoft\/CNTK\/wiki\/Object-Detection-using-Fast-R-CNN\">CNTK\u2019s Fast-R-CNN<\/a>\u00a0was initially tested in an attempt to create both word- and character-level recognition. Note that there is a\u00a0<a href=\"https:\/\/github.com\/rbgirshick\/py-faster-rcnn\"><strong>Faster-R-CNN<\/strong>\u00a0<\/a>with similar function.<\/p>\n<h4>A note on Selective Search<\/h4>\n<p style=\"margin: 0px 0px 15px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">At the time of testing (May 2017), we found that first part of the algorithm, Selective Search (SS),\u00a0is suitable for images of natural scenes with rich colour and of a certain minimum size. However, it struggles to propose new Regions of Interest (ROI) when images mainly consist of black characters on a white-ish background (that is, a lack of colour richness and complex fine patterns in a small region). Meanwhile, there are possible alternatives to Selective Search. The simplest way is naively shifting the bounding box (labelled\u00a0box)\u202fto the top, bottom, left, or right by a small amount. This approach generates multiple regions of interests, before feeding the ROIs into the network.<\/p>\n<h2>Conclusion<\/h2>\n<p>Here, we show an example technique where receipt processing can be automated using a combination of text extraction and image recognition techniques, together with some challenges. If you have any comments, please share with us in the comments below.<\/p>\n<h2 id=\"references\" style=\"margin: 0px 0px 15px; padding: 0px; font-weight: 400; font-size: 32px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-style: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">References<\/h2>\n<ul style=\"margin: 0px 0px 15px 30px; padding: 0px; color: #111111; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; background-color: #fdfdfd;\">\n<li><a href=\"https:\/\/arxiv.org\/abs\/1609.03605\">Detecting Text in Natural Image with Connectionist Text Proposal Network<\/a>\n<ul>\n<li><a href=\"https:\/\/ryubidragonfire.github.io\/blogs\/2017\/06\/06\/Automating-Receipt-Processing.html\">Implementation in Caffee<\/a><\/li>\n<li><a href=\"http:\/\/textdet.com\/\">Demo<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"https:\/\/github.com\/Microsoft\/CNTK\/wiki\">CNTK<\/a>\u00a0implementation of\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1504.08083\">Fast-R-CNN<\/a>\n<ul>\n<li><a href=\"https:\/\/github.com\/Microsoft\/CNTK\/wiki\/Object-Detection-using-Fast-R-CNN\">Wiki<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/Microsoft\/CNTK\/wiki\/Object-Detection-using-Fast-R-CNN\">Example usage<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/Microsoft\/CNTK\/wiki\/Object-Detection-using-Fast-R-CNN\">More detailed version<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Claiming expenses is usually a manual process. This project aims to improve the efficiency of receipt processing by looking into ways to automate this process.\u00a0 This code story describes how we created a skeletal framework to achieve the following: Classify the type of expense Extract the amount spent Extract the retailer (our example is limited [&hellip;]<\/p>\n","protected":false},"author":21356,"featured_media":11254,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[14,16,19],"tags":[81,139,239,250],"class_list":["post-4466","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cognitive-services","category-devops","category-machine-learning","tag-azure-machine-learning-ml-studio","tag-custom-vision-service","tag-machine-learning-ml","tag-microsoft-cognitive-services"],"acf":[],"blog_post_summary":"<p>Claiming expenses is usually a manual process. This project aims to improve the efficiency of receipt processing by looking into ways to automate this process.\u00a0 This code story describes how we created a skeletal framework to achieve the following: Classify the type of expense Extract the amount spent Extract the retailer (our example is limited [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/4466","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21356"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=4466"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/4466\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/11254"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=4466"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=4466"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=4466"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}