{"id":4886,"date":"2017-11-20T13:17:25","date_gmt":"2017-11-20T21:17:25","guid":{"rendered":"https:\/\/www.microsoft.com\/developerblog\/?p=4886"},"modified":"2020-03-14T19:25:38","modified_gmt":"2020-03-15T02:25:38","slug":"opener-permissively-licensed-named-entity-recognition-on-the-jvm","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/opener-permissively-licensed-named-entity-recognition-on-the-jvm\/","title":{"rendered":"Permissively-Licensed Named Entity Recognition on the JVM"},"content":{"rendered":"<p><a href=\"http:\/\/aka.ms\/fortis\">Fortis<\/a>\u00a0is an open source\u00a0social data ingestion, analysis, and visualization platform built on Scala and Apache Spark. The tool is developed in collaboration with the\u00a0<a href=\"https:\/\/www.unocha.org\/\">United Nations Office for the Coordination of Humanitarian Affairs<\/a>\u00a0(UN OCHA) to provide insights into crisis events as they occur, via the lens of social media.<\/p>\n<p>A key part of the Fortis platform is the ability to search events (such as Tweets or news articles) for key figures or places of interest. To increase the accuracy and index quality of this search, Fortis uses <a href=\"https:\/\/en.wikipedia.org\/wiki\/Named-entity_recognition\">named entity recognition<\/a> to differentiate between normal content words and special entities like organizations, people or locations. This code story explains how Fortis integrated named entity recognition using Spark Streaming and Scala, the challenges faced with this approach and with running named entity recognition on the Java Virtual Machine (JVM), and our solution based on Docker containers and Azure Web Apps for Linux.<!--more--><\/p>\n<h2>The state of open source named entity recognition on the JVM<\/h2>\n<p>Several well-known packages in the Java ecosystem offer natural language processing and named entity recognition capabilities; the table below lists some of them. However, many of these projects are either not licensed under terms acceptable for the MIT-licensed Fortis project, or target only a few languages. Some only offer generic named entity recognition such as \u201cthis is an entity\u201d as opposed more granular details like \u201cthis is a place\u201d or \u201cthis is a person.\u201d<\/p>\n<table width=\"618\">\n<tbody>\n<tr>\n<td width=\"160\"><strong>Project<\/strong><\/td>\n<td width=\"175\"><strong>Languages<\/strong><\/td>\n<td width=\"90\"><strong>License<\/strong><\/td>\n<td width=\"150\"><strong>Disambiguation<\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"168\"><a href=\"http:\/\/services.gate.ac.uk\/annie\/\">Annie<\/a><\/td>\n<td width=\"252\">English<\/td>\n<td width=\"84\">GPL<\/td>\n<td width=\"114\">Yes<\/td>\n<\/tr>\n<tr>\n<td width=\"168\"><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/entity-linking-intelligence-service\/\">Azure Cognitive Services<\/a><\/td>\n<td width=\"252\">English<\/td>\n<td width=\"84\">Azure<\/td>\n<td width=\"114\">No<\/td>\n<\/tr>\n<tr>\n<td width=\"168\"><a href=\"http:\/\/nlp.cs.berkeley.edu\/projects\/entity.shtml\">Berkeley<\/a><\/td>\n<td width=\"252\">English<\/td>\n<td width=\"84\">GPL<\/td>\n<td width=\"114\">Yes<\/td>\n<\/tr>\n<tr>\n<td width=\"168\"><a href=\"https:\/\/github.com\/dlwh\/epic\">Epic<\/a><\/td>\n<td width=\"252\">English<\/td>\n<td width=\"84\">Apache v2<\/td>\n<td width=\"114\">Yes<\/td>\n<\/tr>\n<tr>\n<td width=\"168\"><a href=\"http:\/\/factorie.cs.umass.edu\/usersguide\/UsersGuide200QuickStart.html\">Factorie<\/a><\/td>\n<td width=\"252\">English<\/td>\n<td width=\"84\">Apache v2<\/td>\n<td width=\"114\">Yes<\/td>\n<\/tr>\n<tr>\n<td width=\"168\"><a href=\"https:\/\/opennlp.apache.org\/documentation\/1.7.2\/manual\/opennlp.html#tools.namefind.recognition\">OpenNLP<\/a><\/td>\n<td width=\"252\">English, Spanish, Dutch<\/td>\n<td width=\"84\">Apache v2<\/td>\n<td width=\"114\">No<\/td>\n<\/tr>\n<tr>\n<td width=\"168\"><a href=\"https:\/\/nlp.stanford.edu\/software\/CRF-NER.shtml\">Stanford<\/a><\/td>\n<td width=\"252\">English, German, Spanish, Chinese<\/td>\n<td width=\"84\">GPL<\/td>\n<td width=\"114\">Yes<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The <a href=\"http:\/\/www.opener-project.eu\/\">OpeNER<\/a> project, created by the European Union with a conglomerate of research universities and industry, stands out as a key package for Fortis use since OpeNER offers named entity recognition in many languages (English, French, German, Spanish, Italian, Dutch) and is licensed under the Apache v2 license which makes it easy to integrate into an existing open source project.<\/p>\n<h2>Named Entity recognition via OpeNER<\/h2>\n<p>OpeNER is based on a simple pipeline model in which text is analyzed by a sequence of models, each step augmenting the source text with additional information that is used by subsequent steps. The pipeline model is illustrated in the figure below.<\/p>\n<p> <img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2017\/11\/nlp-codestory-opener-pipeline.png\" alt=\"Image nlp codestory opener pipeline\" width=\"851\" height=\"562\" class=\"aligncenter size-full wp-image-10857\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/11\/nlp-codestory-opener-pipeline.png 851w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/11\/nlp-codestory-opener-pipeline-300x198.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2017\/11\/nlp-codestory-opener-pipeline-768x507.png 768w\" sizes=\"(max-width: 851px) 100vw, 851px\" \/><\/p>\n<p>This pipeline can be added to a project via Maven and may be consumed from a JVM language such as Scala. The sample code below illustrates the analysis of an input text using the pipeline (see also the <a href=\"https:\/\/github.com\/CatalystCode\/project-fortis-spark\/blob\/bd2b124d908ed7ede819739b53efdada2b2cf43b\/src\/main\/scala\/com\/microsoft\/partnercatalyst\/fortis\/spark\/transforms\/nlp\/OpeNER.scala\">production code<\/a> in the Fortis project). However, note that in the context of a Spark application the approach has four main limitations which will be discussed in the next section and addressed later in this code story.<\/p>\n<pre class=\"lang:scala decode:true\">\/\/ imports for the sample code below\r\nimport java.io.{BufferedReader, ByteArrayInputStream, File, InputStreamReader}\r\nimport java.util.Properties\r\nimport eus.ixa.ixa.pipe.nerc.{Annotate =&gt; NerAnnotate}\r\nimport eus.ixa.ixa.pipe.pos.{Annotate =&gt; PosAnnotate}\r\nimport eus.ixa.ixa.pipe.tok.{Annotate =&gt; TokAnnotate}\r\nimport ixa.kaflib.KAFDocument\r\n\r\n\/\/ insert here the text from which to extract entities\r\nval text = \"...\"\r\n\r\n\/\/ language of the text, if unknown, can be inferred\r\n\/\/ for example via Cognitive Services http:\/\/aka.ms\/detect-language\r\nval language = \"...\"\r\n\r\n\/\/ path where OpeNER models are stored on disk\r\nval resourcesDirectory = \"...\"\r\n\r\n\/\/ do language processing, incrementally building up an annotated\r\n\/\/ document in the standard \"NLP Annotation Format\" style https:\/\/aka.ms\/naf\r\nval kaf = new KAFDocument(language, \"v1.naf\")\r\ndoTokenization(resourcesDirectory, language, kaf)\r\ndoPartOfSpeechTagging(resourcesDirectory, language, kaf)\r\ndoNamedEntityRecognition(resourcesDirectory, language, kaf)\r\n\r\n\/\/ code for the helper functions used above\r\ndef doTokenization(resourcesDirectory: String, language: String, kaf: KAFDocument): Unit = {\r\n  val properties = new Properties\r\n  properties.setProperty(\"language\", language)\r\n  properties.setProperty(\"resourcesDirectory\", resourcesDirectory)\r\n  properties.setProperty(\"normalize\", \"default\")\r\n  properties.setProperty(\"untokenizable\", \"no\")\r\n  properties.setProperty(\"hardParagraph\", \"no\")\r\n\r\n  val input = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(text.getBytes(\"UTF-8\"))))\r\n  new TokAnnotate(input, properties).tokenizeToKAF(kaf)\r\n}\r\n\r\ndef doPartOfSpeechTagging(resourcesDirectory: String, language: String, kaf: KAFDocument): Unit = {\r\n  val properties = new Properties\r\n  properties.setProperty(\"language\", language)\r\n  properties.setProperty(\"model\", new File(resourcesDirectory, s\"$language-pos.bin\").getAbsolutePath)\r\n  properties.setProperty(\"lemmatizerModel\", new File(resourcesDirectory, s\"$language-lemmatizer.bin\").getAbsolutePath)\r\n  properties.setProperty(\"resourcesDirectory\", resourcesDirectory)\r\n  properties.setProperty(\"multiwords\", \"false\")\r\n  properties.setProperty(\"dictag\", \"false\")\r\n  properties.setProperty(\"useModelCache\", \"true\")\r\n\r\n  new PosAnnotate(properties).annotatePOSToKAF(kaf)\r\n}\r\n\r\ndef doNamedEntityRecognition(resourcesDirectory: String, language: String, kaf: KAFDocument): Unit = {\r\n  val properties = new Properties\r\n  properties.setProperty(\"language\", language)\r\n  properties.setProperty(\"model\", new File(resourcesDirectory, s\"$language-nerc.bin\").getAbsolutePath)\r\n  properties.setProperty(\"ruleBasedOption\", \"off\")\r\n  properties.setProperty(\"dictTag\", \"off\")\r\n  properties.setProperty(\"dictPath\", \"off\")\r\n  properties.setProperty(\"clearFeatures\", \"no\")\r\n  properties.setProperty(\"useModelCache\", \"true\")\r\n\r\n  new NerAnnotate(properties).annotateNEs(kaf)\r\n}\r\n<\/pre>\n<p>After running text through the pipeline, it is now possible to extract entities from the annotated pipeline output. OpeNER supports eight\u00a0types of entities ranging from concrete real-world entities such as &#8220;person&#8221;, &#8220;geopolitical entity (GPE)&#8221; or &#8220;location&#8221; to more abstract concepts like &#8220;date&#8221;, &#8220;time&#8221; and &#8220;money.&#8221;<\/p>\n<pre class=\"lang:scala decode:true \">\/\/ imports for the sample code below\r\nimport scala.collection.JavaConversions._\r\nimport ixa.kaflib.Entity\r\n\r\n\/\/ find the entities in the text annotated by the OpeNER pipeline\r\nval entities = kaf.getEntities.toList\r\n\r\n\/\/ here is an example of how to access place and person entities\r\n\/\/ the full list of entities can be found at https:\/\/aka.ms\/opener-entities\r\nval places = entities.filter(entityIs(_, Set(\"location\", \"gpe\")))\r\nval people = entities.filter(entityIs(_, Set(\"person\")))\r\n\r\n\/\/ code for the helper functions used above\r\ndef entityIs(entity: Entity, types: Set[String]): Boolean = {\r\n  val entityType = Option(entity.getType).getOrElse(\"\").toLowerCase\r\n  types.contains(entityType)\r\n}\r\n<\/pre>\n<h2>Simplifying the integration with OpeNER via Docker and Azure Web Apps for Linux<\/h2>\n<p>In a Spark application context, several issues exist in the approach outlined above:<\/p>\n<ul>\n<li>The model binaries must be managed and deployed to every Spark worker node<\/li>\n<li>Loading the models from disk is time-consuming for short-lived Spark workers<\/li>\n<li>Spark workers are often run with low-spec resources for scaling horizontally instead of vertically<\/li>\n<li>Model files, however, are large binaries so Spark workers can run out of memory when loading more than one or two models<\/li>\n<\/ul>\n<p>To address these limitations, OpeNER has the option to host models behind HTTP services so that developers can separate their natural language processing infrastructure from their application infrastructure. These HTTP services are simple to consume but hard to set up since they have several complex dependencies. To simplify deployment, the Fortis team created Docker images for each of the services. It is possible to run the services locally as follows (using Bash, after <a href=\"https:\/\/docs.docker.com\/engine\/installation\/linux\/docker-ce\/ubuntu\/\">installing Docker<\/a>):<\/p>\n<pre class=\"lang:sh decode:true\"># start the OpeNER containers\r\ndocker run -d -p 8080:80 cwolff\/opener-docker-language-identifier\r\ndocker run -d -p 8081:80 cwolff\/opener-docker-tokenizer\r\ndocker run -d -p 8082:80 cwolff\/opener-docker-pos-tagger\r\ndocker run -d -p 8083:80 cwolff\/opener-docker-ner\r\n\r\n# verify that the four containers started above are running\r\ndocker ps\r\n\r\n# input some text to be processed by the OpeNER pipeline\r\ntext_raw=\"I went to Rome last year. It was fantastic.\"\r\n\r\n# run the text through the OpeNER pipeline\r\ntext_with_language=\"$(curl -d \"input=$text_raw\" http:\/\/localhost:8080)\"\r\ntext_tokenized=\"$(curl -d \"input=$text_with_language\" http:\/\/localhost:8081)\"\r\ntext_tagged=\"$(curl -d \"input=$text_tokenized\" http:\/\/localhost:8082)\"\r\ntext_entities=\"$(curl -d \"input=$text_tagged\" http:\/\/localhost:8082)\"\r\n\r\n# check the output XML, the token \"Rome\" is identified as a place entity\r\necho \"$text_entities\"\r\n<\/pre>\n<p>Each of the Docker images also comes with a one-click deployment template to set up and run the services in production on <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/app-service-web\/app-service-linux-intro\">Azure Web Apps for Linux<\/a>.\u00a0The one-click deployment templates can be found in the following repositories on Github:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/c-w\/opener-docker-language-identifier\">opener-docker-language-identifier<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/c-w\/opener-docker-tokenizer\">opener-docker-tokenizer<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/c-w\/opener-docker-pos-tagger\">opener-docker-pos-tagger<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/c-w\/opener-docker-ner\">opener-docker-ner<\/a><\/li>\n<\/ul>\n<p>The one-click deployments running the OpeNER containers on Azure Web Apps for Linux are convenient as they offer a simple deployment and management story. Deploying the services is as simple as clicking the &#8220;Deploy to Azure&#8221; button on the GitHub repositories and stepping through the wizard. Once the deployment is done, the Azure Portal can be used to easily scale the services horizontally (distributing the service over more instances) and vertically (hosting the service on more powerful virtual machines).<\/p>\n<p>When dealing with large workloads with low latency or high-throughput requirements, however, introducing four HTTP hops for natural language processing will be prohibitively expensive. In such scenarios, there is the option to run the natural language processing models in-process as described earlier in this post or to host multiple Docker images for the OpeNER services on the same host and expose them via a wrapper service such as <a href=\"https:\/\/github.com\/c-w\/opener-docker-wrapper\">opener-docker-wrapper<\/a>.\u00a0An end-to-end usage example of the wrapper service can be found in its <a href=\"https:\/\/github.com\/c-w\/opener-docker-wrapper\">Github repository<\/a>. Deploying the wrapper service on an Azure Standard F1 Virtual Machine leads to an average entity extraction speed of 280ms per request with a standard deviation of 165ms.plea<\/p>\n<p>The ability to identify entities, such as places, people, and organizations adds a powerful level of natural language understanding to applications. However, the open source ecosystem for named entity recognition on the JVM is quite limited, with many projects either being licensed under non-permissive licenses, targeting few languages or being hard to deploy.<\/p>\n<p>To solve this problem, the Fortis team created an MIT-licensed one-click deployment to Azure for web services that lets developers get started with a wide range of natural language tasks in 5 minutes or less, by consuming simple HTTP services for <a href=\"https:\/\/github.com\/c-w\/opener-docker-language-identifier\">language identification<\/a>, <a href=\"https:\/\/github.com\/c-w\/opener-docker-tokenizer\">tokenization<\/a>, <a href=\"https:\/\/github.com\/c-w\/opener-docker-pos-tagger\">part-of-speech-tagging<\/a> and <a href=\"https:\/\/github.com\/c-w\/opener-docker-ner\">named entity recognition<\/a>. These services can be used in a wide variety of additional contexts including identifying organizations in product reviews, automatically tagging places in social media posts, and so forth.<\/p>\n<h2>Resources<\/h2>\n<ul>\n<li><a href=\"https:\/\/github.com\/c-w\/opener-docker-language-identifier\">Language identification service Docker image<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/c-w\/opener-docker-tokenizer\">Tokenization service Docker image<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/c-w\/opener-docker-pos-tagger\">Part-of-speech tagging service Docker image<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/c-w\/opener-docker-ner\">Named entity recognition service Docker image<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/c-w\/opener-docker-wrapper\">Cross-service batch request wrapper server<\/a><\/li>\n<li><a href=\"http:\/\/www.opener-project.eu\">OpeNER natural language processing tools project<\/a><\/li>\n<li><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/app-service\/containers\/app-service-linux-intro\">Deploying Docker images via Azure Web Apps for Linux<\/a><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The ability to correctly identify entities, such as places, people, and organizations, adds a powerful level of natural language understanding to applications. This post introduces a MIT-licensed one-click deployment to Azure for web services that lets developers get started with a wide range of natural language tasks in 5 minutes or less, by consuming simple HTTP services for language identification, tokenization, part-of-speech-tagging and named entity recognition.<\/p>\n","protected":false},"author":21408,"featured_media":10856,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[10,19],"tags":[98,156,267,268,320,333,334],"class_list":["post-4886","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure-app-services","category-machine-learning","tag-azure-web-apps-for-linux","tag-docker","tag-named-entity-recognition","tag-natural-language-processing","tag-scala","tag-spark","tag-spark-streaming"],"acf":[],"blog_post_summary":"<p>The ability to correctly identify entities, such as places, people, and organizations, adds a powerful level of natural language understanding to applications. This post introduces a MIT-licensed one-click deployment to Azure for web services that lets developers get started with a wide range of natural language tasks in 5 minutes or less, by consuming simple HTTP services for language identification, tokenization, part-of-speech-tagging and named entity recognition.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/4886","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21408"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=4886"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/4886\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/10856"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=4886"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=4886"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=4886"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}