{"id":2116,"date":"2017-01-11T04:00:00","date_gmt":"2017-01-11T12:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/reallifecode\/index.php\/2017\/01\/11\/corpus-to-graph-building-a-pipeline-for-an-extracting-entity-relations-graph-from-a-corpus-of-documents\/"},"modified":"2020-03-19T10:21:26","modified_gmt":"2020-03-19T17:21:26","slug":"corpus-to-graph-building-a-pipeline-for-an-extracting-entity-relations-graph-from-a-corpus-of-documents","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/corpus-to-graph-building-a-pipeline-for-an-extracting-entity-relations-graph-from-a-corpus-of-documents\/","title":{"rendered":"Building a Pipeline for Extracting and Graphing Entities and Relations from a Corpus of Documents"},"content":{"rendered":"<p><a href=\"http:\/\/miroculus.com\/\">Miroculus<\/a> is a startup developing a simple, quick and affordable blood test to diagnose cancer and other diseases at an early stage. Their test device can detect the existence of <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC2895440\/\">micro-RNAs (miRNAs)<\/a> in the patient\u2019s blood, which may correlate with a particular disease.<\/p>\n<p>Miroculus developed the <a href=\"https:\/\/loom.miroculus.com\/\">loom.bio<\/a> tool that builds a visual graph according to the relations between <strong>Genes<\/strong>, <strong>miRNAs<\/strong> and <strong>Conditions<\/strong>, extracted from uncurated public articles (ie. <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/pubmed\">Pubmed<\/a>, <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/pmc\/\">PMC<\/a>, etc.)<\/p>\n<p>Microsoft and Miroculus collaborated to build a pipeline to process a corpus of medical documents. The pipeline is a generalized solution that extracts entities from the documents, finds whether a relation exists between them, then stores the derived relations in a database representing a graph.<\/p>\n<h2 id=\"the-problem\">The Problem<\/h2>\n<p>We wanted to build a corpus-to-graph pipeline that is:<\/p>\n<ol>\n<li>Reusable across different corpora and domains<\/li>\n<li>Scalable across different pipeline tasks<\/li>\n<li>Easily integrated with GitHub for continuous deployment<\/li>\n<li>Easily manageable<\/li>\n<\/ol>\n<h2 id=\"design-architecture\">Design Architecture<\/h2>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/2016-04-18-Developing-a-Genomics-Pipeline-architecture.png\" alt=\"Diagram of architecture and workflow\" \/><\/p>\n<h2 id=\"design-choices\">Design Choices<\/h2>\n<p>For this project, we extract relationships between miRNAs and genes, based on the connections that we find in medical research documents available on PubMed and PMC.<\/p>\n<p>To extract entity relations from a single document, we perform the following sequence:<\/p>\n<ol>\n<li>Split a document retrieved from the corpus into sentences<\/li>\n<li>Extract miRNA and gene entities from each sentence<\/li>\n<li>Use a binary classifier to discover relations between each pair of sentences and entities<\/li>\n<li>Create an entry in the Graph DB for each related entity pair. The entry includes entities, classification, score, and reference sentence.<\/li>\n<\/ol>\n<p>We\u2019ve developed a pipeline that uses a scheduled WebJob to periodically query for new documents in the corpus. This sequence is repeated for each document in the corpus.<\/p>\n<p>Each job is implemented as an <a href=\"https:\/\/azure.microsoft.com\/en-us\/documentation\/articles\/app-service-value-prop-what-is\/\">Azure App Service<\/a> feature called <a href=\"https:\/\/azure.microsoft.com\/en-us\/documentation\/articles\/web-sites-create-web-jobs\/\">WebJobs<\/a>. Each <strong>WebJob<\/strong> listens to a queue, which represents the queue of job messages.<\/p>\n<p><strong>Azure Web Apps<\/strong> are easily <a href=\"https:\/\/azure.microsoft.com\/en-us\/documentation\/articles\/web-sites-publish-source-control\/\">integrated to GitHub<\/a>, and using queues is a good practice for supporting scalability.\nAdditionally, each <strong>WebJob<\/strong>, as part of <strong>Azure App Service<\/strong>, is scalable separately.<\/p>\n<p>Domain-specific tasks were abstracted and represented as API calls, to keep the pipeline generic and reusable across different domains and corpora.<\/p>\n<p>Next, we encountered two challenges:<\/p>\n<ol>\n<li>How would we associate each <strong>Web App<\/strong> with the right job, given that we don\u2019t want to create separate repositories for each pipeline task?<\/li>\n<li>How do we manage and monitor the pipeline?<\/li>\n<\/ol>\n<p>We used <a href=\"#pipeline-management\">Pipeline Management<\/a> to tackle the first challenge and <a href=\"#pipeline-deployment\">Pipeline Deployment<\/a> to answer the second one.\nThe following sections detail how we addressed these challenges and integrated everything together into a holistic solution.<\/p>\n<h2 id=\"pipeline-management\">Pipeline Management<\/h2>\n<p>The <a href=\"https:\/\/github.com\/amiturgman\/web-cli-sample-app\">web-cli<\/a> is a great tool for controlling and monitoring your backend services and applications.\nIn addition to the built-in plugins for common tasks, the web-cli is easily extendable with plugins. One particularly useful plugin is for logging that enables you to query logs captured by a WebJob.<\/p>\n<p>In our pipeline implementation, we extended the <strong>web-cli<\/strong> to support administration actions like <strong>rescore<\/strong> or <strong>update model<\/strong>.<\/p>\n<p>This approach can be used to provide the user with a one-stop tool for managing the pipeline.<\/p>\n<h3 id=\"domain-abstraction\">Domain Abstraction<\/h3>\n<p>The pipeline supports two APIs that can be implemented according to a specific domain:<\/p>\n<ol>\n<li>Entity extraction API: splits documents into sentences and extracts relevant entities.<\/li>\n<li>Scoring API: detects relations between extracted entities.<\/li>\n<\/ol>\n<p>We decided to separate these APIs from our pipeline for the following reasons:<\/p>\n<ol>\n<li>Both APIs implement domain-specific logic.<\/li>\n<li><strong>Entity extraction<\/strong> and <strong>Scoring<\/strong> services use different technologies than those used in the pipeline.<\/li>\n<li>Defining an API for these services enables our solution to be configurable.<\/li>\n<li><strong>Entity extraction<\/strong> and <strong>Scoring<\/strong> are processes that run on Linux, and we wanted to enable deployment of the pipeline to <strong>Azure App Services<\/strong>.<\/li>\n<\/ol>\n<h3 id=\"pipeline-deployment\">Pipeline Deployment<\/h3>\n<p>The pipeline includes 4 Node.js <strong>WebJobs<\/strong>, with common dependencies and logic as well as two websites for <strong>web-cli<\/strong> and <strong>Graph API<\/strong>.<\/p>\n<p>We wanted to leverage the <strong>Continuous Deployment<\/strong> feature using <strong>WebJobs<\/strong> but didn\u2019t want a separate repository for each pipeline role. Therefore, we figured out a way to deploy a single repository, with the implementation of all the pipeline\u2019s roles having each <strong>WebJob<\/strong> determine its role in the pipeline.<\/p>\n<p>Here is an example of how web job runners are created according to a received parameter:\n<a href=\"https:\/\/github.com\/CatalystCode\/corpus-to-graph-pipeline\/blob\/master\/lib\/runners\/continuous.js\">corpus-to-graph-pipeline\/lib\/runners\/continuous.js<\/a><\/p>\n<div class=\"language-js highlighter-rouge\">\n<pre class=\"highlight\"><code><span class=\"kd\">var<\/span> <span class=\"nx\">roles<\/span> <span class=\"o\">=<\/span> <span class=\"nx\">require<\/span><span class=\"p\">(<\/span><span class=\"s1\">'..\/roles'<\/span><span class=\"p\">);<\/span>\r\n<span class=\"cm\">\/*...*\/<\/span>\r\n<span class=\"kd\">function<\/span> <span class=\"nx\">Runner<\/span><span class=\"p\">(<\/span><span class=\"nx\">serviceName<\/span><span class=\"p\">,<\/span> <span class=\"nx\">config<\/span><span class=\"p\">,<\/span> <span class=\"nx\">options<\/span><span class=\"p\">)<\/span> <span class=\"p\">{<\/span>\r\n  \r\n  <span class=\"c1\">\/\/ Initializing the relevant role accorindg to serviceName<\/span>\r\n  <span class=\"kd\">var<\/span> <span class=\"nx\">svc<\/span> <span class=\"o\">=<\/span> <span class=\"k\">new<\/span> <span class=\"nx\">roles<\/span><span class=\"p\">[<\/span><span class=\"nx\">serviceName<\/span><span class=\"p\">](<\/span><span class=\"nx\">config<\/span><span class=\"p\">,<\/span> <span class=\"nx\">options<\/span><span class=\"p\">);<\/span>\r\n\r\n  <span class=\"c1\">\/\/ This method is called periodically to check for messages in the queue<\/span>\r\n  <span class=\"kd\">function<\/span> <span class=\"nx\">checkInputQueue<\/span><span class=\"p\">()<\/span> <span class=\"p\">{<\/span>\r\n\r\n    <span class=\"c1\">\/\/ Request a single message from the queue<\/span>\r\n    <span class=\"nx\">queueIn<\/span><span class=\"p\">.<\/span><span class=\"nx\">getSingleMessage<\/span><span class=\"p\">(<\/span><span class=\"kd\">function<\/span> <span class=\"p\">(<\/span><span class=\"nx\">err<\/span><span class=\"p\">,<\/span> <span class=\"nx\">message<\/span><span class=\"p\">)<\/span> <span class=\"p\">{<\/span>\r\n\r\n      <span class=\"c1\">\/\/ Send message to be processed by the service<\/span>\r\n      <span class=\"k\">return<\/span> <span class=\"nx\">svc<\/span><span class=\"p\">.<\/span><span class=\"nx\">processMessage<\/span><span class=\"p\">(<\/span><span class=\"nx\">msgObject<\/span><span class=\"p\">,<\/span> <span class=\"kd\">function<\/span> <span class=\"p\">(<\/span><span class=\"nx\">err<\/span><span class=\"p\">)<\/span> <span class=\"p\">{<\/span>\r\n        <span class=\"cm\">\/*...*\/<\/span>\r\n      <span class=\"p\">});<\/span>\r\n    <span class=\"p\">});<\/span>\r\n  <span class=\"p\">}<\/span>\r\n<span class=\"p\">}<\/span>\r\n<\/code><\/pre>\n<\/div>\n<p>This is an example of how to start the runner with the relevant web job loaded from environment variables:\n<a href=\"https:\/\/github.com\/CatalystCode\/corpus-to-graph-genomics\/blob\/master\/webjob\/continuous\/app.js\">corpus-to-graph-genomics\/webjob\/continuous\/app.js<\/a><\/p>\n<div class=\"language-js highlighter-rouge\">\n<pre class=\"highlight\"><code><span class=\"kd\">var<\/span> <span class=\"nx\">continuousRunner<\/span> <span class=\"o\">=<\/span> <span class=\"nx\">require<\/span><span class=\"p\">(<\/span><span class=\"s1\">'corpus-to-graph-pipeline'<\/span><span class=\"p\">).<\/span><span class=\"nx\">runners<\/span><span class=\"p\">.<\/span><span class=\"nx\">continuous<\/span><span class=\"p\">;<\/span>\r\n<span class=\"kd\">var<\/span> <span class=\"nx\">webJobName<\/span> <span class=\"o\">=<\/span> <span class=\"nx\">process<\/span><span class=\"p\">.<\/span><span class=\"nx\">env<\/span><span class=\"p\">.<\/span><span class=\"nx\">PIPELINE_ROLE<\/span><span class=\"p\">;<\/span>\r\n<span class=\"cm\">\/*...*\/<\/span>\r\n<span class=\"kd\">function<\/span> <span class=\"nx\">startContinuousRunner<\/span><span class=\"p\">()<\/span> <span class=\"p\">{<\/span>\r\n  \r\n  <span class=\"kd\">var<\/span> <span class=\"nx\">runnerInstance<\/span> <span class=\"o\">=<\/span> <span class=\"k\">new<\/span> <span class=\"nx\">continuousRunner<\/span><span class=\"p\">(<\/span><span class=\"nx\">webJobName<\/span><span class=\"p\">,<\/span> <span class=\"cm\">\/*...*\/<\/span><span class=\"p\">);<\/span> \r\n  <span class=\"k\">return<\/span> <span class=\"nx\">runnerInstance<\/span><span class=\"p\">.<\/span><span class=\"nx\">start<\/span><span class=\"p\">(<\/span><span class=\"kd\">function<\/span> <span class=\"p\">(<\/span><span class=\"nx\">err<\/span><span class=\"p\">)<\/span> <span class=\"p\">{<\/span>\r\n    <span class=\"cm\">\/*...*\/<\/span>\r\n  <span class=\"p\">});<\/span>\r\n<span class=\"p\">}<\/span>\r\n<\/code><\/pre>\n<\/div>\n<p>Our solution uses a generic node module and a sample solution that leverages it.<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/CatalystCode\/corpus-to-graph-pipeline\">corpus-to-graph-pipeline<\/a> &#8211; A node module that provides a common implementation of a pipeline.<\/li>\n<li><a href=\"https:\/\/github.com\/CatalystCode\/corpus-to-graph-genomics\">corpus-to-graph-genomics<\/a> &#8211; A sample project that leverages the pipeline module. It can be used as is, or as a reference for how to build a solution for a different problem space.<\/li>\n<\/ul>\n<p>Additionally, the solution comes with two <strong>ARM templates<\/strong> that bundle all the deployment dependencies together like <strong>SQL Server<\/strong>, <strong>Azure Storage<\/strong>, etc.<\/p>\n<p>The <a href=\"https:\/\/github.com\/CatalystCode\/corpus-to-graph-genomics\/blob\/master\/azure-deployment\/Templates\/scalable\/azuredeploy.json\">scalable ARM template<\/a> is designed to enable scalability in production. The <a href=\"https:\/\/github.com\/CatalystCode\/corpus-to-graph-genomics\/blob\/master\/azure-deployment\/Templates\/all-in-one\/azuredeploy.all-in-one.json\">all-in-one ARM template<\/a> is designed for development and testing and is less taxing on resources.<\/p>\n<h2 id=\"opportunities-for-reuse\">Opportunities for Reuse<\/h2>\n<p>This solution can be reused in projects that require \u201ccorpus to graph\u201d pipelines.<\/p>\n<p>Developers can leverage this repository in fields such as medicine, knowledge management and genomics (similar to this project) or any other project requiring processing of a large-scale document repository into a graph.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learn how to build a pipeline for processing a corpus of documents to discover its entities and relations, then store the derived relations in a database representing a graph. <\/p>\n","protected":false},"author":21371,"featured_media":12527,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[10,11],"tags":[63,91,97,99,237,244,256,257,291],"class_list":["post-2116","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure-app-services","category-big-data","tag-azure-app-service","tag-azure-resource-manager-arm","tag-azure-web-apps","tag-azure-webjobs","tag-loom-bio","tag-medical-data","tag-mirnas","tag-miroculus","tag-pipelines"],"acf":[],"blog_post_summary":"<p>Learn how to build a pipeline for processing a corpus of documents to discover its entities and relations, then store the derived relations in a database representing a graph. <\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2116","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21371"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=2116"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/2116\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/12527"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=2116"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=2116"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=2116"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}