{"id":3802,"date":"2017-08-07T16:02:50","date_gmt":"2017-08-07T23:02:50","guid":{"rendered":"https:\/\/www.microsoft.com\/reallifecode\/?p=3802"},"modified":"2020-03-18T10:57:02","modified_gmt":"2020-03-18T17:57:02","slug":"developing-a-custom-search-engine-for-an-expert-system","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/developing-a-custom-search-engine-for-an-expert-system\/","title":{"rendered":"Developing a Custom Search Engine for an Expert Chat System"},"content":{"rendered":"<h2>The Challenge<\/h2>\n<p>Querying specific content areas quickly and easily is a common enterprise need. Fast traversal of specialized publications, customer support knowledge bases or document repositories allows enterprises to deliver service efficiently and effectively. Simple FAQs don\u2019t cover enough ground, and a string search isn\u2019t effective or efficient for those not familiar with the domain or the document set. Instead, enterprises can deliver a custom search experience that saves their clients time and provides them better service through a question and answer format.<\/p>\n<p>We worked with Ernst &amp; Young,\u00a0a leading global professional services firm,\u00a0to help them develop and improve a custom search engine to power a self-service\u00a0expert system leveraging\u00a0their EY Tax Guide for the\u00a0US. The users of their expert system\u00a0require an efficient and reliable experience, with a high degree of accuracy in the set of answers provided. We share our learnings, process, and custom code in this code story.<!--more--><\/p>\n<p><figure id=\"attachment_4632\" aria-labelledby=\"figcaption_attachment_4632\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"wp-image-4632 size-full\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/eycollab.png\" alt=\"\" width=\"414\" height=\"329\" \/><figcaption id=\"figcaption_attachment_4632\" class=\"wp-caption-text\">EY and Microsoft Teams at Work<\/figcaption><\/figure><\/p>\n<p>Consumer search engines combine many sophisticated techniques in each step\u00a0of the process, from augmenting query and answer content, to\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Search_engine_indexing\">indexing target content<\/a>\u00a0, to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Learning_to_rank\">retrieval ranking <\/a>and performance measurement.\u00a0\u00a0 Augmenting\u00a0content\u00a0requires <a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\">natural language processing<\/a> (NLP) techniques like keyword and <a href=\"http:\/\/bdewilde.github.io\/blog\/2014\/09\/23\/intro-to-automatic-keyphrase-extraction\/\">key phrase extraction<\/a>, <a href=\"https:\/\/en.wikipedia.org\/wiki\/N-gram\">n-gram<\/a> analysis, and word treatments including <a href=\"https:\/\/en.wikipedia.org\/wiki\/Stemming\">stemming<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Stop_words\">stop-word<\/a> filtering. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Learning_to_rank\">Ranking and retrieval<\/a> of the right responses to queries use\u00a0machine learning algorithms to measure the similarity of target content units of retrieval and the query itself. Finally, measuring retrieval performance is key to optimizing the quality of the experience, as managing the quality of the consumer search engine experience is an ongoing task.<\/p>\n<p>Despite the sophistication of consumer search engine development and the promises of AI and expert systems, designing an enterprise custom search experience\u00a0that delivers against users\u2019 high expectations can be challenging. Few guidelines exist to provide developers with a comprehensive view of processes and best practices to design, optimize, and improve custom search. Moreover, there are few tools that aid developers in the process of measuring how well their custom search engine performs at retrieving what the user intended to retrieve.\u00a0 From text pre-processing and enrichment to interactive querying and testing, each step could benefit from a process road map, how-to guidelines, and better tools. Enterprises have questions such as: Which techniques should be used at what time? What is the performance impact of different optimization choices on retrieval quality? Which set of optimizations performs the best?<\/p>\n<p>In this project, we addressed the challenge of creating a\u00a0custom domain experience.\u00a0\u00a0We leveraged <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/search\/search-what-is-azure-search\">Azure Search<\/a> and <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/\">Cognitive Services<\/a> and we share our <a href=\"https:\/\/github.com\/CatalystCode\/CustomSearch\">custom code<\/a> for iterative testing, measurement and indexer redeployment.\u00a0In our solution, the customized search engine forms the\u00a0foundation for\u00a0delivering\u00a0a\u00a0question and answer experience in a specific domain area. Below, we provide\u00a0guidelines on designing your own custom search experience, followed by a step-by-step description of our work on this particular project with code and data that you can use to learn from and modify our approach for your projects. In future posts,\u00a0we\u2019ll discuss the presentation layer as well as the work of integrating a custom Azure Search experience and Cognitive Services into a bot presentation experience.<\/p>\n<h2>Designing\u00a0a Custom Search Experience<\/h2>\n<p>Before we describe the solution for our project, we outline search design considerations. These design considerations will help you create an enterprise search experience that rivals the best consumer search engines.<\/p>\n<p>The first step is to understand the custom search life cycle, which involves designing the search experience, collecting and processing content, preparing the content for serving, serving and monitoring, and finally collecting feedback.\u00a0 Designing in continuous measurement and\u00a0improvement is essential to developing and optimizing your search experience.<\/p>\n<p><figure id=\"attachment_4636\" aria-labelledby=\"figcaption_attachment_4636\" class=\"wp-caption aligncenter\" ><img decoding=\"async\" class=\"wp-image-4636 size-full\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/searchlifecycle.png\" alt=\"\" width=\"889\" height=\"586\" \/><figcaption id=\"figcaption_attachment_4636\" class=\"wp-caption-text\">Custom Search Life Cycle<\/figcaption><\/figure><\/p>\n<h3>Determine Your Target User and Intent<\/h3>\n<p>Defining your target user\u00a0allows you to characterize\u00a0the experience that\u00a0they need and the query language that they will use.\u00a0\u00a0If your target user is a domain expert, their query terminology reflects this expertise and if your target user is not familiar with the domain area covered, their queries won&#8217;t\u00a0include expert vocabulary.\u00a0\u00a0 For example, a domain expert may ask about a &#8220;Roth IRA&#8221; by name.\u00a0 A non-expert may ask about a &#8220;retirement savings accounts&#8221; instead.<\/p>\n<p>Characterizing\u00a0the intents of your\u00a0target users guide your experience design and content strategy. In Web search engines, for instance, the user intent falls into one of three categories:<\/p>\n<ul>\n<li><strong>Navigational<\/strong>: Surfing directly to a specific website (e.g., MSN, Amazon or Wikipedia)<\/li>\n<li><strong>Transactional<\/strong>: Completing a specific task (e.g., find a restaurant, reserve a table, sign up for a service)<\/li>\n<li><strong>Informational<\/strong>: Browsing for general information about a topic using free-form queries (e.g., who is the director of Inception, artificial intelligence papers, upcoming events in Seattle)<\/li>\n<\/ul>\n<p>Beyond these three categories, user intent may be further categorized into more specific sub-intents, especially for transactional and informational queries. Clarifying your user intents\u00a0is key to serving the most relevant content in the clearest form.\u00a0If possible, obtain a\u00a0set of potential queries and characterize them by user intent.<\/p>\n<h3>Consider the\u00a0End-to-End Design<\/h3>\n<p>A good custom search design encompasses the end-to-end experience. Answering these ten key questions will give you a high-level set of requirements for your end-to-end custom search design.<\/p>\n<ol>\n<li>Which user intents will be supported?<\/li>\n<li>Is the content available to answer the user queries?\u00a0 Is there any data acquisition or collection that is required to assemble the necessary pieces of content?<\/li>\n<li>What type of content will be served: text, voice, multimedia or other?<\/li>\n<li>How will the content be served for each intent or sub-intent?\u00a0How will the user interface work?<\/li>\n<li>Which delivery interface(s) will be supported (e.g., web page, mobile web page, chatbot, text, speech or other)?<\/li>\n<li>Will the experience include content from more than one source?<\/li>\n<li>Which user signals will be automatically captured for analysis?\u00a0 How?<\/li>\n<li>What type of user feedback will be solicited?\u00a0 How will it be solicited: implicitly or explicitly or both?<\/li>\n<li>What success metrics are there? \u00a0Are they objective, subjective or both?\u00a0 How will they be computed?<\/li>\n<li>How do you compare alternative experiences?\u00a0 Will you run A\/B testing or other testing protocol? \u00a0How will you decide which experience is better in the potential situation of conflicting metrics?<\/li>\n<\/ol>\n<h3>Characterize the\u00a0Query and Consumption Interface and Experience<\/h3>\n<p>Once you have planned for the end-to-end experience, outline the content servicing and consumption experience.\u00a0As you consider the query and results\u00a0serving page layout, consider how many results you will need to deliver.\u00a0The number of results you can serve\u00a0is often a\u00a0function of the screen size of the device where you are serving the experience,\u00a0the character of the answers you are delivering, and the requirements of your target audience.\u00a0It&#8217;s key to think through\u00a0delivering them in a consistent layout that is visually and cognitively appealing.<\/p>\n<h3>Define Success Measures and Feedback<\/h3>\n<p>Define your desired objective success metrics. Is success displaying the best answer in the top five responses, the top three responses, or only in the first response? The success measures will be used in the optimization of the search experience, as well as for ongoing management. Consider measures you will need to optimize for launch and for ongoing performance management. Also consider your approach to experimentation. Will you\u00a0support A\/B testing or other controlled experiments with variants to test different ranking mechanisms or user experiences?<\/p>\n<p>In addition, define how users will provide feedback on\u00a0the quality of the answers or the quality of the experience. For example, you might rely on implicit feedback from usage logs, or explicit feedback that the user provides\u00a0based on the results\u00a0served.\u00a0Your UI affordances for explicit feedback\u00a0might allow users\u00a0to rate the\u00a0usefulness of the specific result served, identify which result is the best, and rate the quality of the overall experience.<\/p>\n<p>To help you\u00a0determine how you\u00a0measure success for your custom search project, here are a few resources to consider:<\/p>\n<ul>\n<li>Search relevance (how to define and measure search success including objective and subjective success metrics)\n<ul>\n<li><a href=\"https:\/\/www.researchgate.net\/publication\/271447836_The_Anatomy_of_Relevance_Topical_Snippet_and_Perceived_Relevance_in_Search_Result_Evaluation\">The Anatomy of Relevance: Topical, Snippet and Perceived Relevance in Search Result Evaluation<\/a><\/li>\n<li><a href=\"https:\/\/blogs.bing.com\/search\/2012\/03\/05\/bing-search-quality-insights-whole-page-relevance\/\">Bing Search Quality Insights: Whole Page Relevance<\/a><\/li>\n<\/ul>\n<\/li>\n<li>Information retrieval ranking metrics\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Learning_to_rank\">Learning to rank (Wikipedia)<\/a><\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Evaluation_measures_(information_retrieval)\">Evaluation measures (information_retrieval) (Wikipedia)<\/a><\/li>\n<li><a href=\"https:\/\/observer.wunderwood.org\/2016\/09\/12\/measuring-search-relevance-with-mrr\/\">Measuring search relevance with MRR<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Our\u00a0Project Solution<\/h2>\n<p>A variety of\u00a0services, tools, and platforms are available to assist in the content preparation and results serving, as well as in the online response to the incoming user queries. We reviewed the following services\u00a0to identify which would serve the custom search experience requirements for our project.<\/p>\n<ul>\n<li>Detection of user query intent\/sub-intent\u00a0via <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/language-understanding-intelligent-service\/\">Language Understanding Intelligent Service (LUIS)<\/a><\/li>\n<li>Serving frequently asked questions\u00a0via <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/qna-maker\/\">QnA Maker API<\/a><\/li>\n<li>Indexing and serving general search content via <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/search\/\">Azure Search<\/a><\/li>\n<li>Text analytics supporting tools, such as language detection, key phrase extraction, topic detection, sentiment analysis\u00a0via <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/text-analytics\/\">Text Analytics API<\/a><\/li>\n<li>Other APIs supporting language, knowledge, speech, vision and more\u00a0via <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\">Microsoft Cognitive Services<\/a><\/li>\n<\/ul>\n<p>Based on our content and target user requirements we\u00a0identified <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/search\/\">Azure Search<\/a> and the <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/language-understanding-intelligent-service\/\">Language Understanding Intelligent Service<\/a> as services we would use in our design.<\/p>\n<h3>Custom Search Engine Development Process and Tools<\/h3>\n<p>The diagram below describes our development process and the tools used. The development and optimization process flow for our project is illustrated in\u00a0gray. The services we leverage are in blue.\u00a0We completed work on the steps with the solid outline in this project.<\/p>\n<p>Azure Search is the foundation for our custom search experience.\u00a0We leverage many of the Azure Search\u00a0features including custom analyzer,\u00a0custom scoring,\u00a0and custom synonyms. We complement these services with custom scripts to iteratively optimize and measure our search experience.\u00a0We have shared links to this custom code within this post and in our\u00a0<a href=\"https:\/\/github.com\/CatalystCode\/CustomDomainSearch\">GitHub repo.\u00a0 <\/a><\/p>\n<p><img decoding=\"async\" class=\"alignnone size-large wp-image-4661\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/servicesused.png\" alt=\"\" width=\"780\" height=\"430\" \/><\/p>\n<p>To illustrate this development and optimization process we used in this project in this code story, we take a set of public domain data and walk through the development and refinement of\u00a0a custom search experience step-by-step. In this case, we&#8217;ll reference a subset of the <a href=\"https:\/\/www.law.cornell.edu\/uscode\/text\/26\">US Tax Code<\/a>.<\/p>\n<h3>The Project Process<\/h3>\n<h4>Source Content Text Pre-Processing<\/h4>\n<p>Creating a\u00a0custom search experience starts with clean and well-structured source text.\u00a0Our objective in the first step was to structure source text by defining a well-characterized unit of retrieval, with metadata for each of these &#8216;answers.&#8217;\u00a0For this project, this involved some restructuring of content originally\u00a0formatted for\u00a0consumption on paper.<\/p>\n<p>In order to prepare the data for search, we parsed\u00a0the source based on formatting characteristics\u00a0in order to transform it into a data table.\u00a0We also standardized the text being used within the content by applying standard formatting.\u00a0 We organized by one row for each candidate response, sometimes referred to as a &#8216;unit of retrieval&#8217; or answer.\u00a0In the screenshot below, you can see the column &#8216;ParaText&#8217; is\u00a0the &#8216;answer&#8217;\u00a0unit.<\/p>\n<p>Metadata is used by Azure Search\u00a0to better identify and disambiguate the target content the user is seeking. We created metadata for each row, or answer, in our content table.\u00a0For example, we\u00a0identified which chapter, section, and subsection the content came from.\u00a0Because our source corpus was designed for written publication, we\u00a0had some unique transformations to perform on the text.\u00a0<span style=\"color: #ff00ff;\">\u00a0<span style=\"color: #000000;\">I<\/span><span style=\"color: #000000;\">n <a href=\"https:\/\/github.com\/CatalystCode\/CustomSearch\/blob\/master\/JupyterNotebooks\/1-content_extraction.ipynb\">this Python\u00a0script<\/a>,<\/span><\/span> you can follow some of the basic source content cleaning approaches we took, leveraging off-the shelf text libraries like <a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/\">Beautiful Soup.<\/a><\/p>\n<p><figure id=\"attachment_4605\" aria-labelledby=\"figcaption_attachment_4605\" class=\"wp-caption alignnone\" ><img decoding=\"async\" class=\"wp-image-4605 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/parsedtaxcode.png\" alt=\"\" width=\"780\" height=\"519\" \/><figcaption id=\"figcaption_attachment_4605\" class=\"wp-caption-text\">Cleaned and Restructured Source Text<\/figcaption><\/figure><\/p>\n<p>The condition\u00a0of your source text will vary.\u00a0You\u00a0may need to interpret less deterministic and consistent formatting of source content.<\/p>\n<h4>Content Enrichment, Upload, and Index Deployment<\/h4>\n<p>Augmenting the source text with additional descriptive metadata, or &#8216;content enrichment&#8217;,\u00a0helps Azure Search understand the subject and meaning of the target text.\u00a0In this case, we enriched\u00a0each row\u00a0with key phrases extracted from the unit of retrieval text.\u00a0We used open source key phrase extraction libraries with some customizations, including custom stop word exclusion. \u00a0You can find this in the end-to-end <a href=\"https:\/\/github.com\/CatalystCode\/CustomSearch\/blob\/master\/JupyterNotebooks\/1-content_extraction.ipynb\">content extraction Jupyter Notebook<\/a>.<\/p>\n<p>As an additional content enrichment step, we\u00a0added the title of the content section of the publication to the keyword list. In our case, the title of the content section\u00a0was a very descriptive categorization of the answer, and thus very useful for our\u00a0search engine in disambiguating this answer.<\/p>\n<p>Once this was complete, we uploaded the content and created the initial index in Azure Search.\u00a0We used the <a href=\"https:\/\/docs.microsoft.com\/en-us\/rest\/api\/searchservice\/\">Azure search REST API\u00a0<\/a>to <a href=\"https:\/\/docs.microsoft.com\/en-us\/rest\/api\/searchservice\/create-index\">create the index<\/a> and\u00a0upload\u00a0the\u00a0new content\u00a0via a <a href=\"https:\/\/github.com\/CatalystCode\/CustomSearch\/blob\/master\/JupyterNotebooks\/2-content_indexing.ipynb\">Jupyter Notebook Python script<\/a>.\u00a0 You can also use <a href=\"http:\/\/www.telerik.com\/fiddler\">Fiddler<\/a>, <a href=\"https:\/\/www.getpostman.com\/\">Postman<\/a> or your favorite REST client.<\/p>\n<p><figure id=\"attachment_4606\" aria-labelledby=\"figcaption_attachment_4606\" class=\"wp-caption alignnone\" ><img decoding=\"async\" class=\"wp-image-4606 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/parsedtaxcodewithkeywords.png\" alt=\"\" width=\"780\" height=\"393\" \/><figcaption id=\"figcaption_attachment_4606\" class=\"wp-caption-text\">Target Content Enriched with Keyword Metadata<\/figcaption><\/figure><\/p>\n<h4>Azure Search Custom Analyzer, Scoring, Synonyms, and Suggesters<\/h4>\n<p>We found that tweaking the custom search analyzer improved the meaning that the Azure Search indexer\u00a0detected from our\u00a0source text.\u00a0 <a href=\"https:\/\/docs.microsoft.com\/en-us\/rest\/api\/searchservice\/custom-analyzers-in-azure-search\">Azure Search Custom Analyzer<\/a> has a number of elements one could customize.\u00a0In our project, we had many terms that were in fact numbers and letters with a dash or other character between them. We found that that standard Azure Search analyzer interpreted these as separate words and mistakenly split them apart. For example &#8220;401k&#8221; would be interpreted as &#8220;401&#8221; and &#8220;k&#8221;.\u00a0In this case, the\u00a0CHAR filter settings in the custom analyzer allowed us\u00a0to call out this pattern via <a href=\"http:\/\/docs.activestate.com\/komodo\/4.4\/regex-intro.html\">regex<\/a>, and to tell the custom analyzer to leave these strings together as one unit.\u00a0Weighting and boosting of certain fields in the content\u00a0can also boost performance.\u00a0We specifically boosted keywords with titles in the custom analyzer JSON.<\/p>\n<p>Here is the example of our Azure search custom analyzer JSON.<\/p>\n<pre class=\"font-size:9 line-height:10 lang:js decode:true\">{\r\n    \"name\": \"taxcodejune\",\r\n    \"fields\":\r\n    [\r\n        {\r\n            \"name\": \"Index\",\r\n            \"type\": \"Edm.String\",\r\n            \"searchable\": false,\r\n            \"filterable\": false,\r\n            \"retrievable\": true,\r\n            \"sortable\": true,\r\n            \"facetable\": false,\r\n            \"key\": true,\r\n            \"indexAnalyzer\": null,\r\n            \"searchAnalyzer\": null,\r\n            \"analyzer\": null,\r\n            \"synonymMaps\": []\r\n        }\r\n    ],\r\n    \"scoringProfiles\":\r\n    [\r\n        {\r\n            \"name\": \"boostexperiment\",\r\n            \"text\":\r\n            {\r\n                \"weights\":\r\n                {\r\n                    \"Title\": 1,\r\n                    \"Keywords\": 1\r\n                }\r\n            }\r\n        }\r\n],\r\n  \"analyzers\": [\r\n    {\r\n      \"@odata.type\": \"#Microsoft.Azure.Search.CustomAnalyzer\",\r\n      \"name\": \"english_search_analyzer\",\r\n      \"tokenizer\": \"english_search\",\r\n      \"tokenFilters\": [\r\n        \"lowercase\"\r\n      ],\r\n      \"charFilters\": [\"form_suffix\"]\r\n    },\r\n    {\r\n      \"@odata.type\": \"#Microsoft.Azure.Search.CustomAnalyzer\",\r\n      \"name\": \"english_indexing_analyzer\",\r\n      \"tokenizer\": \"english_indexing\",\r\n      \"tokenFilters\": [\r\n        \"lowercase\"\r\n      ],\r\n      \"charFilters\": [\"form_suffix\"]\r\n    }\r\n  ],\r\n  \"tokenizers\": [\r\n    {\r\n      \"@odata.type\": \"#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer\",\r\n      \"name\": \"english_indexing\",\r\n      \"language\": \"english\",\r\n      \"isSearchTokenizer\": false\r\n    },\r\n    {\r\n      \"@odata.type\": \"#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer\",\r\n      \"name\": \"english_search\",\r\n      \"language\": \"english\",\r\n      \"isSearchTokenizer\": false\r\n    }\r\n  ],\r\n  \"tokenFilters\": [],\r\n  \"charFilters\": [\r\n     {\r\n       \"name\":\"form_suffix\",\r\n       \"@odata.type\":\"#Microsoft.Azure.Search.PatternReplaceCharFilter\",\r\n       \"pattern\":\"([0-9]{4})-([A-Z]*)\",\r\n       \"replacement\":\"$1$2\"\r\n     }\r\n  ]\r\n}<\/pre>\n<p>Azure Search Custom Analyzer offers more elements to customize than we used in this instance, and you can read <a href=\"https:\/\/docs.microsoft.com\/en-us\/rest\/api\/searchservice\/custom-analyzers-in-azure-search\">more details about it in the API documentation<\/a>.\u00a0For more frequent queries,\u00a0you can use the <a href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/azure-search-how-to-add-suggestions-auto-complete-to-your-search-applications\/\">suggester functionality<\/a> to suggest and autocomplete.<\/p>\n<p>We also found that adding\u00a0synonyms and acronyms to the\u00a0metadata improved our performance,\u00a0given the\u00a0jargon-intensive specialized\u00a0content\u00a0in the\u00a0US Tax Code.\u00a0In the case where the user&#8217;s\u00a0query\u00a0uses synonyms of the keyword in the original content,\u00a0this synonym and acronym\u00a0metadata helps match it to its intended answer. For our project we wrote a <a href=\"https:\/\/github.com\/CatalystCode\/CustomSearch\/blob\/master\/JupyterNotebooks\/AugmentingSearch_CreatingASynonymMap.ipynb\">custom script in a Jupyter Notebook to generate synonyms and acronyms<\/a>, comparing our target content to synonym and acronym references. We added these to the metadata for our content.\u00a0With this acronym-heavy corpus,\u00a0adding\u00a0synonyms and acronyms\u00a0helped performance substantially.\u00a0 We also create a <a href=\"https:\/\/github.com\/CatalystCode\/CustomSearch\/blob\/master\/JupyterNotebooks\/AugmentingSearch_UploadingSynonymMapToAzureSearch.ipynb\">Jupyter Notebook<\/a> to demonstrate how to\u00a0upload the synonym map you created to an Azure Search service.<\/p>\n<h4>Batch Testing, Measurement and Indexer Redeployment<\/h4>\n<p>At every iteration step, we tested the impact of the changes and optimized our choices.\u00a0 In order to upload reparsed content interactively and do custom analyzer updates \u2014 a step that we took many, many\u00a0times as we optimized our custom search engine \u2014\u00a0we <a href=\"https:\/\/github.com\/CatalystCode\/CustomSearch\/blob\/master\/Python\/azsearch_mgmt.py\">created a management script.<\/a>\u00a0This saved time and eliminated errors. Then we used the\u00a0<a href=\"https:\/\/github.com\/CatalystCode\/CustomSearch\/blob\/master\/JupyterNotebooks\/3-azure_search_query.ipynb\">batch-testing script featured in this Jupyter Notebook,<\/a>\u00a0which retrieves the top &#8216;N&#8217; search results for all questions in the batch. This approach allowed us to review and measure the impact of each refinement step we took, from\u00a0content enrichment to tweaking the custom analyzer settings.<\/p>\n<p>We compared all of the retrieved results per question vs. the ground-truth answers from the training set,\u00a0<span style=\"color: #ff00ff;\"><span style=\"color: #000000;\">collated the measurements from each iterative run manually, and confirmed whether or not the optimization changes had improved performance.<\/span><\/span><\/p>\n<h3>Performance Improvement for Each Refinement<\/h3>\n<p><span style=\"color: #000000;\">Once we had our search index deployed, we\u00a0measured performance and compared every refinement step&#8217;s impact on this performance. Some refinements were\u00a0more beneficial than others \u2014\u00a0a function of our source content and, in part, the extent and character of our ground-truth answers.\u00a0 <\/span><\/p>\n<p>Overall, we found that adding titles to our extracted keywords was the most beneficial refinement.\u00a0 Given this result, we boosted the weight of the title and keyword metadata and saw even more benefits.\u00a0 Second-most impactful was creating a CHAR filter in the custom analyzer to recognize strings like &#8216;401k&#8217; and avoid splitting them. Finally, the third most impactful addition was adding acronyms and synonyms\u00a0to our metadata. Based on these refinements we were able to optimize results to a performance level that was now ready for human testing and feedback.<\/p>\n<p>As an example of our iterative approach to performance improvement, we generated a toy example of performance improvements of recall of correct answers.\u00a0In this example, we highlight hypothetical performance improvement for each additional augmentation and analyzer customization.<\/p>\n<p><figure id=\"attachment_4831\" aria-labelledby=\"figcaption_attachment_4831\" class=\"wp-caption alignnone\" ><img decoding=\"async\" class=\"wp-image-4831 size-large\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/03\/results-1024x190.jpg\" alt=\"\" width=\"780\" height=\"145\" \/><figcaption id=\"figcaption_attachment_4831\" class=\"wp-caption-text\">Hypothetical Recall Performance Improvement from Baseline by Approach<\/figcaption><\/figure><\/p>\n<p>The next experimentation and measurement phase is to combine these features listed above, then evaluate and compare relevance in the final presentation to the user. The overall relevance metrics will likely vary based on user\u2019s expectations, background, and usage behavior.<\/p>\n<h3>Future Work<\/h3>\n<p>Follow-on work\u00a0for the project includes recognizing intent using the <a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/language-understanding-intelligent-service\/\">Language Understanding Intelligence Service (LUIS)<\/a>. \u00a0By looking at frequent queries, we can identify and characterize intents using distinctive phrases.\u00a0For example, there are a number of phrases associated with retirement accounts including &#8216;Roth IRA&#8217; and &#8216;401K&#8217;; we could augment the metadata when these phrases are used to include the topic &#8216;retirement.&#8217;<\/p>\n<p>The team plans to prototype the question and answer experience with the help of the <a href=\"https:\/\/docs.microsoft.com\/en-us\/bot-framework\/overview-introduction-bot-framework\">Bot Framework<\/a>.\u00a0 The bot will express the intended purpose of the chatbot following <a href=\"https:\/\/docs.microsoft.com\/en-us\/bot-framework\/bot-design-principles\">design principles<\/a> outlined in the documentation and the <a href=\"https:\/\/github.com\/michhar\/bot-education-samples\/tree\/master\/Node\/bot-azure-search\">code base<\/a>.<\/p>\n<p>After launching, the prototype will be used to display\u00a0<em>realistic usage information and metrics<\/em> with Azure App Insights and the Bot Framework Analytics dashboard.<\/p>\n<h3>Conclusions<\/h3>\n<p>In this code story, we described creating a custom domain search question and answer experience using Azure Search and Cognitive Services and custom testing and measurement code.\u00a0 We have shared our custom code on this <a href=\"https:\/\/github.com\/catalystcode\/customsearch\">GitHub repository<\/a> where you can\u00a0find the\u00a0end-to-end examples in <a href=\"https:\/\/github.com\/CatalystCode\/CustomSearch\/tree\/master\/JupyterNotebooks\">Jupyter Notebooks,<\/a> as well as the\u00a0<a href=\"https:\/\/github.com\/CatalystCode\/CustomSearch\/tree\/master\/Python\">individual Python\u00a0scripts<\/a>.\u00a0We have also provided guidelines for the design of your custom search experience to help you get started.\u00a0 We hope this makes your experience creating a custom search experience faster and more effective.\u00a0 We invite you\u00a0to contribute to the <a href=\"https:\/\/github.com\/catalystcode\/customsearch\">GitHub repository<\/a> and\u00a0to provide feedback in the comments section below.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We address the challenge of creating a custom search experience for a specific domain area. We also provide a guide for creating your own custom search experience by leveraging Azure Search and Cognitive Services and sharing custom code for iterative testing, measurement and indexer redeployment.<\/p>\n","protected":false},"author":21400,"featured_media":12266,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[13,14,19],"tags":[92,132,211,231,249,250],"class_list":["post-3802","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bots","category-cognitive-services","category-machine-learning","tag-azure-search","tag-conversation-as-a-platform","tag-information-extraction","tag-language-understanding-intelligent-service-luis","tag-microsoft-bot-framework-mbf","tag-microsoft-cognitive-services"],"acf":[],"blog_post_summary":"<p>We address the challenge of creating a custom search experience for a specific domain area. We also provide a guide for creating your own custom search experience by leveraging Azure Search and Cognitive Services and sharing custom code for iterative testing, measurement and indexer redeployment.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/3802","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21400"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=3802"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/3802\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/12266"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=3802"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=3802"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=3802"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}