{"id":13703,"date":"2021-06-14T12:15:02","date_gmt":"2021-06-14T19:15:02","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cse\/?p=13703"},"modified":"2021-06-15T11:29:16","modified_gmt":"2021-06-15T18:29:16","slug":"entity-disambiguation-using-search-engine","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/entity-disambiguation-using-search-engine\/","title":{"rendered":"Entity Disambiguation Using Search Engine"},"content":{"rendered":"<h3 aria-level=\"2\"><span data-contrast=\"none\">Background<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">Entity disambiguation or resolving misspelled\/ambiguous tokens has always been an interesting subject with its own challenges. With the rise of digital assistant devices such as Microsoft Cortana, Google Home, Amazon Alexa etc., the overall accuracy of the device has become important. The communication quality with a device contains the capability to understand huma<\/span><span data-contrast=\"none\">n\u2019s<\/span><span data-contrast=\"none\">\u00a0intent, distinguishing correct entities with their correct spelling form and finally fulfilling the human\u2019s request with an appropriate response.\u00a0 Similarly, with the challenges in optical character recognition (OCR) results the best recognition results can only be achieved if the source image is of decent quality<\/span><b><span data-contrast=\"none\">.\u00a0<\/span><\/b><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"2\"><span data-contrast=\"none\">Challenges and Objectives<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40,&quot;335559740&quot;:276}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">A common challenge when mapping an entity name from a document (electronic or scanned), a written or a transcribed spoken sentence to its canonical form in a system of record, is in handling variations in spelling, spelling mistakes, errors due to OCR transcription of a poor-quality scan, or differences in entity form between a transcription of speech to the canonical form. These variations require sub-token matching, word breaking or collation of sib-tokens to increase the likelihood of matching when searching for the most relevant canonical form of the entity.<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">Having misspelled entities\u00a0means\u00a0we may not be able to fulfill the initial speech request. Therefore, either we should resolve the misspelled entities to their correct form, before attempting to query a data source or we should prepare the target data source to fulfill the queries with misspelled entities with\u00a0a higher\u00a0accuracy rate.<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"2\"><span data-contrast=\"none\">Solution<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">This document proposes a methodology to disambiguate misspelled entities by comparing the search retrieval performance with different custom search analyzers in a search engine. Hence, even if the query provided contains some misspelled entities, the search engine can respond to the request with higher precision and recall than the default settings. This method can be applied to any search engine service capable of adding custom search analyzers.\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">In this document we experiment with a speech-to-text scenario. Similar approaches can be applied to OCR and any machine entity extraction system. The implemented solution can be found in <a href=\"https:\/\/github.com\/Azure-Samples\/EntityDisambiguation\/\">EntityDisambiguation <\/a><\/span><span data-contrast=\"none\">repository.<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"2\"><span data-contrast=\"none\">Architecture\u00a0deep-dive<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">In this section, we will discuss the motivation behind the proposed solution. Consider a speech-to-text scenario where a user is interacting with a device and asking to make a phone call.\u00a0The figure below indicates the overall architecture\u00a0from\u00a0the\u00a0user speech,\u00a0to searching\u00a0the data source and responding\u00a0to the user\u2019s request.<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><span data-contrast=\"none\">User: Hey device, call\u202f\u201cJ<\/span><b><span data-contrast=\"none\">ea<\/span><\/b><span data-contrast=\"none\">n\u202fH<\/span><b><span data-contrast=\"none\">e<\/span><\/b><span data-contrast=\"none\">ng\u201d\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">Device: Sorry I cannot find \u201cJ<\/span><b><span data-contrast=\"none\">oh<\/span><\/b><span data-contrast=\"none\">n\u202fH<\/span><b><span data-contrast=\"none\">a<\/span><\/b><span data-contrast=\"none\">ng\u201d\u202f<\/span><\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/06\/Architecture.png\"><img decoding=\"async\" class=\"alignnone wp-image-13704\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/06\/Architecture-300x169.png\" alt=\"Image Architecture\" width=\"662\" height=\"373\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/Architecture-300x169.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/Architecture.png 598w\" sizes=\"(max-width: 662px) 100vw, 662px\" \/><\/a><\/p>\n<p><span data-contrast=\"auto\">The speech is converted to text and then it will be sent to a natural language understanding service such as LUIS. We assume that our LUIS (Language Understanding Intelligent Service) model is trained well, and we will expect that it can identify the intent and extract the entities. It can identify that \u201ccall J<strong>ea<\/strong>n H<strong>e<\/strong>ng\u201d is a \u201ccall\u201d intent with an accuracy score of 0.88. It also can identify that \u201cJ<strong>ea<\/strong>n H<strong>e<\/strong>ng\u201d part of this query is a \u201cpersonName\u201d. However, when we query the search engine for \u201cJ<strong>ea<\/strong>n H<strong>e<\/strong>ng\u201d, it\u2019s possible to miss to retrieve this record from the data source since what exists in the search index is \u201cJ<strong>oh<\/strong>n H<strong>a<\/strong>ng\u201d and not \u201cJ<strong>ea<\/strong>n H<strong>e<\/strong>ng\u201d.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint\">{\r\n  \"query\": \"call jean hang\",\r\n  \"topScoringIntent\": {\r\n    \"intent\": \"Call\",\r\n    \"score\": 0.8851808\r\n  },\r\n  \"intents\": [\r\n    {\r\n      \"intent\": \"Call\",\r\n      \"score\": 0.8851808\r\n    },\r\n    {\r\n      \"intent\": \"None\",\r\n      \"score\": 0.07000262\r\n    }\r\n  ],\r\n  \"entities\": [\r\n    {\r\n      \"entity\": \"jean heng\",\r\n      \"type\": \"builtin.personName\",\r\n      \"startIndex\": 5,\r\n      \"endIndex\": 13\r\n    }\r\n  ]\r\n}<\/pre>\n<p aria-level=\"3\"><span class=\"TextRun SCXW229480454 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SCXW229480454 BCX0\">Sample response from LUIS. Intent can be <\/span><span class=\"NormalTextRun SCXW229480454 BCX0\">identified<\/span><span class=\"NormalTextRun SCXW229480454 BCX0\">\u00a0with high confidence (0.88)<\/span><span class=\"NormalTextRun SCXW229480454 BCX0\">\u00a0and the entity\u00a0<\/span><span class=\"NormalTextRun SCXW229480454 BCX0\">is<\/span><span class=\"NormalTextRun SCXW229480454 BCX0\">\u00a0a\u00a0<\/span><span class=\"NormalTextRun SCXW229480454 BCX0\">personName<\/span><span class=\"NormalTextRun SCXW229480454 BCX0\">. However, as the name is misspelled, when we query our DataSource (search engine) the name cannot be\u00a0<\/span><span class=\"NormalTextRun SCXW229480454 BCX0\">found.<\/span><\/span><span class=\"EOP SCXW229480454 BCX0\" data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"3\"><span data-contrast=\"none\">Methodology<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">This section presents an overview of the proposed methodology to improve the research retrieval for misspelled entities namely personName. Different <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/search\/search-analyzers\">search analyzers<\/a> benefit from different tokenizers and token filters. A <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/search\/index-add-custom-analyzers#tokenizers\">tokenizer<\/a><\/span><span data-contrast=\"none\">\u00a0divides the continuous text into a sequence of tokens, such as breaking a sentence into words and a token filter is used to filter out or modify the tokens generated by a tokenizer<\/span><span data-contrast=\"none\">. Hence, different search analyzers may behave differently, and their performance may\u00a0vary to misspelled person\u2019s names. Our approach is to measure the performance of the search engine in the retrieval of the misspelled\u00a0person&#8217;s name\u00a0when the search engine uses specific or multi-search analyzers.\u00a0 We begin by\u00a0creating a search index using different search analyzers. Then we will\u00a0ingest the\u00a0person\u2019s\u00a0name into the created search index.\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<p aria-level=\"3\"><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"3\"><span data-contrast=\"none\">Analysis of the\u00a0Search\u00a0Index\u00a0Schema\u00a0<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">In this section we will be creating a search schema for our search index. We first started by identifying several search <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/search\/search-analyzers\">analyzers<\/a><\/span><span data-contrast=\"none\">. Then we will identify the fields of the search index needed in this experiment.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"3\"><span data-contrast=\"none\">Analyzers\u00a0<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">An analyzer is a component of the full-<\/span><span data-contrast=\"none\">\u00a0<\/span><span data-contrast=\"none\">text search engine responsible for processing text in query strings and indexed documents. Different analyzers manipulate text in diverse ways depending on the scenario. By default, Azure search uses <a href=\"https:\/\/lucene.apache.org\/core\/6_6_1\/core\/org\/apache\/lucene\/analysis\/standard\/StandardAnalyzer.html\">Standard Lucene<\/a> for both parsing the records\/documents as well as parsing a query. We will be creating several custom search analyzers that we think may be the best candidates for the experiment.<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><span data-contrast=\"none\">In the next section, we will run our experiments by sending a query to the search index for all the combinations of the created search analyzers to verify which analyzer can improve the name retrieval.\u00a0Most of the custom analyzers\u00a0benefit\u00a0from\u00a0<a href=\"https:\/\/lucene.apache.org\/core\/6_6_1\/analyzers-common\/org\/apache\/lucene\/analysis\/core\/LowerCaseTokenizer.html\">case-folding<\/a><\/span><span data-contrast=\"none\">\u00a0to resolve similar terms\u00a0that have\u00a0different case\u00a0folding. Additionally, they use\u00a0<a href=\"https:\/\/lucene.apache.org\/core\/6_6_1\/analyzers-common\/org\/apache\/lucene\/analysis\/miscellaneous\/ASCIIFoldingFilter.html\">ASCI\u00a0folding<\/a> to convert numeric, alphabetic and symbolic\u00a0Unicode\u00a0characters to\u00a0ASCII\u00a0representation of that token.\u00a0<a href=\"https:\/\/docs.microsoft.com\/en-us\/rest\/api\/searchservice\/test-analyzer\">Analyze REST API<\/a><\/span><span data-contrast=\"none\">\u00a0can be used to verify how each\u00a0<\/span><span data-contrast=\"none\">analyzer\u00a0analyzes<\/span><span data-contrast=\"none\">\u00a0text. In this document, we chose\u00a0`Person Name`\u00a0as our experimented entity and we are creating Search Analyzers in the subsequent section.<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">In our experiment we used the following search analyzers<\/span><span data-contrast=\"none\">: <a href=\"https:\/\/lucene.apache.org\/core\/6_6_1\/analyzers-phonetic\/org\/apache\/lucene\/analysis\/phonetic\/package-tree.html\">Phonetic<\/a>, <a href=\"https:\/\/lucene.apache.org\/core\/6_6_1\/analyzers-common\/org\/apache\/lucene\/analysis\/ngram\/EdgeNGramTokenizer.html\">Edge-N-Gram<\/a>, <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/search\/index-add-custom-analyzers\">Microsoft<\/a>, <a href=\"https:\/\/lucene.apache.org\/core\/6_6_1\/analyzers-common\/org\/apache\/lucene\/analysis\/core\/LetterTokenizer.html\">Letter<\/a>, CamelCase, <a href=\"https:\/\/lucene.apache.org\/core\/6_6_1\/analyzers-common\/org\/apache\/lucene\/analysis\/standard\/UAX29URLEmailTokenizer.html\">URL-Email<\/a> and default Analyzer (<a href=\"https:\/\/lucene.apache.org\/core\/6_6_1\/core\/org\/apache\/lucene\/analysis\/standard\/StandardAnalyzer.html\">Standard Lucene<\/a>).<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"3\"><span data-contrast=\"none\">Search\u00a0Index Fields<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">Now that we have created our search analyzers, we will create our search index. We will be adding several fields, each of them corresponding to a search analyzer. The following table represents the fields added to the search index.<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr aria-rowindex=\"1\">\n<td style=\"width: 37.785%;\" data-celllook=\"69905\"><span data-contrast=\"auto\">Field\u202f<\/span><span data-contrast=\"none\">\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 62.1064%;\" data-celllook=\"69905\"><span data-contrast=\"none\">Analyzer\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"2\">\n<td style=\"width: 37.785%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">phonetic\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 62.1064%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">phonetic-analyzer\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"3\">\n<td style=\"width: 37.785%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">edge_n_gram\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 62.1064%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">edge-n-gram-analyzer\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"4\">\n<td style=\"width: 37.785%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">microsoft\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 62.1064%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">En.microsoft\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"5\">\n<td style=\"width: 37.785%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">letter\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 62.1064%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">letter-analyzer\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"6\">\n<td style=\"width: 37.785%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">ngram\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 62.1064%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">ngram-analyzer\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"7\">\n<td style=\"width: 37.785%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">camelcase\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 62.1064%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">camel-case-pattern-analyzer\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"8\">\n<td style=\"width: 37.785%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">stemming\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 62.1064%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">stemming-analyzer\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"9\">\n<td style=\"width: 37.785%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">url_email\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 62.1064%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">url-email-analyzer\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"10\">\n<td style=\"width: 37.785%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">standard_lucene\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 62.1064%;\" data-celllook=\"4369\"><span data-contrast=\"auto\">standard.lucene\u202f<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<pre class=\"prettyprint\"><\/pre>\n<h3 aria-level=\"3\"><span data-contrast=\"none\">Data Ingestion<\/span><span data-contrast=\"none\">\u00a0<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">We used\u00a0the\u00a0midterm election\u00a0candidates search results\u00a0<a href=\"https:\/\/www.kaggle.com\/eliasdabbas\/midterm-elections-candidates-search-results-pages\">dataset<\/a><\/span><span data-contrast=\"auto\">\u00a0and ingested 400 names into the search index. For every record,\u202fa\u202fname\u202fwas added to every field.\u202fHowever,\u202ffor\u00a0every\u202ffield\u202fwe specified a\u202fdifferent\u202fanalyzer.\u00a0The\u00a0JSON (Java Script Object Notation)\u00a0shown below\u00a0is<\/span><span data-contrast=\"none\">\u00a0<\/span><span data-contrast=\"auto\">result of\u202fa search to the search index with the query:\u00a0\u201cJean Heng\u201d.\u202fThe data\u00a0ingested\u00a0into the search index will be used in the next section to verify the search retrieval precision and recall for different analyzers.<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint\">{\r\n  \"@search.score\": 4.589165,\r\n  \"stndard_lucene\": \"Jean Heng\",\r\n  \"phonetic\": \"Jean Heng\",\r\n  \"edge_n_gram\": \"Jean Heng\",\r\n  \"letter\": \"Jean Heng\",\r\n  \"ngram\": \"Jean Heng\",\r\n  \"camelcase\": \"Jean Heng\",\r\n  \"steming\": \"Jean Heng\",\r\n  \"url_email\": \"Jean Heng\",\r\n  \"text_microsoft\": \"Jean Heng\"\r\n}<\/pre>\n<h3 aria-level=\"3\"><span data-contrast=\"none\">Experiment<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">In the previous section, we created a search index with 400 names in the index.\u202fIn this section<\/span><span data-contrast=\"none\">,<\/span><span data-contrast=\"none\">\u00a0we will run some experiments to measure the\u00a0highest\u00a0performing\u00a0analyzer(s) for the name field.\u202fWe can hopefully\u00a0improve the search\u202fretrieval,\u00a0even for misspelled names such as \u201cJ<strong>oh<\/strong>n H<strong>a<\/strong>ng\u201d\u00a0we\u00a0should be able to\u00a0retrieve \u201cJ<strong>ea<\/strong>n H<strong>e<\/strong>ng\u201d from the search engine<\/span><span data-contrast=\"none\">.\u202f<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"3\"><span data-contrast=\"none\">Measuring the retrieval performance\u00a0<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">To measure the search accuracy, we need to know how the search engine\u202fcan retrieve a search result if a misspelled name is passed as a\u202fquery; it is\u202fimportant\u202ffor us that the\u202fretrieved\u202fresult is accurate.\u202fThe following table contains the confusion matrix measures we will need to\u202fcalculate\u00a0the\u202fsearch performance<\/span><span data-contrast=\"none\">.<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<table style=\"border-collapse: collapse; width: 80.7291%; height: 263px;\">\n<tbody>\n<tr style=\"height: 85px;\" aria-rowindex=\"1\">\n<td style=\"width: 52.9844px; height: 85px;\" data-celllook=\"4369\"><span data-contrast=\"auto\">TP\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 189.984px; height: 85px;\" data-celllook=\"4369\"><span data-contrast=\"none\">True\u202fPositive\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 677.031px; height: 85px;\" data-celllook=\"4369\"><span data-contrast=\"none\">Given a misspelled name is searched,<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">a\u00a0name\u202fis\u00a0found\u202fand it matches the expected\u00a0name.\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr style=\"height: 85px;\" aria-rowindex=\"2\">\n<td style=\"width: 52.9844px; height: 85px;\" data-celllook=\"4369\"><span data-contrast=\"auto\">FP\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 189.984px; height: 85px;\" data-celllook=\"4369\"><span data-contrast=\"none\">False\u202fPositive\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 677.031px; height: 85px;\" data-celllook=\"4369\"><span data-contrast=\"none\">Given a misspelled name is\u202fsearched,<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">a\u00a0name\u00a0is\u00a0found but it does not match the expected\u202fname.\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr style=\"height: 44px;\" aria-rowindex=\"3\">\n<td style=\"width: 52.9844px; height: 44px;\" data-celllook=\"4369\"><span data-contrast=\"auto\">TN\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 189.984px; height: 44px;\" data-celllook=\"4369\"><span data-contrast=\"none\">True Negative\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 677.031px; height: 44px;\" data-celllook=\"4369\"><span data-contrast=\"none\">Given a non-existent\u202fname is searched,\u202fno name\u202fis\u00a0found.\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr style=\"height: 44px;\" aria-rowindex=\"4\">\n<td style=\"width: 52.9844px; height: 44px;\" data-celllook=\"4369\"><span data-contrast=\"auto\">FN\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 189.984px; height: 44px;\" data-celllook=\"4369\"><span data-contrast=\"none\">False Negative\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 677.031px; height: 44px;\" data-celllook=\"4369\"><span data-contrast=\"none\">Given a misspelled name is searched,\u202fno name\u00a0is\u00a0found.<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\n<span class=\"TextRun SCXW109378115 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW109378115 BCX0\">With\u00a0<\/span><span class=\"NormalTextRun SCXW109378115 BCX0\">TP, FP, TN and FN\u00a0<\/span><span class=\"NormalTextRun SCXW109378115 BCX0\">defined,\u00a0<\/span><span class=\"NormalTextRun SCXW109378115 BCX0\">we can calculate both recall and precision<\/span><\/span><span class=\"TextRun SCXW109378115 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW109378115 BCX0\">\u00a0using the following formulas<\/span><span class=\"NormalTextRun SCXW109378115 BCX0\">:<\/span><\/span><span class=\"EOP SCXW109378115 BCX0\" data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/span><\/p>\n<p><figure id=\"attachment_13713\" aria-labelledby=\"figcaption_attachment_13713\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/06\/FORMILA-PR.png\"><img decoding=\"async\" class=\"wp-image-13713 \" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/06\/FORMILA-PR-300x135.png\" alt=\"Image FORMILA PR\" width=\"256\" height=\"115\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/FORMILA-PR-300x135.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/FORMILA-PR.png 454w\" sizes=\"(max-width: 256px) 100vw, 256px\" \/><\/a><figcaption id=\"figcaption_attachment_13713\" class=\"wp-caption-text\">Precision (top) and Recall (bottom) formulas<\/figcaption><\/figure><\/p>\n<p><span class=\"TextRun SCXW18515155 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW18515155 BCX0\">Since both precision and recall are crucial factors of the search retrieval, we will be using<\/span><span class=\"NormalTextRun SCXW18515155 BCX0\">\u00a0the<\/span><span class=\"NormalTextRun SCXW18515155 BCX0\">\u00a0F1 score which takes both precision and recall measures into account. The F1 score is the harmonic meaning of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. The following formula is how\u00a0<\/span><span class=\"NormalTextRun SCXW18515155 BCX0\">we calculate<\/span><span class=\"NormalTextRun SCXW18515155 BCX0\">\u00a0the F1 score<\/span><span class=\"NormalTextRun SCXW18515155 BCX0\">.<\/span><\/span><span class=\"EOP SCXW18515155 BCX0\" data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><figure id=\"attachment_13714\" aria-labelledby=\"figcaption_attachment_13714\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/06\/F1.png\"><img decoding=\"async\" class=\" wp-image-13714\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/06\/F1-300x47.png\" alt=\"Image F1\" width=\"428\" height=\"67\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/F1-300x47.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/F1-1024x160.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/F1-768x120.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/F1.png 1270w\" sizes=\"(max-width: 428px) 100vw, 428px\" \/><\/a><figcaption id=\"figcaption_attachment_13714\" class=\"wp-caption-text\">F1 Score formula<\/figcaption><\/figure><\/p>\n<h3 aria-level=\"3\"><span data-contrast=\"none\">Misspelled\u00a0Names and Expected\u00a0Names<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">In the previous section, we ingested 400 names into our search engine. In this section<\/span><span data-contrast=\"none\">,<\/span><span data-contrast=\"none\">\u00a0we will create a dataset having 450 {misspelled, expected} tuples. The first 400 tuples\u00a0corresponded\u00a0to those existing in our search index. The last 50 tuples do not exist in our search index and will be used to have\u00a0impact\u00a0on TN (true negative) measures of the experiment.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p aria-level=\"3\"><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"3\"><span data-contrast=\"none\">Generate Misspelled\u00a0Names<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">The most accurate method to generate misspelled names is to use telemetry data<\/span><span data-contrast=\"none\">19<\/span><span data-contrast=\"none\">\u00a0from the system you are working with. For instance, find what was not correctly spelled in the logs and capturing them to be used for the experiment. If you do not have access to telemetry data\u00a0to create misspelled names, you can use\u00a0the\u00a0Azure speech-to-text\u00a0API.\u00a0The following algorithm illustrates a flowchart on how to generate a dataset of misspelled and expected names from a corpus of the names using speech to text API.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint\">Begin \r\n\r\nmisspelled_nams     empty list \r\n\r\nexpected_names      empty list \r\n\r\nnames               names_corpus \r\n\r\nWhile more data is needed for the experiment, do \r\n\r\n      Call speech API providing a name from names \r\n\r\n      If the response from API is equal to the correct name \r\n\r\n                 Continue \r\n\r\n      Else \r\n\r\n                 Insert misspelled name into misspelled_nams \r\n\r\n                 Insert correct name into expected_names \r\n\r\nEnd<\/pre>\n<p><span class=\"TextRun SCXW44031999 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW44031999 BCX0\">Finally, we will make a dataset of names<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">, l<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">ike<\/span><\/span><span class=\"TextRun SCXW44031999 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW44031999 BCX0\">\u202f<\/span><\/span><span class=\"TextRun SCXW44031999 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW44031999 BCX0\">in the\u00a0<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">table\u202f<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">below,\u00a0<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">where the first 400 records in the expected name column\u00a0<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">are\u00a0<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">ingested in<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">to<\/span><\/span><span class=\"TextRun SCXW44031999 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW44031999 BCX0\">\u00a0<\/span><\/span><span class=\"TextRun SCXW44031999 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW44031999 BCX0\">A<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">zure\u00a0<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">S<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">earch<\/span><\/span><span class=\"TextRun SCXW44031999 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW44031999 BCX0\">\u00a0<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">(<\/span><\/span><span class=\"TextRun SCXW44031999 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW44031999 BCX0\">as discussed in\u00a0<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">the previous<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">\u00a0section<\/span><\/span><span class=\"TextRun SCXW44031999 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW44031999 BCX0\">)<\/span><span class=\"NormalTextRun SCXW44031999 BCX0\">.<\/span><\/span><span class=\"EOP SCXW44031999 BCX0\" data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<table style=\"border-collapse: collapse; width: 58.5296%; height: 345px;\">\n<tbody>\n<tr aria-rowindex=\"1\">\n<td style=\"width: 12.6338%; text-align: center;\" data-celllook=\"69905\"><span data-contrast=\"auto\">#\u202f<\/span><span data-contrast=\"none\">\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 45.3961%; text-align: center;\" data-celllook=\"69905\"><span data-contrast=\"none\">Expected\u00a0name\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<td style=\"width: 41.8986%; text-align: center;\" data-celllook=\"69905\"><span data-contrast=\"none\">Misspelled\u202fname\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"2\">\n<td style=\"width: 12.6338%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">1\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 45.3961%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">Tobye\u202fSchimpke\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 41.8986%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">tobe\u202fshimpok\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"3\">\n<td style=\"width: 12.6338%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">2\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 45.3961%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">Sidney\u202fMcElree\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 41.8986%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">Sidney Mc. Elree\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"4\">\n<td style=\"width: 12.6338%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">&#8230;\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 45.3961%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 41.8986%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">\u202f\u00a0<\/span><span data-ccp-props=\"{&quot;335551550&quot;:2,&quot;335551620&quot;:2}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"5\">\n<td style=\"width: 12.6338%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">400\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 45.3961%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">Kimmie Fridlington\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 41.8986%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">Kim Fridlington\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"6\">\n<td style=\"width: 12.6338%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">401\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 45.3961%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">NOT_FOUND\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 41.8986%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">Netta Niezen\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"7\">\n<td style=\"width: 12.6338%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">&#8230;\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 45.3961%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 41.8986%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<\/tr>\n<tr aria-rowindex=\"8\">\n<td style=\"width: 12.6338%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">450\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 45.3961%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">NOT_FOUND\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<td style=\"width: 41.8986%; text-align: center;\" data-celllook=\"4369\"><span data-contrast=\"auto\">Aluin\u00a0Donnett\u202f<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3 aria-level=\"3\"><span data-contrast=\"none\">Feature\u00a0Selection<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">Having 9 different\u00a0fields,\u00a0each of them using different analyzers\u00a0without considering the empty<\/span><b><span data-contrast=\"none\">\u202f<\/span><\/b><span data-contrast=\"none\">set,\u00a0we\u202fwill have\u202f511\u202fcombinations\u00a0of the fields to query against in our experiment.\u202fWe also have\u00a0\u00a0\u00a0400\u202fdocuments\u202fin our search engine. To\u202fcover all possible\u00a0scenarios,\u00a0we should make 400\u202f\u00d7 511 = 204,400 calls to the\u00a0Azure\u00a0Search.\u202f\u202fWe can also use\u202fone of the\u202ffeature elimination\u202ftechniques [5]\u202fto speed up experiment execution time.\u202fFor\u202finstance,\u202fwe can run our experiment in several phases.\u202fIn each phase<\/span><span data-contrast=\"none\">,<\/span><span data-contrast=\"none\"> we will\u202fremove fields where their F1 score is significantly lower than in other fields.\u202fAs we run our experiment, we will remove more subsets of the fields set.\u202fUsing a superset of all the fields with 511 subsets, or any feature\u202felimination\u202ftechnique, we will have a list of features\u202fwith each element containing field(s)\u202f(e.g., features = [\u201cngram\u201d, \u201cphonetic,\u202fngram\u201d, \u2026.\u201dphonetic,\u202fngram,\u202fstemming\u201d\u202f\u2026\u202f]<\/span><b><span data-contrast=\"none\">\u202f\u202f\u00a0<\/span><\/b><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p aria-level=\"3\"><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"3\"><span data-contrast=\"none\">F1 Scores Calculation<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">Having a dataset of\u202f`missplled_names`,\u202f`expected_names`\u202f(created in section\u202f4.2.1) and\u202f<\/span><span data-contrast=\"auto\">a list\u00a0<\/span><span data-contrast=\"none\">of features (created in previous section),\u00a0now we can calculate the F1 Score for each of those features.\u202fFor every feature in\u202fthe list\u202fof features and for every\u202f`misspelled_name`\u202fin\u202four\u202fdataset, we will do the following steps:<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<ol>\n<li><span class=\"TextRun SCXW6682916 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW6682916 BCX0\">S<\/span><span class=\"NormalTextRun SCXW6682916 BCX0\">end a query to our\u00a0<\/span><span class=\"NormalTextRun SCXW6682916 BCX0\">search\u00a0<\/span><span class=\"NormalTextRun CommentStart SCXW6682916 BCX0\">index\u00a0<\/span><span class=\"NormalTextRun SCXW6682916 BCX0\">with the following payload:<\/span><\/span>\n<pre class=\"prettyprint\">{\r\n  \"queryType\": \"full\",\r\n  \"search\": \"MISSPELLED NAME\",\r\n  \"searchFields\": \"FEATURE\/S\"\r\n}<\/pre>\n<\/li>\n<li><span class=\"NormalTextRun SCXW230466395 BCX0\">Take the response having the most ranking and\u00a0<\/span><span class=\"NormalTextRun SCXW230466395 BCX0\">c<\/span><span class=\"NormalTextRun SCXW230466395 BCX0\">ompare it against<\/span><span class=\"NormalTextRun SCXW230466395 BCX0\">\u00a0<\/span><span class=\"NormalTextRun SCXW230466395 BCX0\">the\u202fcorresponding\u202f<\/span><span class=\"NormalTextRun SCXW230466395 BCX0\">`<\/span><span class=\"NormalTextRun SCXW230466395 BCX0\">expected_name<\/span><span class=\"NormalTextRun SCXW230466395 BCX0\">`<\/span><\/li>\n<li><span class=\"TextRun SCXW14841854 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW14841854 BCX0\">Mark the result based on the\u00a0<\/span><span class=\"NormalTextRun SCXW14841854 BCX0\">following<\/span><span class=\"NormalTextRun SCXW14841854 BCX0\">:\u202f\u00a0<\/span><\/span><span class=\"EOP SCXW14841854 BCX0\" data-ccp-props=\"{&quot;134233279&quot;:true}\">\u00a0<\/span>\n<ul>\n<li><span class=\"NormalTextRun BCX0 SCXW43900721\"><strong>TP<\/strong>:\u202fSome names\u00a0<\/span><span class=\"NormalTextRun BCX0 SCXW43900721\">were found<\/span><span class=\"NormalTextRun BCX0 SCXW43900721\">, and it matches the\u202fexpected\u202f<\/span><span class=\"NormalTextRun BCX0 SCXW43900721\">name.<\/span><\/li>\n<li><span class=\"NormalTextRun SCXW84430020 BCX0\"><strong>FP<\/strong>:\u202fSome names\u00a0<\/span><span class=\"NormalTextRun SCXW84430020 BCX0\">were found<\/span><span class=\"NormalTextRun SCXW84430020 BCX0\">\u00a0but it does not match the\u202fexpected\u202f<\/span><span class=\"NormalTextRun SCXW84430020 BCX0\">name.<\/span><span class=\"NormalTextRun SCXW84430020 BCX0\">\u202f<\/span><\/li>\n<li><span class=\"TextRun BCX0 SCXW168420611\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun BCX0 SCXW168420611\"><strong>TN<\/strong>:\u202fNo names found and no expected name<\/span><\/span><\/li>\n<li><span class=\"NormalTextRun BCX0 SCXW25784445\"><strong>FN<\/strong>: No names\u00a0<\/span><span class=\"NormalTextRun BCX0 SCXW25784445\">were found<\/span><span class=\"NormalTextRun BCX0 SCXW25784445\">\u00a0but\u00a0<\/span><span class=\"NormalTextRun BCX0 SCXW25784445\">we were<\/span><span class=\"NormalTextRun BCX0 SCXW25784445\">\u00a0expected<\/span><span class=\"NormalTextRun BCX0 SCXW25784445\">\u202fto find some\u202f<\/span><span class=\"NormalTextRun BCX0 SCXW25784445\">names<\/span><span class=\"NormalTextRun BCX0 SCXW25784445\">.<\/span><\/li>\n<\/ul>\n<\/li>\n<li><span class=\"NormalTextRun BCX0 SCXW157284153\">Using the formula discussed in\u00a0<\/span><span class=\"NormalTextRun BCX0 SCXW157284153\">the previous<\/span><span class=\"NormalTextRun BCX0 SCXW157284153\">\u00a0section<\/span><span class=\"NormalTextRun BCX0 SCXW157284153\">,\u00a0<\/span><span class=\"NormalTextRun BCX0 SCXW157284153\">calculate<\/span><span class=\"NormalTextRun BCX0 SCXW157284153\">\u00a0the precision, recall, and F1 for each feature.\u202f<\/span><\/li>\n<\/ol>\n<p><span class=\"TextRun SCXW26573843 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW26573843 BCX0\">Finally, for each feature, we will have a result\u202flike\u202fwhat\u202f<\/span><span class=\"NormalTextRun SCXW26573843 BCX0\">is illustrated<\/span><span class=\"NormalTextRun SCXW26573843 BCX0\">\u00a0in the following schema consisting of different measures for the combination of different search\u00a0<\/span><span class=\"NormalTextRun SCXW26573843 BCX0\">analyzers.<\/span><\/span><span class=\"EOP SCXW26573843 BCX0\" data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint\">{\r\n  \"fn\": 26,\r\n  \"tp\": 365,\r\n  \"tn\": 18,\r\n  \"fp\": 11,\r\n  \"fields\": \"camelcase-url_email-text_microsoft\",\r\n  \"precision\": 0.6206896551724138,\r\n  \"recall\": 0.9335038363171355,\r\n  \"f1\": 0.7456165238608637\r\n}<\/pre>\n<h3><span class=\"TextRun SCXW156705585 BCX0\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"none\"><span class=\"NormalTextRun SCXW156705585 BCX0\" data-ccp-parastyle=\"heading 3\">Select\u00a0<\/span><span class=\"NormalTextRun SCXW156705585 BCX0\" data-ccp-parastyle=\"heading 3\">t<\/span><span class=\"NormalTextRun SCXW156705585 BCX0\" data-ccp-parastyle=\"heading 3\">he\u00a0<\/span><span class=\"NormalTextRun SCXW156705585 BCX0\" data-ccp-parastyle=\"heading 3\">B<\/span><span class=\"NormalTextRun SCXW156705585 BCX0\" data-ccp-parastyle=\"heading 3\">est\u00a0<\/span><span class=\"NormalTextRun SCXW156705585 BCX0\" data-ccp-parastyle=\"heading 3\">A<\/span><span class=\"NormalTextRun SCXW156705585 BCX0\" data-ccp-parastyle=\"heading 3\">nalyzers<\/span><\/span><span class=\"EOP SCXW156705585 BCX0\" data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">After\u202frunning\u202fthe experiments mentioned in the previous section, we selected feature(s)\u202fwith the highest F1 score.\u202fIf several features\u00a0exist\u00a0with\u00a0a similar score,\u202fwe can increase the number of test data for misspelled names and expected names.\u00a0\u00a0Then\u00a0we can\u00a0add those extra records into our\u00a0Azure\u00a0Search and run the experiment again.\u202fWe can\u00a0also\u00a0decrease our features using a feature elimination method.\u202f\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">Comparing\u202fdifferent\u202fF1 score results, we noticed that the F1 score for the combination\u202fof\u202f<\/span><b><span data-contrast=\"none\">{\u201dcamelCase\u201d, \u201curl_email\u201d, \u201cmicrosoft\u201d}<\/span><\/b><span data-contrast=\"none\">\u00a0has the highest number equal to\u00a0<\/span><b><span data-contrast=\"none\">0.74,<\/span><\/b><span data-contrast=\"none\">\u00a0whereas if we rely on the default analyzer (stanadard_lucene), the F1 Score is equal to\u202f<\/span><b><span data-contrast=\"none\">0.69<\/span><\/b><span data-contrast=\"none\">. Therefore, if we specify the search\u202fanalyzers\u202fin the\u202fsearch\u202fquery, we will\u202fapproximately\u202fhave\u202fa\u00a05% improvement in the name\u202fretrieval from the search engine.\u202f\u202fFigure\u202f17\u202fillustrates\u00a0the\u00a0F1 score for 50 features of those 511 experimented,\u00a0with the\u00a0F1 score higher than the default analyzer in ascending order.\u00a0<\/span><b><span data-contrast=\"none\">\u00a0<\/span><\/b><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><figure id=\"attachment_13718\" aria-labelledby=\"figcaption_attachment_13718\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/06\/Screen-Shot-2021-06-08-at-10.30.48-AM.png\"><img decoding=\"async\" class=\"size-medium wp-image-13718\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2021\/06\/Screen-Shot-2021-06-08-at-10.30.48-AM-300x219.png\" alt=\"Analyzers with F1 higher than default \" width=\"300\" height=\"219\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/Screen-Shot-2021-06-08-at-10.30.48-AM-300x219.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/Screen-Shot-2021-06-08-at-10.30.48-AM-1024x747.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/Screen-Shot-2021-06-08-at-10.30.48-AM-768x560.png 768w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2021\/06\/Screen-Shot-2021-06-08-at-10.30.48-AM.png 1086w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"figcaption_attachment_13718\" class=\"wp-caption-text\">Analyzers with F1 higher than default (standard_lucene)<\/figcaption><\/figure><\/p>\n<h3 aria-level=\"2\"><span data-contrast=\"none\">Summary<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">In this document we discussed how we can disambiguate misspelled entities using\u202fAzure\u202fSearch. We discussed how we can use Microsoft LUIS to disambiguate entities from the intents, and how we can\u202fimprove entity\u202fretrieval\u202fby experimenting with different search analyzers. Then we<\/span><span data-contrast=\"none\">\u00a0created a search index using nine search analyzers and ingested a dataset of people names into that index. We further executed the\u202fexperiment\u202fby sending misspelled people names to the search index and based on the\u202fretrieval\u202fperformance we have selected the highest performing search analyzers. Experimenting with different search analyzers we could specify the search analyzers with the best precision and recall.<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">Hence,\u00a0we could improve the\u202fF1 score of the search retrieval by 5%.\u202f<\/span><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"none\">This experiment was executed for person names. For any different\u00a0field\u00a0(e.g., address, city, country, etc.)\u00a0a\u00a0similar experiment should be executed to make sure analyzers with the highest performance are selected and will be used. Additionally, we can execute this experiment in conjunction with a spell check service such as Bing spell check to attempt to resolve misspelled entities even before calling the search service.\u00a0<\/span><b><span data-contrast=\"none\">\u00a0<\/span><\/b><span data-ccp-props=\"{&quot;335551550&quot;:6,&quot;335551620&quot;:6}\">\u00a0<\/span><\/p>\n<h3 aria-level=\"2\"><span data-contrast=\"none\">The Team<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"none\">I would like to express my special thanks of gratitude to the\u00a0amazing\u00a0team who helped me\u00a0through\u00a0experiments,\u00a0implementations of this method.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<ul>\n<li><span data-contrast=\"auto\">\u00a0<\/span><span data-contrast=\"auto\">\u00a0<\/span><a href=\"https:\/\/www.linkedin.com\/in\/mokarian\/\"><span data-contrast=\"none\">Maysam Mokarian<\/span><\/a><span data-contrast=\"auto\">\u00a0,\u00a0<\/span><a href=\"mailto:mamokari@microsoft.com\"><span data-contrast=\"none\">mamokari@microsoft.com<\/span><\/a><span data-contrast=\"auto\">\u00a0\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/li>\n<li><span data-contrast=\"auto\">\u00a0\u00a0<\/span><a href=\"https:\/\/www.linkedin.com\/in\/msolhab\/\"><span data-contrast=\"none\">Mona Soliman Habib<\/span><\/a><span data-contrast=\"auto\">\u00a0,\u00a0<\/span><a href=\"mailto:Mona.Habib@microsoft.com\"><span data-contrast=\"none\">Mona.Habib@microsoft.com<\/span><\/a><span data-contrast=\"auto\">\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/li>\n<li><span data-contrast=\"auto\">\u00a0\u00a0<\/span><a href=\"https:\/\/www.linkedin.com\/in\/jitghosh\/\"><span data-contrast=\"none\">Jit Ghosh<\/span><\/a><span data-contrast=\"auto\">\u00a0,\u00a0<\/span><a href=\"mailto:pghosh@microsoft.com\"><span data-contrast=\"none\">pghosh@microsoft.com<\/span><\/a><span data-ccp-props=\"{}\">\u00a0<\/span><\/li>\n<li><span data-contrast=\"auto\">\u00a0\u00a0<\/span><a href=\"https:\/\/www.linkedin.com\/in\/margaryta-ostapchuk-09082a73\/?originalSubdomain=ca\"><span data-contrast=\"none\">Margaryta Ostapchuk<\/span><\/a><span data-contrast=\"auto\">\u00a0,\u00a0<\/span><a href=\"mailto:mostap@microsoft.com\"><span data-contrast=\"none\">mostap@microsoft.com<\/span><\/a><span data-ccp-props=\"{}\">\u00a0<\/span><\/li>\n<li><span data-contrast=\"auto\">\u00a0\u00a0<\/span><a href=\"https:\/\/www.linkedin.com\/in\/eric-rozell-48b63822\/\"><span data-contrast=\"none\">Eric Rozell<\/span><\/a><span data-ccp-props=\"{}\">\u00a0<\/span><\/li>\n<\/ul>\n<h3 aria-level=\"2\"><span data-contrast=\"none\">Resources:<\/span><span data-ccp-props=\"{&quot;335559738&quot;:40}\">\u00a0<\/span><\/h3>\n<p><span data-contrast=\"auto\">[1]\u00a0<\/span><a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/search\/index-add-custom-analyzers\"><span data-contrast=\"none\">https:\/\/docs.microsoft.com\/en-us\/azure\/search\/index-add-custom-analyzers<\/span><\/a><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">[2]\u00a0<\/span><a href=\"https:\/\/www.luis.ai\/\"><span data-contrast=\"none\">https:\/\/www.luis.ai\/<\/span><\/a><span data-contrast=\"auto\">\u00a0<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">[3]\u00a0<\/span><a href=\"https:\/\/web.archive.org\/web\/20191114213255\/https:\/\/www.flinders.edu.au\/science_engineering\/fms\/School-CSEM\/publications\/tech_reps-research_artfcts\/TRRA_2007.pdf\"><span data-contrast=\"none\">Powers, David M W (2011)<\/span><\/a><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog post proposes a methodology to disambiguate misspelled entities by comparing the search retrieval performance with different custom search analyzers in a search engine.<\/p>\n","protected":false},"author":53988,"featured_media":13728,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[14,1,19],"tags":[],"class_list":["post-13703","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cognitive-services","category-cse","category-machine-learning"],"acf":[],"blog_post_summary":"<p>This blog post proposes a methodology to disambiguate misspelled entities by comparing the search retrieval performance with different custom search analyzers in a search engine.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/13703","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/53988"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=13703"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/13703\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/13728"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=13703"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=13703"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=13703"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}