{"id":13308,"date":"2021-01-13T04:39:54","date_gmt":"2021-01-13T12:39:54","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cse\/?p=13308"},"modified":"2021-01-14T05:21:58","modified_gmt":"2021-01-14T13:21:58","slug":"evaluation-framework-for-information-extraction","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/ise\/evaluation-framework-for-information-extraction\/","title":{"rendered":"Evaluation Framework for Information Extraction"},"content":{"rendered":"<h4>This blog post was co-authored with Bianca Furtuna and Prasanna Muralidharan.<\/h4>\n<h2>Introduction<\/h2>\n<p>Information extraction is the process of extracting entities, relations, assertions, topics, and additional information from textual data. For example, we may want to extract medical information from doctors\u2019 clinical notes (See figure 1) and later correlate that with the patient health trajectory. Similarly, we may want to extract topics out of financial reports written by market analysts for information retrieval. In Microsoft\u2019s Commercial Software Engineering team (CSE), we often collaborate with strategic customers on information extraction problems such as Named Entity Recognition (NER). Each new project encompasses specific ways of gathering the raw data, multiple models to experiment with, and different strategies for evaluating the result. In addition, different customers will have their data stored differently and are likely to have different requirements on how these systems should be tested and evaluated.<\/p>\n<p>The approach we take for each project follows best practices for building and deploying Machine Learning (ML) applications encompassing hypotheses, experiments, data, code, and models of high quality that can be reliably managed and reproduced. These best practices ensure that a structured and well-defined experimentation and evaluation flow is set up at the beginning of the project. Such setup enables a clear definition of how performance will be measured, and how all models are compared in a consistent manner. In addition, it allows for error analysis for each candidate model or models and can support models in any language.<\/p>\n<p>In this blog post we cover the process, requirements, and the design of such an evaluation framework for Information Extraction and specifically Named Entity Recognition.<\/p>\n<p><figure id=\"attachment_13312\" aria-labelledby=\"figcaption_attachment_13312\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/12\/ta4h-1.png\"><img decoding=\"async\" class=\"wp-image-13312 size-full\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/12\/ta4h-1.png\" alt=\"Text Analytics for Health demo\" width=\"839\" height=\"289\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/12\/ta4h-1.png 839w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/12\/ta4h-1-300x103.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/12\/ta4h-1-768x265.png 768w\" sizes=\"(max-width: 839px) 100vw, 839px\" \/><\/a><figcaption id=\"figcaption_attachment_13312\" class=\"wp-caption-text\">Figure 1 \u2013 Example clinical text with highlighted entities and relations. (<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/cognitive-services\/text-analytics\/how-tos\/text-analytics-for-health?tabs=relation-extraction\">Source<\/a>)<\/figcaption><\/figure><\/p>\n<p>&nbsp;<\/p>\n<h2>Why build an evaluation framework?<\/h2>\n<p>Evaluating ML models is hard! and evaluation logic can become very complex. In real world cases, the implementation of the evaluation logic should be agreed upon by various stakeholders to make sure the model is optimized on the right business and technical metrics.<\/p>\n<p>To overcome these challenges, we propose an evaluation framework that is modular and performs robust and consistent evaluations. The framework has the following characteristics:<\/p>\n<ul>\n<li>Allows for experimentation with different datasets and models.<\/li>\n<li>Facilitates collaboration of multiple team members as it ensures a consistent evaluation flow which enables team members to work together on developing the model.<\/li>\n<li>Easy integration into <a href=\"https:\/\/github.com\/microsoft\/MLOpsPython\">MLOps frameworks<\/a> for continuous evaluation in a production setting.<\/li>\n<li>Providing the capability to perform exhaustive tests of evaluation logic.<\/li>\n<\/ul>\n<h2>Design<\/h2>\n<p><span style=\"font-family: arial, helvetica, sans-serif;\">The framework is provided as a python package with four main parts:<\/span><\/p>\n<ol>\n<li>An internal representation (of documents, spans, and tokens)<\/li>\n<li>Abstract class for formatting datasets (<span style=\"font-family: 'andale mono', monospace;\">DatasetFormatter<\/span>)<\/li>\n<li>Abstract class for models (<span style=\"font-family: 'andale mono', monospace;\">BaseModel<\/span>)<\/li>\n<li>Abstract class for evaluation logic (<span style=\"font-family: 'andale mono', monospace;\">ModelEvaluator<\/span>)<\/li>\n<\/ol>\n<p><span style=\"font-family: arial, helvetica, sans-serif;\">The different classes are shown in Figure 2 &#8211; Class Diagram.<\/span><\/p>\n<p><figure id=\"attachment_13315\" aria-labelledby=\"figcaption_attachment_13315\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/12\/eval-framework-class-diagram.png\"><img decoding=\"async\" class=\"wp-image-13315 size-full\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/12\/eval-framework-class-diagram-e1608215774910.png\" alt=\"Image eval framework class diagram\" width=\"1117\" height=\"464\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/12\/eval-framework-class-diagram-e1608215774910.png 1117w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/12\/eval-framework-class-diagram-e1608215774910-300x125.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/12\/eval-framework-class-diagram-e1608215774910-1024x425.png 1024w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/12\/eval-framework-class-diagram-e1608215774910-768x319.png 768w\" sizes=\"(max-width: 1117px) 100vw, 1117px\" \/><\/a><figcaption id=\"figcaption_attachment_13315\" class=\"wp-caption-text\">Figure 2 &#8211; Class Diagram<\/figcaption><\/figure><\/p>\n<p>&nbsp;<\/p>\n<p>Let\u2019s go deeper into what each part looks like:<\/p>\n<h3>Data Objects<\/h3>\n<p>The framework contains data objects which are passed between the different modules. These classes represent standard NLP objects like raw text, spans, and tokens. Objects can be serialized and translated to different representations.<\/p>\n<p>The main data object is the Document class. The Document class has the following fields:<\/p>\n<ul>\n<li><strong>Document id<\/strong><\/li>\n<li><strong>Spans<\/strong>: holding the start and end of entities or phrases<\/li>\n<li><strong>Text<\/strong>: The raw text of this document<\/li>\n<li><strong>Tokens<\/strong>: A tokenized representation of the raw text, using a predefined tokenizer<\/li>\n<li><strong>Metadata<\/strong>: Additional metadata on the document<\/li>\n<\/ul>\n<p>The Document objects is resembling objects used by other frameworks, such as <a href=\"https:\/\/spacy.io\/\">spaCy<\/a> and <a href=\"https:\/\/prodi.gy\/\">Prodigy<\/a>:<\/p>\n<pre class=\"prettyprint\">{\r\n\u00a0\u00a0\u00a0 \"spans\": [{\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"start\": 11,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"end\": 15,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"label\": \"PERSON\",\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"token_start\": 3,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"token_end\": 3,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"text\": \"Lisa\"\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 }\r\n\u00a0\u00a0\u00a0 ],\r\n\r\n\u00a0\u00a0\u00a0 \"tokens\": [\"My\", \"name\", \"is\", \"Lisa\"],\r\n\u00a0\u00a0\u00a0 \"meta\": {\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \"id\": 0\r\n\u00a0\u00a0\u00a0 },\r\n\u00a0\u00a0\u00a0 \"text\": \"My name is Lisa\"\r\n}\r\n<\/pre>\n<p>For Named Entity Recognition, the Document and Span objects can be translated from\/into <a href=\"https:\/\/en.wikipedia.org\/wiki\/Inside%E2%80%93outside%E2%80%93beginning_(tagging)\">BIO\/IOB and BILUO\/BIOES<\/a>, allowing easy integration into models which expect such input or datasets in this structure.<\/p>\n<p>&nbsp;<\/p>\n<h3>Dataset Formatter<\/h3>\n<p>The formatter abstraction is used to translate any given input data into a unified data representation. Its implementation should include the loading of the original dataset from a file or a stream, and the translation logic into an Iterable of Document objects. Such abstraction permits the evaluation of models on multiple types of datasets, while still maintaining one evaluation flow. Here\u2019s the <span style=\"font-family: 'andale mono', monospace;\">DatasetFormatter<\/span> code:<\/p>\n<pre class=\"prettyprint\">class DatasetFormatter(ABC):\r\n\r\n    @abstractmethod\r\n    def to_documents(self) -&gt; Iterable[Document]:\r\n        \"\"\"\r\n        Translate a dataset structure into an iterable of documents, \r\n\tto be used by models and for evaluation\r\n        \"\"\"\r\n        pass\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h3>Base Model<\/h3>\n<p>The model abstraction allows a user to experiment with multiple types of models. For example, we can wrap a <a href=\"https:\/\/sklearn-crfsuite.readthedocs.io\/en\/latest\/\">CRF<\/a> model, a <a href=\"https:\/\/spacy.io\/\">spaCy<\/a> model and a <a href=\"https:\/\/huggingface.co\/transformers\/usage.html#named-entity-recognition\">PyTorch<\/a> model and compare the three given the same dataset and the same evaluator:<\/p>\n<pre class=\"prettyprint\">class BaseModel(ABC):\r\n    def __init__(self):\r\n        pass\r\n\r\n    @abstractmethod\r\n    def fit(self, documents: Iterable[Document]):\r\n        pass\r\n\r\n    @abstractmethod\r\n    def predict(self, documents: Iterable[Document]):\r\n        pass\r\n<\/pre>\n<h3>Model Evaluator<\/h3>\n<p>The <span style=\"font-family: 'andale mono', monospace;\">ModelEvaluator<\/span> abstraction allows a user to implement different evaluation strategies, while re-using previously implemented dataset connectors or models. We can further use it to evaluate already existing predictions from a file generated previously or outside this framework.<\/p>\n<pre class=\"prettyprint\">class ModelEvaluator(ABC):\r\n    \"\"\"\r\n    Abstract class to hold logic for evaluation functions\r\n    \"\"\"\r\n\r\n    @abstractmethod\r\n    def evaluate(\r\n        self, annotations: Iterable[Document], predictions: Iterable[Document]\r\n    ) -&gt; Dict:\r\n        \"\"\"\r\n        Evaluate a model on a corpus of documents, each containing annotated and predicted spans,\r\n        and return the overall model metrics\r\n        :param annotations: List of documents with annotated spans\r\n        :param predictions: List of documents with predicted spans\r\n        :return: metrics\r\n        \"\"\"\r\n        pass\r\n<\/pre>\n<h2><\/h2>\n<h2>Example use case: Detecting private entities in text<\/h2>\n<p>Detecting names, organizations and locations is a common problem in NLP. Using this evaluation framework, we can evaluate different NER models on different datasets, using a predefined evaluation strategy.<\/p>\n<h3>Datasets<\/h3>\n<p>In case we have multiple datasets in different formats, we can create different formatters. One common format for NER is the one used in the <a href=\"https:\/\/www.clips.uantwerpen.be\/conll2003\/ner\/\">CoNLL-2003 shared task<\/a>. Another common format is the <a href=\"https:\/\/brat.nlplab.org\/standoff.html\">brat standoff format<\/a>. Different annotation tools like <a href=\"https:\/\/prodi.gy\/\">Prodigy<\/a> or <a href=\"https:\/\/github.com\/doccano\/doccano\">Doccano<\/a> have their own JSONL output format.<\/p>\n<p>Let\u2019s assume we want to train a model on a brat standoff formatted dataset, and then fine tune it on manually labeled data. Using the `DatasetFormatter` object, we would create two classes: <span style=\"font-family: 'andale mono', monospace;\">BratFormatter<\/span> and <span style=\"font-family: 'andale mono', monospace;\">DoccanoFormatter<\/span>. The logic for transforming each type of data into the unified representation of <span style=\"font-family: 'andale mono', monospace;\">List[Document]<\/span> will be written in the <span style=\"font-family: 'andale mono', monospace;\">to_documents<\/span>\u00a0function in each class. In addition, we use spaCy to tokenize the input dataset.<\/p>\n<p>Here\u2019s a na\u00efve implementation of the <span style=\"font-family: 'andale mono', monospace;\">BratFormatter<\/span>:<\/p>\n<pre class=\"prettyprint\">from pathlib import Path\r\n\r\nfrom nlp_eval import Span, Document\r\nfrom nlp_eval.formatting import DatasetFormatter\r\n\r\n\r\nclass BratFormatter(DatasetFormatter):\r\n    def __init__(self, files_path: Path):\r\n        \"\"\"\r\n        Translator between the brat standoff format (https:\/\/brat.nlplab.org\/standoff.html)\r\n        to the internal representation of this package\r\n        :param files_path Path containing txt and ann files.\r\n        \"\"\"\r\n        super().__init__()\r\n        self.files_path = files_path\r\n\r\n    def to_documents(self):\r\n        for txt_path in Path(self.files_path).glob(\"*.txt\"):\r\n            document_id = txt_path.stem\r\n            annotation_file_path = Path(self.files_path, f\"{document_id}.ann\").resolve()\r\n\r\n            with open(str(annotation_file_path), encoding=\"utf-8-sig\") as f_ann:\r\n                ann = f_ann.readlines()\r\n\r\n            with open(str(txt_path), encoding=\"utf-8-sig\") as f_txt:\r\n                text = f_txt.read().replace('\\n', ' ')\r\n            spans = []\r\n            for line in ann:\r\n                annotation = line.split()\r\n                spans.append(\r\n                    Span(\r\n                        label=annotation[1],\r\n                        start=int(annotation[2]),\r\n                        end=int(annotation[3]) - 1,\r\n                        text=\" \".join(annotation[4:]),\r\n                    )\r\n                )\r\n            yield Document(text=text, spans=spans, document_id=document_id, tokens=self.nlp(text))\r\n<\/pre>\n<p>Now that we have our two datasets ready for modeling, let\u2019s discuss the different models we\u2019d like to experiment with.<\/p>\n<h3>Models<\/h3>\n<p>Let\u2019s assume we want to experiment with three models &#8211; a simple CRF model (using the <a href=\"https:\/\/sklearn-crfsuite.readthedocs.io\/en\/latest\/\">sklearn-crfsuite<\/a> package), a <a href=\"https:\/\/spacy.io\/\">spaCy<\/a> model and finally a transformer-based model using packages like <a href=\"https:\/\/github.com\/flairNLP\/flair\">Flair<\/a> or <a href=\"https:\/\/huggingface.co\/transformers\/usage.html#named-entity-recognition\">transformers<\/a>. We can use the <span style=\"font-family: 'andale mono', monospace;\">BaseModel<\/span>\u00a0abstract class to form three new classes, one for each model type. This would allow us to experiment with the different models using the exact same flow, and would therefore create a consistent API for all models. While these model wrappers are an overhead, they help assure that we are comparing apples to apples, and would help if we decide to change our production model in our pipeline from one to the other, as they all have identical APIs.<\/p>\n<p>The two abstract methods in <span style=\"font-family: 'andale mono', monospace;\">BaseModel<\/span> are <span style=\"font-family: 'andale mono', monospace;\">fit<\/span>\u00a0and <span style=\"font-family: 'andale mono', monospace;\">predict<\/span>, which come from the popular scikit-learn package. We would use each method to translate the unified Document representation into the expected input of the specific model and would also translate the output of the model into the same representation so that we could use a unified evaluation for all models.<\/p>\n<p>Here\u2019s a small example on how to adapt a spaCy NER model into a <span style=\"font-family: 'andale mono', monospace;\">BaseModel<\/span>:<\/p>\n<pre class=\"prettyprint\">class SpacyNERModel(BaseModel):\r\n    def __init__(self, nlp=None):\r\n        self.nlp = nlp\r\n        super().__init__()\r\n\r\n    def fit(self, documents: Iterable[Document]):\r\n        # spaCy simple training style, taken from https:\/\/spacy.io\/usage\/training#training-simple-style\r\n        train_data = SpacyNERModel._documents_to_spacy_train_data(documents)\r\n        if not self.nlp:\r\n             self.nlp = spacy.blank(\"en\")\r\n        optimizer = self.nlp.begin_training()\r\n        for i in range(20):\r\n            random.shuffle(train_data)\r\n            for text, annotations in train_data:\r\n                self.nlp.update([text], [annotations], sgd=optimizer)\r\n        self.nlp.to_disk(\"\/model\")\r\n\r\n    def predict(self, documents: Iterable[Document]):\r\n        doc_tuples = [(d.text, d) for d in documents]\r\n        for doc, original_doc in self.nlp.pipe(doc_tuples, as_tuples=True):\r\n            predicted_spans = [Span.from_spacy_span(ent) for ent in doc.ents]\r\n            spacy_doc = Document(text=original_doc.text, tokens=doc, spans=predicted_spans)\r\n            yield spacy_doc\r\n\r\n    @staticmethod\r\n    def _documents_to_spacy_train_data(documents: Iterable[Document]):\r\n        training_data = []\r\n        for document in documents:\r\n            document.handle_overlapping_spans()\r\n            training_data.append((document.text, SpacyNERModel._spans_to_spacy_spans(document.spans)))\r\n        return training_data\r\n\r\n    @staticmethod\r\n    def _spans_to_spacy_spans(spans: List[Span]):\r\n        return {\"entities\": [(span.start, span.end, span.label) for span in spans]}\r\n<\/pre>\n<p>On the <span style=\"font-family: 'andale mono', monospace;\">fit<\/span> function, we translate the documents into the requested input by spaCy. Then we train a spaCy model and save it.\u00a0On the <span style=\"font-family: 'andale mono', monospace;\">predict<\/span> function, we run batch prediction on the test documents, and translate the output into a list of <span style=\"font-family: 'andale mono', monospace;\">Document<\/span> objects.<\/p>\n<h3>Evaluation<\/h3>\n<p>Lastly, for evaluation, we could use the <span style=\"font-family: 'andale mono', monospace;\">ModelEvaluator<\/span> abstract class to implement the evaluation we require. Since in most cases we would only have one evaluation strategy, we can consider either using the evaluator object directly (making it non-abstract) or create a class which implements it.<\/p>\n<p>In this scenario, let\u2019s use the <a href=\"https:\/\/github.com\/chakki-works\/seqeval\">seqeval<\/a> package to calculate different NER metrics.<\/p>\n<pre class=\"prettyprint\">class NERSimpleEvaluator(ModelEvaluator):\r\n    \"\"\"\r\n    Contains the evaluator functions for Named Entity Recognition (NER) using seqeval framework\r\n    \"\"\"\r\n    def evaluate(self, annotations: Iterable[Document], predictions: Iterable[Document], schema=BILOU):\r\n        \"\"\"\r\n        Evaluate a model on a corpus of documents, each containing annotated and predicted spans\r\n        \"\"\"\r\n        annot_list = []\r\n        pred_list = []\r\n\r\n        for annot, pred in zip(annotations, predictions):\r\n            annot_list.append(annot.get_biluo())\r\n            pred_list.append(pred.get_biluo())\r\n\r\n        return classification_report(\r\n            annot_list, pred_list, mode=\"strict\", scheme=schema\r\n        )<\/pre>\n<p>&nbsp;<\/p>\n<h3>Experiment flow<\/h3>\n<p>Now that we have the different building blocks defined, let\u2019s look at an example end-to-end flow (summarized in the sequence diagram below):<\/p>\n<p><figure id=\"attachment_13316\" aria-labelledby=\"figcaption_attachment_13316\" class=\"wp-caption aligncenter\" ><a href=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/12\/evaluation-flow.png\"><img decoding=\"async\" class=\"size-full wp-image-13316\" src=\"https:\/\/devblogs.microsoft.com\/cse\/wp-content\/uploads\/sites\/55\/2020\/12\/evaluation-flow.png\" alt=\"Image evaluation flow\" width=\"975\" height=\"362\" srcset=\"https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/12\/evaluation-flow.png 975w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/12\/evaluation-flow-300x111.png 300w, https:\/\/devblogs.microsoft.com\/ise\/wp-content\/uploads\/sites\/55\/2020\/12\/evaluation-flow-768x285.png 768w\" sizes=\"(max-width: 975px) 100vw, 975px\" \/><\/a><figcaption id=\"figcaption_attachment_13316\" class=\"wp-caption-text\">Figure 3 &#8211; Evaluation flow<\/figcaption><\/figure><\/p>\n<p>&nbsp;<\/p>\n<p>Here is an example flow in code, using a formatter for loading the CONLL2003 dataset and the SpacyNERModel:<\/p>\n<p>1. Read dataset into a list of `Document` objects:<\/p>\n<pre class=\"prettyprint\">formatter = CONLL2003Formatter()\r\nformatter.download()\r\ndoc_iter = formatter.to_documents(fold=\"eng.testb\")\r\n# run flow only on a subset of the examples\r\ndocuments = list(doc_iter)[:10]<\/pre>\n<p>2. Optionally save the dataset to JSONL format<\/p>\n<pre class=\"prettyprint\">Document.save_dataset(\r\n    documents=documents, output_format=\"jsonl\", output_file=\"testa.jsonl\"\r\n)\r\n<\/pre>\n<p>3. Load model for prediction<\/p>\n<pre class=\"prettyprint\">model = SpacyNERModel(nlp='mymodel')\r\npredicted_docs = model.predict(documents)\r\n<\/pre>\n<p>4. Evaluate results<\/p>\n<pre class=\"prettyprint\">evaluator = NEREvaluator()\r\nresults = evaluator.evaluate(annotations=documents, predictions=predicted_docs)\r\nevaluator.print_results_dict()\r\n\r\n<\/pre>\n<p>Another example flow is when we already have predictions stored in a file, and we use the Evaluator to compare stored annotations with stored predictions.<\/p>\n<p>&nbsp;<\/p>\n<h2>Advantages of using the evaluation framework<\/h2>\n<h3>Reproducibility<\/h3>\n<p>When leveraging a structured evaluation pipeline, it is straightforward to achieve full reproducibility of experiment runs. For this purpose we used <a href=\"https:\/\/mlflow.org\/\">MLFlow<\/a> to track dataset identifiers, hyper parameters and outputted metrics for every experiment run, and collected a comparable set of results for experiments with different model types.<\/p>\n<h3>Operationalization<\/h3>\n<p>The proposed evaluation framework can be installed as a Python package. As such, it can be used in production environments to evaluate models or installed by individual team members for individual experimentation. Since the models in the package all inherit from the BaseModel class, they are easily interchangeable, so one could replace the model in the production pipeline with another, without having to create adapters for the new model.<\/p>\n<h3>Testing<\/h3>\n<p>By creating evaluators, models and datasets as defined objects, unit testing and other forms of testing becomes easier. For example, we could create a unit test for the to_documents function on DatasetFormatter, verifying it reads the data correctly. We could do simple tests to `fit` and `predict` on real or mock models, and we can thoroughly test our evaluation strategy, to verify we\u2019re not getting wrong results due to bugs in evaluation.<\/p>\n<p>For example, here\u2019s one unit-test for validating that metrics (<span style=\"font-family: 'andale mono', monospace;\">precision<\/span>, <span style=\"font-family: 'andale mono', monospace;\">recall<\/span> and <span style=\"font-family: 'andale mono', monospace;\">f1<\/span> loaded from the <span style=\"font-family: 'andale mono', monospace;\">metrics<\/span> object) are calculated correctly:<\/p>\n<pre class=\"prettyprint\">@pytest.mark.parametrize(\r\n    \"tp, fp, fn, tn, e_prec, e_rec, e_f1\",\r\n    [  # simple example\r\n        (4, 1, 4, 3, 0.8, 0.5, 8 \/ 13),\r\n        # zero edge case\r\n        (0, 0, 0, 0, 0, 0, 0),\r\n        # some zero counts\r\n        (0, 0, 4, 23, 0, 0, 0),\r\n    ],\r\n)\r\ndef test_calculate_metrics_returns_correct_values(tp, fp, fn, tn, e_prec, e_rec, e_f1):\r\n\r\n    metrics = ModelMetrics.calculate_metrics(tp, fp, fn, tn)\r\n\r\n    assert metrics.precision == e_prec\r\n    assert metrics.recall == e_rec\r\n    assert metrics.f_1 == e_f1\r\n<\/pre>\n<h2>Summary<\/h2>\n<p>Experimenting with ML models is always a challenge. Data scientists often use models from different frameworks, write custom logic and have different assumptions when building the experimentation pipelines. Often, the data scientist owns the evaluation flow from data collection through modeling to results, and different team members working on the same problem might evaluate their model in a different way. Lastly, potential bugs in the flow or specifically in the evaluation code could cause the experiment results to be wrong, which might lead to wrong conclusions or time wasted on experimentation. Mainly for those reasons, we propose a more structured way of experimenting with and evaluating models.<\/p>\n<p>Creating an evaluation framework for your ML project is a great step towards rigorous experimentation and operationalization. The proposed framework introduces structure into the experiment flow, and creates consistency for using different datasets, models, or evaluators. The main limitation however is the overhead in translating data and model APIs between the original format to the framework\u2019s format.<\/p>\n<p>&nbsp;<\/p>\n<p>Icon made by <a title=\"Becris\" href=\"https:\/\/creativemarket.com\/Becris\">Becris<\/a>, <a title=\"Icongeek26\" href=\"https:\/\/www.flaticon.com\/free-icon\/note_3566260?term=notes&amp;page=2&amp;position=50\">Icongeek26<\/a>, <a title=\"Linector\" href=\"https:\/\/www.flaticon.com\/authors\/linector\">Linector<\/a> and <a title=\"Freepik\" href=\"http:\/\/www.freepik.com\/\">Freepik<\/a> from <a title=\"Flaticon\" href=\"https:\/\/www.flaticon.com\/\"> www.flaticon.com<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this blog post we cover the process, requirements, and the design of an evaluation framework for NLP and Information Extraction. We cover the reasoning behind such a framework, and discuss its implementation with examples from a Named Entity Recognition evaluation point of view. <\/p>\n","protected":false},"author":21453,"featured_media":13324,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,19],"tags":[3297,3296,267,268,274],"class_list":["post-13308","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cse","category-machine-learning","tag-evaluation","tag-ml","tag-named-entity-recognition","tag-natural-language-processing","tag-nlp"],"acf":[],"blog_post_summary":"<p>In this blog post we cover the process, requirements, and the design of an evaluation framework for NLP and Information Extraction. We cover the reasoning behind such a framework, and discuss its implementation with examples from a Named Entity Recognition evaluation point of view. <\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/13308","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/users\/21453"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/comments?post=13308"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/posts\/13308\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media\/13324"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/media?parent=13308"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/categories?post=13308"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/ise\/wp-json\/wp\/v2\/tags?post=13308"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}