Text Analytics for Extractive Summarization

We’re delighted to announce that Text Analytics now supports extractive summarization! In general, there are two approaches for automatic text summarization: extractive and abstractive. The Text Analytics API provides extractive summarization starting in version 3.2-preview.1

Extractive Summarization Analysis

Text Analytics for Extractive Summarization is a feature in Azure Text Analytics that produces a text summary by extracting sentences that collectively represent the most important or relevant information within the original content. This feature is designed to shorten content that could be considered too long to read. Extractive summarization condenses articles, papers, or documents to key sentences.

Text Analytics for Extractive Summarization supports the following features:

Extracted sentences: These sentences collectively convey the main idea of the document. They are original sentences extracted from the input document’s content. Each of these extracted sentences has a rank score, an offset (position where the sentence starts at in the input document), and a length.
Rank score: A rank score is an indicator of how relevant a sentence is determined to be, to the main idea of a document. The model gives a score between 0 and 1 (inclusive) to each sentence and returns the highest scored sentences per request. For example, if you request a three-sentence summary, the service returns the three highest scored sentences given that the input document already has three or more sentences.
Maximum sentences: The maximum count of sentences to be returned. The default value is three, which means the sentences with top three highest rank scores will be returned as the extractive summarization analysis result.
Sorting algorithm: The extracted sentences can be sorted by their offset or rank score. The default behavior is sorted by offset. The sorting algorithm applies after the maximum sentences count applies. That means the service will find the top highest rank score summarized sentences first and then sort the sentences.

Next, we will walk through a sample usage of extractive summarization.

An example

You can find complete samples for Java, C#, and Python on our website.

In this section, we will show how to use extractive summarization in Java. To use Text Analytics for Text Extractive Summarization, start with creating a Text Analytics client, and then use the client to make a request to the Text Analytics service on the documents input, which will return the analyzed output that includes the features described above.

Create a Text Analytics client,

TextAnalyticsClient client = new TextAnalyticsClientBuilder()
                                 .credential(new AzureKeyCredential("{key}"))
                                 .endpoint("{endpoint}")
                                 .buildClient();

Prepare a batch of documents as input,

List<String> documents = Arrays.asList(
    "<first document input string your want to analyze>",
    "<second document input string your want to analyze>");

Next, let’s create an extractive summarization action and pass it in a call to beginAnalyzeActions:

SyncPoller<AnalyzeActionsOperationDetail, AnalyzeActionsResultPagedIterable> syncPoller =
  client.beginAnalyzeActions(documents,
    new TextAnalyticsActions().setExtractSummaryActions(new ExtractSummaryAction()),
    "en",
    null);

syncPoller.waitForCompletion();

Since this operation is long-running, we will call getFinalResult() on the poller to get the results after waiting is completed:

syncPoller.getFinalResult().forEach(actionsResult -> {
  System.out.println("Extractive Summarization action results:");
  for (ExtractSummaryActionResult actionResult : actionsResult.getExtractSummaryResults()) {
    for (ExtractSummaryResult documentResult : actionResult.getDocumentsResults()) {
      System.out.println("tExtracted summary sentences:");
      for (SummarySentence summarySentence : documentResult.getSentences()) {
        System.out.printf("tt Sentence text: %s, length: %d, offset: %d, rank score: %f.%n",
          summarySentence.getText(), summarySentence.getLength(), summarySentence.getOffset(), summarySentence.getRankScore());
      }   
    }
  }
});

For example, given an article document,

“At Microsoft, we have been on a quest to advance AI beyond existing techniques, by taking a more holistic, human-centric approach to learning and understanding. As Chief Technology Officer of Azure AI Cognitive Services, I have been working with a team of amazing scientists and engineers to turn this quest into a reality. In my role, I enjoy a unique perspective in viewing the relationship among three attributes of human cognition: monolingual text (X), audio or visual sensory signals, (Y) and multilingual (Z). At the intersection of all three, there’s magic—what we call XYZ-code as illustrated in Figure 1—a joint representation to create more powerful AI that can speak, hear, see, and understand humans better. We believe XYZ-code will enable us to fulfill our long-term vision: cross-domain transfer learning, spanning modalities and languages. The goal is to have pretrained models that can jointly learn representations to support a broad range of downstream AI tasks, much in the way humans do today. Over the past five years, we have achieved human performance on benchmarks in conversational speech recognition, machine translation, conversational question answering, machine reading comprehension, and image captioning. These five breakthroughs provided us with strong signals toward our more ambitious aspiration to produce a leap in AI capabilities, achieving multisensory and multilingual learning that is closer in line with how humans learn and understand. I believe the joint XYZ-code is a foundational component of this aspiration, if grounded with external knowledge sources in the downstream AI tasks.”

The extracted summary sentences display as,

Extractive Summarization action results:
  Extracted summary sentences:
    Sentence text: At Microsoft, we have been on a quest to advance AI beyond existing techniques, by taking a more holistic, human-centric approach to learning and understanding., length: 160, offset: 0, rank score: 1.000000.
    Sentence text: In my role, I enjoy a unique perspective in viewing the relationship among three attributes of human cognition: monolingual text (X), audio or visual sensory signals, (Y) and multilingual (Z)., length: 192, offset: 324, rank score: 0.958233.
    Sentence text: At the intersection of all three, there’s magic—what we call XYZ-code as illustrated in Figure 1—a joint representation to create more powerful AI that can speak, hear, see, and understand humans better., length: 203, offset: 517, rank score: 0.929475.

Summary

Text Analytics for Extractive Summarization is a new feature in the Azure Text Analytics service that produces a text summary by extracting sentences that collectively represent the most important or relevant information within the original document. Furthermore, we have shown how to use it in Java by calling beginAnalyzeActions.

This article introduced the Text Analytics library features for extractive summarization analysis.

For more information about each language from this article, see the following resources:

.NET: Document Reference | README | Samples
Java: Document Reference | README | Samples
JavaScript: Document Reference | README | Samples
Python: Document Reference | README | Samples

Azure SDK Releases

Azure SDK Blog Contributions

Thank you for reading this Azure SDK blog post! We hope that you learned something new and welcome you to share this post. We are open to Azure SDK blog contributions. Please contact us at azsdkblog@microsoft.com with your topic and we’ll get you set up as a guest blogger.