We’re excited to announce the latest beta release of the Azure Text Analytics client libraries for Python, .NET, Java, and JavaScript. The Text Analytics libraries provide access to the Azure Cognitive Service for Language, which offers Natural Language Processing (NLP) features for understanding and analyzing text. This blog post reviews several newly supported features and explores how to use them in Python.
Highlighted features
The beta Text Analytics client libraries target the REST API version 2022-10-01-preview
and include many new features and quality improvements. For a list of everything that’s new, see What’s new in Azure Cognitive Service for Language?. This blog post highlights the following features:
- Abstractive summarization
- Named entity recognition (NER) resolutions
- Healthcare FHIR bundles with document type
- Automatic language and script detection
To try out these new features in Python, run the following command to install the client library:
pip install azure-ai-textanalytics --pre
Note: This blog post was written for version 5.3.0b1 of the Python client library. You can find the same capabilities in the client libraries for .NET (version 5.3.0-beta.1), Java (version 5.3.0-beta.1), and JavaScript (1.1.0-beta.1).
Abstractive summarization
Abstractive summarization is an NLP technique for automatic text summarization where the output summary is composed of novel sentences generated by either rephrasing or using new words. It can be helpful in reducing the time and effort needed to find concise and relevant information in a piece of text. This technique is different from extractive summarization—a feature introduced in a previous version (5.2.0b4) of the client libraries—where the summary is generated by extracting the most important sentences verbatim from the source.
Using the Python client library, we first create the client and then can access the feature through the begin_analyze_actions
API with the AbstractSummaryAction
.
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import (
TextAnalyticsClient,
AbstractSummaryAction
)
client = TextAnalyticsClient(
endpoint="<endpoint>",
credential=AzureKeyCredential("<api-key>"),
)
document = [
"At Microsoft, we have been on a quest to advance AI beyond existing techniques, by taking a more holistic, "
"human-centric approach to learning and understanding. As Chief Technology Officer of Azure AI Cognitive "
"Services, I have been working with a team of amazing scientists and engineers to turn this quest into a "
"reality. In my role, I enjoy a unique perspective in viewing the relationship among three attributes of "
"human cognition: monolingual text (X), audio or visual sensory signals, (Y) and multilingual (Z). At the "
"intersection of all three, there's magic-what we call XYZ-code as illustrated in Figure 1-a joint "
"representation to create more powerful AI that can speak, hear, see, and understand humans better. "
"We believe XYZ-code will enable us to fulfill our long-term vision: cross-domain transfer learning, "
"spanning modalities and languages. The goal is to have pretrained models that can jointly learn "
"representations to support a broad range of downstream AI tasks, much in the way humans do today. "
"Over the past five years, we have achieved human performance on benchmarks in conversational speech "
"recognition, machine translation, conversational question answering, machine reading comprehension, "
"and image captioning. These five breakthroughs provided us with strong signals toward our more ambitious "
"aspiration to produce a leap in AI capabilities, achieving multisensory and multilingual learning that "
"is closer in line with how humans learn and understand. I believe the joint XYZ-code is a foundational "
"component of this aspiration, if grounded with external knowledge sources in the downstream AI tasks."
]
poller = client.begin_analyze_actions(
document,
actions=[
AbstractSummaryAction(),
],
)
document_results = poller.result()
for abstract_summary_results in document_results:
result = abstract_summary_results[0]
for summary in result.summaries:
print(f"{summary.text}\n")
Abstractive summarization is a long-running operation, meaning we send the initial request and then poll the service until the result is ready. In the Python library, long-running operations are prefixed with begin_
and return a poller object on which you can call result()
to get the final result. Once the operation has completed, we can check the output for the summary generated:
Output:
Microsoft is taking a more holistic, human-centric approach to learning and understanding.
The goal is to create more powerful AI that can speak, hear, see, and understand humans better.
If desired, the maximum length of the generated summary can be specified by using the sentence_count
keyword argument.
poller = client.begin_analyze_actions(
document,
actions=[
AbstractSummaryAction(
sentence_count=5
),
],
)
Note: Abstractive summarization is a gated preview feature. You can apply to use this feature at Apply for access to Language Service previews.
Named entity recognition (NER) resolutions
NER, a long-time feature of the Cognitive Service for Language, is used to recognize and categorize entities in written text. The latest beta release builds on NER by adding a preview feature for resolving entities to standard formats. Resolutions provide predictable formats for common quantifiable types and can normalize values to a single, well-known format.
To try out NER resolutions, you can use the recognize_entities
method or the RecognizeEntitiesAction
as input to the begin_analyze_actions
API. You must pass the preview model version 2022-10-01-preview
to receive resolutions in the response.
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient, ResolutionKind
client = TextAnalyticsClient(
endpoint="<endpoint>",
credential=AzureKeyCredential("<api-key>"),
)
doc = ["A total of twenty-six USD is due for service between December 24, 2022 and January 25, 2023."]
result = client.recognize_entities(docs, model_version="2022-10-01-preview")[0]
for entity in result.entities:
print(f"The entity is '{entity.text}' and categorized as '{entity.category}' with subcategory '{entity.subcategory}'.")
for res in entity.resolutions:
if res.resolution_kind == ResolutionKind.CURRENCY_RESOLUTION:
print(f"...The resolution kind is '{res.resolution_kind}' "
f"with a value of '{res.value}', unit of '{res.unit}', and ISO4217 code of '{res.iso4217}'.")
if res.resolution_kind == ResolutionKind.TEMPORAL_SPAN_RESOLUTION:
print(f"...The resolution kind is '{res.resolution_kind}' "
f"that begins on '{res.begin}' and ends on '{res.end}' with duration '{res.duration}'.")
Output:
The entity is 'twenty-six USD' and categorized as 'Quantity' with subcategory 'Currency'.
...The resolution kind is 'CurrencyResolution' with a value of '26.0', unit of 'United States dollar', and ISO4217 code of 'USD'.
The entity is 'between December 24, 2022 and January 25, 2023' and categorized as 'DateTime' with subcategory 'DateRange'.
...The resolution kind is 'TemporalSpanResolution' that begins on '2022-12-24' and ends on '2023-01-25' with duration 'P32D'
With NER resolutions, we were able to recognize “twenty-six” as the number 26
and further resolve the quantity as United States dollar currency. We also resolved the dates into a standard YYYY-MM-DD
format and recognized that they were specified as a range, with a duration of P32D
or 32 days between them.
For a full description of possible resolution kinds, see the service documentation at Resolve entities to standard formats.
Healthcare FHIR bundles with document type
FHIR, or Fast Healthcare Interoperability Resource, is a standard that defines how healthcare information can be exchanged by different computer systems. It’s intended to facilitate interoperability between healthcare information systems so that providers have seamless access to patient healthcare data, aiding them to give the best patient care possible.
FHIR describes a set of modular data components called “resources”. Resources are any content that is exchangeable. A “bundle” is a container for a collection of resources. In Text Analytics for Health, the latest beta release provides the option to receive a FHIR bundle for a particular text document. The FHIR feature can be accessed through the begin_analyze_healthcare_entities
method, or if you’re performing multiple actions on text, the RecognizeHealthcareEntitiesAction
as input to begin_analyze_actions
.
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient, HealthcareDocumentType
client = TextAnalyticsClient(
endpoint="<endpoint>",
credential=AzureKeyCredential("<api-key>"),
)
poller = client.begin_analyze_healthcare_entities(
documents=[
"Prescribed 100mg ibuprofen, taken twice daily."
],
fhir_version="4.0.1",
document_type=HealthcareDocumentType.DISCHARGE_SUMMARY,
)
result = list(poller.result())
print(result[0].fhir_bundle)
As shown in the above example, you can also pass the document_type
that specifies the type of document the source text originates from and can influence the FHIR bundle returned. Possible document types include “ClinicalTrial”, “DischargeSummary”, “ProgressNote”, “Imaging”, and more, which can be found described in the reference documentation.
Select to see FHIR bundle output:
{
"resourceType": "Bundle",
"id": "6ff89ed4-6b74-4dd2-ac77-e52881330339",
"meta": {
"profile": [
"http://hl7.org/fhir/4.0.1/StructureDefinition/Bundle"
]
},
"identifier": {
"system": "urn:ietf:rfc:3986",
"value": "urn:uuid:6ff89ed4-6b74-4dd2-ac77-e52881330339"
},
"type": "document",
"entry": [
{
"fullUrl": "Composition/0892cc03-9774-44dd-b6d0-910b2e0e3916",
"resource": {
"resourceType": "Composition",
"id": "0892cc03-9774-44dd-b6d0-910b2e0e3916",
"status": "final",
"type": {
"coding": [
{
"system": "http://loinc.org",
"code": "11526-1",
"display": "Pathology study"
}
],
"text": "Pathology study"
},
"subject": {
"reference": "Patient/89e2fa86-a2f7-4c2c-ae63-a494dc42263b",
"type": "Patient"
},
"encounter": {
"reference": "Encounter/c5c8c292-c921-4915-baeb-6242eab9db3d",
"type": "Encounter",
"display": "unknown"
},
"date": "2023-01-05",
"author": [
{
"reference": "Practitioner/9b23f2a6-ac1c-4f13-8eb5-fc3ae28984af",
"type": "Practitioner",
"display": "Unknown"
}
],
"title": "Pathology study",
"section": [
{
"title": "General",
"code": {
"coding": [
{
"system": "",
"display": "Unrecognized Section"
}
],
"text": "General"
},
"text": {
"status": "additional",
"div": "<div<\r\n\t\t\t\t\t\t\t<h<General</h1<\r\n\t\t\t\t\t\t\t<p<Prescribed 100mg ibuprofen, taken twice daily.</p<\r\n\t\t\t\t\t</div<"
},
"entry": [
{
"reference": "List/59cc01c9-290b-425c-90fa-70419d09b9e2",
"type": "List",
"display": "General"
}
]
}
]
}
},
{
"fullUrl": "Practitioner/9b23f2a6-ac1c-4f13-8eb5-fc3ae28984af",
"resource": {
"resourceType": "Practitioner",
"id": "9b23f2a6-ac1c-4f13-8eb5-fc3ae28984af",
"name": [
{
"text": "Unknown",
"family": "Unknown"
}
]
}
},
{
"fullUrl": "Patient/89e2fa86-a2f7-4c2c-ae63-a494dc42263b",
"resource": {
"resourceType": "Patient",
"id": "89e2fa86-a2f7-4c2c-ae63-a494dc42263b",
"gender": "unknown"
}
},
{
"fullUrl": "Encounter/c5c8c292-c921-4915-baeb-6242eab9db3d",
"resource": {
"resourceType": "Encounter",
"id": "c5c8c292-c921-4915-baeb-6242eab9db3d",
"meta": {
"profile": [
"http://hl7.org/fhir/us/core/StructureDefinition/us-core-encounter"
]
},
"status": "finished",
"class": {
"system": "http://terminology.hl7.org/CodeSystem/v3-ActCode",
"display": "unknown"
},
"subject": {
"reference": "Patient/89e2fa86-a2f7-4c2c-ae63-a494dc42263b",
"type": "Patient"
},
"period": {
"start": "2023-01-05",
"end": "2023-01-05"
}
}
},
{
"fullUrl": "MedicationStatement/881e92fd-9dcf-4849-ad9c-b4a6b934c6c2",
"resource": {
"resourceType": "MedicationStatement",
"id": "881e92fd-9dcf-4849-ad9c-b4a6b934c6c2",
"extension": [
{
"extension": [
{
"url": "offset",
"valueInteger": 17
},
{
"url": "length",
"valueInteger": 9
}
],
"url": "http://hl7.org/fhir/StructureDefinition/derivation-reference"
}
],
"status": "active",
"medicationCodeableConcept": {
"coding": [
{
"system": "http://www.nlm.nih.gov/research/umls",
"code": "C0020740",
"display": "ibuprofen"
},
{
"system": "http://www.nlm.nih.gov/research/umls/aod",
"code": "0000019879"
},
{
"system": "http://www.whocc.no/atc",
"code": "M01AE01"
},
{
"system": "http://www.nlm.nih.gov/research/umls/ccpss",
"code": "0046165"
},
{
"system": "http://www.nlm.nih.gov/research/umls/chv",
"code": "0000006519"
},
{
"system": "http://www.nlm.nih.gov/research/umls/csp",
"code": "2270-2077"
},
{
"system": "http://www.nlm.nih.gov/research/umls/drugbank",
"code": "DB01050"
},
{
"system": "http://www.nlm.nih.gov/research/umls/gs",
"code": "1611"
},
{
"system": "http://www.nlm.nih.gov/research/umls/lch_nw",
"code": "sh97005926"
},
{
"system": "http://loinc.org",
"code": "LP16165-0"
},
{
"system": "http://www.nlm.nih.gov/research/umls/medcin",
"code": "40458"
},
{
"system": "http://www.nlm.nih.gov/research/umls/mmsl",
"code": "d00015"
},
{
"system": "http://www.nlm.nih.gov/research/umls/msh",
"code": "D007052"
},
{
"system": "http://www.nlm.nih.gov/research/umls/mthspl",
"code": "WK2XYI10QM"
},
{
"system": "http://ncimeta.nci.nih.gov",
"code": "C561"
},
{
"system": "http://www.nlm.nih.gov/research/umls/nci_ctrp",
"code": "C561"
},
{
"system": "http://www.nlm.nih.gov/research/umls/nci_dcp",
"code": "00803"
},
{
"system": "http://www.nlm.nih.gov/research/umls/nci_dtp",
"code": "NSC0256857"
},
{
"system": "http://www.nlm.nih.gov/research/umls/nci_fda",
"code": "WK2XYI10QM"
},
{
"system": "http://www.nlm.nih.gov/research/umls/nci_nci-gloss",
"code": "CDR0000613511"
},
{
"system": "http://www.nlm.nih.gov/research/umls/nddf",
"code": "002377"
},
{
"system": "http://www.nlm.nih.gov/research/umls/pdq",
"code": "CDR0000040475"
},
{
"system": "http://www.nlm.nih.gov/research/umls/rcd",
"code": "x02MO"
},
{
"system": "http://www.nlm.nih.gov/research/umls/rxnorm",
"code": "5640"
},
{
"system": "http://snomed.info/sct",
"code": "E-7772"
},
{
"system": "http://snomed.info/sct",
"code": "C-603C0"
},
{
"system": "http://snomed.info/sct",
"code": "387207008"
},
{
"system": "http://www.nlm.nih.gov/research/umls/usp",
"code": "m39860"
},
{
"system": "http://www.nlm.nih.gov/research/umls/uspmg",
"code": "MTHU000060"
},
{
"system": "http://hl7.org/fhir/ndfrt",
"code": "4017840"
}
],
"text": "ibuprofen"
},
"subject": {
"reference": "Patient/89e2fa86-a2f7-4c2c-ae63-a494dc42263b",
"type": "Patient"
},
"context": {
"reference": "Encounter/c5c8c292-c921-4915-baeb-6242eab9db3d",
"type": "Encounter",
"display": "unknown"
},
"dosage": [
{
"text": "100mg",
"timing": {
"repeat": {
"frequency": 2,
"period": 1,
"periodUnit": "d"
},
"code": {
"text": "twice daily"
}
},
"doseAndRate": [
{
"doseQuantity": {
"value": 100
}
}
]
}
]
}
},
{
"fullUrl": "List/59cc01c9-290b-425c-90fa-70419d09b9e2",
"resource": {
"resourceType": "List",
"id": "59cc01c9-290b-425c-90fa-70419d09b9e2",
"status": "current",
"mode": "snapshot",
"title": "General",
"subject": {
"reference": "Patient/89e2fa86-a2f7-4c2c-ae63-a494dc42263b",
"type": "Patient"
},
"encounter": {
"reference": "Encounter/c5c8c292-c921-4915-baeb-6242eab9db3d",
"type": "Encounter",
"display": "unknown"
},
"entry": [
{
"item": {
"reference": "MedicationStatement/881e92fd-9dcf-4849-ad9c-b4a6b934c6c2",
"type": "MedicationStatement",
"display": "ibuprofen"
}
}
]
}
}
]
}
Automatic language and script detection
Language detection of text is now complimentary when using other Language features (specifically for long-running operations in the Python client library or any method prefixed with begin_
). You can opt in to automatic language detection by passing the "auto"
language hint at the method or document level.
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
client = TextAnalyticsClient(
endpoint="<endpoint>",
credential=AzureKeyCredential("<api-key>"),
)
docs = [
"Microsoft was founded by Bill Gates and Paul Allen",
"Microsoft fue fundado por Bill Gates y Paul Allen",
"Microsoft wurde gegründet von Bill Gates und Paul Allen"
]
poller = client.begin_analyze_actions(
docs,
actions=[RecognizeEntitiesAction()],
language="auto",
)
document_results = poller.result()
for doc, results in zip(docs, document_results):
for result in results:
print(f"Document '{doc}' was detected language in '{result.detected_language.name}'.")
Output:
Document 'Microsoft was founded by Bill Gates and Paul Allen' was detected language in 'English'.
...Entity 'Microsoft' is a 'Organization'.
...Entity 'Bill Gates' is a 'Person'.
...Entity 'Paul Allen' is a 'Person'.
Document 'Microsoft fue fundado por Bill Gates y Paul Allen' was detected language in 'Spanish'.
...Entity 'Microsoft' is a 'Organization'.
...Entity 'Bill Gates' is a 'Person'.
...Entity 'Paul Allen' is a 'Person'.
Document 'Microsoft wurde gegründet von Bill Gates und Paul Allen' was detected language in 'German'.
...Entity 'Microsoft' is a 'Organization'.
...Entity 'Bill Gates' is a 'Person'.
...Entity 'Paul Allen' is a 'Person'.
If automatic language detection is unable to recognize the language used in the input text, it’s also possible to supply a fallback or default language hint to use for the text document. In code, this hint can be provided with the autodetect_default_language
keyword argument:
poller = client.begin_analyze_actions(
docs,
actions=[RecognizeEntitiesAction()],
language="auto",
autodetect_default_language="en"
)
In the preceding scenario, if the detected language is unknown, “English” is used as the language hint.
In addition to language detection, script detection is now in preview and can detect whether your text uses a script like “Latin”. To use script detection, pass model_version="2022-04-10-preview"
into the detect_language
client method:
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient
client = TextAnalyticsClient(
endpoint="<endpoint>",
credential=AzureKeyCredential("<api-key>"),
)
doc = ["Tumhara naam kya hai?"]
response = client.detect_language(doc, model_version="2022-04-10-preview")
print(f"Detected Language: {response[0].primary_language.name}")
print(f"Detected Script: {response[0].primary_language.script}")
Output:
Detected Language: Hindi
Detected Script: Latin
Our result output shows the language detected and that our input was romanized Hindi as indicated by the script.
Summary
The new beta of the Text Analytics client libraries is released and supports many exciting features from the Azure Cognitive Service for Language. In this article, we highlighted features like abstractive summarization, NER resolutions, FHIR bundles, and automatic language and script detection. The Azure SDK team is excited for you to try out the client libraries and encourages questions and feedback.
For more information, see the following resources:
- Cognitive Service for Language documentation
- Python: PyPi | Documentation | Samples
- .NET: NuGet | Documentation | Samples
- Java: Maven | Documentation | Samples
- JavaScript: npm | Documentation | Samples
Congratulations on the progress you’re making on with this service. When I call language/analyze-text/jobs?api-version=2022-10-01-preview with non-English input requesting FHIR structured output, I get the error message: “Fhir Structuring is only supported for the following languages: en”. Can you tell us when other languages will be supported? Thanks!
Thanks @Mike Francis. FHIR only supports English for the time being. You can check out the “What’s new” page in the Language service documentation for any updates regarding new features or additional language support.