Large Language Model Prompt Engineering for Complex Summarization

In this post we’ll demonstrate some prompt engineering techniques to create summaries of medical research publications. The recent explosion in the popularity of Large Language Models (LLM) such as ChatGPT has opened the floodgates to an enormous and ever-growing list of possible new applications in numerous fields. On a recent engagement, our team created a demo of how to use the Azure OpenAI service to leverage LLM capabilities in generating summaries of medical documents for non-specialist readers.

Background

Every day, hundreds of new medical specialist papers are published on sites such as PubMed. For patients or caregivers with a keen interest in new research impacting their condition, it can often be difficult to comprehend the complex jargon and language. Consequently, many journals require submitters to produce a separate short Plain Language Summary for the non-specialist reader. Our customer requested that we prototype using GPT to produce Plain Language Summaries, freeing up time for researchers and editors to focus on publishing new research.

Hypothesis

A model like OpenAI’s Davinci-3, the original LLM that underpinned ChatGPT, could produce a passable Plain Language Summary of medical text describing a drug-study, which could then be refined by an author or editor in short time. We targeted a complete summary, including important details from the source text like patient population, treatment outcomes, and how the research impacted disease treatment. In particular, we wanted the following:

Summaries should be approximately 250 words
Specialist medical terms should be replaced with common language
Complex medical concepts should be explained ‘in-context’ with a short plain language definition
The summary should explain the study aim, protocol, subject population, outcome, and impact on patient treatment and future research
The summary should be informative enough for the reader to get a full understanding of the source paper

Setup

We are using the LangChain python library as a harness for our use of Azure OpenAI and GPT3. Ensure you have a new virtual-environment setup and install the needed dependencies by running pip install -r requirements.txt from the root of the Github project. We will use the pdfminer library to convert the source paper PDF into plaintext for ingestion into Azure OpenAI GPT. The prompt-engineering exercise uses a fabricated article generated by ChatGPT. Its important to note that currently we are limited in the amount of text that GPT can process.

import os
from dotenv import load_dotenv 
load_dotenv() # make sure to set your Azure OpenAI keys below in your own .env file
DEPLOYMENT_NAME = os.getenv("OPENAI_DEPLOYMENT_NAME")

from langchain.llms import AzureOpenAI
from langchain.document_loaders import TextLoader
from pdfminer.high_level import extract_text
from langchain import PromptTemplate, LLMChain
from langchain.chains import LLMChain

llm = AzureOpenAI(deployment_name=DEPLOYMENT_NAME, model_name="text-davinci-003", max_tokens=500)

study_txt = TextLoader('./bleegblorgumab.txt').load()[0].page_content

Initial basic “TLDR;” prompt

Starting with a simple command to GPT. Given GPT models are trained to follow instructions, it should ‘know’ what to do…

prompt_template = """
Write a Plain Language Summary of the medical study below:
{study_text}
"""
chain = LLMChain(llm=llm, prompt=PromptTemplate.from_template(prompt_template))
summary = chain.run(study_txt)
print(summary)
# count the number of words in the summary
print(f"PLS length: {len(summary.split())}")

    This open-label study evaluated the efficacy and safety of Bleegblorgumab, a monoclonal antibody specifically designed to target HER2, in combination with Prednisolone in 50 HER2+ metastatic breast cancer (MBC) patients who had received prior HER2-directed therapy. The results showed that the combination was generally well-tolerated with fatique, nausea and peripheral neuropathy being the most common adverse events. The overall response rate (ORR) was 70%, with 35 patients achieving partial response and 7 patients achieving complete response. The median progression-free survival (PFS) was 12.5 months, and the median overall survival (OS) was not reached at the time of data analysis. These findings suggest that Bleegblorgumab in combination with Prednisolone may offer a valuable therapeutic option for HER2+ MBC patients who have previously received HER2-directed therapy. However, larger randomized controlled trials are needed to confirm these findings and further evaluate the long-term safety and efficacy.
    PLS length: 143

This is a good start, but the language is way too technical. Also we wanted ~ 250 words so it needs to be longer. OpenAI suggests you be as explicit as possible in your prompt-commands. Let’s get more explicit!

prompt_template = """
Summarize the medical text below for a layperson. Simplify all medical jargon to plain language. 
Write 250 words at a 6th grade reading level.
{study_text}
"""

chain = LLMChain(llm=llm, prompt=PromptTemplate.from_template(prompt_template))
summary = chain.run(study_txt)
print(summary)

    This study looked at the combination of Bleegblorgumab and Prednisolone in 50 people with HER2+ metastatic breast cancer (MBC). HER2+ MBC is a type of breast cancer that is aggressive and has a poor prognosis. Bleegblorgumab is a medicine that targets HER2. The study looked at how well the combination of Bleegblorgumab and Prednisolone worked, how long it worked and how safe it was. The results showed that the combination worked well, with 70% of people having a partial or complete response, and 92% having disease control. The median progression-free survival was 12.5 months and the median overall survival had not been reached at the time of this study. Side effects were mostly mild to moderate and included fatigue, nausea, and peripheral neuropathy. The results of this study show that the combination of Bleegblorgumab and Prednisolone may provide a valuable therapeutic option for people with HER2+ MBC and further studies are needed to confirm the findings and evaluate the long-term safety and efficacy.

Ok, we reduced the amount of ‘medical jargon’, but we didn’t get a more complete summary we got LESS complete. 🙁 Let’s be explicit about what we expect in the output…

More is Better?

prompt_template = """
{study_text}

Write a Plain Language Summary of the above medical study for a layperson.
Translate any medical terms to simple english explanations.
Include the following:
- What was the purpose of the study?
- What did the researchers do?
- What did they find?
- What does this mean for me?
Write 250 words at a 6th grade reading level.
"""

chain = LLMChain(llm=llm, prompt=PromptTemplate.from_template(prompt_template))
summary = chain.run(study_txt)
print(summary)

    This study looked at the use of a new drug combination to treat metastatic breast cancer (MBC) that is HER2-positive. HER2-positive MBC is a type of breast cancer that is aggressive and hard to treat. The drug combination used in this study was Bleegblorgumab and Prednisolone. Bleegblorgumab is a monoclonal antibody specifically designed to target HER2, while Prednisolone is an immunomodulatory medication. The researchers wanted to find out if this drug combination would be effective in treating HER2-positive MBC.

    The researchers enrolled fifty patients with HER2-positive MBC who had received prior HER2-directed therapy. The patients received Bleegblorgumab intravenously at a dose of 10 mg/kg every three weeks, and Prednisolone orally at 10 mg/day. The researchers monitored the patients regularly to assess the response to the treatment. The primary endpoint of the study was the overall response rate (ORR), while the secondary endpoints included progression-free survival (PFS), overall survival (OS), and safety.

    The results showed that the combination of Bleegblorgumab and Prednisolone was effective in treating HER2-positive MBC. The overall response rate was 70%, with 35 patients achieving a partial response and 7 patients achieving a complete response. The median progression-free survival was 12.5 months, and the median overall survival had not been reached at the time of data analysis. The combination was generally well-tolerated, with the most common adverse events being fatigue, nausea, and peripheral neuropathy.

    This study shows that the combination of Bleegblorgumab and Prednisolone may be a promising treatment option for HER2-positive MBC. The combination demonstrated efficacy, with a high overall response rate, and was generally well-tolerated. However, larger randomized controlled trials are needed to confirm these findings and further evaluate the long-term safety and efficacy of the combination.

This is quite a bit better! It includes some context from the patient-perspective. It’s got a nice disclaimer at the end.

One thing to note is we’re now including the source study text above our prompt. This helps with the recency problem with LLMs. Putting the prompt last helps the LLM to stay on task.

Let’s do a bit more fine-tuning. We want the output to be in first-person or “active voice”. We also would like it to use bullet-points instead of paragraph style. Let’s see what we can do…

More Prompt Context

prompt_template = """
You are a medical researcher writing a Plain Language Summary of your study for a layperson.
{study_text}
Write a Plain Language Summary of the above medical study for a layperson.
Translate any medical terms to simple english explanations.
Use first-person 'We'.  Use short bullet points.
Answer these questions:
- What was the purpose of the study?
- What did the researchers do?
- What did they find?
- What does this mean for me?
Write 250 words at a 6th grade reading level.
"""

chain = LLMChain(llm=llm, prompt=PromptTemplate.from_template(prompt_template))
summary = chain.run(study_txt)
print(summary)

    We conducted a study to find out if a new treatment, Bleegblorgumab in combination with Prednisolone, could help people with a specific type of breast cancer called HER2+ MBC. HER2+ MBC is a type of breast cancer that is more aggressive and has a poorer prognosis than other types of breast cancer. 

    We enrolled 50 people with HER2+ MBC who had already tried other treatments for their cancer. We gave Bleegblorgumab intravenously every three weeks, and Prednisolone orally every day. We then monitored the participants for any changes in their cancer and side effects. 

    We found that the combination of Bleegblorgumab and Prednisolone was effective in treating HER2+ MBC. 70% of patients achieved either a complete or partial response to the treatment, and 92% experienced disease control. The median progression-free survival was 12.5 months, and the median overall survival was not reached at the time of data analysis, which suggests a potential survival benefit. The treatment was generally well-tolerated, and the most commonly reported side effects were fatigue, nausea, and peripheral neuropathy. 

    These results suggest that the combination of Bleegblorgumab and Prednisolone may offer a valuable therapeutic option for HER2+ MBC patients who have previously received other HER2-directed therapies. However, larger randomized controlled trials are needed to confirm these findings and further evaluate the long-term safety and efficacy of this combination.

Good but not perfect

We got first-person but no bullet-points!

There’s a lot more we could do here in terms of both engineering the initial prompt, as well as using a multi-step ‘chain’ of LLM calls to produce the exact output we want. We’ll save that for our next blog post!

Caveats

Length of paper

The Azure OpenAI Davinci-3 model has a limit of 4097 tokens of combined input and output. Given that we want around 250 words of output and a token represents about one-half of a word, we will reserve 500 tokens for the completion response. That leaves ~3500 tokens for input, or about 1700 words. Any source text longer than 1700 words will need to be manually edited to fit.

Graphs and Figures

The Davinci-3 model was trained on billions of words of text, including a lot of graphs and charts. However we can only input plain-text into the OpenAI service. For this experiment, we will assume that the input is only plain-text, and whatever unformatted table data that was automatically extracted from the source PDF.

Need for Human Review

It bears repeating that the goal of this experiment is to make a draft summary, and any output produced must be reviewed, edited, and approved by a responsible human party.

Key Takeaways

Prompt Engineering for Large Language Models such as OpenAI – GPT is a rapidly evolving area of research and engineering practice. We have found thru trial and error that generating summaries of text using GPT can be enhanced using these guidelines:

Ensure your ‘source’ text to be summarized is above the prompt to help mitigate the ‘recency problem’
Give detailed, direct prompts that specify the output such as number of words, reading level, etc
If you can guide the model by providing the start of the output you want. In our example you could provide the first question of a question/answer style summary as the start of model generation
If necessary consider ‘multi-step’ summary generation to ensure you get more consistent, focused output

Future Directions

In the future we hope to bring you examples of managing longer text input, usage of GPT4, and chat-based interactions for fine-tuning the summarization output. As well we may be able to exploit GPT4’s abilities to ‘understand’ charts, graphs and tables. We also hope to explore the idea of using GPT to ‘grade itself’ by evaluating the generated output against our ideal summary criteria. In this way we hope to move from qualitative prompt generation to quantitative Prompt Engineering.

Acknowledgements

Authored by John Stewart, Microsoft Commercial Software Engineering

Large Language Model Prompt Engineering for Complex Summarization

Background

Hypothesis

Setup

Initial basic “TLDR;” prompt

More is Better?

More Prompt Context

Good but not perfect

Caveats

Length of paper

Graphs and Figures

Need for Human Review

Key Takeaways

Future Directions

Acknowledgements

Author

Read next

Use Cases for Event Hub

Temporal Mutual Transport Layer Security(mTLS) and single-sign-on(SSO) using Azure

Background

Hypothesis

Setup

Initial basic “TLDR;” prompt

More is Better?

More Prompt Context

Good but not perfect

Caveats

Length of paper

Graphs and Figures

Need for Human Review

Key Takeaways

Future Directions

Acknowledgements

Author

Read next

Use Cases for Event Hub

Temporal Mutual Transport Layer Security(mTLS) and single-sign-on(SSO) using Azure

Stay informed