LLMLingua:20X Prompt Compression for Enhanced Inference Performance

Prasun Mishra
6 min readJan 27, 2024

--

from www.llmlingua.com

The increasing trend in natural language processing (NLP) leans towards longer prompts for large language models (LLMs), in some cases going upto 40k token lengths. This shift is influenced by techniques such as:

1) In-context learning, which enhances LLM performance by providing additional context

2) The chain-of-thought (CoT) approach involves iterative reasoning, using longer prompts to maintain conversation context

3) Retrieval-Augmented Generation (RAG) combines document retrieval with LLM generation,resulting in longer prompts for specifying criteria.

4) Additionally, tools like Copilots aid users in crafting more effective, longer, and specific prompts than basic queries.

Long prompts are computationally costly, produce diminishing return and also there are practical limitations with LLM’s to process them efficiently.

As the number of documents increase in the prompt, perfomance reduces drastically

Traditional text compression techniques:

Text compression has long been a focus of NLP research, with toolkits offering various techniques like space removal, stop word filtering, and stemming. However, these approaches, while efficient, can severely curtail context and introduce unwanted ambiguity, posing challenges for subsequent processing

What is LLMLingua?

LLMLingua, a project developed by Microsoft, reduces prompt length for large language models (LLMs) using a coarse-to-fine methodology. After alignment, it employs a well-trained small language model, such as GPT2-small or LLaMA-7B, to detect unimportant tokens in the prompt, enabling efficient compression. LLMs can effectively comprehend the semantic information in the compressed prompts, even if it might be challenging for humans.

LLMLingua compresses the prompt using a three-step process.

What is LongLLMLingua?

An advanced extension specializing in handling extremely long prompts for situations demanding even greater compression at the potential cost of some information detail.

For tasks requiring moderate prompt size reduction without significant information loss, LLMLingua is a good choice and for use cases with extremely long prompts where aggressive compression is necessary, LongLLMLingua could be a solution with the possible trade-offs in information completeness.

LLMLingua is available as a Python library and can be easily installed with the pip command:

!pip install llmlingua

How it works?

LLMLingua tackles long LLM prompts via a three-step process:

  1. Coarse-Grained Selection: Ranks documents based on relevance and selects the top ones, optionally scoring them based on “perplexity” (predicting the next document in context, lower is better).
  2. Token-Level Compression: Analyzes segments (sentences/paragraphs) using a smaller model to identify unimportant tokens (token perplexity) based on context and query, creating concise representations.
  3. Budget & Distribution (Optional): Dynamically adjusts compression length based on resource constraints and LLM-specific token distributions (different models prefer different token lengths)

Benefits:

  • Reduced Prompt Length: Up to 20x compression without accuracy loss.
  • Improved Efficiency: Faster response times and lower costs.
  • Preserved Semantics: Retains key information for effective LLM communication.
  • Adaptability: Works with various LLMs and optimizes performance.

Risks:

  • Information Loss: Compression may omit crucial details, impacting LLM understanding.
  • Misinterpretation: Smaller model misinterpretations can mislead the main LLM.
  • Limited Fit: May not be ideal for tasks requiring complex context or specific details.
  • LLM Specificity: Optimizing for diverse LLMs can be resource-intensive.
  • Bias: Inherited biases can skew prompt compression, leading to inaccuracies.
  • Potential Misuse: Efficiency gains could be used for harmful prompt manipulation.

How to use LLMLingua?

LLMLingua requires users to divide prompts into Questions, Context, and Instructions. If a component is absent, it should be left blank.

  • Instruction: Directives given by the user to LLMs, like task descriptions. Placed before the context and question modules, the instruction module exhibits a high sensitivity to compression.
  • Context: This module offers additional context to address the question, such as documents, demonstrations, web search results, or API call results. Located between the instruction and question modules, its sensitivity to compression is relatively low.
  • Question: Directives from the user to LLMs, including inquiries, questions, or requests. Positioned after the instruction and context modules, the question module is highly sensitive to compression.

As you can see, the maximum compression occurs in the Context. For instance, if the prompt contains three different examples for performing a task step by step, we can likely reduce them to two or even one example.

Below is a snippet from the code covering compression settings, important line items are:

  1. target_token=300 # Target compressed token length is 300
  2. context_budget =”+100" # we are asking model to enhance /add more context
  3. dynamic_context_compression_ratio: 0.3 #A higher value like 0.3 removes a larger percentage of tokens from less relevant documents, prioritizing key information.
# Create an LLMLingua postprocessor with compression settings
node_postprocessor = LongLLMLinguaPostprocessor(
instruction_str="Given the context, please answer the final question",
target_token=300, # Target compressed token length
rank_method="longllmlingua",
additional_compress_kwargs={
"condition_compare": True,
"condition_in_question": "after",
"context_budget": "+100",
"reorder_context": "sort", # Enable document reordering
"dynamic_context_compression_ratio": 0.3,
},
)

Here is the complete code. I have added comments, and it is otherwise easy to understand. Once you insert your OpenAI API Key in the given place and execute this, you will be able to compare the ChatGPT 3.5 response without LLMLingua and with LongLLMLingua. You will observe: the original context had 3033 tokens, and after compression, it has 358 tokens, resulting in a compression ratio of 8.47x. You can experiment with the configuration shown above to find a good balance for your case.

Code courtsey: Colab notebook on LLMLingua

#Install required libraries and packages
!pip install cohere
!pip install llama_index
!pip install llmlingua
!pip install accelerate
# Import necessary tools
from llama_index import (
VectorStoreIndex, # Allows working with vector stores
download_loader, # Downloads loaders for loading data
load_index_from_storage, # Loads indices from storage
StorageContext, # Manages storage contexts
)
import openai # Grants access to OpenAI's features
# Prepare Wikipedia data which will be used in this example
WikipediaReader = download_loader("WikipediaReader") # Get a Wikipedia reader
loader = WikipediaReader() # Create a Wikipedia reader instance
documents = loader.load_data(pages=['Mexican–American_War']) # Load specific page
# Set up OpenAI API key
openai.api_key = 'YOUR_API_KEY' # Replace with your actual OpenAI API key
# Create a vector store index from the documents
index = VectorStoreIndex.from_documents(documents)
# Create a retriever to retrieve relevant documents
retriever = index.as_retriever(similarity_top_k=3) # Retrieve top 3 most relevant documents
# Define the question to answer
question = "What were the main outcomes of the war"
# Retrieve contexts based on the question
contexts = retriever.retrieve(question)
context_list = [n.get_content() for n in contexts] # Get content of retrieved documents
# ------------------------------------------------------------------------
# RESPONSE WITHOUT LLMLingua
# ------------------------------------------------------------------------
# Import LLM and prompt template tools
from llama_index.llms import OpenAI
from llama_index.prompts import PromptTemplate
# Create an OpenAI LLM instance
llm = OpenAI(model="gpt-3.5-turbo-16k")
# Define a prompt template for question answering
template = (
"We have provided context information below. \n"
"---------------------\n"
"{context_str}"
"\n---------------------\n"
"Given this information, please answer the question: {query_str}\n"
)
qa_template = PromptTemplate(template)
# Format the prompt with context and question
prompt = qa_template.format(context_str="\n\n".join(context_list), query_str=question)
# Get response from LLM without LLMLingua compression
response = llm.complete(prompt)
print(str(response))
# ------------------------------------------------------------------------
# RESPONSE WITH LLMLingua COMPRESSION
# ------------------------------------------------------------------------
# Import necessary tools for LLMLingua
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import CompactAndRefine
from llama_index.indices.postprocessor import LongLLMLinguaPostprocessor
# Create an LLMLingua postprocessor with compression settings
node_postprocessor = LongLLMLinguaPostprocessor(
instruction_str="Given the context, please answer the final question",
target_token=300, # Target compressed token length
rank_method="longllmlingua",
additional_compress_kwargs={
"condition_compare": True,
"condition_in_question": "after",
"context_budget": "+100",
"reorder_context": "sort", # Enable document reordering
"dynamic_context_compression_ratio": 0.3,
},
)
# Retrieve nodes again (required for LLMLingua processing)
retrieved_nodes = retriever.retrieve(question)
# Create a response synthesizer
synthesizer = CompactAndRefine()
# Apply LLMLingua compression
new_retrieved_nodes = node_postprocessor.postprocess_nodes(
retrieved_nodes, query_bundle=QueryBundle(query_str=question)
)
# Compare original and compressed contexts
original_contexts = "\n\n".join([n.get_content() for n in retrieved_nodes])
compressed_contexts = "\n\n".join([n.get_content() for n in new_retrieved_nodes])
original_tokens = node_postprocessor._llm_lingua.get_token_length(original_contexts)
compressed_tokens = node_postprocessor._llm_lingua.get_token_length(compressed_contexts)
print("Compressed Contexts:")
print(compressed_contexts)
print("Original Tokens:", original_tokens)'

Paper: https://arxiv.org/abs/2310.05736

Demo: https://huggingface.co/spaces/microsoft/LLMLingua

Website:https://llmlingua.com/

#AI #MachineLearning #LargeLanguageModels #NaturalLanguageProcessing #DeepLearning #Technology #Innovation #PromptEngineering #ModelOptimization #InferenceAcceleration #LLMLingua #MicrosoftResearch #Compression #PerformanceOptimization #DataScience #CloudComputing #Research #Development #NLP #MLEngineer #DataScientist #RAG

--

--

Prasun Mishra

Hands-on ML practitioner. AWS Certified ML Specialist. Kaggle expert. BIPOC DS Mentor. Working on an interesting NLP use cases!