Massive Text Embedding Benchmark (MTEB)helps to find optimal Embedding for your RAG LLM use case

Prasun Mishra
5 min readMar 25, 2024

--

By now, most of you would have deployed or experimented with LLM and probably RAG LLM architecture. Retrieval Augmented Generation (RAG) optimizes LLM utility, boosting retrieval accuracy, performance, flexibility, and explainability.

Embeddings and Vector Databases are vital in RAG solutions. Vector databases efficiently handle high-dimensional embeddings, enabling efficient storage and retrieval of unstructured data based on semantic similarity.I have covered about RAG in my previous post.

In this post, we will delve into next level of Embedding performance and how we can leverage Massive Text Embedding Benchmark (MTEB) Leaderboard to find optimal Embeddings for a given use case.

Purpose of MTEB ?

As per the authors of MTEB paper:

  • Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks.
  • It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking.
  • This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation.
  • To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages.
  • Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date.
  • We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks.

MTEB Leaderboard:

We can access MTEB leaderboard here.

Huggingface MTEB leaderboard

MTEB Leaderboard has 9 tabs, covering Overall score plus 8 tabs for each tasktypes:

  1. Overall
  2. Bitext Mining (Metric: F1)
  3. Classification (Metric: Accuracy)
  4. Clustering (Metric: Validity Measure (v_measure))
  5. Pair Classification (Metric: Average Precision based on Cosine Similarities (cos_sim_ap))
  6. Reranking (Metric: Mean Average Precision (MAP))
  7. Retrieval (Metric: Normalized Discounted Cumulative Gain @ k (ndcg_at_10))
  8. STS (Metric: Spearman correlation based on cosine similarity)
  9. Summarization (Metric: Spearman correlation based on cosine similarity)

Reading MTEB laderboard:

  • Each model is ranked based on various task types, with the “Overall” tab representing an average performance across these tasks.
  • For simplicity, let’s consider building a customer support bot for an ecommerce/retail setting. This bot would provide answers to FAQs, troubleshoot product issues, and offer product specifications to end customers. For instance, if a WiFi modem/router buyer searches for “Modem model ABC123, Wi-Fi connection keeps dropping, how to fix,” the recommender should present a list of the best possible answers and references from FAQs, product specification guides, and community forums addressing such issues.
  • This task falls under Retrieval, and the metric used is NDCG@10 (Normalized Discounted Cumulative Gain @k = 10), which evaluates the quality of ranked lists, particularly in information retrieval and recommendation systems. It considers the relevance of items in the list and their position within the ranking. NDCG@10 assesses how well the list prioritizes solutions related to the WiFi connection dropping issue for Modem model ABC123 within the top 10 results.
  • When shortlisting candidate models, focus on models with strong performance in retrieval metrics. Since the domain involves information retrieval from documents such as FAQs, product specifications, community groups, and chats on issue fixing, consider models trained on similar datasets.
  • For Information Retrieval task type, BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:· Fact-checking: FEVER, Climate-FEVER, SciFact· Question-Answering: NQ, HotpotQA, FiQA-2018· Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus· News Retrieval: TREC-NEWS, Robust04· Argument Retrieval: Touche-2020, ArguAna·Duplicate Question Retrieval: Quora, CqaDupstack·Citation-Prediction: SCIDOCS·Tweet Retrieval: Signal-1M· Entity Retrieval: DBPedia
  • So, if we examine the models performing well on HotPotQA and FiQA-2018, they may be suitable for our use case.
  • Now, let’s shift our attention to the Retrieval tab. Here, we observe the ranking of embedding models based on their performance on these two databases, highlighted in red.
  • This suggests that if we select 3 to 4 models to experiment with, GritLM/GritLM-7B emerges as the top-performing embedding model. Additionally, Salesforce/SFR-Embedding-Mistral (2) and intfloat/e5-mistral-7b-instruct (3) should also be considered. Moreover, voyage-lite-02-instruct (5) appears to be a lightweight option.
  • Now, let’s take another close look at the Overall tab:
  • Note that voyage-lite-02-instruct (no. 5) has a lower embedding dimension of 1024 compared to the 4096 of the other three embedding models, and a token length of 4000 compared to 32768.
  • More tokens result in a larger embedding matrix, demanding additional memory and processing power during both training and inference. While a larger vocabulary offers enhanced expressiveness, it also increases computational costs.
  • For tasks emphasizing efficiency over capturing every nuance in language, a smaller vocabulary with lower-dimensional embeddings may offer a favorable trade-off.
  • The next step involves experimenting with these 3–4 embedding models and selecting the best one. Access model details via Hugging Face repositories using “Commit” links.
  • Prioritize models with explainability features mentioned in repository descriptions, and consider fine-tuning chosen models on use case specific datasets for optimal performance.
  • I am eager to hear from you and learn about your experience in discovering the most optimal embeddings on the MTEBleaderboard

#RAG Large Language Models (LLMs) #Generative AI (GenAI) #Word Embeddings #Sentence Embeddings #Vector Space Models #Vector Databases #Information Retrieval (IR) #Question Answering (QA) #Text Summarization # Machine Translation (MT) #Sentiment Analysis #Named Entity Recognition (NER) #Text Classification #Dialogue Systems #MTEB Leaderboard

--

--

Prasun Mishra

Hands-on ML practitioner. AWS Certified ML Specialist. Kaggle expert. BIPOC DS Mentor. Working on an interesting NLP use cases!