E5 Text Embeddings
Multilingual E5 Text Embeddings: A Technical Report. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei, arXiv 2024
Improving Text Embeddings with Large Language Models. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei, arXiv 2024
Text Embeddings by Weakly-Supervised Contrastive Pre-training. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022
LLM based Models
| BEIR | # of layers | embedding dimension | Huggingface | |
|---|---|---|---|---|
| E5-mistral-7b-instruct | 56.9 | 32 | 4096 | intfloat/e5-mistral-7b-instruct |
English Pre-trained Models
| BEIR | # of layers | embedding dimension | Huggingface | |
|---|---|---|---|---|
| E5-small-v2 | 49.0 | 12 | 384 | intfloat/e5-small-v2 |
| E5-base-v2 | 50.3 | 12 | 768 | intfloat/e5-base-v2 |
| E5-large-v2 | 50.6 | 24 | 1024 | intfloat/e5-large-v2 |
| E5-small | 46.0 | 12 | 384 | intfloat/e5-small |
| E5-base | 48.8 | 12 | 768 | intfloat/e5-base |
| E5-large | 50.0 | 24 | 1024 | intfloat/e5-large |
| E5-small-unsupervised | 40.8 | 12 | 384 | intfloat/e5-small-unsupervised |
| E5-base-unsupervised | 42.9 | 12 | 768 | intfloat/e5-base-unsupervised |
| E5-large-unsupervised | 44.2 | 24 | 1024 | intfloat/e5-large-unsupervised |
The models with -unsupervised suffix only pre-trains on unlabeled datasets.
Multilingual Pre-trained Models
| BEIR | # of layers | embedding dimension | Huggingface | |
|---|---|---|---|---|
| multilingual-e5-small | 46.6 | 12 | 384 | intfloat/multilingual-e5-small |
| multilingual-e5-base | 48.9 | 12 | 768 | intfloat/multilingual-e5-base |
| multilingual-e5-large | 51.4 | 24 | 1024 | intfloat/multilingual-e5-large |
| multilingual-e5-large-instruct | 52.5 | 24 | 1024 | intfloat/multilingual-e5-large-instruct |
Install Python Package Requirements
pip install -r requirements.txt
For e5-mistral-7b-instruct, it would require transformers>=4.34 to load Mistral model.
Evaluate on the BEIR Benchmark
After installing the required python packages, run the following command on GPU machines:
bash scripts/eval_mteb_beir.sh intfloat/e5-small-v2
By default, the evaluation script will use all the available GPUs.
Caution: it could take quite a long time (~10 hours) due to corpus encoding.
For intfloat/e5-mistral-7b-instruct, it could take even longer (several days).
Evaluate on the MTEB Benchmark
Run the following command:
bash scripts/eval_mteb_except_retrieval.sh intfloat/e5-small-v2
For multilingual models, simply add a --multilingual suffix:
bash scripts/eval_mteb_except_retrieval.sh intfloat/multilingual-e5-base --multilingual
Other Resources
The data for our proposed synthetic task personalized passkey retrieval is available at https://huggingface.co/datasets/intfloat/personalized_passkey_retrieval.
Troubleshooting
If you encounter OOM error, please try to reduce the batch size.
Citation
If you find our paper or models helpful, please consider cite as follows:
@article{wang2024multilingual,
title={Multilingual E5 Text Embeddings: A Technical Report},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2402.05672},
year={2024}
}
@article{wang2023improving,
title={Improving Text Embeddings with Large Language Models},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2401.00368},
year={2023}
}
@article{wang2022text,
title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2212.03533},
year={2022}
}
License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Microsoft Open Source Code of Conduct