CodeMachine LearningTechnology

Jina: First Open-Source 8K Text Embedding Model

What’s New?
Jina AI has launched ‘jina-embeddings-v2’, the first and only open-source text embedding model that supports an extensive 8K token context length. It rivals OpenAI’s ‘text-embedding-ada-002’ (Ada) model in performance across various tasks, including classification, reranking, retrieval, and summarization.

Why Does It Matter?
Jina-embeddings-v2’s 8K context length significantly improves performance in scenarios where understanding the broader context is essential for accurate conclusions. Furthermore, its open-source nature ensures ongoing development and innovation in this domain.

Key Takeaways:

  1. Open-Source: Free to use and able to be run locally, in contrast to OpenAI’s proprietary Ada model, promoting community-driven development.
  2. Competitive Performance: Delivers performance on par with the Ada model across various tasks.
  3. Extended Context: The 8K context length enables detailed text analysis, unlocking applications in healthcare, law, and finance


jina-embeddings-v2-base-en is an English embedding model with a maximum sequence length of 8192. It’s based on a Bert architecture called JinaBert, supporting the symmetric bidirectional variant of ALiBi for longer sequences. The underlying jina-bert-v2-base-en is pretrained on the C4 dataset and further trained on Jina AI’s collection of over 400 million sentence pairs from various domains, meticulously curated.

While it was initially trained with a 512-sequence length, it can effectively handle sequences up to 8k or even longer due to ALiBi. This versatility makes it suitable for tasks like long document retrieval, semantic textual similarity, text reranking, recommendation, RAG, and LLM-based generative search.

Despite its 137 million parameters, the model ensures fast inference and outperforms our smaller model. It’s recommended to use a single GPU for inference. Additionally, we offer other embedding models as well.

Join Upaspro to get email for news in AI and Finance

One thought on “Jina: First Open-Source 8K Text Embedding Model

  • Really interesting. Thanks!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses User Verification plugin to reduce spam. See how your comment data is processed.