The main idea behind semantic chunking is to split a given text based on how similar the chunks are in meaning. This similarity is calculated by chunking the given text into sentences, then turning all these text-based chunks into vector embeddings and calculating the cosine similarity between these chunks. After that, we initialize a threshold, for example, 0.8, and whenever the cosine similarity between 2 consecutive segments is more than that, a split is done there.
With just a handful of examples, a fine-tuned open source embedding model can provide greater accuracy at a lower price than proprietary models like OpenAI’s text-embedding-3 suite of models
Triplets: Text triplets consisting of (query, positive context, negative context). In our case, the dataset is generated through an LLM (Claude 3.5) by asking it to generate a triplet of query and the corresponding positive context from which the query can be answered. We use a technique called In-batch Sampling to generate negative contexts.
The Loss function chosen was MultipleNegativesRankingLoss
Rank | Model | Model Size (Million Parameters) | Memory Usage (GB, fp32) | Average |
---|---|---|---|---|
1 | stella_en_1.5B_v5 | 1543 | 5.75 | 61.01 |
8 | stella_en_400M_v5 | 435 | 1.62 | 58.97 |
13 | gte-large-en-v1.5 | 434 | 1.62 | 57.91 |
33 | bge-large-en-v1.5 | 335 | 1.25 | 54.29 |
25 | text-embedding-3-large - openAI | 55.44 |
Evaluators chosen were Hit-Rate@k and Recall@k. Rank aware metrics did not make much sense since we did not have ranks of retrieved for gold-labelled dataset as well
This contains the results of experiments that we performed to test the deal extraction pipeline!
The results will be updated as more batches of human labelled datasets come