Jina

At a glance

i dno if im doing something wrong.

Plain Text

from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(
    model_name="jinaai/jina-embeddings-v2-base-en",
)
nodes = [TextNode(text="first question to match", id_ = "1"), TextNode(text="this is a simulation", id_ = "2")]
index = VectorStoreIndex(nodes , embed_model = embed_model, show_progress=True)
vector_retriever = index.as_retriever(similarity_top_k=10)
matches = vector_retriever.retrieve("first question to match")
for node in matches:
    print(node.get_score())
    print(node.get_text())

any advice on how to improve this? I find bm25 embeddings to do better with real content, so am trying hybrid search. but quite disappointed with the semantic search

5 comments

LLogan M

You could try lowering the chunk size a bit. (512 is usually another good choice, default is 1024)

It could also depend on your queries. Short queries are less helpful. Sometimes I'll add a step to get the llm to rewrite the query.

Once you have a large number of documents, a top k of 2 (the default) probably isn't going to cut it. I might be increasing it, as well as use a reranker to filter back down to a smaller subset

vvalu

my chunks are like 1-3 sentences as are my queries

vvalu

i have a database of question stems

vvalu

and my query is a different question stem

vvalu

goal is to look for similar questions

Add a reply

Find answers from the community

Jina