Find answers from the community

Updated 3 months ago

Jina

i dno if im doing something wrong.
Plain Text
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(
    model_name="jinaai/jina-embeddings-v2-base-en",
)
nodes = [TextNode(text="first question to match", id_ = "1"), TextNode(text="this is a simulation", id_ = "2")]
index = VectorStoreIndex(nodes , embed_model = embed_model, show_progress=True)
vector_retriever = index.as_retriever(similarity_top_k=10)
matches = vector_retriever.retrieve("first question to match")
for node in matches:
    print(node.get_score())
    print(node.get_text())

any advice on how to improve this? I find bm25 embeddings to do better with real content, so am trying hybrid search. but quite disappointed with the semantic search
L
v
5 comments
You could try lowering the chunk size a bit. (512 is usually another good choice, default is 1024)

It could also depend on your queries. Short queries are less helpful. Sometimes I'll add a step to get the llm to rewrite the query.

Once you have a large number of documents, a top k of 2 (the default) probably isn't going to cut it. I might be increasing it, as well as use a reranker to filter back down to a smaller subset
my chunks are like 1-3 sentences as are my queries
i have a database of question stems
and my query is a different question stem
goal is to look for similar questions
Add a reply
Sign up and join the conversation on Discord