Find answers from the community

Updated 4 months ago

Hey hope everyone is doing good!

At a glance
Hey hope everyone is doing good!
I have a question. Well more something of an issue. I have implemented a pipeline for uploading documents with embeddings to Azure congnitive store using the CognitiveSearchVectorStore class and IngestionPipeline. Every goes well but the search results using the default querry engine (as_query_engine) are absolutly terrible. It cant answer simple questions like who the ceo of the company im working for is (while there are a lot of documents mentioning whoe it is). Also when searching using the azure portal search explorer the results are perfect.

Ive used the setup from the docs example with the IngestionPipeline example.

Plain Text
....
azure_vector_store = CognitiveSearchVectorStore(
    search_or_index_client=azure_index_client,
    index_name=azure_index_name,
    filterable_metadata_field_keys=metadata_fields,  index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
    id_field_key="id",
    chunk_field_key="content",
    embedding_field_key="embedding",
    metadata_string_field_key="li_jsonMetadata",
    doc_id_field_key="li_doc_id",
    embedding_dimensionality=embedding_dimensionality,
)
....
text_splitter = SentenceSplitter(
    separator=" ",
    chunk_size=1000,
    chunk_overlap=50,         tokenizer=tiktoken.encoding_for_model(llm_model).encode,
    include_metadata=True
)
pipeline = IngestionPipeline(
    transformations=[
        text_splitter,
        CleanHTMLTransform(),
        embed_model
    ],
    cache=cache,
)

pipelined_nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)
azure_vector_store.add(pipelined_nodes)
L
M
17 comments
When you query, try checking response.source_nodes to see which nodes are being pulled in to answer the query
the source nodes are one of the least relevant ones of the docs, with scores
0.028588563203811646
0.018051709979772568
Plain Text
query_engine = index.as_query_engine(vector_store_query_mode="hybrid")
response = query_engine.query("....")

removed the query string for privacy reasons
The score might be distance?
so smaller is better?
is the order default less to most relevant or something, on the azure portal the same query returns a score of 10 lol
Either that or the query code is not correct -- but I don't have access to this feature to test
I can point you towards the source code if you want to debug further
If you make any fixes, open a PR and I can get it merged for you πŸ’ͺ
Hey so I havent had much time to figure out exactly how to fix it but it think https://github.com/run-llama/llama_index/blob/362a79cc8791580695a0e8b70511faeb4573386d/llama_index/vector_stores/cogsearch.py#L555

is missing:
query_type="semantic",
semantic_configuration_name="default",

Plain Text
        results = self._search_client.search(
            search_text=search_query,
            query_type="semantic",
            semantic_configuration_name="default",
            vectors=vectors,
            top=query.similarity_top_k,
            select=select_fields,
            filter=odata_filter,
        )
ooo interesting. If you add that, the query makes more sense?
the results are better adding those fields
Add a reply
Sign up and join the conversation on Discord