Hey hope everyone is doing good!

At a glance

Hey hope everyone is doing good!
I have a question. Well more something of an issue. I have implemented a pipeline for uploading documents with embeddings to Azure congnitive store using the CognitiveSearchVectorStore class and IngestionPipeline. Every goes well but the search results using the default querry engine (as_query_engine) are absolutly terrible. It cant answer simple questions like who the ceo of the company im working for is (while there are a lot of documents mentioning whoe it is). Also when searching using the azure portal search explorer the results are perfect.

Ive used the setup from the docs example with the IngestionPipeline example.

Plain Text

....
azure_vector_store = CognitiveSearchVectorStore(
    search_or_index_client=azure_index_client,
    index_name=azure_index_name,
    filterable_metadata_field_keys=metadata_fields,  index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
    id_field_key="id",
    chunk_field_key="content",
    embedding_field_key="embedding",
    metadata_string_field_key="li_jsonMetadata",
    doc_id_field_key="li_doc_id",
    embedding_dimensionality=embedding_dimensionality,
)
....
text_splitter = SentenceSplitter(
    separator=" ",
    chunk_size=1000,
    chunk_overlap=50,         tokenizer=tiktoken.encoding_for_model(llm_model).encode,
    include_metadata=True
)
pipeline = IngestionPipeline(
    transformations=[
        text_splitter,
        CleanHTMLTransform(),
        embed_model
    ],
    cache=cache,
)

pipelined_nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)
azure_vector_store.add(pipelined_nodes)

17 comments

LLogan M

When you query, try checking response.source_nodes to see which nodes are being pulled in to answer the query

MMilkMilker

the source nodes are one of the least relevant ones of the docs, with scores
0.028588563203811646
0.018051709979772568

LLogan M

rip

MMilkMilker

Plain Text

query_engine = index.as_query_engine(vector_store_query_mode="hybrid")
response = query_engine.query("....")

removed the query string for privacy reasons

LLogan M

The score might be distance?

LLogan M

so smaller is better?

MMilkMilker

is the order default less to most relevant or something, on the azure portal the same query returns a score of 10 lol

LLogan M

Either that or the query code is not correct -- but I don't have access to this feature to test

LLogan M

I can point you towards the source code if you want to debug further

MMilkMilker

yes please

LLogan M

So here's the query function
https://github.com/run-llama/llama_index/blob/362a79cc8791580695a0e8b70511faeb4573386d/llama_index/vector_stores/cogsearch.py#L517

LLogan M

And heres the add function, when we insert data
https://github.com/run-llama/llama_index/blob/362a79cc8791580695a0e8b70511faeb4573386d/llama_index/vector_stores/cogsearch.py#L397

MMilkMilker

thank you!

LLogan M

If you make any fixes, open a PR and I can get it merged for you 💪

MMilkMilker

Hey so I havent had much time to figure out exactly how to fix it but it think https://github.com/run-llama/llama_index/blob/362a79cc8791580695a0e8b70511faeb4573386d/llama_index/vector_stores/cogsearch.py#L555

is missing:
query_type="semantic",
semantic_configuration_name="default",

Plain Text

        results = self._search_client.search(
            search_text=search_query,
            query_type="semantic",
            semantic_configuration_name="default",
            vectors=vectors,
            top=query.similarity_top_k,
            select=select_fields,
            filter=odata_filter,
        )

LLogan M

ooo interesting. If you add that, the query makes more sense?

MMilkMilker

the results are better adding those fields

Add a reply

Find answers from the community

Hey hope everyone is doing good!