Hello, I am having issues with a very

GGuillaume

Hello, I am having issues with a very simple example: If I load a single page in a vector store, everything is fine. However if I split this page, like in sections of a wikipedia article and load it in a vector store, now the answers are pretty bad, like if the query engine was using a single section to answer everything. Would you have an idea where my issue is?

Plain Text

# create client and a new collection
client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = client.get_or_create_collection("split_doc_collection")

docs = SimpleDirectoryReader("../data/split_doc_collection", recursive=True).load_data()

# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create and dl embeddings instance  
embed_model=LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
)

model = "gpt-3.5-turbo"
service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    chunk_overlap=50,
    llm=OpenAI(model=model, temperature=0.5, system_prompt=system_prompt),
    embed_model=embed_model
)

# And set the service context
set_global_service_context(service_context)

index = VectorStoreIndex.from_documents(
    docs, storage_context=storage_context, service_context=service_context
)

12 comments

GGuillaume

I tried using DocumentSummaryIndex, it's even worse: https://gpt-index.readthedocs.io/en/latest/examples/index_structs/doc_summary/DocSummary.html

WWhiteFang_Jr

You could try the following:

Use a different embedding model, For ref: https://huggingface.co/spaces/mteb/leaderboard
Increase the top_k value to use more nodes while synthesizing with llm. To do that just add

query_engine = index.as_query_engine(similarity_top_k=5)

Try with Increasing or decreasing the chunk_size value.

GGuillaume

Thanks a lot, that's helpful. But what I am trying to understand is why, if I feed the text content in a single file, it does fine, but if I split it in chunks manually (which are about the chunk_size of the model) it does much worse. To me it seems that I am not using the right architecture

WWhiteFang_Jr

but if I split it in chunks manually (which are about the chunk_size of the model) it does much worse.

With this you mean you are creating nodes for chunk size 4096, In case of GPT-3.5?

WWhiteFang_Jr

When you create larger chunk size nodes and since llm model based on your choice only has limited space for a llm call so maybe it is onlyt able to fetch only single record and that is why you are not getting good results.

You could try creating Documents only and let it divide itself to form nodes

WWhiteFang_Jr

and see if it impoves the performance

GGuillaume

What I mean is: let's say we have a long wikipedia article divided in small sections. When I feed the full documents in one go, summarizing works well. However, if I split each section in a separate text file, now it's terrible as it seems to focus only on one of them. I tried the similarity_top_k=5 and the DocumentSummaryIndex, it doesn't work well.
I am doing this to test the idea of having multiple documents of the same topic in one place, and querying it. It looks like it works better if I append all the documents into a single one for some reason.
It's probably just a configuration issue, or me using the wrong index type?

GGuillaume

Many thanks for helping:)

WWhiteFang_Jr

Can you show me how youa re splitting your texts manually?

GGuillaume

I just take each section of an article and write it in a new text file. Something else I tried is having 5 versions of the same page, containing each a subset of facts. Ideally the summary would contain a mix of all the facts but it's focusing only on the subset of one of the page (which can vary)

WWhiteFang_Jr

You also mentioned that you set the chunk size to model size, Can you tell me what value was it that you set?

WWhiteFang_Jr

So you are creating Document object by yourself here?

Add a reply

Find answers from the community

Hello, I am having issues with a very