Find answers from the community

Updated 2 months ago

Hello, I am having issues with a very

Hello, I am having issues with a very simple example: If I load a single page in a vector store, everything is fine. However if I split this page, like in sections of a wikipedia article and load it in a vector store, now the answers are pretty bad, like if the query engine was using a single section to answer everything. Would you have an idea where my issue is?

Plain Text
# create client and a new collection
client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = client.get_or_create_collection("split_doc_collection")

docs = SimpleDirectoryReader("../data/split_doc_collection", recursive=True).load_data()

# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create and dl embeddings instance  
embed_model=LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
)

model = "gpt-3.5-turbo"
service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    chunk_overlap=50,
    llm=OpenAI(model=model, temperature=0.5, system_prompt=system_prompt),
    embed_model=embed_model
)

# And set the service context
set_global_service_context(service_context)

index = VectorStoreIndex.from_documents(
    docs, storage_context=storage_context, service_context=service_context
)
G
W
12 comments
You could try the following:
query_engine = index.as_query_engine(similarity_top_k=5)
  • Try with Increasing or decreasing the chunk_size value.
Thanks a lot, that's helpful. But what I am trying to understand is why, if I feed the text content in a single file, it does fine, but if I split it in chunks manually (which are about the chunk_size of the model) it does much worse. To me it seems that I am not using the right architecture
but if I split it in chunks manually (which are about the chunk_size of the model) it does much worse.

With this you mean you are creating nodes for chunk size 4096, In case of GPT-3.5?
When you create larger chunk size nodes and since llm model based on your choice only has limited space for a llm call so maybe it is onlyt able to fetch only single record and that is why you are not getting good results.


You could try creating Documents only and let it divide itself to form nodes
and see if it impoves the performance
What I mean is: let's say we have a long wikipedia article divided in small sections. When I feed the full documents in one go, summarizing works well. However, if I split each section in a separate text file, now it's terrible as it seems to focus only on one of them. I tried the similarity_top_k=5 and the DocumentSummaryIndex, it doesn't work well.
I am doing this to test the idea of having multiple documents of the same topic in one place, and querying it. It looks like it works better if I append all the documents into a single one for some reason.
It's probably just a configuration issue, or me using the wrong index type?
Many thanks for helping:)
Can you show me how youa re splitting your texts manually?
I just take each section of an article and write it in a new text file. Something else I tried is having 5 versions of the same page, containing each a subset of facts. Ideally the summary would contain a mix of all the facts but it's focusing only on the subset of one of the page (which can vary)
You also mentioned that you set the chunk size to model size, Can you tell me what value was it that you set?
So you are creating Document object by yourself here?
Add a reply
Sign up and join the conversation on Discord