Find answers from the community

Updated last year

Hey guys I am trying to index data which

Hey guys, I am trying to index data which size is around 500MB, is there any way I can do it without any error and i can save a little time(fast indexing), because doing embedding of this large is little hectic .
L
d
L
31 comments
You can increase the embed_batch_size, the default is 10
Thanks for the response @Logan M , Does it also work with the batch of files like 50 pdf of 300MB in one go ??
It will certainly help with speed. Although you might hit rate limits too
you can insert one doc at a time if you hit rate limits
Ok, can we have any other method or i can maybe include something in this. So, it can save some cost.
@Logan M any thoughts ?
Not sure what you mean here πŸ€” increasing the batch size is really the only way to speed up when you have such a large set of data to embed
And the cost of the embedding that increases or how it will affect ??
No cost change, it will just be faster
At the end of the day, it has to embed all the data, and the only cost is how many tokens you embed (which won't change with batch size)
Plain Text
from llama_index import ServiceContext, VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding(embed_batch_size=42)
service_context = ServiceContext.from_defaults(embed_model=embed_model)

# optionally set a global service context to avoid passing it into other objects every time
from llama_index import set_global_service_context
set_global_service_context(service_context)

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()

response = query_engine.query("What is the document all about?")

print(response)


even after passing the doc, getting this response It is not possible to answer this question without prior knowledge of the document.
It doesn't know what "document" you are referring to πŸ˜…

Try something like "Summarize the text"
One more thing can we pass this batch_size in the openAI API, if yes any reference for the same ?
and also @Logan M Does it increase the context response, our main issue is when we embed multiple docs of large size, The query response we are getting is not as expected.
Hmm we can only set the batch size for the embeddings. Not sure how to pass it to the raw api
When you have many documents, you'll probably want to either increase similarity_top_k (the default is 2) or split the data into multiple indexes and use a sub question or router query engine
Got it @Logan M , Thanks. Also do we have some references, colab notebook or docs? I can look for the same
Tried both of the above but the response i am getting from it it not completely accurate, i am trying to retrieve the data from multiple documents.
@Logan M , is there a way to improve the query response and retrieve the context.
What kinds of questions are you asking? What kinds of documents?
I have multiple pdf's which contain texts(Some in table formats, bullet points), But while I am doing a QnA with them it's getting confused and mixing up the response.
@Logan M I believe this can be an issue with the embeddings as well, is there any method so we can do embeddings more accurately?
The biggest improvement you can make is better pre-processing and cleaning of your text before creating the index πŸ€” Other than that, chunk_size or similarity_top_k could help
Any reference, i mean any docs we have so i can look in to.
@Logan M I believe most of the community is having this issue of pre-processing before creating index.
No clear guide at the moment.

It's really just about preprocessing documents into clear sections.

Adding additional metadata may also help with retrieval
https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/documents_and_nodes/usage_metadata_extractor.html
Thanks i will look in to this.
@Logan M any tips for cleaning my text before creating the index?
Add a reply
Sign up and join the conversation on Discord