Hey guys I am trying to index data which

ddev_blockchain

Hey guys, I am trying to index data which size is around 500MB, is there any way I can do it without any error and i can save a little time(fast indexing), because doing embedding of this large is little hectic .

31 comments

LLogan M

You can increase the embed_batch_size, the default is 10

LLogan M

https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#batch-size

ddev_blockchain

Thanks for the response @Logan M , Does it also work with the batch of files like 50 pdf of 300MB in one go ??

LLogan M

It will certainly help with speed. Although you might hit rate limits too

LLogan M

you can insert one doc at a time if you hit rate limits

ddev_blockchain

Ok, can we have any other method or i can maybe include something in this. So, it can save some cost.

ddev_blockchain

@Logan M any thoughts ?

LLogan M

Not sure what you mean here 🤔 increasing the batch size is really the only way to speed up when you have such a large set of data to embed

ddev_blockchain

And the cost of the embedding that increases or how it will affect ??

LLogan M

No cost change, it will just be faster

LLogan M

At the end of the day, it has to embed all the data, and the only cost is how many tokens you embed (which won't change with batch size)

ddev_blockchain

Got it thanks

ddev_blockchain

Plain Text

from llama_index import ServiceContext, VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding(embed_batch_size=42)
service_context = ServiceContext.from_defaults(embed_model=embed_model)

# optionally set a global service context to avoid passing it into other objects every time
from llama_index import set_global_service_context
set_global_service_context(service_context)

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()

response = query_engine.query("What is the document all about?")

print(response)

even after passing the doc, getting this response It is not possible to answer this question without prior knowledge of the document.

LLogan M

It doesn't know what "document" you are referring to 😅

Try something like "Summarize the text"

ddev_blockchain

One more thing can we pass this batch_size in the openAI API, if yes any reference for the same ?

ddev_blockchain

and also @Logan M Does it increase the context response, our main issue is when we embed multiple docs of large size, The query response we are getting is not as expected.

LLogan M

Hmm we can only set the batch size for the embeddings. Not sure how to pass it to the raw api

LLogan M

When you have many documents, you'll probably want to either increase similarity_top_k (the default is 2) or split the data into multiple indexes and use a sub question or router query engine

ddev_blockchain

Got it @Logan M , Thanks. Also do we have some references, colab notebook or docs? I can look for the same

LLogan M

Oh yea, there's many on the docs 🙂

https://gpt-index.readthedocs.io/en/latest/examples/query_engine/sub_question_query_engine.html

https://gpt-index.readthedocs.io/en/latest/examples/agent/openai_agent_with_query_engine.html

ddev_blockchain

Tried both of the above but the response i am getting from it it not completely accurate, i am trying to retrieve the data from multiple documents.

ddev_blockchain

@Logan M , is there a way to improve the query response and retrieve the context.

LLogan M

What kinds of questions are you asking? What kinds of documents?

ddev_blockchain

I have multiple pdf's which contain texts(Some in table formats, bullet points), But while I am doing a QnA with them it's getting confused and mixing up the response.

ddev_blockchain

@Logan M I believe this can be an issue with the embeddings as well, is there any method so we can do embeddings more accurately?

LLogan M

The biggest improvement you can make is better pre-processing and cleaning of your text before creating the index 🤔 Other than that, chunk_size or similarity_top_k could help

ddev_blockchain

Any reference, i mean any docs we have so i can look in to.

ddev_blockchain

@Logan M I believe most of the community is having this issue of pre-processing before creating index.

LLogan M

No clear guide at the moment.

It's really just about preprocessing documents into clear sections.

Adding additional metadata may also help with retrieval
https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/documents_and_nodes/usage_metadata_extractor.html

ddev_blockchain

Thanks i will look in to this.

LLeon_G

@Logan M any tips for cleaning my text before creating the index?

Add a reply

Find answers from the community

Hey guys I am trying to index data which