Find answers from the community

Updated 2 years ago

Vector index building

Or maybe would be a greater idea to use another type of index, != GPTSimpleVectorIndex ?
L
k
45 comments
Nah a vector index should be fine.

Even though it's creating many documents, the calls to openai to generate embeddings are batched. Is it just taking a long time to build the index?
Yup, it never ends, im reaching openai rate limits. What i can see in the DEBUG logs, is that its sending a request to openai, for each chat text line. Does not seems to be grouping the text lines into a bigger bundle ones.
ok, not true, i take it back. Its grouping the lines actually.
OK good 😅 at least that's working
Do you have a paid account? I know the initial $5 credit limits to 60 requests per minute
actually im using the initial $5 ones. What i see, is that the text that gets send to openai, has too many overhead. But i guess thats something the whatsapp loader is doing, not llamaindex itself. Ill try to check it.
I noticed that the input seemd to be length ~ 1.600 . im using this: PromptHelper(max_input_size=4096, max_chunk_overlap=20,num_output=num_output)
Looks that there is another params somewhere else, and gets read to build the text-batches?
Yes, in the service_context object you can also set the chunk size (default is 3900 tokens)

ServiceContext.from_defaults(..., chunk_size_limit=1500)
Hm.. it seems its grouping document in groups of 10..
(chunk_size_limit doesnt afect)
Chunk size limit just changes the size of each text chunk, not the grouping 😅

I'm not sure if there's a control for the grouping 🤔
Trying to figure out, how does GPTSimpleVectorIndex build the text string that gets sent to openai…
But thats the split part. I would guess the ‘join’ of documents prior sending them them upstream is something else (?)
For creating embeddings, looks like it starts around here
https://github.com/jerryjliu/llama_index/blob/main/gpt_index/indices/vector_store/base.py#L172

Maybe one option is splitting your documents into nodes ahead of time and passing them in, rather than using from documents, so that the nodes get batched better?

https://gpt-index.readthedocs.io/en/latest/guides/primer/usage_pattern.html#parse-the-documents-into-nodes
oh, intersting. if i do that, i get the same log message, that i see when using fromdocuments “Adding chunk: . For what i can see, the node’s text are the same than the document’s text.
Probably because the text is short, and does not need splitting. batching seems to be happening later on
for what its worth, i just learned that loader ‘just’ returns a list of Documents.
there is the magic number 10.
oooo you found it!
SO you were right, no way to change it.
Actually, it looks like an optional parameter to the class https://github.com/jerryjliu/llama_index/blob/170150eb5cfe73000c511d97c604ddb5a6f2e9ab/gpt_index/embeddings/base.py#L51

But would need to trace the call stack a long way up to see how to set it (if it's even possible) 🤔
oooo I think I see how to set it, I'll try to write an example
it comes from the servicecontext
Plain Text
from llama_index.embeddings.openai import OpenAIEmbedding

service_context = ServiceContext.from_defaults(embed_model=OpenAIEmbedding(embed_batch_size=50))
something like that maybe??
writing exactly that.. 🙂
Amazing! 👏
I just miss a detail. this ‘extra_info’ attributes on Document.. what is it used for?.. for my case seems useless
Just to track info about the document in a dict (maybe a filename, or page number)
ok. i guess i does make sense to stick that into embedding somehow under certain circunstances.
The useful part is that the extra info will show up in response.source_nodes, if you want to better track where answers came from
makes sense, however not so much on sending that to the model’s..
woot, 2MB text input, generated 800M index.
Yup! Each embedding is 1536 dimensions, big numbers
Add a reply
Sign up and join the conversation on Discord