Find answers from the community

Updated 2 years ago

Vector index building

At a glance

Or maybe would be a greater idea to use another type of index, != GPTSimpleVectorIndex ?

45 comments

Nah a vector index should be fine.

Even though it's creating many documents, the calls to openai to generate embeddings are batched. Is it just taking a long time to build the index?

kkittenkill

Yup, it never ends, im reaching openai rate limits. What i can see in the DEBUG logs, is that its sending a request to openai, for each chat text line. Does not seems to be grouping the text lines into a bigger bundle ones.

kkittenkill

ok, not true, i take it back. Its grouping the lines actually.

LLogan M

OK good 😅 at least that's working

LLogan M

Do you have a paid account? I know the initial $5 credit limits to 60 requests per minute

kkittenkill

actually im using the initial $5 ones. What i see, is that the text that gets send to openai, has too many overhead. But i guess thats something the whatsapp loader is doing, not llamaindex itself. Ill try to check it.

kkittenkill

I noticed that the input seemd to be length ~ 1.600 . im using this: PromptHelper(max_input_size=4096, max_chunk_overlap=20,num_output=num_output)

kkittenkill

Looks that there is another params somewhere else, and gets read to build the text-batches?

LLogan M

Yes, in the service_context object you can also set the chunk size (default is 3900 tokens)

ServiceContext.from_defaults(..., chunk_size_limit=1500)

kkittenkill

Hm.. it seems its grouping document in groups of 10..

kkittenkill

(chunk_size_limit doesnt afect)

LLogan M

Chunk size limit just changes the size of each text chunk, not the grouping 😅

I'm not sure if there's a control for the grouping 🤔

kkittenkill

Trying to figure out, how does GPTSimpleVectorIndex build the text string that gets sent to openai…

kkittenkill

looks like its near here: https://github.com/jerryjliu/llama_index/blob/83a7c646ad16509048c18aee3c2b2d69fa9b8b81/gpt_index/node_parser/node_utils.py

kkittenkill

But thats the split part. I would guess the ‘join’ of documents prior sending them them upstream is something else (?)

LLogan M

For creating embeddings, looks like it starts around here
https://github.com/jerryjliu/llama_index/blob/main/gpt_index/indices/vector_store/base.py#L172

Maybe one option is splitting your documents into nodes ahead of time and passing them in, rather than using from documents, so that the nodes get batched better?

https://gpt-index.readthedocs.io/en/latest/guides/primer/usage_pattern.html#parse-the-documents-into-nodes

kkittenkill

oh, intersting. if i do that, i get the same log message, that i see when using fromdocuments “Adding chunk: ”. For what i can see, the node’s text are the same than the document’s text.

kkittenkill

Probably because the text is short, and does not need splitting. batching seems to be happening later on

kkittenkill

for what its worth, i just learned that loader ‘just’ returns a list of Documents.

kkittenkill

🙂

kkittenkill

All right

kkittenkill

https://github.com/jerryjliu/llama_index/blob/170150eb5cfe73000c511d97c604ddb5a6f2e9ab/gpt_index/embeddings/base.py

kkittenkill

bingo!

kkittenkill

there is the magic number 10.

LLogan M

oooo you found it!

LLogan M

nice!

kkittenkill

SO you were right, no way to change it.

LLogan M

Actually, it looks like an optional parameter to the class https://github.com/jerryjliu/llama_index/blob/170150eb5cfe73000c511d97c604ddb5a6f2e9ab/gpt_index/embeddings/base.py#L51

But would need to trace the call stack a long way up to see how to set it (if it's even possible) 🤔

LLogan M

oooo I think I see how to set it, I'll try to write an example

kkittenkill

ah, right…

kkittenkill

it comes from the servicecontext

LLogan M

Plain Text

from llama_index.embeddings.openai import OpenAIEmbedding

service_context = ServiceContext.from_defaults(embed_model=OpenAIEmbedding(embed_batch_size=50))

LLogan M

something like that maybe??

kkittenkill

writing exactly that.. 🙂

🎉

working fine!

Amazing! 👏

I just miss a detail. this ‘extra_info’ attributes on Document.. what is it used for?.. for my case seems useless