Find answers from the community

Updated 4 months ago

Rate limit

At a glance

The community member is new to creating VectorStore indexes using the OpenAI API and is facing an issue with the token limit when trying to train a model with a large number of files. They are interested in learning how to create larger VectorStore indexes or merge different VectorStores.

The comments suggest that the community member should try lowering the embedding batch size, which may help reduce the token usage. Another community member confirms that they have a paid OpenAI account but still reach the limit, so they had to divide the files into smaller folders, resulting in multiple VectorStores.

The community members discuss the possibility of merging different VectorStores, but there doesn't seem to be a direct way to do this. Instead, they suggest wrapping each index as a tool in an agent or subquestion query engine, although this may not be the ideal solution.

The community members recommend that the original poster try changing the batch size and recreating the index with all the files, which seems to have worked based on the final comments.

There is also a discussion about whether OpenAI stores the data that was used to train the model, and the community members confirm that OpenAI stores the data for up to 30 days, but it is not used for any further training.

Useful resources
Hi guys, i am really new on this type of technology, i have managed to create a VectorStore index with trained data using the OpenAI API. I am really interested on getting to know how can i create larger VectorStores indexes. The reason is i have now a lot of files and when i try to train the model, i get a OpenAI token limit error. So i was wondering how can i merge/load different VectorStores, or how can i load a lot more of files.

This i how i am loading the files (Failing code due to token limit):

Plain Text
def construct_index(directory_path):
    num_outputs = 1024

    llm_predictor = LLMPredictor(
        llm=OpenAI(
            temperature=0.1, model_name="text-davinci-003", max_tokens=num_outputs
        )
    )

    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

    docs = SimpleDirectoryReader(directory_path).load_data()

    index = GPTVectorStoreIndex(nodes=docs, service_context=service_context)

    index.storage_context.persist(persist_dir="index")

    return index


Thanks in advance for the help πŸ˜‰
L
F
12 comments
Do you have a paid openai account? I know the trial usage is very rate limited

In any case, you can try lowering the embedding batch size (the default is 10)
https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#batch-size
Yes i have a paid openai account. But i still reach the limit. So i have to divide the files in 4 smaller folders. But this result in having 4 different vector stores
Lowering the batch size will likely help then (like lowering to 1)

Although it will just be slower
Is there a way i can merge different VectorStores?
Yea there's no merge function really, but you could wrap each index to be a tool in an agent or subquestion query engine

Although normally you'd want to do this and sort your data into specific topics per index
Embeddings are super cheap though, if you need to recreate
So do you recommend me to change the batch size and recreate the index with all the files?
I believe so! Probably worth a shot
Will do and tell you how it goes!! Thanks for your time
@Logan M It worked!! Thanks for de advice. I have a question does Openai stores the data i trained? I read that they don't but i just want to double check. Thank for the time
They store it for up to 30 days apparently, but they state that its not used for any training data πŸ€·β€β™‚οΈ
Also, glad it works now!
Add a reply
Sign up and join the conversation on Discord