Rate limit

At a glance

The community member is new to creating VectorStore indexes using the OpenAI API and is facing an issue with the token limit when trying to train a model with a large number of files. They are interested in learning how to create larger VectorStore indexes or merge different VectorStores.

The comments suggest that the community member should try lowering the embedding batch size, which may help reduce the token usage. Another community member confirms that they have a paid OpenAI account but still reach the limit, so they had to divide the files into smaller folders, resulting in multiple VectorStores.

The community members discuss the possibility of merging different VectorStores, but there doesn't seem to be a direct way to do this. Instead, they suggest wrapping each index as a tool in an agent or subquestion query engine, although this may not be the ideal solution.

The community members recommend that the original poster try changing the batch size and recreating the index with all the files, which seems to have worked based on the final comments.

There is also a discussion about whether OpenAI stores the data that was used to train the model, and the community members confirm that OpenAI stores the data for up to 30 days, but it is not used for any further training.

Useful resources

FFran Piantoni

Hi guys, i am really new on this type of technology, i have managed to create a VectorStore index with trained data using the OpenAI API. I am really interested on getting to know how can i create larger VectorStores indexes. The reason is i have now a lot of files and when i try to train the model, i get a OpenAI token limit error. So i was wondering how can i merge/load different VectorStores, or how can i load a lot more of files.

This i how i am loading the files (Failing code due to token limit):

Plain Text

def construct_index(directory_path):
    num_outputs = 1024

    llm_predictor = LLMPredictor(
        llm=OpenAI(
            temperature=0.1, model_name="text-davinci-003", max_tokens=num_outputs
        )
    )

    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

    docs = SimpleDirectoryReader(directory_path).load_data()

    index = GPTVectorStoreIndex(nodes=docs, service_context=service_context)

    index.storage_context.persist(persist_dir="index")

    return index

Thanks in advance for the help 😉

12 comments

LLogan M

Do you have a paid openai account? I know the trial usage is very rate limited

In any case, you can try lowering the embedding batch size (the default is 10)
https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#batch-size

FFran Piantoni

Yes i have a paid openai account. But i still reach the limit. So i have to divide the files in 4 smaller folders. But this result in having 4 different vector stores

LLogan M

Lowering the batch size will likely help then (like lowering to 1)

Although it will just be slower

FFran Piantoni

Is there a way i can merge different VectorStores?

LLogan M

Yea there's no merge function really, but you could wrap each index to be a tool in an agent or subquestion query engine

Although normally you'd want to do this and sort your data into specific topics per index

LLogan M

Embeddings are super cheap though, if you need to recreate

FFran Piantoni

So do you recommend me to change the batch size and recreate the index with all the files?

LLogan M

I believe so! Probably worth a shot

FFran Piantoni

Will do and tell you how it goes!! Thanks for your time

FFran Piantoni

@Logan M It worked!! Thanks for de advice. I have a question does Openai stores the data i trained? I read that they don't but i just want to double check. Thank for the time

LLogan M

They store it for up to 30 days apparently, but they state that its not used for any training data 🤷‍♂️

LLogan M

Also, glad it works now!

Add a reply

Find answers from the community

Rate limit