How do I solve this issue Token indices

NNachos

How do I solve this issue: "Token indices sequence length is longer than the specified maximum sequence length for this model (4846 > 1024)." It keeps on coming.

This is the fuction that I am running:

def construct_index(directory_path):
# set maximum input size
max_input_size = 512
# set number of output tokens
num_outputs = 256
# set maximum chunk overlap
max_chunk_overlap = 20
# set chunk size limit
chunk_size_limit = 600

# define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-embedding-ada-002", max_tokens=num_outputs))
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

documents = SimpleDirectoryReader(directory_path).load_data()[0]

text_splitter = TokenTextSplitter(separator=" ", chunk_size=2048, chunk_overlap=20)
text_chunks = text_splitter.split_text(documents.text)
doc_chunks = [Document(t) for t in text_chunks]

index = GPTSimpleVectorIndex(
doc_chunks, llm_predictor=llm_predictor, prompt_helper=prompt_helper
)

index.save_to_disk('index.json')

return index

4 comments

jjerryjliu0

Hi @Nachos , i noticed you're using the embedding model "text-embedding-ada-002" in the LLMPredictor. You should choose a valid language model https://platform.openai.com/docs/models/gpt-3

NNachos

Hey thanks @jerryjliu0 , I did try changing models earlier, but that did not help either. Is there a correct way to truncate sequence length?

NNachos

Also since I am using the said model for embeddings, I believe it is the most apt one: https://platform.openai.com/docs/guides/embeddings/use-cases

jjerryjliu0

yeah but you shouldn't use that model in llm_predictor. see https://gpt-index.readthedocs.io/en/latest/how_to/embeddings.html#custom-embeddings for how to define a custom embedding: https://gpt-index.readthedocs.io/en/latest/how_to/embeddings.html#custom-embeddings

Add a reply

Find answers from the community

How do I solve this issue Token indices