Find answers from the community

s
F
Y
a
P
Updated 2 years ago

How do I solve this issue Token indices

How do I solve this issue: "Token indices sequence length is longer than the specified maximum sequence length for this model (4846 > 1024)." It keeps on coming.

This is the fuction that I am running:

def construct_index(directory_path):
# set maximum input size
max_input_size = 512
# set number of output tokens
num_outputs = 256
# set maximum chunk overlap
max_chunk_overlap = 20
# set chunk size limit
chunk_size_limit = 600

# define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-embedding-ada-002", max_tokens=num_outputs))
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

documents = SimpleDirectoryReader(directory_path).load_data()[0]

text_splitter = TokenTextSplitter(separator=" ", chunk_size=2048, chunk_overlap=20)
text_chunks = text_splitter.split_text(documents.text)
doc_chunks = [Document(t) for t in text_chunks]

index = GPTSimpleVectorIndex(
doc_chunks, llm_predictor=llm_predictor, prompt_helper=prompt_helper
)

index.save_to_disk('index.json')

return index
j
N
4 comments
Hi @Nachos , i noticed you're using the embedding model "text-embedding-ada-002" in the LLMPredictor. You should choose a valid language model https://platform.openai.com/docs/models/gpt-3
Hey thanks @jerryjliu0 , I did try changing models earlier, but that did not help either. Is there a correct way to truncate sequence length?
Also since I am using the said model for embeddings, I believe it is the most apt one: https://platform.openai.com/docs/guides/embeddings/use-cases
Add a reply
Sign up and join the conversation on Discord