Find answers from the community

N
Nachos
Offline, last seen 2 months ago
Joined September 25, 2024
How do I solve this issue: "Token indices sequence length is longer than the specified maximum sequence length for this model (4846 > 1024)." It keeps on coming.

This is the fuction that I am running:

def construct_index(directory_path):
# set maximum input size
max_input_size = 512
# set number of output tokens
num_outputs = 256
# set maximum chunk overlap
max_chunk_overlap = 20
# set chunk size limit
chunk_size_limit = 600

# define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-embedding-ada-002", max_tokens=num_outputs))
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

documents = SimpleDirectoryReader(directory_path).load_data()[0]

text_splitter = TokenTextSplitter(separator=" ", chunk_size=2048, chunk_overlap=20)
text_chunks = text_splitter.split_text(documents.text)
doc_chunks = [Document(t) for t in text_chunks]

index = GPTSimpleVectorIndex(
doc_chunks, llm_predictor=llm_predictor, prompt_helper=prompt_helper
)

index.save_to_disk('index.json')

return index
4 comments
j
N