Window

At a glance

I'm encountering token limit errors with OpenAI when processing very large PDFs.

Here's my code (just the relevant snippets):

Plain Text

MODEL = "gpt-4-1106-preview"
EMBED_MODEL = "text-embedding-3-large"
llm = OpenAI(model=MODEL, temperature=0.1)
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
   original_text_metadata_key="original_text",
)
embed_model = OpenAIEmbedding()
client = qdrant_client.QdrantClient(QDRANT_URL, api_key=QDRANT_API_KEY)

pdf_reader = SimpleDirectoryReader(input_files=pdf_files)
documents = pdf_reader.load_data()
vector_store = QdrantVectorStore(client=client,            collection_name=collection_name,                  batch_size=20)
service_context = ServiceContext.from_defaults(llm=llm,                      node_parser=node_parser,                          embed_model=embed_model)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store,                         service_context=service_context)
refreshed_docs = index.refresh_ref_docs(documents)

And here's the error I'm getting:

Plain Text

WARNING - Retrying llama_index.embeddings.openai.get_embeddings in 1.6310027891256675 seconds as it raised BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 8212 tokens (8212 in your prompt; 0 for the completion). Please reduce your prompt or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}.

I've already tried using the newer OpenAI embedding model text-embedding-3-large and experimented with different values for embed_batch_size (10, 50, 100), but nothing has worked.

Does anyone have any suggestions?

9 comments

LLogan M

This is a an issue where somewhere in your text, and window size of 3 is causing the chunk to be too large.

Likely the solution is running your nodes through another splitter, like a token text splitter

SSayan

Thanks @Logan M

Would you be able to point me to some documentation for that?

LLogan M

Plain Text

sentence_window = SentenceWindowNodeParser(...)
token_splitter = TokenTextSplitter(chunk_size=7000)

nodes = sentence_window(documents)
nodes = token_splitter(nodes)

index = VectorStoreIndex(nodes=nodes, ...)

LLogan M

I don't know if you upgraded to v0.10 or not, so I left the imports out lol

SSayan

Got it, thank you! I always appreciate your help @Logan M !

I'm planning to upgrade to v0.10 and run some tests this weekend!

I have one last question - the current code works fine for most small documents. The issue only occurred with a 550-page PDF. If I implement the changes you suggested, can I still use the modified code for other, smaller documents that were working before? Or might there be some side effects I should consider?

LLogan M

I think the above code is probably a good safe gaurd.

I think the root cause of the error is that some text parsed from the PDF doesn't have clear sentence boundaries for some reason.

SSayan

Sorry for bothering again! My usage pattern is

Plain Text

service_context = ServiceContext.from_defaults(llm=llm,
node_parser=node_parser,
embed_model=embed_model)

and the node_parser is currently SentenceWindowNodeParser()

Because I am doing an "upsert" / "upcreate", I need to load the VectorStoreIndex from the Qdrant vector DB

Plain Text

index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)

and then perform the "upsert" operation

Plain Text

refreshed_docs = index.refresh_ref_docs(documents)

I am not loading the VectorStoreIndex with nodes, so I won't be able to use your example directly, @Logan M . Is there a way I can use your recommendation within my setup? Possibly chaining the node parsers somehow, if that's possible?

LLogan M

Hmm, tricky. Possible solution is
a) making a PR to add a chunk size limit to the sentence window parser
b) subclassing the sentence window parser and modifying the logic a bit to add a chunk size limit

Both will require looking at the source code 😅

SSayan

@Logan M , I wanted to share an update with you. I managed to solve the issue by changing the loader. Initially, I was using this loader - https://llamahub.ai/l/file-pdf, but it was extracting a lot of content as strange symbols and unreadable text. This might have been due to the encoding or the fonts in the file. I then switched to this loader - https://llamahub.ai/l/file-unstructured, which started producing readable text and proper sentences, allowing the embedding process to identify proper sentences more effectively.

Furthermore, I've noticed that using SentenceSplitter() results in fewer calls to the OpenAI embedding API compared to SentenceWindowNodeParser(), and the ingestion process completes in a reasonable amount of time for the document in question.

Also, with SentenceSplitter(), most of the nodes in Qdrant that I manually checked contained actual sentences and paragraphs. In contrast, with SentenceWindowNodeParser(), many nodes contained items like "---" or "..."

Additionally, when a large number of calls to Qdrant are necessary, using gRPC mode helped avoid Qdrant timeouts:

Plain Text

client = qdrant_client.QdrantClient(host=QDRANT_HOST,  grpc_port=6334,
prefer_grpc=True,
api_key=QDRANT_API_KEY)

Add a reply

Find answers from the community

Window