I am running into a weird issue when

At a glance

I am running into a weird issue when trying to parse a CSV File, hitting the OpenAI token limit when trying to generate embeddings using text-embedding-3-large.. Anything stand out int his code that would cause the issue ?

Plain Text

embedding = OpenAIEmbedding(api_key="XXX", model="text-embedding-3-large")

node_parser = SentenceWindowNodeParser.from_defaults(window_size=3)
dir_reader = SimpleDirectoryReader(input_files=[tmpfile])
docs = dir_reader.load_data(show_progress=True)
for doc in docs:
    doc.metadata["external_id"] = external_id

nodes = node_parser.get_nodes_from_documents(docs, show_progress=True)

print("Getting batched embeddings for nodes from embedding " + embedding.model_name + "..")
text_chunks = [node.get_content(metadata_mode=MetadataMode.EMBED) for node in nodes]
embeddings = embedding.get_text_embedding_batch(text_chunks, show_progress=True)

This then errors out with

Plain Text

openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 71420 tokens (71420 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

12 comments

LLogan M

I wouldn't use sentence window node parser with a CSV

LLogan M

its inherently not sentence-based

LLogan M

unless you manually parse out fields from your CSV into document objects, rather than relying on simple directory reader

LLogan M

If you looked at the metadata for some of your nodes, I'm guessing its huuuuge (71K tokens is a lot)

LLogan M

This is because theres no clear sentence boundary

ddenen99

ah got it makes sense

ddenen99

wondering how that complicates things on the querying side..

LLogan M

Querying csvs is pretty hard

I'd recommend putting it into a sqlite db and doing text-2-sql if its highly numerical

If its just a CSV of question answer pairs though, you could parse each row into a document

LLogan M

It really depends on the data

ddenen99

Gotcha, thats helpful. We are going to look at unstructured next

ddenen99

The builtin CSV reader seems to use the pandas read_csv method to take each row as a dataframe i think

LLogan M

yea that sounds familiar

Add a reply

Find answers from the community

I am running into a weird issue when