Find answers from the community

Updated 8 months ago

I am running into a weird issue when

I am running into a weird issue when trying to parse a CSV File, hitting the OpenAI token limit when trying to generate embeddings using text-embedding-3-large.. Anything stand out int his code that would cause the issue ?

Plain Text
embedding = OpenAIEmbedding(api_key="XXX", model="text-embedding-3-large")

node_parser = SentenceWindowNodeParser.from_defaults(window_size=3)
dir_reader = SimpleDirectoryReader(input_files=[tmpfile])
docs = dir_reader.load_data(show_progress=True)
for doc in docs:
    doc.metadata["external_id"] = external_id

nodes = node_parser.get_nodes_from_documents(docs, show_progress=True)

print("Getting batched embeddings for nodes from embedding " + embedding.model_name + "..")
text_chunks = [node.get_content(metadata_mode=MetadataMode.EMBED) for node in nodes]
embeddings = embedding.get_text_embedding_batch(text_chunks, show_progress=True)


This then errors out with

Plain Text
openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 71420 tokens (71420 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}
L
d
12 comments
I wouldn't use sentence window node parser with a CSV
its inherently not sentence-based
unless you manually parse out fields from your CSV into document objects, rather than relying on simple directory reader
If you looked at the metadata for some of your nodes, I'm guessing its huuuuge (71K tokens is a lot)
This is because theres no clear sentence boundary
ah got it makes sense
wondering how that complicates things on the querying side..
Querying csvs is pretty hard

I'd recommend putting it into a sqlite db and doing text-2-sql if its highly numerical

If its just a CSV of question answer pairs though, you could parse each row into a document
It really depends on the data
Gotcha, thats helpful. We are going to look at unstructured next
The builtin CSV reader seems to use the pandas read_csv method to take each row as a dataframe i think
yea that sounds familiar
Add a reply
Sign up and join the conversation on Discord