Find answers from the community

Updated 2 years ago

node parser SentenceWindowNodeParser

At a glance
node_parser = SentenceWindowNodeParser.from_defaults(window_size=3)
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
service_context = ServiceContext.from_defaults(llm=llm, node_parser=node_parser)
qa_template = Prompt(template)
documents = SimpleDirectoryReader('/home/kbillesk/chat_gdpr/api/data/', recursive=True).load_data()
index = VectorStoreIndex.from_documents(documents,service_context=service_context, show_progress=True)
L
k
7 comments
Ohhh wow, this error is coming from the embedding model
the 3 sentence window is causing the text to be bigger than 8196 tokens.... which is kind of crazy?
Try lowering the sentence window to 2. If that doesn't work, either your data isn't really sentence-based, or our sentence splitter is doing a very bad job 😦
Yeah I think it seems crazy too? - it is probably a siogle file causing the problem. Thanks
If you do figure out the text causing the issue, let me know! Just curious if it's bad text, or our sentence splitter causing the issue
I checked the data and all the data is sentence-based documents. Some of them are downloaded HTML including some irrelevant html for menu. Other documents are pdf. I am not entirely sure how you preprocess the files in the SimpleDirectoryReader, but it could be that I should clean up the HTML files?
Add a reply
Sign up and join the conversation on Discord