Hey @yoelk , hmm could you post a full stack trace?
Hey @jerryjliu0 Sure. Here's the full output (note that I've printed the document store which contains only a single 70KB text file)
what's the code that's triggering this error ?
documents = SimpleDirectoryReader(folder).load_data()
print(documents)
index = GPTSimpleVectorIndex(documents)
index.save_to_disk(filename)
st.session_state[f'{filename}'] = index
I guess the issue here is with creating the embeddings
When trying other text files, after I print the document store I get the below output from the embedding process which works fine:
2023-02-19 09:05:11.276 > [build_index_from_documents] Total LLM token usage: 0 tokens
2023-02-19 09:05:11.276 > [build_index_from_documents] Total embedding token usage: 98947 tokens
oh hm. under the hood by default we just call openai embedding api. do you think it's just not able to recognize certain chars?
I can send you the original doc
Unfortunately I didn't manage to debug it
@yoelk nice find, that's very possible...
would you be able to send me some sample data/code? i can try to look into a fix
@jerryjliu0 Took me some time to find the example but I randomly tried different documents I found on the net and found the attached one which causes the same error
thanks @yoelk ! i'll try taking a stab at this
Thanks, @jerryjliu0 I basically used the directory reader with only this file in the folder and then I used the GPTSimpleVectorIndex with chunk size=256
Hey @jerryjliu0 , did you get the chance to look at it? I'm still getting the same error on some files
INFO:openai:error_code=None error_message="[''] is not valid under any of the given schemas - 'input'" error_param=None error_type=invalid_request_error message='OpenAI API error received' stream_error=False
@yoelk that usually means that you're sending a blank string as a doc to openai's embedding api
can you double check your Document objects and make sure that none of them contain blank strings?
@jerryjliu0 I agree, but that happens with the default chunking in GPTSimpleVectorIndex (tried different chunk sizes and issue reproduced in some of them). When I added Langchain's text splitter I had no issues.
Got it. You're saying you can repro this with the text above right? I can try it out
sounds good. i was able to repro. will look into a fix!
in the meantime yeah you can manually try plugging in a langchain text splitter
can confirm this also worked to fix it for me