Find answers from the community

Updated 4 months ago

I really liked this stuff you guys did

At a glance
I really liked this stuff you guys did on putting RAG in production. The ideas seem promising, but I still find that they fail on various edge cases, particularly in technical documents. For example the windowing idea, where you index one sentence at a time, but then the LLM sees a larger portion of the document. I end up with a lot of garbage in my index :D. Also, PDFs with tables πŸ™„. Are you doing some more work / discussion around this? https://docs.llamaindex.ai/en/stable/end_to_end_tutorials/dev_practices/production_rag.html
Attachment
decouple_chunks.png
L
s
9 comments
PDFs with tables are a univseral problem πŸ˜… Sadly no one has it quite figured out yet, but it's on our list to improve.

Unstructured has some basic table stuff, we recently did a demo here
https://docs.llamaindex.ai/en/stable/examples/query_engine/sec_tables/tesla_10q_table.html#joint-tabular-semantic-qa-over-tesla-10q
Curious what you mean by "junk" with the sentence window -- probably because the data isn't actually easily split into sentences?
Technical documents often have lots of short sentences that don't make sense out of context, so embedding those is just going to add noise to the db
That's a fair point -- the auto-merging retriever is another similar approach that seems to also work well in my experience, that might make more sense for your data
As for tables, I built an index manually from pdf files by converting them to docx using Word itself, and then writing some code (with the help of AI, of course) to convert docx to markdown, with markdown-formatted tables. I then used a markdown parser/splitter from langchain to keep sections together, and to gather section meta-data, and then finally converted those docs to llama index Documents. This makes the tables readable by humans, since they are well-formed markdown, and hopefully readable by the AI too.
Did you have to use the UI in word, or was there a way to achieve this using purely code?
Would be cool if llama-index had a pipeline that did something similar
I didn't find an immediate way to do it in code, but I also didn't need to for my use case.
Add a reply
Sign up and join the conversation on Discord