I really liked this stuff you guys did

At a glance

I really liked this stuff you guys did on putting RAG in production. The ideas seem promising, but I still find that they fail on various edge cases, particularly in technical documents. For example the windowing idea, where you index one sentence at a time, but then the LLM sees a larger portion of the document. I end up with a lot of garbage in my index :D. Also, PDFs with tables 🙄. Are you doing some more work / discussion around this? https://docs.llamaindex.ai/en/stable/end_to_end_tutorials/dev_practices/production_rag.html

Attachment

9 comments

LLogan M

PDFs with tables are a univseral problem 😅 Sadly no one has it quite figured out yet, but it's on our list to improve.

Unstructured has some basic table stuff, we recently did a demo here
https://docs.llamaindex.ai/en/stable/examples/query_engine/sec_tables/tesla_10q_table.html#joint-tabular-semantic-qa-over-tesla-10q

LLogan M

Curious what you mean by "junk" with the sentence window -- probably because the data isn't actually easily split into sentences?

sskittythecat

Technical documents often have lots of short sentences that don't make sense out of context, so embedding those is just going to add noise to the db

LLogan M

That's a fair point -- the auto-merging retriever is another similar approach that seems to also work well in my experience, that might make more sense for your data

LLogan M

https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever.html

sskittythecat

As for tables, I built an index manually from pdf files by converting them to docx using Word itself, and then writing some code (with the help of AI, of course) to convert docx to markdown, with markdown-formatted tables. I then used a markdown parser/splitter from langchain to keep sections together, and to gather section meta-data, and then finally converted those docs to llama index Documents. This makes the tables readable by humans, since they are well-formed markdown, and hopefully readable by the AI too.

LLogan M

Did you have to use the UI in word, or was there a way to achieve this using purely code?

LLogan M

Would be cool if llama-index had a pipeline that did something similar

sskittythecat

I didn't find an immediate way to do it in code, but I also didn't need to for my use case.

Add a reply

Find answers from the community

I really liked this stuff you guys did