Hello community. I am trying to build a

TToMa1785

Hello community. I am trying to build a RAG application focussing on QA, summarization for research papers and also articles. So far i have experimented with query retrieval for simple text documents. I realize the problem with research papers is that it has many other features such as diagrams, tables, images on top of normal text. I am wondering what is the best way forward

Is it to somehow preprocess the data for example lets say there is a document 1.pdf to something like 1.txt, 1,jpg ,1_table and so on ?
Create seperate nodes - one for text, images, tables and text from webpages ?

That is how i am thinking on a very high level , i guess there would be many intermediate steps. I guess this problem has been already been tackled and was wondering for the best practices here.

I am currently using Mistral-7B-instructv2 basemodel (quantized) with gte embeddings and working in google colab enviroment for now. So totally open source approach . Any tips, notebooks would be highly appreciated.

5 comments

LLogan M

Yea this is just a document ingestion question. I think unstructured may have some utilities for pulling non-text objects out of PDFs

From there, you can either generate captions for these objects, or embed images using CLIP, etc. Theres a few approaches for this

TToMa1785

yes but it is also a crucial step if i understand it correctly. I was wondering if this library : https://github.com/deepdoctection/deepdoctection could be used in conjunction with llama index?
I guess there should be case specific ingestion processes before even applying RAG principles.
Also a follow up question is there some kind of blog / readme for best practices in RAG implementation ?

TToMa1785

Hi I have a question document management. I am using mistral 7b + gte large embedding model . I am using the following code to index and query the documents. What is the way to update the indexes ? The example colab notebook : https://docs.llamaindex.ai/en/stable/module_guides/indexing/document_management.html is for Open AI right ?

Plain Text

documents = SimpleDirectoryReader("/content/Data/").load_data()
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

Which had one document test1.txt and now i wanted to add another document test2.txt and reindex it . Is there a way to do that for non Open AI models?

LLogan M

You can do index.insert(document) for new documents.

Or if each document has a consistent ID, you can use index.refresh_ref_docs(documents). If you set filename_as_id=True in the directory reader, the ID should be consistent

TToMa1785

do you have an example i can quickly have a look into ?

Add a reply

Find answers from the community

Hello community. I am trying to build a