Hello community. I am trying to build a RAG application focussing on QA, summarization for research papers and also articles. So far i have experimented with query retrieval for simple text documents. I realize the problem with research papers is that it has many other features such as diagrams, tables, images on top of normal text. I am wondering what is the best way forward
- Is it to somehow preprocess the data for example lets say there is a document 1.pdf to something like 1.txt, 1,jpg ,1_table and so on ?
- Create seperate nodes - one for text, images, tables and text from webpages ?
That is how i am thinking on a very high level , i guess there would be many intermediate steps. I guess this problem has been already been tackled and was wondering for the best practices here.
I am currently using Mistral-7B-instructv2 basemodel (quantized) with gte embeddings and working in google colab enviroment for now. So totally open source approach . Any tips, notebooks would be highly appreciated.