Find answers from the community

Updated 4 months ago

hello there, i have a question about

hello there, i have a question about enhancing data extraction quality from scanned documents over a typical rag + reranker pipeline. i am currently using LlamaParse to convert tabular data into markdown table, then indexing them. There can be a case where the tabular data are not being converted properly (e.g. table fonts are too small, document not scanned properly by people, etc), thus making the markdown table unusable. Since I am using gpt-4o in my pipeline,

questions:
  • Can I also extract the table as an image and put them in my pipeline? So if markdown table us unusable, gpt-4o can also look into the image for data extraction
  • Do I also have to manage how I chunk the markdown table and image in sequence if I have more than 1 table?
j
g
10 comments
  1. yes! our multimodal tutorial should give you a sense of how to use our multimodal mode to extract both text and images: https://github.com/run-llama/llama_parse/blob/main/examples/multimodal/multimodal_rag_slide_deck.ipynb
  1. i'm not entirely sure what you mean, is this about making sure the tables are in order before you feed it to the LLM?
i'm not entirely sure what you mean, is this about making sure the tables are in order before you feed it to the LLM?
yes. i also realised during the retrieval and reranking process, data rows of table are not extracted in order (e.g. rows of page x are extracted instead of extracting from page 1 onwards). i use the following code to check. could it be the way how i am chunking my data?

Plain Text
from llama_index.core.response.notebook_utils import display_source_node

for i, n in enumerate(response_table.source_nodes):
    print(f'Source node {i+1}')
    display_source_node(n, source_length=20000)
hey @jerryjliu0 i was reading the notebook but i dont understand why are the path of the images being indexed instead of the image themselves. for my use case, i am reading documents directly from Azure Blob Storage and then LlamaParse to parse the document. I can saved the document from blob storage as a PIL object but what I am confuse is how can I indexed them together with the parsed result from LlamaParse?
@galvangjx re: the first point, that's just because by definition retrieval returns chunks ordered by embedding similarity instead of by page number. are you saying you'd want the chunks ordered by page number?
re: the second point, llamaparse lets you download the image screenshots along with the text. in the notebook, we embed the text chunks, but attach a link to the page screenshots through the metadata of the text chunks. if you want to you can store the page screenshots in blob storage instead of local file system, you would just need the URL to blob storage.

then before calling the LLM during query-time, you get the image paths attached to each text node and use that to load the original image
in a way, yes, i would like to chunk by page number. but that doesn't matter because i preprocess llama parse results by condensing and retaining only key information (e.g. basic invocie data and just tables). what i was concerned about was if i chunk tables from different pages, i want to also ensure i am also attaching the correct image of the table as metadata
while using AzStorageBlobReader with llamaparse as the file_extractor, i wasn't able to download the image screenshots and came across '<bytes/buffer>' error. i reported it in the llamacloud chat though.
sorry im still not getting how i can give the query engine the image during query time πŸ˜… . all i see is just the link to each page screenshots have been given, while creating a custom multi modal query engine. unless you mean the tool created at the later part does the actual loading of the screenshots?

this tutorial shows the images are being embedded with text into a vector db and it makes more sense to me - https://docs.llamaindex.ai/en/stable/examples/multi_modal/gpt4v_multi_modal_retrieval/#build-multi-modal-index-and-vector-store-to-index-both-text-and-images
hey @jerryjliu0 i've managed to set up a multimodal rag pipeline following to the tutorial sent in my previous message. now im trying to put everything together using my own data. I have a list of ImageDocument which are the images i loaded from azure blob storage and also a Document which are markdown texts result from llamaparse.

when building the MultiModalVectorStoreIndex, my loaded images (from ImageDocument) are empty, can you help me understand why that is?

ValueError: Cannot build index from nodes with no content. Please ensure all nodes have content.
The difference is - the metadata from ImageDocument that was loaded from azure blob storage is empty. but the metadata from ImageDocument that was loaded from local file system is populated. might this be a bug?
Add a reply
Sign up and join the conversation on Discord