Tables plus text

hharshit_alpha

Hey community members
I need some help from you guys. I am trying to create a bot for financial documents.

def ask(file):
print(" Loading...")
PDFReader = download_loader("PDFReader")
loader = PDFReader()
documents = loader.load_data(file=Path(file))
print("Path: ", Path(file))

# Check if the index file exists
if os.path.exists(INDEX_FILE):
# Load the index from the file
logger.info("found index.json in the directory")
index = GPTSimpleVectorIndex.load_from_disk(INDEX_FILE)
else:
logger.info("didnt find index.json in the directory")
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-003"))

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=1024)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

# Save the index to the file
index.save_to_disk(INDEX_FILE)

Above is my code snippet for generating index for a pdf. I have used PDFReader from llamahub to extract texts from the pdf. The bot answers well when asked about the text. But it fails when I ask the value from the table present in the pdf.

I tried using different open-ai text models. The best one being text-davinci-003. The bot is not able to answer me about the values present in the tables in the pdf. This is because the pdfReader simply just converts the content of pdf to text (it doesnot take any special steps to convert the table content). I want to know how can i sucessfully index both text and the tables in the pdf using langchain and llamaindex.

11 comments

LLogan M

This is pretty tricky to do... I think the unstructuredIO reader on llamahub does some extra stuff to better parse different elements in a pdf?

LLogan M

Might be worth a shot to see what it does

hharshit_alpha

@Logan M I checked this out, its still giving me similar results. It doesn't perform anywhere good for the table related questions. Thanks for the suggestion tho, but we need to look for an alternative

hharshit_alpha

@everyone Can you guys please look into this and help me with table indexing inside the pdfs, so that my bot can answer for the values present inside the tables too?

LLogan M

Like I said, this is an extremely tricky problem.

You need to first identify the table, and then extract the text + format it. There are some libraries that do this out there (unstructured is one of them, but I guess it didn't work ? Lol)

LLogan M

This might help? https://camelot-py.readthedocs.io/en/master/

JJanis

You can do this with deepdoctection, as the framework supports detecting and segmenting tables. There is a loader available in the llama-hub but you cannot index your table separately yet. You can also check the suggestion I made in the integrations channel where I want to make able to index everything based on the pdf layout.

hharshit_alpha

Hi guys, I was thinking that where i am creating the document object , is there a way i can create it my self (without using llamaindex loaders)? and then we can do some changes in the text and the table . Or do you guys have any suggestions?

LLogan M

You can create the Document object yourself, like this

Plain Text

from llama_index import Document
doc = Document("doc text", doc_id="my id", extra_info={"key": "val"})

The extra info and ID are optional

hharshit_alpha

Thanks @Logan M that might actually help. Could you please help me with any sample link of code (if you have any) where document object is created by us?

LLogan M

Not really any sample code, but here's a small example of reading a json file

Plain Text

from llama_index import Document
import josn

with open("file.json", "r") as f:
  data = json.load(f)

documents = []
for key, val in data.items():
  documents.append(Document(f"{key}: {val}"))

Add a reply

Find answers from the community

Tables plus text