I'm relatively new here, but I'm happy to share a few ideas.
It sounds like you're looking at a standard RAG pipeline: loading the source data, chunking/parsing/embedding it, storing it, and then querying it. Some existing tools might accomplish what you need with clever queries and configuration right out of the box.
I like Llama Index. If you're looking for something more customizable than an auto-GPT solution, it's a great option.
Here's what I would do:
- Familiarize yourself with the data and look for patterns and structure, which should inform your parsing/chunking strategies and increase chances of success.
- Organize the source data into a data directory, split and sorted by type. Avoid processing a large PDF all at once; if memory issues arise, consider splitting it up or converting it to another format. LlamaParse might be helpful. Alternatively, you could use a VM in GPC/AWS with lots of RAM for a one-time translation.
- Iterate over the directory with something like UnstructuredReader. Parse the text into Nodes using a splitter like SentenceSplitter or SemanticSplitting. This will populate your doc store with text. Index it and store it in a VectorStoreIndex. I personally like pgvector, but find whatever works for you.
- Connect it to a QueryEngine with just a few lines of code.
More advanced techniques can offer features like NER and graph knowledge.
This is just my approach, but I'm sure others with more experience might have a (much) simpler solution. π