Im new to AI, although I have basic

Velislav · 2024-08-02T11:09:42.103Z

Im new to AI, although I have basic coding literacy and I can quickly learn anything needed and I just need someone who can point me in the right direction as I really don't know what to do. Thanks in advance!

I'm relatively new here, but I'm happy to share a few ideas.

It sounds like you're looking at a standard RAG pipeline: loading the source data, chunking/parsing/embedding it, storing it, and then querying it. Some existing tools might accomplish what you need with clever queries and configuration right out of the box.

I like Llama Index. If you're looking for something more customizable than an auto-GPT solution, it's a great option.

Here's what I would do:

Familiarize yourself with the data and look for patterns and structure, which should inform your parsing/chunking strategies and increase chances of success.
Organize the source data into a data directory, split and sorted by type. Avoid processing a large PDF all at once; if memory issues arise, consider splitting it up or converting it to another format. LlamaParse might be helpful. Alternatively, you could use a VM in GPC/AWS with lots of RAM for a one-time translation.
Iterate over the directory with something like UnstructuredReader. Parse the text into Nodes using a splitter like SentenceSplitter or SemanticSplitting. This will populate your doc store with text. Index it and store it in a VectorStoreIndex. I personally like pgvector, but find whatever works for you.
Connect it to a QueryEngine with just a few lines of code.

More advanced techniques can offer features like NER and graph knowledge.

This is just my approach, but I'm sure others with more experience might have a (much) simpler solution. 🙂

Find answers from the community

Im new to AI, although I have basic