Find answers from the community

Updated 2 months ago

Guidance

Good evening folks - I am new to LlamaIndex, and playing around. Currently, I have built a proof-of-concept that can load a PDF file (using PDFReader), and I can query for 1 PDF document. Now, I'd like to expand this to multiple PDF files located in a Folder. I do not have a background in this type of work, so the concept of Index and adding documents is confusing me. Do I need to pick a certain Index type? Do I have to build a graph and then ingest indices 1 by 1? How can I go about what I am trying to accomplish? I am looking for some rookie guidance. Thank you.
3
L
A
V
27 comments
I think starting with a single vector index is a good starting point, even with multiple documents.

As you start to expand, or depending on the types of queries you expect, you could expand it to be a graph index (with sub indexes grouped by topic or something) or another type of index
If you want to keep document boundaries, definitely look at building a graph/composable index
Thank you, Logan. I will look into your guidance about the graph/composable index.
@Voldev I think you were also curious about this
@Vayu as well!
What I'm finding trouble with is hallucinating answers. I built out a similar system with langchain alone and the JSON agent would retrieve the information just fine, but when I embed the jsons into an index with llamaindex, it seems to only get part of the information, and make up the rest. Would this possibly help with that?
Hmm, I think that's mostly a prompt engineering problem.

Are you using gpt-3.5?
There might also be a better way to insert your json data into an index, depending on what it looks like πŸ€”
Task is related to some made-up data about drones, here is how I'm making the index and prompting:
Plain Text
def prepare_data():
    '''This function loads the data from the directory and prepares it for indexing'''

    # load from disk if index already exists
    if Path('index.json').is_file():
        index = GPTSimpleVectorIndex.load_from_disk('index.json')
        #check for new files
        docs = glob.glob(config['data']['books'] + '/*')
        with open('docs.txt', 'r') as f:
            old_docs = json.load(f)
        new_docs = [Document(t) for t in docs if t not in old_docs]
        if len(new_docs) > 0:
            for doc in new_docs:
                index.insert(doc)
        index.save_to_disk('index.json')
        with open('docs.txt', 'w') as f:
            json.dump(docs, f)

        return index
    #load the data
    documents = SimpleDirectoryReader(config['data']['books']).load_data()
    #make a record of all files contained in the directory
    docs = glob.glob(config['data']['books'] + '/*')
    #save off the docs list as a json
    with open('docs.txt', 'w') as f:
        json.dump(docs, f)

    index = GPTSimpleVectorIndex(documents, max_input_size=2048, num_output=2000, max_chunk_overlap=12)
    index.save_to_disk('index.json')
    return index


and the prompt:
Plain Text
    print(index.query(
        '''### USER QUERY: it just started snowing, what is the snow checklist?
         ### DRONE: AeroGuardian AG950
         ### DIRECTION: Find and provide the relevant weather checklist. Provide the number of elements in the checklist, then list each element. Answer in the form of a JSON object. '''))
the json data is in a few different forms, which is why I'm not just doing a regex lookup to begin with :p
there is a json entry with the drone name, exactly as written there. Same prompt into the langchain json-agent returns the exact checklist no problem.... hmmmm
@Ali Salih Thanks! πŸ™‚
Can someone comment on the data privacy aspect while using llama-index? For example working with a private PDF file; What best practices should be utilized to keep data still private?
I would be interested in this as well!
That's what i'm working on currently. From what i know:

1 - can't use any 3rd party system (like openai for exemple), expect for loading models from huggingface
2 - For the llm that you will use in local or loaded from huggingface: it needs to be commercial-use licensed (ex: not vicuna because mit licensed i guess[might be wrong on this one, but you get the point])

optionnal:
3 - can run on local without internet, as a proof to not be connected to a 3rd party program
@felixeatsramen @Ali Salih

Yea for private stuff, definitely check out the recent colab notebook (you'll probably be more interested in the GPU section, CPU is too slow)

https://colab.research.google.com/drive/16QMQePkONNlDpgiltOi7oRQgmB8dU5fl?usp=sharing
Thanks, that's helpful. Did you test it with CPU only?
since you speak about that @Logan M, using CPU
this error happens when i try to have the response of the question
it says me that it might come from tensorflow
i installed tensorRT + Cudnn recently because i wanted to use the GPU version of the code, could it come from it ?
At the bottom section of the notebook, i tested a camel model running in a T4 gpu (15GB)
Hmmm no idea haha. Maybe it crashed due to vram?
No, it's the CPU based program
Ah, I wouldn't even bother with the CPU part tbh... it's too slow to be useful πŸ₯²
Add a reply
Sign up and join the conversation on Discord