Guidance

At a glance

Good evening folks - I am new to LlamaIndex, and playing around. Currently, I have built a proof-of-concept that can load a PDF file (using PDFReader), and I can query for 1 PDF document. Now, I'd like to expand this to multiple PDF files located in a Folder. I do not have a background in this type of work, so the concept of Index and adding documents is confusing me. Do I need to pick a certain Index type? Do I have to build a graph and then ingest indices 1 by 1? How can I go about what I am trying to accomplish? I am looking for some rookie guidance. Thank you.

27 comments

LLogan M

I think starting with a single vector index is a good starting point, even with multiple documents.

As you start to expand, or depending on the types of queries you expect, you could expand it to be a graph index (with sub indexes grouped by topic or something) or another type of index

LLogan M

If you want to keep document boundaries, definitely look at building a graph/composable index

LLogan M

Or even using the new router abstraction

https://colab.research.google.com/drive/1KH8XtRiO5spa8CT7UrXN54IWdZk3DDxl#scrollTo=a8bc0d20-fa6a-455b-ab85-ea9e4fcc0b37

AAli Salih

Thank you, Logan. I will look into your guidance about the graph/composable index.

LLogan M

@Voldev I think you were also curious about this

AAli Salih

@Vayu as well!

VVoldev

What I'm finding trouble with is hallucinating answers. I built out a similar system with langchain alone and the JSON agent would retrieve the information just fine, but when I embed the jsons into an index with llamaindex, it seems to only get part of the information, and make up the rest. Would this possibly help with that?

LLogan M

Hmm, I think that's mostly a prompt engineering problem.

Are you using gpt-3.5?

LLogan M

There might also be a better way to insert your json data into an index, depending on what it looks like 🤔

VVoldev

Task is related to some made-up data about drones, here is how I'm making the index and prompting:

Plain Text

def prepare_data():
    '''This function loads the data from the directory and prepares it for indexing'''

    # load from disk if index already exists
    if Path('index.json').is_file():
        index = GPTSimpleVectorIndex.load_from_disk('index.json')
        #check for new files
        docs = glob.glob(config['data']['books'] + '/*')
        with open('docs.txt', 'r') as f:
            old_docs = json.load(f)
        new_docs = [Document(t) for t in docs if t not in old_docs]
        if len(new_docs) > 0:
            for doc in new_docs:
                index.insert(doc)
        index.save_to_disk('index.json')
        with open('docs.txt', 'w') as f:
            json.dump(docs, f)

        return index
    #load the data
    documents = SimpleDirectoryReader(config['data']['books']).load_data()
    #make a record of all files contained in the directory
    docs = glob.glob(config['data']['books'] + '/*')
    #save off the docs list as a json
    with open('docs.txt', 'w') as f:
        json.dump(docs, f)

    index = GPTSimpleVectorIndex(documents, max_input_size=2048, num_output=2000, max_chunk_overlap=12)
    index.save_to_disk('index.json')
    return index

and the prompt:

Plain Text

    print(index.query(
        '''### USER QUERY: it just started snowing, what is the snow checklist?
         ### DRONE: AeroGuardian AG950
         ### DIRECTION: Find and provide the relevant weather checklist. Provide the number of elements in the checklist, then list each element. Answer in the form of a JSON object. '''))

VVoldev

the json data is in a few different forms, which is why I'm not just doing a regex lookup to begin with :p

VVoldev

there is a json entry with the drone name, exactly as written there. Same prompt into the langchain json-agent returns the exact checklist no problem.... hmmmm

VVayu

@Ali Salih Thanks! 🙂

AAli Salih

Can someone comment on the data privacy aspect while using llama-index? For example working with a private PDF file; What best practices should be utilized to keep data still private?

ffelixeatsramen

I would be interested in this as well!

VVaylonn

That's what i'm working on currently. From what i know:

1 - can't use any 3rd party system (like openai for exemple), expect for loading models from huggingface
2 - For the llm that you will use in local or loaded from huggingface: it needs to be commercial-use licensed (ex: not vicuna because mit licensed i guess[might be wrong on this one, but you get the point])

optionnal:
3 - can run on local without internet, as a proof to not be connected to a 3rd party program

LLogan M

@felixeatsramen @Ali Salih

Yea for private stuff, definitely check out the recent colab notebook (you'll probably be more interested in the GPU section, CPU is too slow)

https://colab.research.google.com/drive/16QMQePkONNlDpgiltOi7oRQgmB8dU5fl?usp=sharing

ffelixeatsramen

Thanks, that's helpful. Did you test it with CPU only?

VVaylonn

since you speak about that @Logan M, using CPU

VVaylonn

Attachment