LlamaIndex

Log inLog into community

Find answers from the community

Updated 5 months ago

This program is so vast I don t even

This program is so vast I don t even

At a glance

·

This program is so vast I don't even know where to start. I currently have a totally separate bot that uses embeddings ada 002 to create embeddings (from a long and large text document) and then I have a python bot that answers those using a davinci model.

How would I go about recreating this using GPT Index? For my use case I would need the bot to answer extremely specifically (think legal statue, very fine detail- specific). What avenue should I start with?

h

T

38 comments

This sounds like a perfect use-case. You can load in your documents into a SimpleVectorIndex and then just query the index

Plain Text

import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'

from gpt_index import GPTSimpleVectorIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex(documents)
index.query("<question_text>?")

what format are your documents in?

Anything really, currently they are in .txt but I have no issues swapping them to different formats

ok cool then this code should work well

Ok, thank you. So if I understood correctly that code will search the embeddings database for the answer? So now I would just need to figure out how to load them into the GPTSimpleVectorIndex.

yep

and yeah you can just load them in using the SimpleDirectoryReader in that code snippet up there

or the updated way to do it is:

Plain Text

from gpt_index import download_loader

SimpleDirectoryReader = download_loader("SimpleDirectoryReader")

loader = SimpleDirectoryReader('./data')
documents = loader.load_data()
index = GPTSimpleVectorIndex(documents)
index.query("<question_text>?")

Yeah I'm just a bit confused on how does it know which documents to load?

Like here is an example code snippet of what I mean (the code determines which txt file to open) and then it creates the embeddings based on that:

Attachment

I'm quite new to programming so I'm not sure how that code you posted could be used to load a txt file into 🤔

oh

no worries -- basically ./data here is a local folder

so you can put your txt files into a folder named './data` in the same directory as your code

and then run that code @Teemu

It basically just goes in and loads all the files in the folder

Ah thank you so much, that helps clear it up. I'm guessing I should build the bot in a totally different directory also?

up to you. I find it easier to keep everything in the same directory when building out the first version

just make sure the ./data folder only has your txt files in it

Ah alright. Do you think this architecture could also support sharing the bot lets say on a Wordpress site?

I had some issues with my previous bot creating a lot of chunks which could be overwhelming

Hmm the code has GPTSimpleVectorIndex as undefined and folder/files it creates dont display correctly. "GPTSimpleVectorIndex" is not definedPylancereportUndefinedVariable"

Looks like maybe some sort of import error

having a lot of chunks isn't too much of an issue since the indexing is done beforehand. How many do you have?

did you add from gpt_index import GPTSimpleVectorIndex?

The file.cpython should contain the embeddings right? It says it's in binary or has unsupported text coding

Does GPT index and the libraries it runs support letters such as 'ä' or 'ö'. I suspect that might be the issue...

the embeddings aren't in any file. They're generated in the index variable

you just have to put your original txt documents in the folder

Yeah I just wanted to check the embeddings because my previous bot had some troubles with special characters not being supported which ruined the semantic search. But I managed to run the code and this is what it said:

INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 11018 tokens
INFO:root:> [query] Total LLM token usage: 3554 tokens
INFO:root:> [query] Total embedding token usage: 6 tokens

So it looks like it was a success.

I guess the next step is to figure out how I can build a chatbot using that. I imagine it will look quite different from the last one.

How would you access or view them? I have ran the code a few times now and it seems to be generating them but I want to make sure they have been generated properly.

Because when I try to query the index it doesn't show the "documents" as being defined anywhere. So what would be the correct way to generate an answer based on the indexed documents?

I see -- when you run a query, the response has a field called source_nodes that tells you which document(s) were used to construct the answer. You can check that out.

As for viewing the embeddings themselves, try doing this once you've already generated your index once

Plain Text

# save to disk
index.save_to_disk('index.json')
# load from disk
index = GPTSimpleVectorIndex.load_from_disk('index.json')

that way, you're actually saving the index you generated. You can open up that json file and take a look

Yeah the encoding isn't perfect

Attachment

but I was able to query the index finally

but when asking even a 1 sentence question I got a notification saying maximum token length is 4097 tokens. But then when I ask other questions it answers with just 1 sentence...

Add a reply

Sign up and join the conversation on Discord