This sounds like a perfect use-case. You can load in your documents into a SimpleVectorIndex
and then just query the index
import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'
from gpt_index import GPTSimpleVectorIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex(documents)
index.query("<question_text>?")
what format are your documents in?
Anything really, currently they are in .txt but I have no issues swapping them to different formats
ok cool then this code should work well
Ok, thank you. So if I understood correctly that code will search the embeddings database for the answer? So now I would just need to figure out how to load them into the GPTSimpleVectorIndex.
and yeah you can just load them in using the SimpleDirectoryReader
in that code snippet up there
or the updated way to do it is:
from gpt_index import download_loader
SimpleDirectoryReader = download_loader("SimpleDirectoryReader")
loader = SimpleDirectoryReader('./data')
documents = loader.load_data()
index = GPTSimpleVectorIndex(documents)
index.query("<question_text>?")
Yeah I'm just a bit confused on how does it know which documents to load?
Like here is an example code snippet of what I mean (the code determines which txt file to open) and then it creates the embeddings based on that:
I'm quite new to programming so I'm not sure how that code you posted could be used to load a txt file into 🤔
no worries -- basically ./data
here is a local folder
so you can put your txt files into a folder named './data` in the same directory as your code
and then run that code @Teemu
It basically just goes in and loads all the files in the folder
Ah thank you so much, that helps clear it up. I'm guessing I should build the bot in a totally different directory also?
up to you. I find it easier to keep everything in the same directory when building out the first version
just make sure the ./data
folder only has your txt files in it
Ah alright. Do you think this architecture could also support sharing the bot lets say on a Wordpress site?
I had some issues with my previous bot creating a lot of chunks which could be overwhelming
Hmm the code has GPTSimpleVectorIndex as undefined and folder/files it creates dont display correctly. "GPTSimpleVectorIndex" is not definedPylancereportUndefinedVariable"
Looks like maybe some sort of import error
having a lot of chunks isn't too much of an issue since the indexing is done beforehand. How many do you have?
did you add from gpt_index import GPTSimpleVectorIndex
?
The file.cpython should contain the embeddings right? It says it's in binary or has unsupported text coding
Does GPT index and the libraries it runs support letters such as 'ä' or 'ö'. I suspect that might be the issue...
the embeddings aren't in any file. They're generated in the index
variable
you just have to put your original txt documents in the folder
Yeah I just wanted to check the embeddings because my previous bot had some troubles with special characters not being supported which ruined the semantic search. But I managed to run the code and this is what it said:
INFO:root:> [build_index_from_documents] Total LLM token usage: 0 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 11018 tokens
INFO:root:> [query] Total LLM token usage: 3554 tokens
INFO:root:> [query] Total embedding token usage: 6 tokens
So it looks like it was a success.
I guess the next step is to figure out how I can build a chatbot using that. I imagine it will look quite different from the last one.
How would you access or view them? I have ran the code a few times now and it seems to be generating them but I want to make sure they have been generated properly.
Because when I try to query the index it doesn't show the "documents" as being defined anywhere. So what would be the correct way to generate an answer based on the indexed documents?
I see -- when you run a query, the response has a field called source_nodes
that tells you which document(s) were used to construct the answer. You can check that out.
As for viewing the embeddings themselves, try doing this once you've already generated your index once
# save to disk
index.save_to_disk('index.json')
# load from disk
index = GPTSimpleVectorIndex.load_from_disk('index.json')
that way, you're actually saving the index you generated. You can open up that json file and take a look
Yeah the encoding isn't perfect
but I was able to query the index finally
but when asking even a 1 sentence question I got a notification saying maximum token length is 4097 tokens. But then when I ask other questions it answers with just 1 sentence...