Find answers from the community

Updated 3 months ago

Hello, everyone. Could someone explain

Hello, everyone. Could someone explain to me how the system of searching for suitable "pieces of text" in a document works in GPTVectorStoreIndex?
And how to make this selection more accurate, because sometimes it finds inappropriate passages.
T
W
A
44 comments
It finds semantically similar text snippets
The comparison is between the prompt you send and the information you have in your knowledge base (vector store)
My problem is that the database consists of 60 pages of text (divided by paragraphs and headings).

And I noticed that the bot really searches for a syntactic match.
But, when I use 2-3 nodes (similarity_top_k=...), I don't get what I need for some "complicated" questions. I just get semantically similar nodes of another paragraph.

Clarification that the knowledge base and questions are in Ukrainian.
Are you using OpenAI services? Or opensource llm and embedding model?
my code snipet:

temperature = business_unit.temperature
if business_unit.max_tokens:
llm_predictor = LLMPredictor(
llm=ChatOpenAI(model_name=business_unit.gpt_model, temperature=temperature,
max_tokens=business_unit.max_tokens)
)
else:
llm_predictor = LLMPredictor(
llm=ChatOpenAI(model_name=business_unit.gpt_model, temperature=temperature)
)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor,
system_prompt=business_unit.system_prompt)
if os.path.exists(index_name):
index = load_index_from_storage(
StorageContext.from_defaults(persist_dir=index_name),
service_context=service_context,
)
else:
documents = SimpleDirectoryReader(documents_folder).load_data()
index = GPTVectorStoreIndex.from_documents(
documents, service_context=service_context
)
index.storage_context.persist(persist_dir=index_name)
query_engine = index.as_query_engine(similarity_top_k=2)
response = query_engine.query(query_text)
Okay for the code you can make the following change

Plain Text
from llama_index.llms import OpenAI


temperature = business_unit.temperature
    if business_unit.max_tokens:
        # define LLM, add your conditions
        llm = OpenAI(model=business_unit.gpt_model, temperature=temperature,
                           max_tokens=business_unit.max_tokens)
    else:
        # define LLM, add your conditions
        llm = OpenAI(model=business_unit.gpt_model, temperature=temperature)

    service_context = ServiceContext.from_defaults(llm=llm,
                                                   system_prompt=business_unit.system_prompt)
    if os.path.exists(index_name):
        index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=index_name),
            service_context=service_context,
        )
    else:
        documents = SimpleDirectoryReader(documents_folder).load_data()
        index = GPTVectorStoreIndex.from_documents(
            documents, service_context=service_context
        )
        index.storage_context.persist(persist_dir=index_name)
    query_engine = index.as_query_engine(similarity_top_k=2)
    response = query_engine.query(query_text)
Thank you very much, but can you provide an explanation for this code?
I see you changed OpenAI instead of ChatOpenAI
I just replaced ChatOpenAI with llamaIndex OpenAI.

Now for your problem, I think of these reasons:
  • OpenAI embedding model may not be able to work correctly for embeddings. You can try exploring embedding models which gives better support for Ukraine langauge.
  • For Response generation I think OpenAI is the best one out there for any non-english language. GPT-3 is not good now. You should either try with GPT3.5 or GPT-4
changed that. The problem is the same.

Let me give you some examples ( possible mistakes as I translated it in the translator so as not to rewrite a lot):

response from my api:
{
"user_question": "How do I create an event?"
"response": {
"response": "To create an event, you need to log in to your personal account. After that, in the upper right corner of the screen, click on the "icon" of the account. In the menu that opens, select the "Manage" section, and then go to the "Create an event" section. The next step is the "Add event details" tab, where you can specify the name of the event, date and time, description and other necessary details. Save the event by clicking the "Save" button: ."
"eval_result": 4.0,
"llm_context": "Lesson creation instructions: http://surl.li/koogr \n\nCreating a test:\nPlease note: when creating a lesson, you can add a test task to it, which consists of different questions. At the second level of creating a subject "Curriculum", you can fill it with various test tasks. To fill in the selected topic, click "Add test". Next, click the "Create a new test" tab. When creating a new test, enter a name for it. Click the "Create" button. Select the "Add questions" tab. Next, select the question type.u000bList of question types:\nTrue or False: indicate whether the statement is true or false.\nOne correct answer.\nMultiple correct answers.\nSorting answers (place in the correct order). \Fill in the blanks.nShort open-ended answer (answer in one word).nAn open-ended answer.nMatching (forming correct answer pairs).nThe created timetable should be saved by clicking the "Save" button.
 Instructions for creating a timetable: http://surl.li/kpvgn \n\n\nCreating a subject timetable:\nTo create a subject timetable, you should follow these steps: log in to your personal account. Click the "icon" of your account in the upper right corner of the screen. A menu will open on the right, in which select the "Management" section. Go to the section "Class schedule". Click "Create timetable". In the "Select class" tab, specify the class for which you are creating the timetable. Select the academic year. Specify the semester in the drop-down list. Select a shift. A window opens with the days of the week listed horizontally and the order of lessons in time listed vertically. For each time period, select a subject. In the tab opposite the subject name, select from the drop-down list what will be assessed in all lessons on the specified day and time."
},
"sendpulse_cont": [
"{"success":true,"data":true}",
"{"success":true,"data":true}"
]
}


------------------------------
the passage I need in the document:
Creating events: At the second level of creating a Curriculum subject, you have the opportunity to add various events: party, theme night, conference, concert, competition, tournament, etc. To create an event for your class, you should follow these steps: click "Add event". Be sure to enter the name of the event you plan to organise for your class. Next, select the type of event: party, theme night, conference, concert, competition, tournament, etc. Specify the duration of the event. Specify the location of the event. In the "Note" line, leave any clarifications, requests or any other necessary information about the event. After entering all the data, click the Create button. To finish setting up the event, click the Close button. Instructions for creating events: http://surl.li/koonb


As you can see, the llm_context does not contain the node I need. I don't understand why it works this way.
answer +- correct, but a very strange llm_context, and a wrong link
The main problem is that Llama currently gives sometimes irrelevant context. It always gives 2 (specified by the arugment) passages (nodes) even when it does not see any matches in the knowledge base (example: query Hello, context - a bunch of text from the knowledge base not related to Hello).
It should give ONLY relevant context ONLY if it finds it in the knowledge base.

If there are more than two matches in the knowledge base, we will get only 2. That is, if the query context appears 10 times in the text of the database, we will pass only 2 to ChatGPT.
We must pass ALL relevant text fragments.
Yes it gives 2 nodes as that is the default value. You can increase that by updating the top_k value.

Plain Text
query_engine = index.as_query_engine(similarity_top_k=5 OR ANY VALUE for no of nodes you want to have)


But this will bring the top 5 or 10 nodes based on the value that you have set
Yes, I understand, but when I put at least 10 nodes, there is still no passage on "Creating events" that I need.

It's also a bit strange when there is only 1 passage in the database, but I get 10 nodes, etc.
Also I see that your llm_context contains items related to creating different kind of items as well. It may be getting high similarity value to your query.
Yes, you are right. That it finds other nodes semantically.

But why isn't this passage what I need here? Maybe I need to improve my knowledge base?
Of course, you won't understand the text, but I'm talking about the formatting itself. Maybe it should be improved?
Attachment
image.png
You could try one thing, to check, You could update metadata for the particular node once which you wnat to be fetched. Add some informations like This node contains information about creating event etc etc

and then try querying and check if the node is coming in the retrieved or not
If this work, then you can try the above metioned notebook
thx, I'll try.

I also wondered if it would solve my question if I added a note to each passage:
questions this excerpt can answer (and a list of all possible questions)
Yes, if this approach approach works then you can use the above notebook and let llm generate the questions
but I think that llm will not do a good job (in Ukrainian).

And I'll have to do it manually
And if you add the queries after the text it will surely increase the semantic percentage🤝 . Do add them in metadata as if the node gets bigger it will be divided into N number of parts and all of them will contain the same metadata.

If you add as a Note, If node gets divided, it'll be only in the last node
thx ❤️
@WhiteFang_Jr

I noticed a problem with the nodes. They are not separated correctly. That is, one node can consist of parts of different nodes.
How can I properly separate paragraphs in the database to form the correct nodes?
Here's an example.

The red box is node "A", and the green box is another node, the next node "B"
Attachment
image.png
If you want to put each node with each topic, You'll have to do that manually, Have the data at your side and then create Document object with those data and pass it to VectorStoreIndex
Can I specify any special characters in the database text that will separate nodes?
No I dont think it will work, Actually LlamaIndex separates the information based on chunk size which has default value of 1024 tokens.
Oh, this is very bad news =(((
Can you provide this code from the llamaindex library? Maybe I can redefine it and modify it?
You want code for how llamaindex currently chunks the information and make nodes right?
https://github.com/run-llama/llama_index/blob/37f8421ff8131275dbe40837890971bb924b76e1/llama_index/node_parser/simple.py#L73

This is where Node formation starts. So you'll have to follow the trace to where division actually takes places.
great, thanks
I just added these lines and that's it - now I have a proper node. It's not "cropped".

How does it work? О
Attachments
image.png
image.png
I mean the grey lines
Yeah, okay. It's not about the line, it's about the indents. If there are three "/n", then it takes only one paragraph
Add a reply
Sign up and join the conversation on Discord