Hello, everyone. Could someone explain

At a glance

Hello, everyone. Could someone explain to me how the system of searching for suitable "pieces of text" in a document works in GPTVectorStoreIndex?
And how to make this selection more accurate, because sometimes it finds inappropriate passages.

44 comments

TTeemu

It finds semantically similar text snippets

WWhiteFang_Jr

https://docs.llamaindex.ai/en/stable/getting_started/concepts.html#querying-stage

TTeemu

The comparison is between the prompt you send and the information you have in your knowledge base (vector store)

AAnoDy

My problem is that the database consists of 60 pages of text (divided by paragraphs and headings).

And I noticed that the bot really searches for a syntactic match.
But, when I use 2-3 nodes (similarity_top_k=...), I don't get what I need for some "complicated" questions. I just get semantically similar nodes of another paragraph.

Clarification that the knowledge base and questions are in Ukrainian.

WWhiteFang_Jr

Are you using OpenAI services? Or opensource llm and embedding model?

AAnoDy

my code snipet:

temperature = business_unit.temperature
if business_unit.max_tokens:
llm_predictor = LLMPredictor(
llm=ChatOpenAI(model_name=business_unit.gpt_model, temperature=temperature,
max_tokens=business_unit.max_tokens)
)
else:
llm_predictor = LLMPredictor(
llm=ChatOpenAI(model_name=business_unit.gpt_model, temperature=temperature)
)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor,
system_prompt=business_unit.system_prompt)
if os.path.exists(index_name):
index = load_index_from_storage(
StorageContext.from_defaults(persist_dir=index_name),
service_context=service_context,
)
else:
documents = SimpleDirectoryReader(documents_folder).load_data()
index = GPTVectorStoreIndex.from_documents(
documents, service_context=service_context
)
index.storage_context.persist(persist_dir=index_name)
query_engine = index.as_query_engine(similarity_top_k=2)
response = query_engine.query(query_text)

AAnoDy

I took this as an example:

https://huggingface.co/spaces/llamaindex/llama_index_vector_demo/blob/main/app.py

AAnoDy

@WhiteFang_Jr
I found this notebook, is it what I need?

https://github.com/run-llama/llama_index/blob/main/docs/examples/metadata_extraction/MetadataExtractionSEC.ipynb

WWhiteFang_Jr

Okay for the code you can make the following change

Plain Text

from llama_index.llms import OpenAI


temperature = business_unit.temperature
    if business_unit.max_tokens:
        # define LLM, add your conditions
        llm = OpenAI(model=business_unit.gpt_model, temperature=temperature,
                           max_tokens=business_unit.max_tokens)
    else:
        # define LLM, add your conditions
        llm = OpenAI(model=business_unit.gpt_model, temperature=temperature)

    service_context = ServiceContext.from_defaults(llm=llm,
                                                   system_prompt=business_unit.system_prompt)
    if os.path.exists(index_name):
        index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=index_name),
            service_context=service_context,
        )
    else:
        documents = SimpleDirectoryReader(documents_folder).load_data()
        index = GPTVectorStoreIndex.from_documents(
            documents, service_context=service_context
        )
        index.storage_context.persist(persist_dir=index_name)
    query_engine = index.as_query_engine(similarity_top_k=2)
    response = query_engine.query(query_text)

AAnoDy

Thank you very much, but can you provide an explanation for this code?

AAnoDy

I see you changed OpenAI instead of ChatOpenAI

WWhiteFang_Jr

I just replaced ChatOpenAI with llamaIndex OpenAI.

Now for your problem, I think of these reasons:

OpenAI embedding model may not be able to work correctly for embeddings. You can try exploring embedding models which gives better support for Ukraine langauge.
For Response generation I think OpenAI is the best one out there for any non-english language. GPT-3 is not good now. You should either try with GPT3.5 or GPT-4

AAnoDy

changed that. The problem is the same.

Let me give you some examples ( possible mistakes as I translated it in the translator so as not to rewrite a lot):

response from my api:
{
"user_question": "How do I create an event?"
"response": {
"response": "To create an event, you need to log in to your personal account. After that, in the upper right corner of the screen, click on the "icon" of the account. In the menu that opens, select the "Manage" section, and then go to the "Create an event" section. The next step is the "Add event details" tab, where you can specify the name of the event, date and time, description and other necessary details. Save the event by clicking the "Save" button: ."
"eval_result": 4.0,
"llm_context": "Lesson creation instructions: http://surl.li/koogr \n\nCreating a test:\nPlease note: when creating a lesson, you can add a test task to it, which consists of different questions. At the second level of creating a subject "Curriculum", you can fill it with various test tasks. To fill in the selected topic, click "Add test". Next, click the "Create a new test" tab. When creating a new test, enter a name for it. Click the "Create" button. Select the "Add questions" tab. Next, select the question type.u000bList of question types:\nTrue or False: indicate whether the statement is true or false.\nOne correct answer.\nMultiple correct answers.\nSorting answers (place in the correct order). \Fill in the blanks.nShort open-ended answer (answer in one word).nAn open-ended answer.nMatching (forming correct answer pairs).nThe created timetable should be saved by clicking the "Save" button.  Instructions for creating a timetable: http://surl.li/kpvgn \n\n\nCreating a subject timetable:\nTo create a subject timetable, you should follow these steps: log in to your personal account. Click the "icon" of your account in the upper right corner of the screen. A menu will open on the right, in which select the "Management" section. Go to the section "Class schedule". Click "Create timetable". In the "Select class" tab, specify the class for which you are creating the timetable. Select the academic year. Specify the semester in the drop-down list. Select a shift. A window opens with the days of the week listed horizontally and the order of lessons in time listed vertically. For each time period, select a subject. In the tab opposite the subject name, select from the drop-down list what will be assessed in all lessons on the specified day and time."
},
"sendpulse_cont": [
"{"success":true,"data":true}",
"{"success":true,"data":true}"
]
}

------------------------------
the passage I need in the document:

Creating events:
At the second level of creating a Curriculum subject, you have the opportunity to add various events: party, theme night, conference, concert, competition, tournament, etc. To create an event for your class, you should follow these steps: click "Add event". Be sure to enter the name of the event you plan to organise for your class. Next, select the type of event: party, theme night, conference, concert, competition, tournament, etc. Specify the duration of the event. Specify the location of the event. In the "Note" line, leave any clarifications, requests or any other necessary information about the event. After entering all the data, click the Create button. To finish setting up the event, click the Close button. Instructions for creating events:  http://surl.li/koonb

As you can see, the llm_context does not contain the node I need. I don't understand why it works this way.

AAnoDy

answer +- correct, but a very strange llm_context, and a wrong link

AAnoDy

The main problem is that Llama currently gives sometimes irrelevant context. It always gives 2 (specified by the arugment) passages (nodes) even when it does not see any matches in the knowledge base (example: query Hello, context - a bunch of text from the knowledge base not related to Hello).
It should give ONLY relevant context ONLY if it finds it in the knowledge base.

If there are more than two matches in the knowledge base, we will get only 2. That is, if the query context appears 10 times in the text of the database, we will pass only 2 to ChatGPT.
We must pass ALL relevant text fragments.

WWhiteFang_Jr

Yes it gives 2 nodes as that is the default value. You can increase that by updating the top_k value.

Plain Text

query_engine = index.as_query_engine(similarity_top_k=5 OR ANY VALUE for no of nodes you want to have)

But this will bring the top 5 or 10 nodes based on the value that you have set

AAnoDy

Yes, I understand, but when I put at least 10 nodes, there is still no passage on "Creating events" that I need.

It's also a bit strange when there is only 1 passage in the database, but I get 10 nodes, etc.

WWhiteFang_Jr

Also I see that your llm_context contains items related to creating different kind of items as well. It may be getting high similarity value to your query.

AAnoDy

Yes, you are right. That it finds other nodes semantically.

But why isn't this passage what I need here? Maybe I need to improve my knowledge base?

AAnoDy

Of course, you won't understand the text, but I'm talking about the formatting itself. Maybe it should be improved?

Attachment

WWhiteFang_Jr

You could try one thing, to check, You could update metadata for the particular node once which you wnat to be fetched. Add some informations like This node contains information about creating event etc etc

and then try querying and check if the node is coming in the retrieved or not

WWhiteFang_Jr

If this work, then you can try the above metioned notebook

WWhiteFang_Jr

For updating particular object: https://docs.llamaindex.ai/en/stable/core_modules/data_modules/index/document_management.html

AAnoDy

thx, I'll try.

I also wondered if it would solve my question if I added a note to each passage:
questions this excerpt can answer (and a list of all possible questions)

WWhiteFang_Jr

Yes, if this approach approach works then you can use the above notebook and let llm generate the questions

AAnoDy

but I think that llm will not do a good job (in Ukrainian).

And I'll have to do it manually

WWhiteFang_Jr

And if you add the queries after the text it will surely increase the semantic percentage🤝 . Do add them in metadata as if the node gets bigger it will be divided into N number of parts and all of them will contain the same metadata.

If you add as a Note, If node gets divided, it'll be only in the last node

AAnoDy

thx ❤️

AAnoDy

@WhiteFang_Jr

I noticed a problem with the nodes. They are not separated correctly. That is, one node can consist of parts of different nodes.
How can I properly separate paragraphs in the database to form the correct nodes?

AAnoDy

Here's an example.

The red box is node "A", and the green box is another node, the next node "B"

Attachment

WWhiteFang_Jr

If you want to put each node with each topic, You'll have to do that manually, Have the data at your side and then create Document object with those data and pass it to VectorStoreIndex

AAnoDy

Can I specify any special characters in the database text that will separate nodes?

WWhiteFang_Jr

No I dont think it will work, Actually LlamaIndex separates the information based on chunk size which has default value of 1024 tokens.

AAnoDy

Oh, this is very bad news =(((

AAnoDy

Can you provide this code from the llamaindex library? Maybe I can redefine it and modify it?

WWhiteFang_Jr

For chunking?

AAnoDy

sry?

WWhiteFang_Jr

You want code for how llamaindex currently chunks the information and make nodes right?

AAnoDy

yep

WWhiteFang_Jr

https://github.com/run-llama/llama_index/blob/37f8421ff8131275dbe40837890971bb924b76e1/llama_index/node_parser/simple.py#L73

This is where Node formation starts. So you'll have to follow the trace to where division actually takes places.

AAnoDy

great, thanks

AAnoDy

I just added these lines and that's it - now I have a proper node. It's not "cropped".

How does it work? О_о

Attachments

AAnoDy

I mean the grey lines

AAnoDy

Yeah, okay. It's not about the line, it's about the indents. If there are three "/n", then it takes only one paragraph

Add a reply

Find answers from the community

Hello, everyone. Could someone explain