Are your queries pretty short? Do you have a lot of documents/nodes in your index?
Sounds a little weird. But I've noticed the embeddings for short queries (1-2 words) usually end up fetching the same nodes unless the query is highly specific to the documents in the index 🤔
My queries are generally goals e.g. get stronger/ be more productive/ sleep better
And my nodes are basically quotes about productivity/ success. There are about 20 word doc pages of them.
Almost always the same ones being returned though... And sometimes it won't even use nodes that contain same word as in the query..
See my query here.. maybe I am doing something embarassing
I see! Some notes:
- try using
query = input(...).strip()
to get rid of any newlines in the user input (this actually has some effect on embeddings somehow lol)
- you don't need the mode argument for a vector index 👍
- you can try increasing
similarity_top_k
in your query. The default is 1 so it just fetches the 1 closest matching node
- if possible, it might work a little better to insert each quote as it's own document. Then in your query, you can do something like
index.query("my query", similarity_top_k=10, response_mode="compact")
--- this will fetch the 10 closest quotes, and send as much of the matched quotes as possible in each LLM call to create the final answer
Ok super useful @Logan M - thank you. I will test & come back to you.
But for example I inputted - as a test - the following as a query: Should I aggressively avoid stimulants?
I inputted because I know I have a quote in the documents specifically about this. From Sam Altman: {'Quotes': 'I have one big shot of espresso immediately when I wake up and one after lunch. I assume this is about 200mg total of caffeine per day. I tried a few other configurations; this was the one that worked by far the best. I otherwise aggressively avoid stimulants, but I will have more coffee if I’m super tired and really need to get something done.\n'}}
The response I get is: Yes, it is still advisable to aggressively avoid stimulants. The context information suggests that the speaker has found that avoiding stimulants helps them to feel better and to be more productive, and that having good social skills and refining ideas can also be beneficial.
& that quote from Sam Altman is returned as a source node but only after many many many others which to me seem irrelevant e.g. "I look at like how I did on my to-do list for the year, and I write the one out for the next year. Because I know I’m going to do that every year, I stress about it less the other 364 days. The fundamental pattern that has always worked for me is take time, explore a lot of things, try a lot of things, try to have a beginner’s mind about what will work and what won’t work. But trust your intuitions, pursue a lot of things as cheaply and quickly as possible. Then be very honest with yourself about what’s working well and what’s not."
So I am slightly confused about how the similarity and context loading works..
Or it might be that I need to tweak (as you suggest) given that my use case is quotes..
Embeddings can be a bit of a pain to work with sometimes... was that test query using one doc per quote? Or you just put all your quotes in one (or a few) documents?
Maybe you'd have better luck with a keyword index?
That test query was with all quotes in one doc. I am loading them using the airtable loader.
The thing is I do want the summary of different quotes and I suppose it did at least return the right quote and form the answer based on it.
I am just confused as to exactly how the similarity works.
Thx so much for your help.
is the issue that I am not returning the actual source quotes? I'm trying via print(response.source_nodes), but maybe that is showing me something different?
I tried the similarity k = 10 as you suggested and the source nodes still don't seem to match... very odd.
anyway, thank you so much
Printing the source nodes is correct yea.
Yea I suppose the embeddings are not working great for your quotes 😅 maybe a keyword index would be better tbh, assuming your goal is to type in a query and fetch related quotes?
Ha yes apparently...
So no the use case is actually to ask this corpus of quotes (which is currently 200 but ideally will be more) tips on productivity.
So I kind of want it to scan like 20-30 most relevant quotes and synthesise an answer
Thanks for your help yesterday @Logan M . I'm starting to think the issue is that the nodes are formatted as shown below.
Do you think somehow scraping just the quotes and separating them with lines would improve things?
[NodeWithScore(node=Node(text='(if possible... this author still working on it...), speaking publicly.\n"}}, {'id': 'recQMRSsnTXxKwT21', 'createdTime': '2023-01-21T17:47:05.000Z', 'fields': {'Quotes': '“He realised that just by being in Los Angeles he would be surrounded be the world’s top aeronautics thinkers. They could help him refine any ideas, and there would be plenty of recruits to join his next venture.”\xa0\n\n\n'}}, {'id': 'recQW22c9oSCHAbvj', 'createdTime': '2022-12-25T17:02:13.000Z', 'fields': {'Quotes': 'But the thing that matters almost more than anything in determining whether I’ll have a happy, satisfying day is this: no matter what time you get up, start the day with a real, sit-down breakfast.\n'}}, {'id': 'recRDsVenek5G2OEs', 'createdTime': '2022-12-12T20:34:27.000Z', 'fields': {'Quotes': 'I make sure to leave enough time in my schedule to think about what to work on. The best ways for me to do this are reading books, hanging out with interesting people, and spending time in nature.\n'}}
& presumably also the issue is that the nodes contain metadata about their relationship to one another and of course that is not mega relevant to my use case
+ (sorry for overload) is it definitely the case that what is in {context_str} is the source nodes?
--
while True:
query_str = input("Tell human history your goal, and it will give you a plan.\n")
QA_PROMPT_TMPL = (
"Based on the context information below, suggest ways to: {query_str}.\n"
"---------------------\n"
"{context_str}"
"\n---------------------\n"
)
QA_PROMPT = QuestionAnswerPrompt(QA_PROMPT_TMPL)
LLMPredictor(llm = OpenAI(temperature=1, model_name="text-davinci-003"))
index = GPTSimpleVectorIndex(nodes)
response = index.query(query_str, text_qa_template=QA_PROMPT, response_mode="default")
print(response)
print(response.source_nodes)
print(" ")
Yea lots of extra data in there, might be a good idea the scrape the quotes! 🫡
Yea, the node text goes into context_str!
But if the text is too big, it will get split a little more