I think each blog being it's own file makes sense. Let me explain the implications
Basically, each Document
object inserted into llama-index gets broken into nodes according to the chunk_size (the default chunk_size is 1024). Each node will inherit any metadata from that document, as well as recording a link to the original document ID in node.ref_doc_id
.
Filenames only matter if you are including them as metadata, or are setting them as the document ID.
In general, I think splitting the blogs into sperate files makes sense
@Logan M So Llamindex treats each file as its own document automatically?
if all in the same levels under a /data folder
Yup it does. Although some file types (like PDF) may get split into further documents.
It's also easy enough to write your own loader, if you need the flexibility
(I think you had written one earlier too!)
Llamaindex is still struggling to find the right information. So I have a list of supplements split into different files for now. Each file contains the supplement's title, description, and ingredients. The content of each file is like this:
Title: supplement name
Description: supplement description and benefits and use cases
Ingredients: list of ingredients
And so far i created like 10 files for 10 products. I got the code setup like this:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo", max_tokens=500)
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=51200)
documents = SimpleDirectoryReader('data2').load_data()
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine(
response_mode="tree_summarize",
similarity_top_k=10
)
response = query_engine.query("increase muscle mass")
However it keeps saying "based on given context information, it cannot be determined which supplement, if any..."
Even though one of the product's description explicity states "increase muscle mass".
What am I not doing right? or what are some areas of improvement so that llamaindex can find it. Ive made the description to be very descriptive using chatgpt so that it can be searched easily.
Product 1
Title: GAINZ
Description and Benefits:
This herbal blend called "GAINZ" is a potent mixture of herbs that targets a multitude of fitness and health goals. This concoction has been specially formulated to boost strength and aid in increasing muscle mass, making it an ideal supplement for fitness enthusiasts, athletes, or anyone looking to improve their physique.
Additionally, GAINZ is designed to prevent muscle atrophy, potentially helping individuals maintain muscle mass during periods of low physical activity or due to aging. It serves as an excellent pre-workout supplement, aiming to enhance endurance and stamina, thereby helping users achieve their workout goals more effectively.
Notably, GAINZ also doubles as a testosterone booster, which could further assist in muscle development and maintaining an active and vigorous lifestyle. Moreover, its anti-aging properties contribute to longevity, promoting a healthier, extended life.
Furthermore, GAINZ aims to stimulate protein synthesis, a vital process for muscle growth and recovery. Finally, to round out its fitness benefits, the blend also aims to alleviate muscle soreness, helping users recover faster after intense workouts.
This versatile herbal blend called "GAINZ", with its wide-ranging benefits, aims to enhance overall physical performance and increase muscle mass, serving as an all-in-one solution for fitness and health enthusiasts.
Ingredients:
The GAINZ blend's ingredients are currently confidential and not available to the public.
this is one of the files ^however, Llamaindex cannot find this when queried with "increase muscle mass"
its a pretty straightforward data. nothing too complex or deeply nested
I guess you are treating the query more like a google search than an actual question ? Maybe if you add some extra details to the query. Also not sure if I would use tree-summarize for this, but feel free to experiment with / without it.
query_preamble= "Given the provided product details, recommend a product that satisfies the following search query: "
response = query_engine.query(query_preamble + "increase muscle mass")
thanks! that did some magic! I will keep testing this out
it seems like its hallucaniating. its recommending products outside of the knowledge base
i feel like it has a tendency to hallucinate the more i interact with it even when i ask the same question. like i just run it over and over
Maybe have to be more specific in the prompt perhaps. Prompt engineering is fun π«
query_preamble= "Using the provided product details details, and only those details, recommend a product (if any) that satisfies the following search query: "
response = query_engine.query(query_preamble + "increase muscle mass")
Are you using a chat engine? The base query engine has no connection/history between queries π€
no, just the default. But it does remember for some reason. Like it would say "my original answer still stands"
yeah when i run it like 2 or more times, it would say "my original answer still stnads"
or it would say "based on the new context"
I ran it twice and it said this for its latest answer:
"Given the new context, the original answer is not applicable. The provided product details do not include a specific product that targets muscle mass. Therefore, the original answer remains the same:
Original Answer: The product "GAINZ" would be a suitable recommendation for increasing muscle mass. It is a herbal blend specifically formulated to boost strength and aid in increasing muscle mass. It is designed as a pre-workout supplement to enhance endurance and stamina, helping users achieve their workout goals more effectively. Additionally, GAINZ doubles as a testosterone booster, further assisting in muscle development and maintaining an active lifestyle."
So what's happening here is actually under the hood, it's making more than one LLM call.
This is because when it fetches the top k pieces of text, all that text doesn't fit into a single LLM call. So, it splits it into chunks and gets an initial answer with the first chunk.
Then, it switches to a refine mode, where it presents new context, and asks the LLM to either rupfate or repeat its existing answer.
As you can see though, the LLM is... not following these instructions well π which is a little common for gpt-3.5
Lowering the top k might help avoid the refine process, or there's a few other response modes as well
gotcha, ill try out other llm models
maybe switching to an older model
Okay your suggestion of lowering the top k helped. I lowered it to 3
Awesome! It should be faster now too ππ
how do I customize the output? Cause it would keep saying "Please note that these recommendations are based solely on the provided product details and not on personal experience or scientific evidence. It is always recommended to consult with a healthcare professional before starting any new supplement or herbal blend." towards the end
I dont want it to say that
hmm, that might be because of OpenAI actually? Seems like a "safety" thing for certain topics π€
I wonder if there's just a way to detect that in the string and remove it haha
I was able to do it using this way:
query_preamble= "Given the provided product details, recommend several products from the provided data, and describe them, that satisfies the following search query or question: "
prompt = query_preamble + query.content
response = query_engine.query(prompt + ". Now output this data in a numbered list without including 'Product Name' and 'Description' keywords. Then add a summary at the end.")