Find answers from the community

Updated 5 months ago

hey guys I was wondering if anyone has

At a glance
hey guys, I was wondering if anyone has recommendations to use LLamaIndex optimally to perform the following task:

  • Make a general query to a forum like reddit
    • ex: “Whats the best way to lose weight?”
  • For post 1:
    • Collect all answers to query (along with “upvote” statistics)
    • Combine similar answers and add up upvotes
      • ex API response: { text: ‘Go on a diet’, upvotes: 10 }, { text: ‘Reduce caloric intake’, upvotes: 2 }, { text: ‘Exercise more’, upvotes: 4 }
      • The similar results get combined and the algorithm keeps an updated count of the upvotes
        • ex: FINAL_RESULTS = { ‘Eat a healthy diet’: 12 }, { ‘Exercise’: 4 }
  • For post 2…n:
    • Repeat, adding to the total FINAL_RESULTS set

The challenge I've had is to return accurate counts for FINAL_RESULTS and to have the program successfully merge like answers ("Go on a diet" and "Reduce caloric intake" should be merged into something like "Eat a healthy diet" above.)
b
L
m
27 comments
@Logan M this is a friend of mine. ❤️
My suggestion was do all of the math and everything out of LLM, ask LLM to categorize the text's for you and return the json document
i.e if you upload 1....n posts to a index, ask the LLM to categorize all posts (in same api format) underneath a json array and then loop through that and add up yourself
Yea, definitely separating the math sounds like a good idea

the "similar answers" could be retrieved using embeddings, and then you process the scores manually, while asking the LLM to summarize the similar responses
how would you get similar answers w/ embeddings?
embed_model.get_text_embedding( and then use weaviate / vector to do like nearVector?
Could just construct a vector index on the fly, and retrieve all nodes above a score threshold
i am reading more docs to get a sense of what you guys are talking about here.
Plain Text
query_obj = VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=2)

query_result = vector_store.query(query_obj)
for similarity, node in zip(query_result.similarities, query_result.nodes):
    print(
        "\n----------------\n"
        f"[Node ID {node.node_id}] Similarity: {similarity}\n\n"
        f"{node.get_content(metadata_mode='all')}"
        "\n----------------\n\n"
    )


might be able to do something like this.
There's actually a postprocessor for this
https://gpt-index.readthedocs.io/en/stable/core_modules/query_modules/node_postprocessors/modules.html#similaritypostprocessor

So you could do

Plain Text
from llama_index.indices.postprocessor import SimilarityPostprocessor

postprocessor = SimilarityPostprocessor(similarity_cutoff=0.8)

index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=50)
nodes = retriever.retrieve("query str")

filtered_nodes = postprocessor.postprocess_nodes(nodes)
Then feed that to a list index for summarization

Plain Text
from llama_index import ListIndex

index = ListIndex([x.node for x in filtered_nodes])
summary = index.as_query_engine(response_mode='tree_summarize', use_async=True).query("Summarize these reddit comments.")
print(str(summary))
ok I know I'm turning this into my thread, but, this is a problem I'm trying to wrap my head around vector indexes as well. For example if his vector store index has 100 posts from reddits and you only get top_k 50 similar ones, it's incomplete data set, what's best way around that?
just set it to 10000000000 lol
is that a viable option lol
haha ok go crazy
would use a lot of memory I guess (since every node would be in memory)
but 🤷‍♂️
I started doing it last night but realized I was being silly is using a metadata_filter to exclude the node ids from my top_k=3, to produce some kind of paginating mechanism, how dumb is that?
Plain Text
from llama_index import GPTVectorStoreIndex, ListIndex, VectorStoreIndex
from llama_index.indices.postprocessor import SimilarityPostprocessor
from llama_index.readers.schema.base import Document

query = "how to lose weight"

documents = [
    Document(text="exercise more"),
    Document(text="reduce caloric intake"),
    Document(text="go on a diet"),
    Document(text="eat less calories"),
    Document(text="Hire a nutritionist"),
    Document(text="Hit the gym"),
]

index = GPTVectorStoreIndex.from_documents(documents)

postprocessor = SimilarityPostprocessor(similarity_cutoff=0.8)

index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(similarity_top_k=50)
nodes = retriever.retrieve(query)

filtered_nodes = postprocessor.postprocess_nodes(nodes)

index = ListIndex([x.node for x in filtered_nodes])

summary = index.as_query_engine(response_mode="tree_summarize", use_async=True).query(
    "Summarize these reddit comments."
)
print(str(summary))


result: "These reddit comments suggest various ways to achieve weight loss or maintain a healthy lifestyle. The suggestions include eating fewer calories, going on a diet, reducing caloric intake, exercising more, hitting the gym, and hiring a nutritionist."
not quite what i was looking for. tried tuning some of the params. would there be a way to just get general common categories as text from a list of text inputs

input: "eat less calories", "reduce caloric intake", "try not to eat so much"
output: "reduce calories"
Hmmm, you'd have to change the final query a bit

I.e. maybe something like "Given a set of reddit comments, reduce them to a single short, simple, key take-away"
so i guess thats the thing - ive gotten great results with gpt and then llama index when it comes to condensing large data down into a general summary. But, this is a more specific ask. Wasn't sure if it was easily doable with the existing tools
If you needed more structured control, you could try processing with a pydanitc program, rather than a query engine
https://gpt-index.readthedocs.io/en/stable/examples/output_parsing/openai_pydantic_program.html
ok, cooll. ill give that a shot
Add a reply
Sign up and join the conversation on Discord