Find answers from the community

Updated 3 months ago

Relevant Topics

hey guys, I was wondering if any of you ever thought about asking a model to return the most relevant/cited topics across its context. I tried but obviously it failed not being able to correctly query the DB (chromadb in this case) for this specific task. Anyone working/worked on this and can share suggestions?

Thanks!
E
e
L
35 comments
could you share more details? how dod you try it, and what you would like to achieve
hi Emanuel, I'am trying to design a way to submit prompts like:

Prompt: list the top 10 cited <elements such as words/topics/etc> in the provided documents
for instance asking for top 10 used words in newspaper articles
I think this kind of retrieval of information should be somehow supported on a database level, not only LLM
cause obviously, the plain model will search for the most similar chunks in its context, without being able to retrieve what I am looking for
idk probably this is a weird and dumb question, but I struggled to find something useful in that sense diving in documentation
also in chromadb/pgvector/pinecone documentation at query-level
all the efforts seem to be focused on similarity retrieval only
Yea this is less about retrieval, and more about prompt engineering no?
what kind of prompt engineering could you setup to ask for frequency of words?
i struggle to have ideas in that sense
"List the most common topics in the provided context" πŸ€·β€β™‚οΈ

Something like a pydantic program may help here too, if you had something more structured in mind
https://gpt-index.readthedocs.io/en/stable/examples/output_parsing/openai_pydantic_program.html
thanks, I will dive in the article you provided!
regarding the prompt you suggested, kinda black boxed. Isn't it? I mean, what is going on behind it?
I mean, LLMs are black boxes to begin with πŸ˜… I don't think you'd want to hear a technical explanation of how LLMs and transformers work.

Basically, give them a prompt and some context, and it hopefully follows your instructions if the LLM is capable enough
yeah that's clear!
but I mean, isn't one of the purposes of frameworks like Llamaindex to provide developers tools to design less-blackboxed apps with LLMs?
that's my point, but you are totally right
I just find vectordbs very limited in their query abilities at the moment, but probably it's due to their relative youth
by the way (last question I swear!), if I provide my application with thousands of articles from newspapers asking for what you said List the most common topics in the provided contex, don't you think my costs will explode and results poor without proper llamaindex tuning??
They definitely might! It comes down to the approach you want to take really. If you are using an LLM for this task, it needs to read every single piece of text in order to properly perform this task -- there's not really a way around that fact πŸ€”

However, you could look into using a smaller local model for the task, like an entity or keyword extractor (a few state-of-the-art models below)

Entities:
https://huggingface.co/tomaarsen/span-marker-mbert-base-multinerd

Keywords:
https://huggingface.co/tomaarsen/span-marker-bert-base-uncased-keyphrase-inspec
exactly, processing every single piece of text is crazy. I was thinking about some approaches to handle this but using smaller local task-related-models it definitely a thing!
In your opinion these kind of models integrations should be used during ingestion (i.e. metadata extraction) or retrieval (i.e. augmentation strategies)?
I know, I have a lot of questions but I just find it too incredibile to just sit there and not ask ahaha
no worries! I think it depends on your use case as to where to use them

Using during ingestion for metadata extraction will help with retrieval

Using them during retrieval (like a node-postprocessor) can help modify retrieved chunks and change what gets sent to the LLM
can you help me find relevant pieces of llamaindex's documentation useful to understand how to develop such kinda-of data agents provided with small local models?
I guess this feels less like a data-agent problem and more of a hardcoded pipeline problem right?

For example, node-postprocessor docs are here
https://gpt-index.readthedocs.io/en/stable/core_modules/query_modules/node_postprocessors/usage_pattern.html

Or docs on customizing metdata is here
https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/documents_and_nodes/usage_documents.html#advanced-metadata-customization
thanks! yea I mean, I was wondering how can I equip agents and/or other entities with, for instance, the models you linked here. Did you mention them not to be used into llamaindex envs but as outside modules?
kinda creating a tool with it for a data-agent (that's where my question about data agents came out)
again, sorry for dumb questions, I'm just tryin to connect all of these pieces. Which, you know, are A LOT ahah
and thanks for the incredible support you're providing me
btw I probably need to keep diving
Yea so far the stuff I linked would lead you towards using it within the llamaindex env

For agents, you could setup a custom tool that performs whatever function you want (i.e. extracting keywords with some external model)

Examples are custom functions/tools for agents are here
https://gpt-index.readthedocs.io/en/stable/core_modules/agent_modules/tools/usage_pattern.html#using-with-our-agents
It is definitely a LOT haha no worries
Add a reply
Sign up and join the conversation on Discord