Hey everyone,

At a glance

Hey everyone,
Keyword extractor question:

I am setting up a keyword extractor pipeline like so:

Plain Text

from llama_index.extractors import (
    KeywordExtractor,
)

extractors = [
    KeywordExtractor(keywords=5, llm = llm, )
]

from llama_index.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=extractors)

nodes = pipeline.run(documents=d[0:5])
nodes[1].metadata

Which ends up printing:

Plain Text

{'excerpt_keywords': "I'm sorry, but your request doesn't match the context provided. Could you please provide more information or clarify your question?"}

Plain Text

{'excerpt_keywords': "I'm sorry, but I don't have enough information to answer your question."}

I have tried changing the prompt in the KeywordExtractor definition, but for some documents it works and others it doesn't.
I know for sure the documents that give the above answer have content, and have no meaningful difference from the ones that return actual keywords.

Not familiar enough with the code to know where the prompt might be going wrong here.
Any ideas / suggestions?

Will be glad to submit a PR if I can get some direction on where this might be fixed. I also notices a TODO for the KeywordExtractor prompt that I'll look into as well.

14 comments

LLogan M

oh very weird. What LLM are you using?

The default prompt is

Plain Text

template=f"""\
{{context_str}}. Give {self.keywords} unique keywords for this \
document. Format as comma separated. Keywords: """

Although it looks like this is not configurable at the moment

LLogan M

# TODO: figure out a good way to allow users to customize keyword template lol

LLogan M

it should just be a kwarg to customize imo, needs a PR to update

LLogan M

on another note, I've been meaning to add a keyword extractor not based on an LLM, something like using this
https://huggingface.co/tomaarsen/span-marker-bert-base-uncased-keyphrase-inspec

LLuroDev

@Logan M
Currently using GPT-4-32k through Azure but have tried with local as well and sometimes it has really weird results. Can't figure out why for the life of me.
Sometimes it will even return that it "isn't allowed to help with that" which is interesting. (BAsed on prompt above)

I tried adjusting the prompt above as well but with no such luck. Wondering if theres some artifact somewhere that is being included in the call, messing with the results?

Yeah if I get some time to do it correctly, I will try to submit a PR for this so you guys can check it out. In the mean time, will update if I find a solution.

LLogan M

Yea since its using llm.predict() directly, the only thing in the API call is the template and the context itself. I wonder if the context you are running over is causing the LLM to respond with "safe" answers

LLuroDev

I would share the context here but it's proprietary, but after reviewing theres no obvious reason it would need a "safe" answer.
But I came to the same conclusion lol not sure what is going on there.

Appreciate your response!

LLogan M

Hmm yea pretty weird!

Yea no worries. My last piece of advice here is to just use gpt-3.5 instead of gpt-4 for this specific processing.

It will be a lot cheaper, and also, might be a little more flexible with responses?

LLogan M

Generally I save gpt-4 for cases that need complex reasoning and decision making

LLuroDev

@Logan M
Lightbulb moment.
I attached a callback manager to check the prompting and figured it out.

So I set up a service context earlier in the script with AzureOpenAI class because we are using an Azure account. But since I am using this service context for the pipeline, I think the prompt I use as a replacement for system prompt is being used through the keyword extractor as well. Shown here:

Plain Text

CBEvent(event_type=<CBEventType.LLM: 'llm'>, payload={<EventPayload.MESSAGES: 'messages'>: 
[ChatMessage(role=<MessageRole.SYSTEM: 'system'>, content="You are a ...... my personal instructions.....", additional_kwargs={}), ChatMessage(role=<MessageRole.USER: 'user'>, 
content='    Give 5 unique keywords for this .... content I need keywords for ......

So I guess I have a two fold question for you actually.

Is there a way to pass the keyword extractor a service context? It doesn't look obvious from what I have seen at least.
And two, is there a better way to be editing the prompt I send with the QueryFusionRetriever instead of with system prompt? It looked like it only accepted the query gen prompt, but I might be missing something. I guess more granular than system prompt in service context.

LLuroDev

Edit: My custom instructions are asking for things that aren't found in the content I am giving the LLM (expected behaivior) and my prompt says not to answer. Explains the "I cant help with that"

This seems like something someone else might potentially run into as well since it's not intuitive. That callback manager is a life saver 🙂

LLogan M

hmmm interesting -- Is this some weird pass-by-reference bug happening?

I know you can attach a system prompt directly to an LLM. So if you pass that same LLM into the keyword extractor, it will use that system prompt

What if instead you setup the extractor like

Plain Text

extractors = [
    KeywordExtractor(keywords=5, llm=AzureOpenAI(...))
]

LLuroDev

Yeah that fixed it. As usual, just not paying enough attention 🤦🏻‍♂️
Thanks for the help

Will work on the PR for extractor prompt as kwarg

LLogan M

Awesome, that sounds great!

Add a reply

Find answers from the community

Hey everyone,