would it be possible to preprocess and

At a glance

would it be possible to preprocess and post process the sources the retriever retrievers before sending them to the llm?

30 comments

LLogan M

For sure! This is what node postprocessors are for

You can implement a custom post-processor to modify nodes before sending to the LLM
https://gpt-index.readthedocs.io/en/latest/core_modules/query_modules/node_postprocessors/usage_pattern.html#custom-node-postprocessor

index.as_query_engine(..., node_post_processors=[DummyNodePostprocessor])

There are a few pre-made ones as well

https://gpt-index.readthedocs.io/en/latest/core_modules/query_modules/node_postprocessors/modules.html

aaszaiman1

Oh awesome thank you so much would i be able to create a custom one to say remove urls before processing and then add them back in after the llm generates a response

aaszaiman1

For example the test would be "blahblahbllahlink" to either remove the actual url and add it back in after or replace the url with a dummy url to be mapped and added back in after

LLogan M

yea that's definitely possible. The input is a list of source nodes, and the output has to be a list of source nodes, but you can do anything you want in the actual function

aaszaiman1

Awesome thank you so much I will let you know how it goes

aaszaiman1

sorry the from llama_index.indices.postprocessor does not have a base extension

LLogan M

laaaame

LLogan M

from llama_index.indices.postprocessor.types import BaseNodePostprocessor

LLogan M

that should be the import

aaszaiman1

Thanks you!

aaszaiman1

I created a class that I believe should be working "import re
from typing import Dict, List, Optional
from llama_index import QueryBundle
from llama_index.indices.postprocessor.types import BaseNodePostprocessor
from llama_index.schema import Node, NodeWithScore

class URLReplacementPostprocessor(BaseNodePostprocessor):
urlpattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
dummy_variable = "URLPLACEHOLDER{}"

def init(self):
# A dictionary to store original URLs based on their dummy representation
self.global_url_map: Dict[str, str] = {}

def postprocess_nodes(
self, nodes: List[NodeWithScore], query_bundle: Optional[QueryBundle] = None,
) -> List[NodeWithScore]:

for n in nodes:
# Find all URLs in the node content
urls = self.url_pattern.findall(n.node.text)

# Replace each URL with a unique dummy variable and store the original URL
for idx, url in enumerate(urls):
dummy = self.dummy_variable.format(idx)
n.node.text = n.node.text.replace(url, dummy)
self.global_url_map[dummy] = url

return nodes

def restore_original_urls(self, text: str) -> str:
"""Replace the dummy URLs back to the original URLs in the provided text."""
for dummy, original in self.global_url_map.items():
text = text.replace(dummy, original)
return text"

aaszaiman1

and it works when i apply the functions regularly onto the retrieved nodes however when I try to call it like "node_postprocessor = URLReplacementPostprocessor()
chat = ContextChatEngine.from_defaults(
service_context=service_context,
retriever=Retriever(
index=cast(VectorStoreIndex,index),
embed_model=weighted_embed_model
),
node_postprocessors=[node_postprocessor]
)" and print out the source nodes it does not seem to be working?

LLogan M

which version of llama-index did you have? I think node-postprocessors were only added to the context chat engine rather recently

aaszaiman1

i have version 0.8.0

aaszaiman1

I can try updating and let you know but does this look correct?

LLogan M

At a high level it looks correct 🤔

aaszaiman1

ah okay thank you I am running it now with 8.2 and I will lyk

aaszaiman1

that seemed to do the trick thank you again!

LLogan M

Nice!

aaszaiman1

Hi I was running some more tests on the node post processor and when inspecting the source nodes the urls were replaced by the dummy urls however the the actual response still contains the urls. How is this possible?

aaszaiman1

Also I am resetting the chats memory after each iteration so it should not have a recollection of the old urls

LLogan M

:pepeSus:

LLogan M

I have no idea lol

LLogan M

that sounds extremely... impossible? 🤔

LLogan M

You could use the token counting handler to inspect each LLM input, to see what the actual text is looking like?

LLogan M

https://gpt-index.readthedocs.io/en/stable/examples/callbacks/TokenCountingHandler.html#advanced-usage

aaszaiman1

lol i can pm you the source nodes and response if you want to see, it is super weird and okay I will try that thank you so much!

aaszaiman1

lol I wanted to update you I think the llm is making up urls either based off the dummy urls or based off the context its very weird

LLogan M

LOL niceeee 😆 Maybe need to tweak how you insert the dummy urls. Or give the LLM some explanation for the dummy placeholder in the system prompt?

aaszaiman1

Yes I’m gonna test it more tmrw and I’ll keep u posted

Add a reply

Find answers from the community

would it be possible to preprocess and