Find answers from the community

Updated last year

would it be possible to preprocess and

would it be possible to preprocess and post process the sources the retriever retrievers before sending them to the llm?
L
a
30 comments
For sure! This is what node postprocessors are for

You can implement a custom post-processor to modify nodes before sending to the LLM
https://gpt-index.readthedocs.io/en/latest/core_modules/query_modules/node_postprocessors/usage_pattern.html#custom-node-postprocessor

index.as_query_engine(..., node_post_processors=[DummyNodePostprocessor])

There are a few pre-made ones as well

https://gpt-index.readthedocs.io/en/latest/core_modules/query_modules/node_postprocessors/modules.html
Oh awesome thank you so much would i be able to create a custom one to say remove urls before processing and then add them back in after the llm generates a response
For example the test would be "blahblahbllahlink" to either remove the actual url and add it back in after or replace the url with a dummy url to be mapped and added back in after
yea that's definitely possible. The input is a list of source nodes, and the output has to be a list of source nodes, but you can do anything you want in the actual function
Awesome thank you so much I will let you know how it goes
sorry the from llama_index.indices.postprocessor does not have a base extension
from llama_index.indices.postprocessor.types import BaseNodePostprocessor
that should be the import
I created a class that I believe should be working "import re
from typing import Dict, List, Optional
from llama_index import QueryBundle
from llama_index.indices.postprocessor.types import BaseNodePostprocessor
from llama_index.schema import Node, NodeWithScore


class URLReplacementPostprocessor(BaseNodePostprocessor):
urlpattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
dummy_variable = "URLPLACEHOLDER{}"

def init(self):
# A dictionary to store original URLs based on their dummy representation
self.global_url_map: Dict[str, str] = {}

def postprocess_nodes(
self, nodes: List[NodeWithScore], query_bundle: Optional[QueryBundle] = None,
) -> List[NodeWithScore]:

for n in nodes:
# Find all URLs in the node content
urls = self.url_pattern.findall(n.node.text)

# Replace each URL with a unique dummy variable and store the original URL
for idx, url in enumerate(urls):
dummy = self.dummy_variable.format(idx)
n.node.text = n.node.text.replace(url, dummy)
self.global_url_map[dummy] = url

return nodes

def restore_original_urls(self, text: str) -> str:
"""Replace the dummy URLs back to the original URLs in the provided text."""
for dummy, original in self.global_url_map.items():
text = text.replace(dummy, original)
return text"
and it works when i apply the functions regularly onto the retrieved nodes however when I try to call it like "node_postprocessor = URLReplacementPostprocessor()
chat = ContextChatEngine.from_defaults(
service_context=service_context,
retriever=Retriever(
index=cast(VectorStoreIndex,index),
embed_model=weighted_embed_model
),
node_postprocessors=[node_postprocessor]
)" and print out the source nodes it does not seem to be working?
which version of llama-index did you have? I think node-postprocessors were only added to the context chat engine rather recently
i have version 0.8.0
I can try updating and let you know but does this look correct?
At a high level it looks correct πŸ€”
ah okay thank you I am running it now with 8.2 and I will lyk
that seemed to do the trick thank you again!
Hi I was running some more tests on the node post processor and when inspecting the source nodes the urls were replaced by the dummy urls however the the actual response still contains the urls. How is this possible?
Also I am resetting the chats memory after each iteration so it should not have a recollection of the old urls
I have no idea lol
that sounds extremely... impossible? πŸ€”
You could use the token counting handler to inspect each LLM input, to see what the actual text is looking like?
lol i can pm you the source nodes and response if you want to see, it is super weird and okay I will try that thank you so much!
lol I wanted to update you I think the llm is making up urls either based off the dummy urls or based off the context its very weird
LOL niceeee πŸ˜† Maybe need to tweak how you insert the dummy urls. Or give the LLM some explanation for the dummy placeholder in the system prompt?
Yes I’m gonna test it more tmrw and I’ll keep u posted
Add a reply
Sign up and join the conversation on Discord