Oh awesome thank you so much would i be able to create a custom one to say remove urls before processing and then add them back in after the llm generates a response
For example the test would be "blahblahbllahlink" to either remove the actual url and add it back in after or replace the url with a dummy url to be mapped and added back in after
yea that's definitely possible. The input is a list of source nodes, and the output has to be a list of source nodes, but you can do anything you want in the actual function
Awesome thank you so much I will let you know how it goes
sorry the from llama_index.indices.postprocessor does not have a base extension
from llama_index.indices.postprocessor.types import BaseNodePostprocessor
that should be the import
I created a class that I believe should be working "import re
from typing import Dict, List, Optional
from llama_index import QueryBundle
from llama_index.indices.postprocessor.types import BaseNodePostprocessor
from llama_index.schema import Node, NodeWithScore
class URLReplacementPostprocessor(BaseNodePostprocessor):
urlpattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
dummy_variable = "URLPLACEHOLDER{}"
def init(self):
# A dictionary to store original URLs based on their dummy representation
self.global_url_map: Dict[str, str] = {}
def postprocess_nodes(
self, nodes: List[NodeWithScore], query_bundle: Optional[QueryBundle] = None,
) -> List[NodeWithScore]:
for n in nodes:
# Find all URLs in the node content
urls = self.url_pattern.findall(n.node.text)
# Replace each URL with a unique dummy variable and store the original URL
for idx, url in enumerate(urls):
dummy = self.dummy_variable.format(idx)
n.node.text = n.node.text.replace(url, dummy)
self.global_url_map[dummy] = url
return nodes
def restore_original_urls(self, text: str) -> str:
"""Replace the dummy URLs back to the original URLs in the provided text."""
for dummy, original in self.global_url_map.items():
text = text.replace(dummy, original)
return text"
and it works when i apply the functions regularly onto the retrieved nodes however when I try to call it like "node_postprocessor = URLReplacementPostprocessor()
chat = ContextChatEngine.from_defaults(
service_context=service_context,
retriever=Retriever(
index=cast(VectorStoreIndex,index),
embed_model=weighted_embed_model
),
node_postprocessors=[node_postprocessor]
)" and print out the source nodes it does not seem to be working?
which version of llama-index did you have? I think node-postprocessors were only added to the context chat engine rather recently
I can try updating and let you know but does this look correct?
At a high level it looks correct π€
ah okay thank you I am running it now with 8.2 and I will lyk
that seemed to do the trick thank you again!
Hi I was running some more tests on the node post processor and when inspecting the source nodes the urls were replaced by the dummy urls however the the actual response still contains the urls. How is this possible?
Also I am resetting the chats memory after each iteration so it should not have a recollection of the old urls
that sounds extremely... impossible? π€
You could use the token counting handler to inspect each LLM input, to see what the actual text is looking like?
lol i can pm you the source nodes and response if you want to see, it is super weird and okay I will try that thank you so much!
lol I wanted to update you I think the llm is making up urls either based off the dummy urls or based off the context its very weird
LOL niceeee π Maybe need to tweak how you insert the dummy urls. Or give the LLM some explanation for the dummy placeholder in the system prompt?
Yes Iβm gonna test it more tmrw and Iβll keep u posted