Find answers from the community

Updated 2 months ago

Are there any tools in Llama_Index or

Are there any tools in Llama_Index or Langchain that can be used for cleaning up nodes before embedding? I'm looking for something to help remove white space, special characters, etc.
L
P
12 comments
Isn't that as simple as just applying a regex to your text? πŸ‘€
text = re.sub(r'[^0-9A-Za-z ]", "", text)
something like that
yeah I should have been more clear about what I'm looking for haha. How do I do that with SimpleNodeParser or do I have to use a different node parser that gives me access to each node as it parses?
Or should I literally just iterate over every single node and update them after SimpleNodeParser returns the nodes?
I would just iterate after you get the parsed nodes yea πŸ™‚
haha ok thanks!
maybe I'll make a PR to allow a custom parser to be passed to SimpleNodeParser
mmm maybe wait a day or two, something new coming out soon that will make this easier πŸ™‚
ooooo can't wait!
I'll even add this as an example! πŸ’ͺ
Add a reply
Sign up and join the conversation on Discord