Find answers from the community

Updated 3 months ago

llama_index/llama-index-integrations/rea...

I'm playing around with the WholeSiteReader and I was wondering, cause I can't find anything in the code, if anyone knows a way to filter out parts of a site. this Doesn't seem to show anything for filters, but I'm hoping someone knows a way to add a filter through other means.
s
i
27 comments
nothing in that code to do it. what are you trying to filter out?
I'm trying to scrape a site, but I don't need every little variant of parts of it as a lot of it is repeating
I'm thinking I can track the 'visiting' part and if it's a duplicate to an extent, I can skip it some how
but I'm open for anything else
nothing in there that would let you do that. Just looks like it accepts:

Plain Text
   Attributes:
        prefix (str): URL prefix to focus the scraping.
        max_depth (int): Maximum depth for BFS algorithm.


where max depth is the level of link-to-link navigation from the starting base URL
default is 10, so if that's part of your issue you could pass a lower number
ah, I'm trying to filter out the shop part of the site
yeah, that wont help then. you will prob. need to fork it/PR and add some kind of filtering method in it.

Can you paste an example? is it a tag?
yah, so the site is comma.ai and I want to filter out it's shop from being scrapped as it's just the shop, that info isn't needed
that's funny. i use comma ai and i was like huh... i'm now inside their discord lol
that threw me off for a second there
oh you use openpilot?!
yep, pre global 2019 subaru
I'm building this for the bot for FrogPilot discord server, I'm hoping in the future that other community discord servers will use it too, but I'm starting small
installer.comma.ai/jnewb1/subaru-preglobal-long (we did a bounty to get preglobal long support)
k. let me look at this scraper code, one sec. it might be easy to just add a filter
I was thinking about doing that and then making a PR. Seems like it'd make sense, but I don't know how they'd want to add a filter
i think you could do something like:

Plain Text
def __init__(self, prefix: str, max_depth: int = 10, excluded_uris: List[str] = None) -> None:

self.excluded_uris = set(excluded_uris) if excluded_uris is not None else set()


Plain Text
reader = WholeSiteReader(prefix="https://comma.ai", max_depth=10, excluded_uris=["/shop"])


Plain Text
def uri_not_excluded(self, uri: str) -> bool:
        for excluded_uri in self.excluded_uris:
            if excluded_uri in uri:
                return False
        return True


Plain Text
if depth > self.max_depth or not self.uri_not_excluded(current_url):
                continue
i don't have a setup right now to test it. but i think something like that would work. You could make excluded uris optional too. that way the core code doesn't change for anyone else
default is set to None
if you ever wanna join a fork discord, hit me up, I'll invite you to frogpilot discord, where the bot is
Sounds good! let me know if it works
Add a reply
Sign up and join the conversation on Discord