Scrapy - Offsite Request To Be Processed Based On A Regex

June 11, 2023 Post a Comment

I have to crawl 5-6 domains. I wanted to write a the crawler as such that the offsite requests if contains a some substrings example set as [ aaa,bbb,ccc] if the offsite url conta

Solution 1:

Offsite middleware already uses regex by default, however it's no exposed. It compiles the domains you provide into regex, but the domains are escaped so providing regex code in allowed_domains would not work.

What you can do though is extend that middleware and override get_host_regex() method to implement your own offsite policy.

The original code in scrapy.spidermiddlewares.offsite.OffsiteMiddleware:

defget_host_regex(self, spider):
    """Override this method to implement a different offsite policy"""
    allowed_domains = getattr(spider, 'allowed_domains', None)
    ifnot allowed_domains:
        return re.compile('') # allow all by default
    regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d isnotNone)
    return re.compile(regex)

You can just override to return your own regex:

# middlewares.py    classMyOffsiteMiddleware(OffsiteMiddleware):
    defget_host_regex(self, spider):
        allowed_regex = getattr(spider, 'allowed_regex', '') 
        return re.compile(allowed_regex)

# spiders/myspider.py classMySpider(scrapy.Spider):
    allowed_regex = '.+?\.com'# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.MyOffsiteMiddleware': 666,
}

Python Freelancers

Scrapy - Offsite Request To Be Processed Based On A Regex

Solution 1:

Post a Comment for "Scrapy - Offsite Request To Be Processed Based On A Regex"