Scrapy - Offsite Request To Be Processed Based On A Regex
I have to crawl 5-6 domains. I wanted to write a the crawler as such that the offsite requests if contains a some substrings example set as [ aaa,bbb,ccc] if the offsite url conta
Solution 1:
Offsite middleware already uses regex by default, however it's no exposed. It compiles the domains you provide into regex, but the domains are escaped so providing regex code in allowed_domains
would not work.
What you can do though is extend that middleware and override get_host_regex()
method to implement your own offsite policy.
The original code in scrapy.spidermiddlewares.offsite.OffsiteMiddleware
:
defget_host_regex(self, spider):
"""Override this method to implement a different offsite policy"""
allowed_domains = getattr(spider, 'allowed_domains', None)
ifnot allowed_domains:
return re.compile('') # allow all by default
regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d isnotNone)
return re.compile(regex)
You can just override to return your own regex:
# middlewares.py classMyOffsiteMiddleware(OffsiteMiddleware):
defget_host_regex(self, spider):
allowed_regex = getattr(spider, 'allowed_regex', '')
return re.compile(allowed_regex)
# spiders/myspider.py classMySpider(scrapy.Spider):
allowed_regex = '.+?\.com'# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.MyOffsiteMiddleware': 666,
}
Post a Comment for "Scrapy - Offsite Request To Be Processed Based On A Regex"