Scrapy - Linkextractor In Control Flow And Why It Doesn't Work
Solution 1:
I hate to answer my own question, but I think I figured it out. When I define the start_requests
function, I might be overriding the rules
behavior, so it didn't work. When I remove the __init__
and start_requests
functions, spider works as intended.
classYenibirisSpider(CrawlSpider):
name = 'yenibirisspider'
start_urls = [
'https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=1',
]
rules = (
Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)), callback='parse_page', follow=True),
)
defparse_page(self, response):
items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
for item in items:
yield scrapy.Request(
url=item,
method='GET',
callback=self.parse_items
)
defparse_items(self, response):
# crawling the item without any problem here yield item
Solution 2:
It seems like your rule
and LinkExtractor
is correctly defined. However, I don't understand why you define both start_requests() and start_urls. If you don't override start_requests()
and override only start_urls
, parent class' start_request()
generates requests for URL's in the start_urls
attribute. So, one of them is redundant in your case. Also, __init__
definition is wrong. It should be like this :
def__init__(self,*args,**kwargs):
super(YenibirisSpider,self).__init__(*args,**kwargs)
...
When are the LinkExtractors are supposed to run in this script and why they are not running ?
LinkExtractor extracts links from corresponding response when it is received.
How can I make the spider follow to the next pages, parse the pages and parse the items in them with LinkExtractors
The regex .*&sayfa=\d+
in the LinkExtractor is appropriate for the webpage. It should work after you fix the mistakes in your code as it is expected.
How can I implement the parse_page with the LinkExtractor?
I don't understand what you mean here.
Post a Comment for "Scrapy - Linkextractor In Control Flow And Why It Doesn't Work"