Scrapy - Linkextractor In Control Flow And Why It Doesn't Work

March 09, 2024 Post a Comment

I'm trying to understand why my LinkExtractor doesn't work and when it is actually running in the crawl loop? This is the page I'm crawling. There are 25 listings on each page an

Solution 1:

I hate to answer my own question, but I think I figured it out. When I define the start_requests function, I might be overriding the rules behavior, so it didn't work. When I remove the __init__ and start_requests functions, spider works as intended.

classYenibirisSpider(CrawlSpider):

    name = 'yenibirisspider'

    start_urls = [
        'https://www.yenibiris.com/is-ilanlari?q=yazilim&sayfa=1',
    ]

    rules = (
        Rule(LinkExtractor(allow=(r'.*&sayfa=\d+',)), callback='parse_page', follow=True),
    )


    defparse_page(self, response):
        items = response.css('div.listViewRowsContainer div div div.jobTitleLnk a::attr(href)').getall()
        for item in items:
            yield scrapy.Request(
                url=item,
                method='GET',
                callback=self.parse_items
            )

    defparse_items(self, response):

       # crawling the item without any problem here yield item

Solution 2:

It seems like your rule and LinkExtractor is correctly defined. However, I don't understand why you define both start_requests() and start_urls. If you don't override start_requests() and override only start_urls, parent class' start_request() generates requests for URL's in the start_urls attribute. So, one of them is redundant in your case. Also, __init__ definition is wrong. It should be like this :

def__init__(self,*args,**kwargs):
    super(YenibirisSpider,self).__init__(*args,**kwargs)
    ...

When are the LinkExtractors are supposed to run in this script and why they are not running ?

LinkExtractor extracts links from corresponding response when it is received.

How can I make the spider follow to the next pages, parse the pages and parse the items in them with LinkExtractors

The regex .*&sayfa=\d+ in the LinkExtractor is appropriate for the webpage. It should work after you fix the mistakes in your code as it is expected.

How can I implement the parse_page with the LinkExtractor?

I don't understand what you mean here.

Python Freelancers

Scrapy - Linkextractor In Control Flow And Why It Doesn't Work

Solution 1:

Solution 2:

Post a Comment for "Scrapy - Linkextractor In Control Flow And Why It Doesn't Work"