Can't Get Scrapy To Parse And Follow 301, 302 Redirects
Solution 1:
If you want to parse 301 and 302 responses, and follow them at the same time, ask for 301 and 302 to be processed by your callback and mimick the behavior of RedirectMiddleware.
Test 1 (not working)
Let's illustrate with a simple spider to start with (not working as you intend yet):
import scrapy
class HandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
def parse(self, response):
self.logger.info("got response for %r" % response.url)
Right now, the spider asks for 2 pages, and the 2nd one should redirect to http://www.example.com
$ scrapy runspider test.py
2016-09-30 11:28:17 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:28:18 [scrapy] DEBUG: Redirecting (302) to <GET http://example.com/> from <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>
2016-09-30 11:28:18 [handle] INFO: got response for 'https://httpbin.org/get'
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: None)
2016-09-30 11:28:18 [handle] INFO: got response for 'http://example.com/'
2016-09-30 11:28:18 [scrapy] INFO: Spider closed (finished)
The 302 is handled by RedirectMiddleware
automatically and it does not get passed to your callback.
Test 2 (still not quite right)
Let's configure the spider to handle 301 and 302s in the callback, using handle_httpstatus_list
:
import scrapy
class HandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
handle_httpstatus_list = [301, 302]
def parse(self, response):
self.logger.info("got response %d for %r" % (response.status, response.url))
Let's run it:
$ scrapy runspider test.py
2016-09-30 11:33:32 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:33:33 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:33:33 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:33:33 [scrapy] INFO: Spider closed (finished)
Here, we're missing the redirection.
Test 3 (working)
Do the same as RedirectMiddleware but in the spider callback:
from six.moves.urllib.parse import urljoin
import scrapy
from scrapy.utils.python import to_native_str
class HandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
handle_httpstatus_list = [301, 302]
def parse(self, response):
self.logger.info("got response %d for %r" % (response.status, response.url))
# do something with the response here...
# handle redirection
# this is copied/adapted from RedirectMiddleware
if response.status >= 300 and response.status < 400:
# HTTP header is ascii or latin1, redirected url will be percent-encoded utf-8
location = to_native_str(response.headers['location'].decode('latin1'))
# get the original request
request = response.request
# and the URL we got redirected to
redirected_url = urljoin(request.url, location)
if response.status in (301, 307) or request.method == 'HEAD':
redirected = request.replace(url=redirected_url)
yield redirected
else:
redirected = request.replace(url=redirected_url, method='GET', body='')
redirected.headers.pop('Content-Type', None)
redirected.headers.pop('Content-Length', None)
yield redirected
And run the spider again:
$ scrapy runspider test.py
2016-09-30 11:45:20 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:45:21 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F)
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'http://example.com/'
2016-09-30 11:45:21 [scrapy] INFO: Spider closed (finished)
We got redirected to http://www.example.com and we also got the response through our callback.
Post a Comment for "Can't Get Scrapy To Parse And Follow 301, 302 Redirects"