How To Add Proxies To Beautifulsoup Crawler
These are the definitions in the python crawler: from __future__ import with_statement from eventlet.green import urllib2 import eventlet import re import urlparse from bs4 import
Solution 1:
Heads up that there is a less complex solution to this available now, shared here:
importrequestsproxies= {"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080"}
requests.get("http://example.org", proxies=proxies)
Then do your beautifulsoup as normal from the request response.
So if you want separate threads with different proxies, you just call different dictionary entries for each request (e.g. from a list of dicts).
This seems more straight-forward to implement when your existing package use is already requests / bs4, since it is just an extra **kwargs
added on your existing requests.get()
call. You don't have to initialize/install/open separate urllib handlers for each thread.
Solution 2:
Have a look at the example of BeautifulSoup using HTTP Proxy
http://monzool.net/blog/2007/10/15/html-parsing-with-beautiful-soup/
Post a Comment for "How To Add Proxies To Beautifulsoup Crawler"