Skip to content Skip to sidebar Skip to footer

How To Add Proxies To Beautifulsoup Crawler

These are the definitions in the python crawler: from __future__ import with_statement from eventlet.green import urllib2 import eventlet import re import urlparse from bs4 import

Solution 1:

Heads up that there is a less complex solution to this available now, shared here:

importrequestsproxies= {"http": "http://10.10.1.10:3128",
           "https": "http://10.10.1.10:1080"}

requests.get("http://example.org", proxies=proxies)

Then do your beautifulsoup as normal from the request response.

So if you want separate threads with different proxies, you just call different dictionary entries for each request (e.g. from a list of dicts).

This seems more straight-forward to implement when your existing package use is already requests / bs4, since it is just an extra **kwargs added on your existing requests.get() call. You don't have to initialize/install/open separate urllib handlers for each thread.

Solution 2:

Have a look at the example of BeautifulSoup using HTTP Proxy

http://monzool.net/blog/2007/10/15/html-parsing-with-beautiful-soup/

Post a Comment for "How To Add Proxies To Beautifulsoup Crawler"