Downloading A LOT Of Files Using Python
Is there a good way to download a lot of files en masse using python? This code is speedy enough for downloading about 100 or so files. But I need to download 300,000 files. Obviou
Solution 1:
The usual pattern with multiprocessing is to create a job()
function that takes arguments and performs some potentially CPU bound work.
Example: (based on your code)
from multiprocessing import Pool
def job(url):
file_name = str(url.split('/')[-1])
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.write(u.read())
f.close()
pool = Pool()
urls = ["ftp://ftp.sec.gov/{0:s}".format(f) for f in flist]
pool.map(job, urls)
This will do a number of things:
- Create a multiprocessing pool and set of workers as you have CPU(s) or CPU Core(s)
- Create a list of inputs to the
job()
function. - Map the list of inputs
urls
tojob()
and wait for all jobs to complete.
Python's multiprocessing.Pool.map
will take care of splitting up your input across the no. of workers in the pool.
Another useful neat little thing I've done for this kind of work is to use progress like this:
from multiprocessing import Pool
from progress.bar import Bar
def job(input):
# do some work
pool = Pool()
inputs = range(100)
bar = Bar('Processing', max=len(inputs))
for i in pool.imap(job, inputs):
bar.next()
bar.finish()
This gives you a nice progress bar on your console as your jobs are progressing so you have some idea of progress and eta, etc.
I also find the requests library very useful here and a much nicer set of API(s) for dealing with web resources and downloading of content.
Post a Comment for "Downloading A LOT Of Files Using Python"