Download .xls Files From A Webpage Using Python And Beautifulsoup

February 21, 2024 Post a Comment

I want to download all the .xls or .xlsx or .csv from this website into a specified folder. https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009 I have looked into mechanize

Solution 1:

The issues with your script as it stand are:

The url has a trailing / which gives an invalid page when requested, not listing the files you want to download.
The CSS selector in soup.select(...) is selecting div with the attribute webpartid which does not exist anywhere in that linked document.
You are joining the URL and quoting it, even though the links are given in the page as absolute URLs and they do not need quoting.
The try:...except: block is stopping you seeing the errors generated when trying to download the file. Using an except block without a specific exception is bad practise and should be avoided.

A modified version of your code that will get the correct files and attempt to download them is as follows:

from bs4 import BeautifulSoup
# Python 3.xfrom urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")

# Select all A elements with href attributes containing URLs starting with http://for link in soup.select('a[href^="http://"]'):
    href = link.get('href')

    # Make sure it has one of the correct extensionsifnotany(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = href.rsplit('/', 1)[-1]
    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

However, if you run this you'll notice that a urllib.error.HTTPError: HTTP Error 403: Forbidden exception is thrown, even though the file is downloadable in the browser. At first I thought this was a referral check (to prevent hotlinking), however if you watch at the request in your browser (e.g. Chrome Developer tools) you'll notice that the initial http:// request is blocked there also, and then Chrome attempts a https:// request for the same file.

In other words, the request must go via HTTPS to work (despite what the URLs in the page say). To fix this you will need to rewrite the http: to https: before using the URL for the request. The following code will correctly modify the URLs and download the files. I've also added an variable to specify the output folder, which is added to the filename using os.path.join:

import os
from bs4 import BeautifulSoup
# Python 3.xfrom urllib.request import urlopen, urlretrieve

URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = ''# path to output folder, '.' or '' uses current folder

u = urlopen(URL)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    ifnotany(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

    # We need a https:// URL for this site
    href = href.replace('http://','https://')

    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

Solution 2:

I found this to be a good working example, using the BeautifulSoup4, requests, and wget modules for Python 2.7:

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainerurl='https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'

file_types = ['.xls', '.xlsx', '.csv']

for file_type in file_types:

    response = requests.get(url)

    for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
        if link.has_attr('href'):
            if file_type in link['href']:
                full_path = url + link['href']
                wget.download(full_path)

Solution 3:

i tried above code still giving me urllib.error.HTTPError: HTTP Error 403: Forbidden


Also tried by adding user agents my modified code


import os

from bs4 import BeautifulSoup
# Python 3.xfrom urllib.request import Request,urlopen, urlretrieve

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
URL = Request('https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009', headers=headers)

#URL = 'https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009'

OUTPUT_DIR = 'E:\python\out'# path to output folder, '.' or '' uses current folder

u = urlopen(URL)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    ifnotany(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

    # We need a https:// URL for this site
    href = href.replace('http://','https://')

    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

Solution 4:

This worked best for me ... using python3

import os
import urllib

from bs4 import BeautifulSoup
# Python 3.xfrom urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError

URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = ''# path to output folder, '.' or '' uses current folder

u = urlopen(URL)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
ifnotany(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
    continue

filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

# We need a https:// URL for this site
href = href.replace('http://','https://')

try:
    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")
except urllib.error.HTTPError as err:
    if err.code == 404:
        continue

Python Freelancers

Download .xls Files From A Webpage Using Python And Beautifulsoup

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Post a Comment for "Download .xls Files From A Webpage Using Python And Beautifulsoup"