Skip to content Skip to sidebar Skip to footer

Extracting Src From Beautifulsoup Tag

I was trying to scrape newegg for product name, description, price and image using beautifulsoup. I have got following bs4.element.Tag type and I want to extract 'src' link from ta

Solution 1:

The src is in the img tag:

from bs4 import BeautifulSoup
tag = """<a class="itemImage" href="http://www.newegg.com/Product/Product.aspx?Item=N82E16875169194&amp;cm_re=Samsung_edge-_-75-169-194-_-Product" id="img_75-169-194" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'>\n<img alt='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty' src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'/>\n</a>"""

soup = BeautifulSoup(tag,"lxml")

src = soup.img["src"]

Which will give you:

http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg

Solution 2:

try regular expressions in python reference https://docs.python.org/2/library/re.html

import re
s = """
    <a class="itemImage" href="http://www.newegg.com/Product/Product.aspx?Item=N82E16875169194&amp;cm_re=Samsung_edge-_-75-169-194-_-Product" id="img_75-169-194" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'>\n<img alt='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty' src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'/>\n</a>"""
src_list = re.findall("src=[^\s]*", s)

output:

src_list = ['src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg"']

Post a Comment for "Extracting Src From Beautifulsoup Tag"