Python Beautifulsoup Extract Specific Urls
Is it possible to get only specific URLs? Like: next..., href=re.compile('http://www\.iwashere\.com/'))
which matches (for your example):
[<ahref="http://www.iwashere.com/washere.html">next</a>, <ahref="http://www.iwashere.com/wasnot.html">next</a>]
so any <a>
tag with a href
attribute that has a value that starts with the string http://www.iwashere.com/
.
You can loop over the results and pick out just the href
attribute:
>>>for elem in soup.find_all('a', href=re.compile('http://www\.iwashere\.com/')):...print elem['href']...
http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html
To match all relative paths instead, use a negative look-ahead assertion that tests if the value does not start with a schem (e.g. http:
or mailto:
), or a double slash (//hostname/path
); any such value must be a relative path instead:
soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))
Solution 2:
If you're using BeautifulSoup 4.0.0 or greater:
soup.select('a[href^="http://www.iwashere.com/"]')
Solution 3:
You could solve this with partial matching in gazpacho:
Input:
html = """\
<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>
"""
Code:
from gazpacho import Soup
soup = Soup(html)
links = soup.find('a', {'href': "http://www.iwashere.com/"}, partial=True)
[link.attrs['href'] for link in links]
Which will output:
# ['http://www.iwashere.com/washere.html', 'http://www.iwashere.com/wasnot.html']
Post a Comment for "Python Beautifulsoup Extract Specific Urls"