Why Is HTML Returned By Requests Different From The Real Page HTML?
I'm trying to scrape a webpage for getting some data to work with, one of the web pages I want to scrape is this one https://www.etoro.com/people/sparkliang/portfolio, the problem
Solution 1:
It happens because the page uses dynamic javascript to create the DOM elements. So you won't be able to accomplish it using requests. Instead you should use selenium with a webdriver and wait for the elements to be created before scraping.
You can try downloading ChromeDriver executable here. And if you paste it in the same folder as your script you can run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--window-size=1920x1080")
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe" # CHANGE THIS IF NOT SAME FOLDER
driver = webdriver.Chrome(options=chrome_options, executable_path=chrome_driver)
url = 'https://www.etoro.com/people/sparkliang/portfolio'
driver.get(url)
html_text = driver.page_source
jobs = WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'p.i-portfolio-table-hat-fullname'))
)
for job in jobs:
print(job.text)
Here we use selenium with WebDriverWait
and EC
to ensure that all the elements wil exist when we try to scrape the info we're looking for.
Outputs
Facebook
Apple
Walt Disney
Alibaba
JD.com
Mastercard
...
Post a Comment for "Why Is HTML Returned By Requests Different From The Real Page HTML?"