Skip to content Skip to sidebar Skip to footer

Can't Find The Right Way To Grab Part Numbers From A Webpage Using Requests

I'm trying to create a script to parse different part numbers from a webpage using requests. If you check on this link and click on Product list tab, you will see the part numbers.

Solution 1:

The difficulty for the driver is to click to the 'Product list' button so I found a solution:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.common.exceptions import TimeoutException, StaleElementReferenceException
from selenium import webdriver
import time

class NoPartsNumberException(Exception):
    pass

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)


driver.get("https://www.festo.com/cat/en-id_id/products_ADNH")
wait.until(ec.frame_to_be_available_and_switch_to_it(wait.until(ec.visibility_of_element_located((By.CSS_SELECTOR, "object")))))
wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "#btn-group-cookie > input[value='Accept all cookies']"))).click()
driver.switch_to.default_content()
wait.until(ec.frame_to_be_available_and_switch_to_it((By.XPATH, "//iframe[@name='CamosIF']")))

endtime = time.time() + 30
while True:
    try:
        if time.time() > endtime:
            raise NoPartsNumberException('No parts number found')
        product_list = wait.until(ec.element_to_be_clickable((By.XPATH, "//div[@id='f24']")))
        product_list.click()
        part_numbers_elements = wait.until(ec.visibility_of_all_elements_located((By.XPATH, "//div[contains(@id, 'v471')]")))
        break
    except (TimeoutException, StaleElementReferenceException):
        pass

part_numbers = [p.text for p in part_numbers_elements[1:]]
print(part_numbers)

driver.close()

In this way the driver clicks on the 'Product list' button until it opens the window containing the part numbers and you have to wait much less than 10 seconds as in your code with the hardcoded time sleep


Solution 2:

To grab part numbers from the webpage using Selenium you need to:

  • Induce WebDriverWait for the object frame to be available and switch to it.

  • Induce WebDriverWait for the desired element to be clickable and click on the Accept all cookies.

  • Switch back to the default_content()

  • Induce WebDriverWait for the desired frame to be available and switch to it.

  • Induce WebDriverWait for the staleness_of() of the stale element.

  • Click on the tab with text as Product list using execute_script().

  • You can use the following Locator Strategies:

    driver.get('https://www.festo.com/cat/en-id_id/products_ADNH')
    WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.TAG_NAME,"object")))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input.btn.btn-primary#accept-all-cookies"))).click()
    driver.switch_to.default_content()
    WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe#CamosIFId")))
    WebDriverWait(driver, 20).until(EC.staleness_of(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Product list']")))))
    driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[text()='Product list']"))))
    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='ah']/img//following::div[2]")))])
    driver.quit()
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
  • Console Output:

    ['539691', '539692', '539693', '539694']
    

Reference

You can find a couple of relevant discussions in:


Solution 3:

I believe you have covered the iframe and WebDriverWait concept well.

The site seems to re-render the content a few times prior to be able to actual get the right element and click on it. Hence why you had to add a sleep of 10 seconds.

There is a believe that EC must be used when using WebDriverWait. EC is only a bunch of class helpers to retrieve an element with some defined properties (i.e visible, hidden, clickable...)

In your case, ec.visibility_of_all_elements_located was a good choice. But once the element is retrieve, the DOM is re-rentered and you will generate a StaleElementReferenceException if you use the WebElement click method. Also believe that the click using JS will just be ignored as the passed element is no longer present.

Since until() can be used to determine when to return element, why not utilize it and create our own EC class:

class SelectProductTab(object):
    def __init__(self, locator):
        self.locator = locator
        self._selected_background_image = 'url("IMG?i=ec2a883936d53541a030c2ddb511e7e8&s=p")'

    def __call__(self, driver):
        els = driver.find_elements(*self.locator)
        if len(els) > 0:
            els[0].click()
        else:
            return False
        return els[0] if self.__is_selected(els[0]) else False

    def __is_selected(self, el):
        return self._selected_background_image in el.get_attribute('style')

This class will do the following:

  1. Retrieve the element
  2. Click on it
  3. Ensure the desired tab is selected. Basically ensure the click did work
  4. Upon the tab being selected, returns the element back to the caller

One part is not handled, as WebDriverWait already supports it, it is to handle exception. In your case, you will be facing StaleElementReferenceException.

wait = WebDriverWait(driver, 30, ignored_exceptions=(StaleElementReferenceException, ))

Then call until() with your own implementation of an EC class:

wait.until(SelectProductTab((By.CSS_SELECTOR, "[id='r17'] > [id='f24']")))

Full code:

with webdriver.Chrome(ChromeDriverManager().install(), options=options) as driver:
    driver.get(link)
    wait = WebDriverWait(driver, 15)
    wait.until(EC.frame_to_be_available_and_switch_to_it(
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "object")))))
    wait.until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, "#btn-group-cookie > input[value='Accept all cookies']"))).click()
    driver.switch_to.default_content()
    wait.until(EC.frame_to_be_available_and_switch_to_it(
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "iframe#CamosIFId")))))
    
    # Sleep was removed, click is now handled inside our own EC class + will ensure the tab is selected
    wait = WebDriverWait(driver, 30, ignored_exceptions=(StaleElementReferenceException, ))
    
    wait.until(SelectProductTab((By.CSS_SELECTOR, "[id='r17'] > [id='f24']")))
    
for elem in wait.until(
                EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-ctcwgtname='tabTable'] [id^='v471_']")))[1:]:
            print(elem.text)

Output:

539691
539692
539693
539694

Note to import the following import:

from selenium.common.exceptions import StaleElementReferenceException

Post a Comment for "Can't Find The Right Way To Grab Part Numbers From A Webpage Using Requests"