By Group "DeepDiver"

The DeepDiver group tried to analyze customer reviews from different shopping websites, like Amazon, BestBuy, Target. We encountered some problems in the web-scrapting process, and after trial and error, we succeeded in gathering the data we needed. This blog is about the bugs/problems we met and how we found solutions, mainly about locating elements and dealing with dynamic contents.

The whole process was like this: first we initiated Google Chrome via ChromeDriver, loaded the comment page, located the comment title, content and other information we need, then clicked the next page button, each time the button was clicked, 8 or 10 more comments were automatically loaded. We then concated together the comments for each page and exported to a .csv file for use next time.

Problem 1: The review list was not loaded until the page has scrolled to the bottom

Solution: Every time a new page was opened, the browser automatically scrolled to review list area in the bottom of the page, which was about 1500px above from the footer of the page. Then the function slept for a while and waited for the review contents to be loaded. Otherwise, we still can not get the comment content.

# Scroll down to bottom
driver.execute_script('window.scrollTo(0, document.body.scrollHeight-1500);')
# Wait to load page
RANDOM_SLEEP_TIME = uniform(5, 9)
time.sleep(RANDOM_SLEEP_TIME)

Problem 2: Some reviews only had review title, without content

We saved the review titles and contents into two lists, and found that the lengths of the two lists were different. The review title list contained more items than review content list, since some comments only had titles and no content.

# Gather info on each page
review_dict = {'title': [], 'content': []}
while True:
    # get the reviews' title and content
    review_title = [title.text for title in driver.find_elements(By.CSS_SELECTOR, '[data-hook="review-title"]')]
    review_content = [content.text for content in driver.find_elements(By.CSS_SELECTOR, '[data-hook="review-body"]')]
    # join the original list with '+'
    review_dict['title'] = review_dict['title'] + review_title
    review_dict['content'] = review_dict['content'] + review_content
len(review_dict['title'])
Out[32]: 1306
len(review_dict['content'])
Out[33]: 1084

Solution: Change the data structure, save a single comment into a dictionary, then insert all the dictionaries of that page into an array.

reviews_current_page = []
for review in review_list:
    current_review = {} # single review
    current_review['username'] = review.find_element(By.CSS_SELECTOR, '[data-test="review-card--username"]').text
    current_review['title'] = review.find_element(By.CSS_SELECTOR, '[data-test="review-card--title"]').text
    current_review['content'] = review.find_element(By.CSS_SELECTOR, '[data-test="review-card--text"]').text
    current_review['date'] = review.find_element(By.CSS_SELECTOR, '[data-test="review-card--reviewTime"]').text
    current_review['stars'] = review.find_element(By.CLASS_NAME, 'ilnowk').text
    reviews_current_page.append(current_review) # Append reviews to review list

Problem 3: The Class Name of the element contained a dynamic hash, and cannot be obtained by CLASS_NAME

<h3 class="Heading__StyledHeading-sc-1mp23s9-0 jCmZWX" data-test="review-card--title" tabindex="-1">Love this toothbrush</h3>

Solution: The data- attribute of the element was fixed, we could use CSS_SELECTOR instead of CLASS_NAME to get the element.

current_review['title'] = review.find_element(By.CSS_SELECTOR, '[data-test="review-card--title"]').text

Problem 4: When jumping to the next page, the 'Load 8 more' button was not found

StaleElementReferenceException: stale element reference: element is not attached to the page document

Solution: There were two possibilities: one was that the element had not been loaded, and the other was that the element might have been removed. We replaced presence_of_element_located with element_to_be_clickable, still invalid

from selenium.webdriver.support import expected_conditions as EC

next_page_path = WebDriverWait(driver, RANDOM_SLEEP_TIME).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="pageBodyContainer"]/div[11]/div[8]/button')))

Then we found that the button's XPath changed after turning the page, from //*[@id="pageBodyContainer"]/div[11]/div[8]/button to //*[@id="pageBodyContainer"]/div[10]/div[8]/button . So XPath cannot be used, but <button> itself did not have an obvious class name, only its parent element has the data-test="load-more-btn" attribute, so the CSS SELECTOR was used, first located the parent element, and then used > to get child element button.

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
next_page_path = WebDriverWait(self.driver, RANDOM_SLEEP_TIME).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[data-test="load-more-btn"]>button'))) 

Problem 5: Empty elements when crawling

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[data-hook="helpful-vote-statement"]"}

Solution: If the comment was "0 guest found this helpful", then this text would not be displayed, so the program reported an NoSuchElementException error. We needed to add a try except to deal with the empty element. If this element was not found, fill this item manually.

try:
    current_review['vote'] = review.find_element(By.CSS_SELECTOR, '[data-hook="helpful-vote-statement"]').text
except NoSuchElementException:
    current_review['vote'] = '0 people found this helpful'

Published

Category

Progress Report

Contact