By Group "DeepDiver"

Simulation Request

We need to find the pattern of parameters in the request, construct and initiate the request ourselves.

def fetchurl(page, sku):
    #url
    url = "https://www.bestbuy.com/ugc/v2/reviews"    #all review in this url, different sku
    #request header
    headers = {
        "user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36"
        }
    #parameters
    params = {
        "page": page,
        "pageSize": 20,
        "sku": sku,
        "sort": "MOST_RECENT",
        }

    r = requests.get(url, headers=headers, params=params)
    return r.json()

There are some request parameters in request head, we can manually create each parameter to obtain the information we want. "page" means the current review page, "pageSize" means the total review numbers in one page, "sku" means the unique number for each products, therefore, if you want to obtain the reviews of different products, you just need to modify the sku number. "sort" means the presentation order of reviews.

Parse data

After we successfully simulate the request, we need to parse the data we obtained.

def parseJson(jsonObj):
    data = jsonObj["topics"]

    reviewData = [] 
    for item in data:
        #review author
        review_author = item["author"]
        #review rating
        rating = item["rating"]
        #review title
        title = item["title"]
        #review content
        content = BeautifulSoup(item["text"], "html.parser").text
        #review time
        create_time = item["submissionTime"]

        dataItem = [create_time, review_author, title, content, rating]
        reviewData.append(dataItem)

    return reviewData

The returned data is in json format, we can use python's own json library to parse it and get the data we want in this order: author, rating, title, review content, create time

Save data

After parse data, we need to save it into csv file for further processing

def save_data(data, path, filename):

    if not os.path.exists(path):
        os.makedirs(path)

    dataframe = pd.DataFrame(data)
    dataframe.to_csv(path + filename, encoding='utf_8_sig', mode='a', index=False, sep=',', header=False )

Main Function

Successfully defined all the function we need, we could run the web scrapping program to obtain the product reviews. Here, we use time.sleep to avoid the block from the website

if __name__ == "__main__":
    skuList = [1493375, 6485362]
    #Oral-B Pro 1000, Philips Sonicare 4100, 
    for sku in skuList:
        startPage = 1
        endPage = 63
        path = "/Users/wangqingran/Desktop/Python course/7036 NLP/Group project/scraping_review_data/"
        filename = "reviews-" + str(sku) + ".csv"

        csvHeader = [["create_time", "review_author", "title", "content", "rating"]]
        save_data(csvHeader, path, filename)

        for p in range(startPage, endPage+1):
            print("\rNow scraping page ", str(p), " ...", end="")
            html = fetchurl(p, sku)
            review = parseJson(html)
            save_data(review, path, filename)
            time.sleep(uniform(0.2, 0.6))

Published

Category

Progress Report

Contact