This demonstration repository illustrates how to use Python to fetch news articles from Google based on given keywords. Subsequently, the fetched articles are processed by the GPT-3.5 model to generate a concise summary of the key points in the articles.
This demonstration utilizes two Python libraries to fetch the latest news articles from Google and retrieve their full content. The first library is [GoogleNews], which enables us to search for news articles based on keywords and retrieve their titles and URLs. The second library is [Newspaper3k], which allows us to download the HTML pages of articles and parse them to extract their textual content.
Import necessary libraries
from GoogleNews import GoogleNews
import pandas as pd
import requests
from fake_useragent import UserAgent
import newspaper
from newspaper import fulltext
import re
# Define the keyword to search.
keyword = 'Sora'
Get news link from Google News
We search for news related to the Sora topic. The language is English, the region is the United States, and the time is set to one day. We obtain two pages of news data from Google News and save it into a dataframe.
# Perform news scraping from Google and extract the result into Pandas dataframe.
googlenews = GoogleNews(lang='en', region='US', period='1d', encode='utf-8')
googlenews.clear()
googlenews.search(keyword)
googlenews.get_page(2)
news_result = googlenews.result(sort=True)
news_data_df = pd.DataFrame.from_dict(news_result)
But I got an error during the operation:
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)>
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)>
The error typically occurs when there is an issue with SSL certificate verification. This can happen when the Python environment is unable to verify the SSL certificate of the website you are trying to access.
To resolve this issue, I use the following solutions:
pip install --upgrade certifi
ps:
If updating
certifi
doesn't solve the issue, you can try setting SSL verification toFalse
when making the request. However, this is not recommended for production code as it bypasses SSL certificate verification.Here's how you can do it: ```python import ssl import certifi
ssl._create_default_https_context = ssl._create_unverified_context
Your scraping code here
googlenews = GoogleNews(lang='en', region='US', period='1d', encode='utf-8') googlenews.clear() googlenews.search(keyword) googlenews.get_page(2) news_result = googlenews.result(sort=True) news_data_df = pd.DataFrame.from_dict(news_result) ```
new version of code:
import ssl
import certifi
ssl._create_default_https_context = ssl._create_unverified_context
googlenews = GoogleNews(lang='en', region='US', period='1d', encode='utf-8')
googlenews.clear()
googlenews.search(keyword)
googlenews.get_page(2)
news_result = googlenews.result(sort=True)
news_data_df = pd.DataFrame.from_dict(news_result)
# Display information of dataframe.
news_data_df.info()
here is the output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 20 non-null object
1 media 20 non-null object
2 date 20 non-null object
3 datetime 20 non-null datetime64[ns]
4 desc 20 non-null object
5 link 20 non-null object
6 img 20 non-null object
dtypes: datetime64[ns](1), object(6)
memory usage: 1.2+ KB
and we can see the dataframe here:
title | media | date | datetime | desc | link | img |
---|---|---|---|---|---|---|
OpenAI's Sora isn't the end of the world... yet | XDA Developers | 0 hours ago | 2024-02-27 10:48:56.943676 | Link | ||
Japan Moon lander revives | The Manila Times | 0 hours ago | 2024-02-27 10:48:56.942748 | Link | ||
A Sora Rival Raises Millions from NEA | The Information | 1 hours ago | 2024-02-27 09:48:56.947537 | Link | ||
Dog Adopted After 900 Days in Shelter Returned... | Yahoo | 1 hours ago | 2024-02-27 09:48:56.946703 | Link | ||
Sora's Leap into AI-Driven Video Generation Sp... | BNN Breaking | 1 hours ago | 2024-02-27 09:48:56.945744 | Link |
as you can see, there are only some links to the news, but what if we want the full text of the paper?
Get the full text of the news from the link
To achieve this goal, we use newspaper3k package, one of the most liked crawler framework on GitHub in Python crawler frameworks, suitable for scraping news web pages.
ua = UserAgent()
news_data_df_with_text = []
for index, headers in news_data_df.iterrows():
news_title = str(headers['title'])
news_media = str(headers['media'])
news_update = str(headers['date'])
news_timestamp = str(headers['datetime'])
news_description = str(headers['desc'])
news_link = str(headers['link'])
print(news_link)
news_img = str(headers['img'])
try:
# html = requests.get(news_link).text
html = requests.get(news_link, headers={'User-Agent':ua.chrome}, timeout=5).text
text = fulltext(html)
print('Text Content Scraped')
except:
print('Text Content Scraped Error, Skipped')
pass
news_data_df_with_text.append([news_title, news_media, news_update, news_timestamp,
news_description, news_link, news_img, text])
news_data_with_text_df = pd.DataFrame(news_data_df_with_text, columns=['Title', 'Media', 'Update', 'Timestamp',
'Description', 'Link', 'Image', 'Text'])
and here is the example result:
Title | Media | Update | Timestamp | Description | Link | Image | Text |
---|---|---|---|---|---|---|---|
OpenAI's Sora isn't the end of the world... yet | XDA Developers | 0 hours ago | 2024-02-27 10:48:56.943676 | Link | This page is not available! | ||
Japan Moon lander revives | The Manila Times | 0 hours ago | 2024-02-27 10:48:56.942748 | Link | ERROR 404\n\npage not found\n\nClick here to g... | ||
A Sora Rival Raises Millions from NEA | The Information | 1 hours ago | 2024-02-27 09:48:56.947537 | Link | Sorry, we weren't able to find that.\n\nIf you... |