</newscatcher>
Search…
Export News into a CSV with Python
In this tutorial, you’ll learn how to extract data from NewsCatcher News API and write it into a CSV file.

Requirements

    Python 3
    Preinstalled packages:
1
requests==2.24.0
2
pandas==1.3.2
Copied!
You can copy paste packages name and version and put them into separated file "requirements.txt". Then, using terminal, install with a signle command
1
pip install -r requirements.txt
Copied!
​

Steps to accomplish

    1.
    Prepare the environment
    2.
    Make an API call
    3.
    Extract all found news articles
    4.
    Export data to CSV file
      1.
      From Python Dictionary
      2.
      From Pandas table
​

1. Prepare the environment

This step consists of:
    import packages
    set environmental variables (API Key, URL of the News API, etc)
    define a correct work folder
The following code illustrates the above steps:
1
# Import packages
2
# Default packages
3
import time
4
import csv
5
import os
6
import json
7
​
8
​
9
# Preinstalled packages
10
import requests
11
import pandas
12
​
13
​
14
​
15
# Define desired work folder, where you want to save your .csv files
16
# Windows Example
17
os.chdir('C:\\Users\\user_name\\PycharmProjects\\extract_news_data')
18
# Linux Example
19
os.chdir('/mnt/c/Users/user_name/PycharmProjects/extract_news_data')
20
​
21
# URL of our News API
22
base_url = 'https://api.newscatcherapi.com/v2/search'
23
​
24
# Your API key
25
X_API_KEY = 'PUT_YOUR_API_KEY'
Copied!
Receive your API key by registering at app.newscatcherapi.com​

2. Make an API call

Let's take it easy and try to make a single call. For example, we would like to look for all mentions of 3 popular cryptocurrencies Bitcoin, Ethereum, and Dogecoin.
In order to make a call, we need to set headers and parameters. In parameters, I am also filtering on articles in English as well as narrow down the search top 10 000 the most trustful news sources based on rank variable . Default timeperiod is set to 1 week, so no need to define this parameter.
1
# Put your API key to headers in order to be authorized to perform a call
2
headers = {'x-api-key': X_API_KEY}
3
​
4
# Define your desired parameters
5
params = {
6
'q': 'Bitcoin AND Ethereum AND Dogecoin',
7
'lang': 'en',
8
'to_rank': 10000,
9
'page_size': 100,
10
'page': 1
11
}
12
​
13
# Make a simple call with both headers and params
14
response = requests.get(base_url, headers=headers, params=params)
15
​
16
# Encode received results
17
results = json.loads(response.text.encode())
18
if response.status_code == 200:
19
print('Done')
20
else:
21
print(results)
22
print('ERROR: API call failed.')
Copied!
If the status_code is not 200, the error message should give you a clear idea of what was wrong
Here are the results that we received:
Output of an API call
As you can see, we found 253 articles mentioning all three popular cryptocurrencies in one article. Another parameter worth looking at is "total_pages". It shows how many API calls you will have to make in order to extract all found news articles. We will use it later in the guide. Besides, you can explore further by looking at each article separately. All of them are stored in "articles" JSON Key.
If you want to know more about all response data you get, go check our Search Endpoint page​

(Optional) Visualize data with pandas package

To be able to look at all 100 articles at the same time, let's create a pandas table from the "articles" Key.
1
# Import data into pandas
2
pandas_table = pd.DataFrame(results['articles'])
Copied!
Article represented by a pandas table
Now, you can have a closer look at the first 100 articles, before extracting the remaining.

3. Extract All Found News Articles

At this stage, we are already confident that an API call returns expected results. The next step is to extract all found news articles using "total_pages" value.
One thing to keep in mind is that I am using Free Trial API Key, where the frequency of API calls is limited to 1 call/second. So, to not be penalized for overuse, I make my code wait for one second between each call.
1
# Variable to store all found news articles
2
all_news_articles = []
3
​
4
# Ensure that we start from page 1
5
params['page'] = 1
6
​
7
# Infinite loop which ends when all articles are extracted
8
while True:
9
​
10
# Wait for 1 second between each call
11
time.sleep(1)
12
​
13
# GET Call from previous section enriched with some logs
14
response = requests.get(base_url, headers=headers, params=params)
15
results = json.loads(response.text.encode())
16
if response.status_code == 200:
17
print(f'Done for page number => {params["page"]}')
18
​
19
​
20
# Adding your parameters to each result to be able to explore afterwards
21
for i in results['articles']:
22
i['used_params'] = str(params)
23
​
24
​
25
# Storing all found articles
26
all_news_articles.extend(results['articles'])
27
​
28
# Ensuring to cover all pages by incrementing "page" value at each iteration
29
params['page'] += 1
30
if params['page'] > results['total_pages']:
31
print("All articles have been extracted")
32
break
33
else:
34
print(f'Proceed extracting page number => {params["page"]}')
35
else:
36
print(results)
37
print(f'ERROR: API call failed for page number => {params["page"]}')
38
break
39
​
40
print(f'Number of extracted articles => {str(len(all_news_articles))}')
Copied!
In summary, we iterate through all available pages, extract news articles and store them in one variable called "all_news_article". We also add used parameters to each article, so when exploring you can see where it comes from. You can always delete this part of code if you do not want to have this information in your CSV file.

(Optional) Having multiple queries

Imagine that you want to extract news data from multiple queries at one time. So, instead of searching for articles where all 3 popular cryptocurrencies are mentioned, you would like to look for each of them separately and adding "business" as a topic. In this case, you will have multiple parameters and you will have to add one more iteration.
Here is how the params variable looks like:
1
params = [
2
{
3
'q': 'Bitcoin',
4
'lang': 'en',
5
'to_rank': 10000,
6
'topic': "business",
7
'page_size': 100,
8
'page': 1
9
},
10
{
11
'q': 'Ethereum',
12
'lang': 'en',
13
'to_rank': 10000,
14
'topic': "business",
15
'page_size': 100,
16
'page': 1
17
},
18
{
19
'q': 'Dogecoin',
20
'lang': 'en',
21
'to_rank': 10000,
22
'topic': "business",
23
'page_size': 100,
24
'page': 1
25
}
26
]
27
​
Copied!
In the code, we added one more iteration and put "separated_param" inside requests.get function.
1
# Variable to store all found news articles, mp stands for "multiple queries"
2
all_news_articles_mp = []
3
​
4
# Infinite loop which ends when all articles are extracted
5
for separated_param in params:
6
​
7
print(f'Query in use => {str(separated_param)}')
8
9
while True:
10
# Wait for 1 second between each call
11
time.sleep(1)
12
​
13
# GET Call from previous section enriched with some logs
14
response = requests.get(base_url, headers=headers, params=separated_param)
15
results = json.loads(response.text.encode())
16
if response.status_code == 200:
17
print(f'Done for page number => {separated_param["page"]}')
18
​
19
​
20
# Adding your parameters to each result to be able to explore afterwards
21
for i in results['articles']:
22
i['used_params'] = str(separated_param)
23
​
24
​
25
# Storing all found articles
26
all_news_articles_mp.extend(results['articles'])
27
​
28
# Ensuring to cover all pages by incrementing "page" value at each iteration
29
separated_param['page'] += 1
30
if separated_param['page'] > results['total_pages']:
31
print("All articles have been extracted")
32
break
33
else:
34
print(f'Proceed extracting page number => {separated_param["page"]}')
35
else:
36
print(results)
37
print(f'ERROR: API call failed for page number => {separated_param["page"]}')
38
break
39
​
40
print(f'Number of extracted articles => {str(len(all_news_articles_mp))}')
Copied!
One more important thing is to deduplicate results. Right now we extract articles from 3 different queries. But, as we saw before, there can be mentions of all 3 of them in the same article. So different queries can bring the same articles. That is why we have "_id" value generated for each article. ID is created by decoding both title and clean_url (Web Domain Name of the News Source).
Here is how you can deduplicate results in Python:
1
# Define variables
2
unique_ids = []
3
all_news_articles = []
4
​
5
# Iterate on each article and check whether we saw this _id before
6
for article in all_news_articles_mp:
7
if article['_id'] not in unique_ids:
8
unique_ids.append(article['_id'])
9
all_news_articles.append(article)
Copied!

4. Export Data to CSV file

4.1 From Python dictionary

You can save a file directly from the dictionary generated previously:
1
field_names = list(all_news_articles[0].keys())
2
# Generate CSV file from dict
3
with open('extracted_news_articles.csv', 'w', encoding="utf-8", newline='') as csvfile:
4
writer = csv.DictWriter(csvfile, fieldnames=field_names, delimiter=";")
5
writer.writeheader()
6
writer.writerows(all_news_articles)
Copied!

4.2 From Pandas table

Or create a Pandas table, check the results and then generate a CSV:
1
# Generate CSV from Pandas table
2
# Create Pandas table
3
pandas_table = pd.DataFrame(all_news_articles)
4
​
5
# Generate CSV
6
pandas_table.to_csv('extracted_news_articles.csv', encoding='utf-8', sep=';')
Copied!
​
Last modified 1mo ago