Build a News Aggregator
This guide takes you through the process of building a news aggregator with the NewsCatcher news API.
News aggregators can do a lot of things, but at the core of things, they all need to scrape, clean, and organize news information from multiple sources. In this guide, you'll go through the process of building a news aggregator using NewsCatcher's news API(in less than 10 mins).

Requirements

Here's what you'll need to follow along with this guide:
  • URLs of the news outlets you want to aggregate
  • Python 3
  • The following Python modules
1
newscatcherapi
2
tldextract
Copied!

Set-Up

Install the required modules using pip:
1
pip install newscatcherapi tldextract
Copied!

Get The Clean URLs

To get the clean URLs of the news outlets, you can use the following function:
1
import tldextract
2
​
3
def clean_the_url(url):
4
clean_url = '.'.join(list(tldextract.extract(url))[1:]).replace('www.', '').lower()
5
return clean_url
Copied!
If you're working with international news outlets and want a specific regional website, you can use the lang parameter to get it. For example, if you want to get the news articles for 'arabic.rt.com' and 'francais.rt.com', you can't just use the clean URL because it will be the same for both of them, 'rt.com'.

Get The News Articles

Once you have the clean URLs of the news sources you want to aggregate, you can simply pass them as a list to the sources parameter of the v2/search endpoint to get the articles. And that's it!
Let's say you wanted to aggregate all news articles from The New York Times and The Guardian published in the last week.
All you would need to do is use the get_search() method to fetch the articles:
1
from newscatcherapi import NewsCatcherApiClient
2
​
3
newscatcherapi = NewsCatcherApiClient(x_api_key='your_key_1')
4
​
5
clean_urls = ['nytimes.com', 'theguardian.com']
6
​
7
aggregated_articles = newscatcherapi.get_search(q='*',
8
from_ = "1 week ago",
9
page_size = 100,
10
sources = clean_urls))
Copied!
Alternatively, if you're not working with Python you can make a GET request with the clean URLs:
1
curl -XGET 'https://api.newscatcherapi.com//v2/search?q=*%20page_size=100%20sources=nytimes.com,theguardian.com%20from=1 week ago' -H 'x-api-key: your_key_1'
Copied!
Which would yield a JSON/Dictionary object that looks like this:
1
{
2
"status": "ok",
3
"total_hits": 3183,
4
"page": 1,
5
"total_pages": 32,
6
"page_size": 100,
7
"articles": [
8
{
9
"title": "The E.U. unveils a plan to ban Russian oil imports.",
10
"author": "Matina Stevis-Gridneff",
11
"published_date": "2022-05-04 07:12:29",
12
"published_date_precision": "full",
13
"link": "https://www.nytimes.com/2022/05/04/world/eu-russia-oil-ban.html",
14
"clean_url": "nytimes.com",
15
...
16
}
17
...
18
],
19
,
20
"user_input": {
21
"q": "*",
22
"search_in": [
23
"title_summary"
24
],
25
"lang": null,
26
"not_lang": null,
27
"countries": null,
28
"not_countries": null,
29
"from": "2022-05-02 05:28:11",
30
"to": null,
31
"ranked_only": "True",
32
"from_rank": null,
33
"to_rank": null,
34
"sort_by": "relevancy",
35
"page": 1,
36
"size": 100,
37
"sources": [
38
"nytimes.com",
39
"theguardian.com"
40
],
41
"not_sources": null,
42
"topic": null,
43
"published_date_precision": null
44
}
45
}
Copied!
You can now extract the articles list from this and use them however you like.
Check out this nifty crypto news aggregator we made.
CryptoCatcher in action
Did we miss any outlets you need? Reach out to us at [email protected], we'll add it without breaking a sweat.