Scroll Through All Pages To Get All Found Articles
Learn how to iterate through found pages and extract articles
Whenever you make an API call using /v2/search or /v2/latest_headlines endpoints, we give you key information about your search. Here is an example of a search API call:
Based on the information above, we can say that given the page_size and being on the 1st page, we are seeing all found articles from 1 to 100 out of 1300.
By incrementing the page parameter to 2, I will get all articles from 101 to 200 out of 1300. You get the logic.
To summarize, your goal is to iterate through all found pages and extract articles.
The whole process should be divided into 2 parts:
Make 1 call and identify the total number of pages in total_pages
Increment page input until the total_pages value to get all articles.
Python (SDK)
The Python library can be installed using pip installlaunched from the terminal. You can find all the details either on PyPi website or our GitHub Repository.
pipinstallnewscatcherapi
When installed, the package can be directly called from the Python application.
We prepared separate functions get_search_all_pagesand get_latest_headlines_all_pagesto simplify the process of extracting all articles.
from newscatcherapi import NewsCatcherApiClientnewscatcherapi =NewsCatcherApiClient(x_api_key='your_key_1')# /v2/search Endpointall_articles = newscatcherapi.get_search_all_pages( q='\"Elon Musk\"', from_='13 days ago', countries='IN,GB', page_size=100, page=1)
Python (requests)
# Preinstalled packagesimport requests # 2.24.0# Default packagesimport jsonimport time# URL of our News APIbase_url ="https://api.newscatcherapi.com/v2/search"# Your API keyX_API_KEY ='PUT_YOUR_API_KEY'# Define your desired parametersparams ={"q":"\"Elon Musk\"","from":"13 days ago","countries":"IN,GB","page_size":100,"page":1}# Put your API key to headers in order to be authorized to perform a callheaders ={"x-api-key": X_API_KEY}# Variable to store all found news articlesall_news_articles ={}# Ensure that we start from page 1params['page']=1# Infinite loop which ends when all articles are extractedwhileTrue:# Wait for 1 second between each call time.sleep(1)# GET Call response = requests.get(base_url, headers=headers, params=params) results = json.loads(response.text.encode())if response.status_code ==200:print(f'Done for page number => {params["page"]}/{results["total_pages"]}')# Storing all found articlesifnot all_news_articles: all_news_articles = resultselse: all_news_articles['articles'].extend(results['articles'])# Ensuring to cover all pages by incrementing "page" value at each iteration params['page']+=1if params['page']> results['total_pages']:print("All articles have been extracted")breakelse:print(f'Proceed extracting page number => {params["page"]}')else:print(results)print(f'ERROR: API call failed for page number => {params["page"]}')breakprint(f'Number of extracted articles => {str(len(all_news_articles))}')