Clustering News Articles

Group similar articles together to reduce noise

Version 3 introduces an important feature known as clustering, which involves grouping similar articles. Clustering serves multiple purposes, such as identifying and eliminating duplicate articles published by different media sources. It also simplifies the process of focusing on significant events, rather than having to sift through multiple articles with varying wording but essentially the same content.

To fine-tune the precision of your search results, we provide two parameters:

  1. clustering_threshold: This parameter determines the threshold at which articles are considered similar. Increasing the threshold leads to more article clusters, which means that fewer articles will be grouped together because of stricter similarity requirements.

    For optimal results, we recommend experimenting with different values for this parameter, typically ranging between 0.6 and 0.9, depending on your specific use case.

  2. clustering_variable: By default, our clustering mechanism considers the entire content of the articles. However, you have the option to utilize the clustering_variable parameter. This parameter allows you to explore clusters based on the articles' title and summary, which can provide more relevant and tailored results to meet your specific requirements.

The clustering process occurs dynamically at the API level and takes into account the search filters you apply. This approach enables the API to generate the most appropriate clusters for your use case, rather than relying on a one-size-fits-all clustering method.

One crucial detail is that clustering occurs one page of results at a time. For instance, if your query matches 150 articles, and you set the page_size parameter to 100, only the first 100 articles will be clustered at once. Some articles from the second page may belong in the same cluster as those from the first page, but due to the page size limitation, the cluster may be divided. To ensure effective clustering, we recommend setting the page_size parameter to a value greater than the expected 'total_hits' for your query.

Last updated