Deduplicate Articles

Enhance search efficiency by filtering out duplicate articles.

Introduction

In the dynamic world of news, significant events and topics receive extensive coverage from numerous publications. Media outlets often republish articles, resulting in a flood of similar content. Navigating through all these data with duplicate information takes a lot of work and effort.

To overcome this challenge and provide you with only original articles, we have implemented a content-based deduplication process. This not only improves the quality of the data but also streamlines data management, saving you valuable time and resources. Furthermore, by retaining both the original articles and their duplicates, we offer you the flexibility to conduct comprehensive analysis and access a broader dataset.

Deduplication Process

The deduplication process involves several distinct steps, ranging from basic comparisons to advanced machine-learning techniques. Each step is designed to ensure that only unique and relevant news articles are delivered.

Deduplication with Hashes

First, we generate unique identifiers, or "hashes," based on an article's title and content. These hashes are compared against existing ones in our database. This initial step helps quickly identify and remove exact duplicates.

Embedding-Based Text Comparison

Next, we transform articles that pass the initial filter into vector representations (embeddings) using Natural Language Processing (NLP). These vectors capture the meaning of the text, allowing us to identify semantically similar articles. By comparing these vectors, we can find duplicates that do not exactly match but are very similar in content.

Levenshtein Distance Calculation

For the remaining articles, we use the Levenshtein distance, which measures how many changes are needed to make one word into another. This way, we catch minor differences and typos. Articles with small Levenshtein distances are likely duplicates.

Identifying the Original Article

To find the original article among similar ones, we use a ranking algorithm that considers factors like domain, canonical URL, and author. The highest-ranked article is marked as the original. If a better candidate for the original article is later discovered, we update our records.

By default, when you search articles using v3 Search News endpoint /api/search , you get all relevant articles, including possible duplicates. This is useful for detailed analysis and comprehensive research.

If you want only unique articles, set the exclude_duplicates parameter to true. This way, you filter out all duplicates and get just the original articles.

Each article object in the response contains additional fields:

  • duplicate_count: The number of duplicates associated with the original article.

  • duplicate_articles_group_id: A unique identifier for duplicates associated with the original article.

The deduplication feature supports only English-language articles. Applying the exclude_duplicates parameter to non-English articles triggers a validation error.

Here is an example request and response for filtering out duplicate articles in the search results:

Example Request

GET https://v3-api.newscatcherapi.com/api/search?lang=en&exclude_duplicates=true&q=Elon Musk

Example Response

 {
    "status": "ok",
    "total_hits": 7929,
    "page": 1,
    "total_pages": 80,
    "page_size": 100,
    "articles": [
        {
            "title": "Musk plans stock option grants to Tesla's high-performers, sources say",
            "author": "Yahoo Finance CA",
            "authors": [
                "Yahoo Finance CA"
            ],
            "journalists": [],
            "published_date": "2024-06-18 06:05:00",
            "published_date_precision": "full",
            "updated_date": "2024-06-18 06:05:00",
            "updated_date_precision": "full",
            "link": "https://headtopics.com/ca/musk-plans-stock-option-grants-to-tesla-s-high-performers-54410956",
            "domain_url": "headtopics.com",
            "full_domain_url": "headtopics.com",
            "name_source": "Head Topics",
            "is_headline": false,
            "paid_content": false,
            "parent_url": "https://headtopics.com/ca",
            "country": "US",
            "rights": "headtopics.com",
            "rank": 16951,
            "media": "https://i.headtopics.com/images/2024/6/18/yahoofinanceca/musk-plans-stock-option-grants-to-tesla-s-high-per-musk-plans-stock-option-grants-to-tesla-s-high-per-78239C706F15AA66DE6240CE7B6EEC78.webp",
            "language": "en",
            "description": "Tesla CEO Elon Musk told employees on Monday that the electric vehicle maker is working on stock-based compensation for high-performing employees, according ...",
            "content": "SAN FRANCISCO - Tesla CEO Elon Musk told employees on Monday that the electric vehicle maker is working on stock-based compensation for high-performing employees, according to two people who reviewed an internal memo...",
            "language": "en",
            "twitter_account": "headtopicscom",
            "all_links": [
                "https://twitter.com/headtopicscom",
                "https://energyindustrynews.net/energy/news-54412556",
                "https://www.linkedin.com/headtopics",
                "https://www.facebook.com/headtopics"
            ],
            "all_domain_links": [
                "energyindustrynews.net",
                "facebook.com",
                "linkedin.com",
                "twitter.com"
            ],
            "id": "52758118ebc3e1e0ac6314dabb770dc6",
            "score": 24.470715,
            "duplicate_count": 0,
            "duplicate_articles_group_id": "ba87d09e98254108a9d9178b17672018"
        }
        // more articles
    ]
}

Conclusion

Our deduplication process guarantees you receive only unique, high-quality, relevant news articles. By storing both unique articles and their duplicates, we offer you a choice between complete datasets for detailed analysis and a more distinctive, duplicate-free set of articles. This ensures you get the most out of the news data, whether conducting in-depth research or looking for original content.

Last updated