Entity Disambiguation

Precision in Company News Tracking

Introduction

Entity Disambiguation is a crucial process in news tracking and information retrieval. It refers to the task of accurately identifying and differentiating between entities (in this case, companies) that share the same or similar names. This process is essential in today's information-rich environment, where companies need precise and relevant news about their business, competitors, or industry.

Why Entity Disambiguation Matters: A Real-World Example

Imagine a financial services company that tracks news articles about "Apple" to stay informed about the tech giant's latest developments. However, the term "Apple" can refer to the tech company, the fruit, or even a person's name. Without Entity Disambiguation, the company might receive a mix of irrelevant articles about fruit farming or unrelated individuals.

Entity Disambiguation solves this problem by using unique identifiers like the company's domain URL (apple.com), founder names (Steve Jobs, Tim Cook), and other contextual information to filter out irrelevant articles. This ensures that the financial services company receives only articles about Apple Inc., allowing them to make informed decisions based on accurate and relevant information.

How Entity Disambiguation Works

1. Smart Article Retrieval

The Entity Disambiguation system is built on top of our News API v3, which aggregates over 1 million articles daily. The Entity Disambiguation process begins by retrieving potentially relevant articles. It uses a detailed search query that combines:

  • The company's legal name

  • Website URL

  • Clean name from the Clearbit API

To identify company names that might be common English words, the system employs a special approach:

  1. It makes an API request to search for the full company name over a week-long period.

  2. If there are more than 10,000 results, the system categorizes the name as a common English word.

  3. For such cases, the search is limited to the ner_ORG field, which contains only organizations identified through Named Entity Recognition.

For example, for a company named "Riot," if the initial search returns more than 10,000 results in a week, the system would adjust its query to:

ORG_entity_name = '"riot" OR "riot.com" OR "Riot"'

This approach helps to significantly reduce the number of irrelevant articles retrieved in the initial stage.

2. Filter Flag Creation

Once the system has retrieved a selection of potentially relevant articles, it creates flags based on company identifiers found in the article. These flags include:

  • is_domain_present: Indicates whether the company's domain URL is mentioned in the article.

  • is_company_name_present_in_title: Checks if the company's name appears in the article's title.

  • is_company_name_present_in_ai_generated_summary: Indicates whether the company's name is present in the AI-generated summary of the article.

  • is_alias_present_in_content: Checks if any aliases of the company are mentioned in the article's content.

  • is_alias_present_in_title: Checks if any aliases of the company appear in the title.

  • founder_present: Indicates whether the company's founder is mentioned in the article.

  • founder_present_percent: Represents the percentage of founders mentioned if the company has multiple founders.

3. Semantic Similarity Analysis

A key part of the Entity Disambiguation process is calculating the similarity between the article text and the company's description. We use an embedding-based text comparison approach:

  1. We convert both the company description and relevant sentences from the article into vector representations (embeddings) using Natural Language Processing (NLP).

  2. These embeddings capture the meaning and relationships of the content, allowing us to gauge the semantic similarity between the company description and the article content.

  3. We use cosine similarity as our metric to measure the similarity between these vector representations.

This method allows us to identify content that is semantically related to the company, even if it uses different wording or phrasing.

A cosine similarity score is a floating-point number ranging from 0 to 1:

  • A score of 1 indicates that the two texts are identical in terms of their semantic content.

  • A score of 0 indicates that the texts are completely unrelated.

  • The higher the score, the more similar the two texts are semantically.

The semantic similarity analysis produces the following additional fields in the entity_disambiguation object:

  • average_cosine_similarity: Average similarity score between the company's description and all relevant sentences in the article.

  • highest_cosine_similarity: Highest similarity score among all relevant sentences in the article.

  • relevant_sentences: Array of objects containing sentences from the article identified as relevant to the company, along with their cosine similarity scores.

Remember that higher cosine similarity scores indicate stronger semantic similarity to the company's description. This can help in prioritizing and filtering the most relevant articles.

For more information on cosine similarity and vector embeddings, check out Vector Embeddings and read about Cosine Similarity on Wikipedia.

Delivering Results: Structure and Frequency

The Entity Disambiguation system adds extra fields related to the disambiguation process to each article object. These enhanced articles are organized into clusters of semantically similar content and delivered to clients via data dumps to the AWS S3 bucket.

Data Structure

Each article object within a cluster is enriched with entity disambiguation data, as illustrated in the example below:

{
  "title": "Implementing Neural Networks in TensorFlow (and PyTorch)",
  // ... (other standard article fields)
  "entity_disambiguation": {
    "average_cosine_similarity": 0.3923861011862755,
    "highest_cosine_similarity": 0.4942742586135864,
    "relevant_sentences": [
      {
        "sentence": "TensorFlow is a comprehensive ecosystem of tools, libraries, and community resources for building and deploying machine learning applications.",
        "cosine_similarity": 0.4942742586135864
      },
      // ... (other relevant sentences)
    ],
    "founder_present": null,
    "founder_present_percent": null,
    "is_domain_present": true,
    "is_company_name_present_in_title": true,
    "is_company_name_present_in_ai_generated_summary": true,
    "is_alias_present_in_content": true,
    "is_alias_present_in_title": true
  },
  "company_name": "TensorFlow",
  "company_aliases": "TensorFlow",
  "cluster_id": "16552780591689057479"
}

Delivery Frequency

The frequency of updates can range from daily to hourly, depending on client needs and system capacity. Typically, a new folder is created for each day, containing the latest data for all monitored companies.

Data Flexibility and Client Usage

The Entity Disambiguation system provides enriched article data with additional fields, allowing clients to implement their own filtering strategies based on their specific needs. Clients can use these fields to:

  • Filter articles based on the presence of the company's domain or name in the title or summary.

  • Prioritize articles with high semantic similarity to the company's description.

  • Focus on articles that mention company founders or specific aliases.

  • Group related articles, enabling trend analysis and comprehensive coverage of specific topics.

The system provides detailed similarity scores, allowing clients to set their own thresholds based on their specific requirements. This flexibility lets clients fine-tune their filtering process to balance precision and recall in their news-tracking efforts.

Benefits and Use Cases

Entity Disambiguation offers several key benefits:

  • Improved Accuracy: Ensures that clients receive only relevant articles, reducing noise and improving the accuracy of their information.

  • Time-Saving: Automatically filters out irrelevant articles, saving clients time and effort in manual sorting.

  • Customizable Filtering: Allows clients to set their own filtering criteria based on the provided flags, giving them flexibility in how they use the information.

Industries and companies that can particularly benefit from Entity Disambiguation include:

  • Financial Services: Analysts and investors can focus on relevant company news for informed decision-making.

  • Public Relations and Marketing: Agencies can effectively track mentions of their clients in the media.

  • Legal and Compliance: Firms can monitor relevant news about their clients or regulatory changes.

  • Investment Firms: Track news about companies they invest in to make timely decisions.

  • Corporate Communications Teams: Monitor media coverage to manage their brand image.

  • Regulatory Bodies: Track news about companies under their jurisdiction to ensure compliance.

Put Entity Disambiguation to Work

Entity Disambiguation is a powerful tool for companies and organizations that need to track precise and relevant news about specific entities. By leveraging advanced natural language processing techniques and customizable filtering options, it offers a flexible and accurate solution for navigating the complex landscape of digital news and information.

Reach out to our sales team to learn how Entity Disambiguation can enhance your news tracking.

Last updated