Exploration of Entity Identification and Graph-based Relationship Curation in the AI Incident Database
This blog post was contributed by Aiman Muzaffar, Akshay Thirumal Reddy, and Daniel Bazargun, students at The George Washington University School of Business, who analyzed the AI Incident Database and its entity management as part of their capstone coursework.
Editorial note: The AIID, its editors, and its policies address incident reports and alleged harms. Like others who track AI incidents, the general lack of regulation and mandatory reporting for AI incidents means that the AIID usually has no access to ground truth data regarding AI incidents or harms. It is important to acknowledge that journalism reports about alleged AI incidents and harms, on which the following work is based, can be biased or inaccurate.
Introduction
Identifying and organizing entities within the AI Incident Database (AIID) is a complex task. Most incidents, as reported by journalists, involve complex relationships between developers, deployers, and alleged victims. It is important to understand these connections to effectively evaluate accountability and ethical considerations in the deployment of AI systems. As of now, AIID entity resolution is done manually by editorial staff sifting through thousands of news articles to curate incident reports. This approach is labor intensive and presents a potential logistical challenge as the database grows.
A more data-driven, automated approach would ease processing the increasing volume and complexity of AIID incident data. This blog proposes a new approach for AIID entity resolution, incorporating natural language processing (NLP) and machine learning (ML) techniques to extract entities, assign them into meaningful clusters, and efficiently map out relationships between them.
Methodology
We use text embedding and existing entity clustering data, followed by graph visualization and graph clustering, to identify and establish relationships between entities.
Clustering with Text Embeddings
The aim of embedding and clustering existing entities is to automatically group them in a meaningful way that directly informs the construction of a relationship graph. The first step in the approach is to convert known entities into numeric vectors. The embedding model all-MiniLM-L6-v2 was used to generate high-dimensional numeric representations for the entities.
Clustering Methodology
K-means clustering was applied to the entity embeddings to generate 100 groups of entities, based on their similarity in the embedding space. Investigation of the clusters showed expected patterns and relationships among developers, deployers, and reports of harmed parties. Figure 1 below shows the clustering of unique victim entities, where each point represents an entity that is assigned to one of 100 clusters.
Figure 1: K-Means Scatter Plot of Alleged Victim Text Embeddings
Defining Relationships for the Clustered Entities
For each cluster we manually generated a cluster label based on the characteristics of the entities inside them. These cluster labels help to interpret entities and inform their place in the relationship graph, as each entity will become a node in the relationship graph. See Table 1 for an example of cluster labels and graph nodes.
Table 1: Node Table
The entity-relationship graph provides a structured visualization of the relationships among developers, deployers, and alleged victims involved in AI incident reports. Each node represents a distinct entity, such as a developer responsible for creating an AI system, a deployer utilizing the system, or an alleged victim affected by its deployment. The edges, defined in the edges table, establish relationships between these entities, such as deployed by, connecting alleged developers and deployers to affected victims as displayed in Table 2 below.
Table 2: Edge Table
The edges between nodes capture specific relationships, including:
- Developer → Deployed by → Deployer: This relationship links developers to deployers, illustrating how AI systems are transferred and utilized in real-world applications.
- Developer → Direct Alleged Harms → Victim: This edge highlights incidents where harm allegedly is caused directly by the AI systems implemented by developers.
- Deployer → Allegedly Harms → Victim: This captures cases where deployers’ use of AI systems leads to allegations of harm, reflecting the broader impact of their operational decisions.
The final interactive visualization, of which a static image is available in Figure 2 below, is a powerful tool for identifying areas where ethical AI governance and accountability are likely to be most needed. The interactive graph was developed using Pyviz and NetworkX.
Figure 2a: Entire AIID Relationship Graph, with green detail area highlighted in 2b
Figure 2b. Detail of AIID Relationship Graph for an AI developer and deployer
Conclusion
As AI applications continue to evolve, the tools and methodologies for understanding their impact must advance accordingly. Our current method provides a starting point and illustrates the utility of graph-based analyses for analysis and reporting of AI incidents. Artifacts for recreating this analysis and visualization are available on GitHub. Of course, challenges remain with automated entity resolution and understanding relationships between them. For example, manually labeling clusters remains time consuming, and text embedding and clustering approaches must be better-tuned for incident report data.