Introducing John to Dr Watson

Unsupervised Name Matching

Written by Philipp Warmer

In everyday life individuals are referred to by a variety of names. The used name largely depends on the context of the calling. A 'Dr. John H. Watson' might be referred to as 'John' in an intimate way - or by 'Dr. Watson' in a professional setting. Despite both name variants having nothing in common, they both refer to the same person. This is quite obvious to a human, however making this connection is challenging to the computer due to the lack of the given context.

Here we showcase an unsupervised workflow based on dirty-cat, spaCy and scikit-learn to determine the underlying name-groups of different name variants, or to say it in layman's terms, to explain the computer that both 'John' and 'Dr. Watson' both refer to the name-group 'Dr. John H. Watson'. This is done in 4 steps:

Extracting the names from the text
Based on the name similarities determine the number of name clusters
Using the number of clusters to determine meaningful name-groups
Assign each name to a name-group

Setting up the environment

Before we get started let's make sure we have all required libraries and the respective spaCy language model installed.

Package	Version
dirty-cat	0.2.0
en-core-web-sm	3.2.0
matplotlib	3.5.1
numpy	1.22.3
pandas	1.4.2
scikit-learn	1.0.2
seaborn	0.11.2
spacy	3.2.4

After installing spaCy the language model can be downloaded like this:

> python -m spacy download en_core_web_sm

Let’s load the required packages

Now that all the libraries are installed, let us import them. I ran all of the following code in Python 3.9.11.

import numpy as np
import spacy
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import AffinityPropagation
from dirty_cat import SimilarityEncoder, GapEncoder

If you can import all of the libraries without an error you are ready to go. Let's start!

Detecting and extracting names from text

First we want to get a list of names. This can be done using the code snippet below. Here, based on an input string, we extract a list of names. This step is added to make the workflow comprehensive. However, we continue with a predefined list of names in the next step.

def get_names(text:str)-> list[str]:
    """
    Returns a unique, sorted list of named entities from a string of text. 
    
    Parameters
    ----------
    text : string
    
    Returns
    -------
    out : sorted list of unique names extracted from text
    """
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    names = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
    names.sort()
    return list(set(names))

> names = get_names(some_book)

Determine the similarity of names

To learn onto how many individuals we want to map our names, we first need to determine the putative number of people to whom the names belong. For illustration purposes I’ve chosen two people, ‘Dr. John H. Watson’ and ‘Sherlock Holmes’. For each of the two individuals I’ve selected multiple name variants, including only their first and last name, with and without title or with a typo. Using this example list as input we compute the n-gram similarity using the get_encodings function. Under the hood it used the similarity encoder from the dirty cat library. With the function get_clusters the similarities are subject to affinity propagation. This results in an automated cluster assignment for each name and thus also the total number of clusters.

names = ['Watson', 'John', 'John H. Watson', 'Sherlock Holmes', 'olmes', 'Sherlock', 'Holmes', 'Dr. Watson']

def get_encodings(names:list[str])->np.array:
    """
    Returns n-gram similarities from list of names. 
    
    Parameters
    ----------
    names : list of strings
    
    Returns:
    --------
    transformed_values : array
                      An array of n-gram similarities
    """
    sorted_values = np.unique(np.array(names))
    similarity_encoder = SimilarityEncoder(similarity='ngram')
    transformed_values = similarity_encoder.fit_transform(
        sorted_values.reshape(-1, 1))
    return transformed_values

def get_clusters(encodings, random_state:int = 0)->(np.array, int):
    """
    Returns the cluster identity for the input encodings and the total number of clusters.

    Parameters
    ----------
    encodings : array
             output from get_encodings

    random_state : int, default: 0
                seed for random number

    Returns:
    --------
    clusters : array
           Cluster identity for each encoding

    n_clusters : int
             Number of identified clusters

    """
    clustering = AffinityPropagation(random_state=random_state).fit(encodings)
    clusters = clustering.labels_
    n_clusters = len(set(clusters))
    return (clusters, n_clusters)

> encodings = get_encodings(names)
> clusters, n_clusters = get_clusters(encodings)

Let's see if we could retrieve the two expected name clusters for Dr. John H. Watson and Sherlock Holmes.

> print(n_clusters)
>> 2

Hurray! We got the right number of name clusters. Let’s next figure out the underlying name constituents.

Let's determine the name-groups

Now that we have automatically determined the number of name clusters we next use the function get_name_groups to determine the underlying name constituents as well as the activations for each name-group. Why do we want the activations? In short, an activation can be understood as how strongly a given name responds to a name-group, thereby giving us a measure of relatedness.

def get_name_groups(names:list[str], n_labels:int, n_name_parts:int=2, random_state:int=0)-> (list[str], np.array):
    """
    Returns the name-groups and activations.
    
    Parameters
    ----------
    names : list of strings
          
    
    n_labels : int
            Number of wanted name-groups, output of get_clusters can be used
    
    n_name_parts : int default 2
                Number of wanted name-group parts, 2 as default as people typically have a first and a last name
                
    random_state : int default 0
                Seed for random numbers

    Returns:
    --------
    name_groups : list of strings
               name-groups consisting of n_name_parts
    
    name_activations : array
                    Activations for each name
           
    """
                    
    enc = GapEncoder(n_components=n_labels, random_state=42)
    name_activations = enc.fit_transform(np.array(names).reshape(-1,1))
    name_groups = enc.get_feature_names_out(n_labels=n_name_parts)
    return (name_groups, name_activations)

> name_groups, name_activations = get_name_groups(names, n_clusters)

So let's check if the name-groups make sense to us.

> print(name_groups)
>> ['sherlock, holmes', 'watson, john']

That looks promising. We determined two name-groups: 'sherlock, holmes', 'watson, john'. Let's now connect them to our list of names.

Let’s map each name to a name-group

Now we can map the name-group activation to each name. The figure below, generated with plot_topic_activations, shows the activation value for each name / name-group pair. Do we see activations that make sense to us?

def plot_topic_activations(name_activations:np.array, name_groups:list[str], names:list[str])-> sns.heatmap:
    """
    Visualize mapping of name-groups to names
    
    Parameters
    ----------
    name_activations : array
                    Array of name activations from get_name_groups
          
    
    name_groups : list of strings
               List of name-groups from get_name_groups
    
    names : list of strings
                Number of wanted name-group parts, 2 as default as people typically have a first and a last name

    Returns
    --------
    out : sns.heatmap
       Heatmap showing the activations for each name, name group pair
           
    """
    data = pd.DataFrame(name_activations, columns=name_groups, index=names)
    sns.heatmap(data)
    plt.title('Activations for each name / name-group pair')
    plt.xlabel('Name-groups')
    plt.ylabel('Extracted names')
    plt.show()

> plot_topic_activations(name_activations, name_groups, names)

It looks like each of the extracted names has a clear name-group associated with it. All the ‘Holmes’ variants map to the ‘sherlock, holmes’ name-group and the ‘Watson’ variants to the ‘watson, john’ name-group. We are not done yet, let us programmatically connect them!

Therefore we use the get_clean_names function to select the largest activation value for each name. This way we select the most associated name-group for every name. Let's check if we can connect John and Dr. Watson to the same name-group.

def get_clean_names(name_activations:np.array, name_groups:list[str], names:list[str])-> pd.DataFrame:
    """
    Returns mapping of name-groups to names
    
    Parameters
    ----------
    name_activations : array
                    Array of name activations from get_name_groups
          
    
    name_groups : list of strings
               List of name-groups from get_name_groups
    
    names : list of strings
                Number of wanted name-group parts, 2 as default as people typically have a first and a last name

    Returns
    --------
    out : pd.DataFrame
       Dataframe contains mapping of name-groups to names
           
    """
    return (pd.DataFrame(name_activations, columns=name_groups, index=names)
               .idxmax(axis=1)
               .reset_index()
               .rename({'index':'extracted_names', 0:'name_group'}, axis=1)
            )

> matched_names = get_clean_names(name_activations, name_groups, names)

extracted_names      name_group
0    Watson            watson, john
1    John              watson, john
2    John H. Watson    watson, john
3    Sherlock Holmes   sherlock, holmes
4    olmes             sherlock, holmes
5    Sherlock          sherlock, holmes
6    Holmes            sherlock, holmes
7    Dr. Watson        watson, john

That looks like a good mapping to me :-) We were able to successfully map all the ‘Holmes’ and ‘Watson’ variants to their respective name-group and thus brought John back to Dr. John H. Watson. ✅

Voila, let's wrap it up.

In this automated workflow we started out by highlighting how names can be extracted using named entity recognition. Next we selected a set of names, computed their n-gram similarity, from which we determined the number of meaningful name clusters. Afterwards we pulled out their underlying name-groups and mapped them back to the initial names using topic activations. This way we performed an unsupervised mapping of both 'John' and 'Dr. Watson' to the same underlying name-group: ‘watson, john’. In other words, we mapped ambiguous names to unambiguous name-groups.

This workflow runs smoothly for our test cases but arguably the real world of natural language processing is more messy. If this workflow doesn’t function well for your use case, here are a couple of screws that can be adjusted to make this pipeline more robust. Setting the obvious preprocessing of names aside, the main ones are:

Determining name clusters: One way to improve the unsupervised determination of name clusters is to make it work by consensus of multiple orthogonal approaches. The similarity encoding approach from dirty-cat could be supplemented with the word2vec algorithm (https://arxiv.org/abs/1301.3781) or the transformer-based universal sentence encoder (https://arxiv.org/abs/1803.11175). Then the number of name clusters is determined by majority vote, which is inherently more robust.
Generation of name-groups: While content, such as name-groups, can be generated by reconstructing a latent space using convolutional variational autoencoders or generative adversarial networks, any of those would require massive pre-training to be useful for our workflow. A more lightweight way would be by bundling potentially ambiguous name-groups together and use them as the input to the whole workflow, basically run it recursively either for a predefined number of iterations or till the activations converge.

I hope you took something away from my assembly of spaCy, dirty_cat and sklearn. If you have any questions or input please feel free to get in touch.