Entity Resolution

What is entity resolution? ER identifies which records in one or more databases refer to the same real world entity.

How to do ER?

The traditional ER techniques are based on calculating string similarity. Typical attributes strings are name, address, etc.

The traditional ER methods do not always yield accurate results.

More recent studies use collective ER. Collective ER techniques takes relational similarity into account, in addition to string similarity.

They yield more accurate results, but at a higher computational cost.
The collective ER often uses the neighbors of nodes as the relation in record graph to solve the ER.

The basics of Entity Resolution

Source

What we need is an intelligent and mostly automated way to assign products under the right category name.
Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets.
Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further.
What's an entity? A unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.).

That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites.

Important questions in ER

How can we tell that these multiple references point to the same entity?
What if the attributes for each entity aren't the same across references?
What happens when there are more than two or three or ten references to the same entity?
Which one is the main (canonical) version?
Do we just throw the duplicates away?

The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:

Deduplication: eliminating duplicate (exact) copies of repeated data.
Record linkage: identifying records that reference the same entity across different sources.
Canonicalization: converting data with more than one possible representation into a standard form.

Python dedupe library provides great tools for ER.
dedupe library

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data.
dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.
It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task.
In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets.
Dedupe also scales fairly well.
How dedupe works?

Effective deduplication relies largely on domain expertise. This is for two main reasons:

Domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice.
domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look.

As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

dedupe cleverly exploits the structure of the input data to instead compare the records field by field.
dedupe lets the user nominate the features they believe will be most useful.

fields = [

{'field' : 'Name', 'type': 'String'},

{'field' : 'Phone', 'type': 'Exact', 'has missing' : True},

{'field' : 'Address', 'type': 'String', 'has missing' : True},

{'field' : 'Purchases', 'type': 'String'},

]

How dedupe works?

Source

Journalists, academics, and businesses work hard to get big masses of data to learn about what people or organizations are doing. Unfortunately, once we get the data, we often can’t answer our questions because we can’t tell who is who.

In much real-world data, we do not have a way of absolutely deciding whether two records, say John Smith and J. Smith are referring to the same person. If these were records of campaign contribution data, did a John Smith give two donations or did John Smith and maybe Jane Smith give one contribution apiece?

People are pretty good at making these calls, if they have enough information. For example, I would be pretty confident that the following two records are the about the same person.

first name | last name | address | phone |

--------------------------------------------------------------

bob | roberts | 1600 pennsylvania ave. | 555-0123 |

Robert | Roberts | 1600 Pensylvannia Avenue | |

If we have to decide which records in our data are about the same person or organization, then we could just go through by hand, compare every record, and decide which records are about the same entity. This is very, very boring and can takes a long time. Dedupe is a software library that can make these decisions about whether records are about the same thing about as good as a person can, but quickly.

Matching Records

If you look at the following two records, you might think it’s pretty clear that they are about the same person.

first name | last name | address | phone |

--------------------------------------------------------------

bob | roberts | 1600 pennsylvania ave. | 555-0123 |

Robert | Roberts | 1600 Pensylvannia Avenue | |

However, I bet it would be pretty hard for you to explicitly write down all the reasons why you think these records are about the same Mr. Roberts.

Record similarity

One way that people have approached this problem is by saying that records that are more similar are more likely to be duplicates. That’s a good first step, but then we have to precisely define what we mean for two records to be similar.

The default way that we do this in Dedupe is to use what’s called a string metric. A string metric is an way of taking two strings and returning a number that is low if the strings are similar and high if they are dissimilar. One famous string metric is called the Hamming distance. It counts the number of substitutions that must be made to turn one string into another. For example, roberts and Roberts would have Hamming distance of 1 because we have to substitute r for R in order to turn roberts into Roberts.

There are lots of different string metrics, and we actually use a metric called the Affine Gap Distance, which is a variation on the Hamming distance.

Record by record or field by field

When we are calculating whether two records are similar we could treat each record as if it was a long string.

record_distance = string_distance('bob roberts 1600 pennsylvania ave. 555-0123',

'Robert Roberts 1600 Pensylvannia Avenue')

Alternately, we could compare field by field

record_distance = (string_distance('bob', 'Robert')

+ string_distance('roberts', 'Roberts')

+ string_distance('1600 pennsylvania ave.', '1600 Pensylvannia Avenue')

+ string_distance('555-0123', ''))

The major advantage of comparing field by field is that we don’t have to treat each field string distance equally. Maybe we think that its really important that the last names and addresses are similar but it’s not as important that first name and phone numbers are close. We can express that importance with numeric weights, i.e.

record_distance = (0.5 * string_distance('bob', 'Robert')

+ 2.0 * string_distance('roberts', 'Roberts')

+ 2.0 * string_distance('1600 pennsylvania ave.', '1600 Pensylvannia Avenue')

+ 0.5 * string_distance('555-0123', ''))

Setting weights and making decisions

Say we set our record_distance to be this weighted sum of field distances, just as we had above. Let’s say we calculated the record_distance and we found that it was the beautiful number 8.

That number, by itself, is not that helpful. Ultimately, we are trying to decide whether a pair of records are duplicates, and I’m not sure what decision I should make if I see an 8. Does an 8 mean that the pair of records are really similar or really far apart, likely or unlikely to be duplicates. We’d like to define the record distances so that we can look at the number and know whether to decide whether it’s a duplicate.

Also, I really would rather not have to set the weights by hand every time. It can be very tricky to know which fields are going to matter and even if I know that some fields are more important I’m not sure how to quantify it (is it 2 times more important or 1.3 times)?

Fortunately, we can solve both problems with a technique called regularized logistic regression. If we supply pairs of records that we label as either being duplicates or distinct, then Dedupe will learn a set of weights such that the record distance can easily be transformed into our best estimate of the probability that a pair of records are duplicates.

Once we have learned these good weights, we want to use them to find which records are duplicates. But turns out that doing this the naive way will usually not work, and we’ll have to do something smarter.

Active learning

In order to learn those weights, Dedupe needs example pairs with labels. Most of the time, we will need people to supply those labels.

But the whole point of Dedupe is to save people’s time, and that includes making good use of your labeling time so we use an approach called Active Learning.

Basically, Dedupe keeps track of bunch unlabeled pairs and whether

the current learning blocking rules would cover the pairs
the current learned classifier would predict that the pairs are duplicates or are distinct

We maintain a set of the pairs where there is disagreement: that is pairs which classifier believes are duplicates but which are not covered by the current blocking rules, and the pairs which the classifier believes are distinct but which are blocked together.

Dedupe picks, at random from this disagreement set, a pair of records and asks the user to decide. Once it gets this label, it relearns the weights and blocking rules. We then recalculate the disagreement set.

Other field distances

We have implemented a number of field distance measures. See the details about variables.

Making Smart Comparisons

Say we have magic function that takes in a pair of records and always returns a False if a pair of records are distinct and True if a pair of records refer to the same person or organization.

Let’s say that this function was pretty slow. It always took one second to return.

How long would it take to duplicate a thousand records?

Within a dataset of thousand records, there are (1,000×999) / 2 = 499,500 unique pairs of records. If we compared all of them using our magic function it would take six days.

But, one second is a long time, let’s say we sped it up so that we can make 10,000 comparisons per second. Now we can get through our thousand-record-long dataset in less than a minute.

Feeling good about our super-fast comparison function, let’s take on a dataset of 100,000 records. Now there are (100,000×99,999) / 2 = 4,999,950,000 unique possible pairs. If we compare all of them with our super-fast comparison function, it will take six days again. If we want to work with moderately sized data, we have to find a way of making fewer comparisons.

Duplicates are rare

In real world data, nearly all possible pairs of records are not duplicates.

In this four-record example below, only two pairs of records are duplicates–(1, 2) and (3, 4), while there are four unique pairs of records that are not duplicates–(1,3), (1,4), (2,3), and (2,4). Typically, as the size of the dataset grows, the fraction of pairs of records that are duplicates gets very small very quickly.

first name	last name	address	phone	record_id
bob	roberts	1600 pennsylvania ave.	555-0123	1
Robert	Roberts	1600 Pensylvannia Avenue		2
steve	Jones	123 Cowabunga Lane	555-0000	3
Stephen	Janes	123 Cawabunga Ln	444-555-0000	4

If we could only compare records that were true duplicates, we wouldn’t run into the explosion of comparisons. Of course, if we already knew where the true duplicates were, we wouldn’t need to compare any individual records. Unfortunately we don’t, but we do quite well if just compare records that are somewhat similar.

Blocking

Duplicate records almost always share something in common. If we define groups of data that share something and only compare the records in that group, or block, then we can dramatically reduce the number of comparisons we will make. If we define these blocks well, then we will make very few comparisons and still have confidence that will compare records that truly are duplicates.

This task is called blocking, and we approach it in two ways: predicate blocks and canopies.

Predicate blocks

A predicate block is a bundle of records that all share a feature – a feature produced by a simple function called a predicate.

Predicate functions take in a record field, and output a set of features for that field. These features could be “the first 3 characters of the field,” “every word in the field,” and so on. Records that share the same feature become part of a block.

Let’s take an example. Let’s use a “first 3 character” predicate on the address field below..

first name	last name	address	phone	record_id
bob	roberts	1600 pennsylvania ave.	555-0123	1
Robert	Roberts	1600 Pensylvannia Avenue		2
steve	Jones	123 Cowabunga Lane	555-0000	3
Stephen	Janes	123 Cawabunga Ln	444-555-0000	4

That leaves us with two blocks - The ‘160’ block, which contains records 1 and 2, and the ‘123’ block, which contains records 3 and 4.

{'160' : (1,2) # tuple of record_ids

'123' : (3,4)

}

Again, we’re applying the “first three characters” predicate function to the address field in our data, the function outputs the following features – 160, 160, 123, 123 – and then we group together the records that have identical features into “blocks”.

Others simple predicates Dedupe uses include:

whole field
token field
common integer
same three char start
same five char start
same seven char start
near integers
common four gram
common six gram

Index Blocks

Dedupe also uses another way of producing blocks from searching and index. First, we create a special data structure, like an inverted index, that lets us quickly find records similar to target records. We populate the index with all the unique values that appear in field.

When blocking, for each record we search the index for values similar to the record’s field. We block together records that share at least one common search result.

Index predicates require building an index from all the unique values in a field. This can take substantial time and memory. Index predicates are also usually slower than predicate blocking.

Combining blocking rules

If it’s good to put define blocks of records that share the same ‘city’ field, it might be even better to block records that share both the ‘city’ field and the ‘zip code’ field. Dedupe tries these cross-field blocks. These combinations blocks are called disjunctive blocks.

Learning good blocking rules for given data

Dedupe comes with a long set of predicates, and when these are combined Dedupe can have hundreds of possible blocking rules to choose from. We will want to find a small set of these rules that covers every labeled duplicated pair but minimizes the total number pairs dedupe will have to compare.

While we approach this problem by using greedy algorithms, particularly Chvatal’s Greedy Set-Cover algorithm.

Grouping Duplicates

Once we have calculated the probability that pairs of record are duplicates or not, we still have a kind of thorny problem because it’s not just pairs of records that can be duplicates. Three, four, thousands of records could all refer to the same entity (person, organization, ice cream flavor, etc.,) but we only have pairwise measures.

Let’s say we have measured the following pairwise probabilities between records A, B, and C.

A -- 0.6 -- B -- 0.6 -- C

The probability that A and B are duplicates is 60%, the probability that B and C are duplicates is 60%, but what is the probability that A and C are duplicates?

Let’s say that everything is going perfectly and we can say there’s a 36% probability that A and C are duplicates. We’d probably want to say that A and C should not be considered duplicates.

Okay, then should we say that A and B are a duplicate pair and C is a distinct record or that A is the distinct record and that B and C are duplicates?

Well… this is a thorny problem, and we tried solving it a few different ways. In the end, we found that hierarchical clustering with centroid linkage gave us the best results. What this algorithm does is say that all points within some distance of centroid are part of the same group. In this example, B would be the centroid - and A, B, C and would all be put in the same group.

Unfortunately, a more principled answer does not exist because the estimated pairwise probabilities are not transitive.

Clustering the groups depends on us setting a threshold for group membership – the distance of the points to the centroid. Depending on how we choose that threshold, we’ll get very different groups, and we will want to choose this threshold wisely.

In recent years, there has been some very exciting research that solves the problem of turning pairwise distances into clusters, by avoiding making pairwise comparisons altogether. Unfortunately, these developments are not compatible with Dedupe’s pairwise approach. See, Michael Wick, et.al, 2012. “A Discriminative Hierarchical Model for Fast Coreference at Large Scale” and Rebecca C. Steorts, et. al., 2013. “A Bayesian Approach to Graphical Record Linkage and De-duplication”.

Choosing a Good Threshold

Dedupe can predict the probability that a pair of records are duplicates. So, how should we decide that a pair of records really are duplicates?

To answer this question we need to know something about Precision and Recall. Why don’t you check out the Wikipedia page and come back here.

There’s always a trade-off between precision and recall. That’s okay. As long as we know how much we care about precision vs. recall, we can define an F-score that will let us find a threshold for deciding when records are duplicates that is optimal for our priorities.

Typically, the way that we find that threshold is by looking at the true precision and recall of some data where we know their true labels - where we know the real duplicates. However, we will only get a good threshold if the labeled examples are representative of the data we are trying to classify.

So here’s the problem - the labeled examples that we make with Dedupe are not at all representative, and that’s by design. In the active learning step, we are not trying to find the most representative data examples. We’re trying to find the ones that will teach us the most.

The approach we take here is to take a random sample of blocked data, and then calculate the pairwise probability that records will be duplicates within each block. From these probabilities we can calculate the expected number of duplicates and distinct pairs, so we can calculate the expected precision and recall.

Special Cases

The process we have been describing is for the most general case–when you have a dataset where an arbitrary number of records can all refer to the same entity.

There are certain special cases where we can make more assumptions about how records can be linked, which if true, make the problem much simpler.

One important case we call Record Linkage. Say you have two datasets and you want to find the records in each dataset that refer to the same thing. If you can assume that each dataset, individually, is unique, then this puts a big constraint on how records can match. If this uniqueness assumption holds, then (A) two records can only refer to the same entity if they are from different datasets and (B) no other record can match either of those two records.