What we need is an intelligent and mostly automated way to assign products under the right category name.
Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets.
Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further.
What's an entity? A unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.).
That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites.
Important questions in ER
How can we tell that these multiple references point to the same entity?
What if the attributes for each entity aren't the same across references?
What happens when there are more than two or three or ten references to the same entity?
Which one is the main (canonical) version?
Do we just throw the duplicates away?
The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:
Deduplication: eliminating duplicate (exact) copies of repeated data.
Record linkage: identifying records that reference the same entity across different sources.
Canonicalization: converting data with more than one possible representation into a standard form.
Python dedupe library provides great tools for ER.
dedupe library
Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data.
dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.
It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task.
In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets.
Dedupe also scales fairly well.
Howdedupeworks?
Effective deduplication relies largely on domain expertise. This is for two main reasons:
Domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice.
domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look.
As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.
dedupe cleverly exploits the structure of the input data to instead compare the records field by field.
dedupe lets the user nominate the features they believe will be most useful.
dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches.
These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.
Blocking is used to reduce the number of overall record comparisons that need to be made.
dedupe's method of blocking involves engineering subsets of feature vectors (these are called 'predicates') that can be compared across records.
In the case of our people dataset above (image above), the predicates might be things like:
the first three digits of the phone number
the full name
the first five characters of the name
a random 4-gram within the city name
Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase.
The blocks are developed by computing the edit distance between predicates across records.
dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.
Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number.
This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).
The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema.
As the user labels more and more tuples, dedupegradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.
Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not.
In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), dedupe uses hierarchical clustering with centroidal linkage.
Records within some threshold distance of a centroid will be grouped together.
The final result is an annotated version of the original dataset that now includes a centroid label for each record.
Active Learning
You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.
Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished
It's important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection.
In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.
Journalists, academics, and businesses work hard to get big masses of data to learn about what people or organizations are doing. Unfortunately, once we get the data, we often can’t answer our questions because we can’t tell who is who.
In much real-world data, we do not have a way of absolutely deciding whether two records, say John Smith and J. Smith are referring to the same person. If these were records of campaign contribution data, did a John Smith give two donations or did John Smith and maybe Jane Smith give one contribution apiece?
People are pretty good at making these calls, if they have enough information. For example, I would be pretty confident that the following two records are the about the same person.
bob | roberts | 1600 pennsylvania ave. | 555-0123 |
Robert | Roberts | 1600 Pensylvannia Avenue | |
If we have to decide which records in our data are about the same person or organization, then we could just go through by hand, compare every record, and decide which records are about the same entity. This is very, very boring and can takes a long time. Dedupe is a software library that can make these decisions about whether records are about the same thing about as good as a person can, but quickly.
Matching Records
If you look at the following two records, you might think it’s pretty clear that they are about the same person.
bob | roberts | 1600 pennsylvania ave. | 555-0123 |
Robert | Roberts | 1600 Pensylvannia Avenue | |
However, I bet it would be pretty hard for you to explicitly write down all the reasons why you think these records are about the same Mr. Roberts.
Record similarity
One way that people have approached this problem is by saying that records that are more similar are more likely to be duplicates. That’s a good first step, but then we have to precisely definewhat we mean for two records to be similar.
The default way that we do this in Dedupe is to use what’s called a string metric. A string metric is an way of taking two strings and returning a number that is low if the strings are similar and high if they are dissimilar. One famous string metric is called the Hamming distance. It counts the number of substitutions that must be made to turn one string into another. For example, roberts and Roberts would have Hamming distance of 1 because we have to substitute r for R in order to turn roberts into Roberts.
There are lots of different string metrics, and we actually use a metric called the Affine Gap Distance, which is a variation on the Hamming distance.
Record by record or field by field
When we are calculating whether two records are similar we could treat each record as if it was a long string.
record_distance = string_distance('bob roberts 1600 pennsylvania ave. 555-0123',
The major advantage of comparing field by field is that we don’t have to treat each field string distance equally. Maybe we think that its really important that the last names and addresses are similar but it’s not as important that first name and phone numbers are close. We can express that importance with numeric weights, i.e.
Say we set our record_distance to be this weighted sum of field distances, just as we had above. Let’s say we calculated the record_distance and we found that it was the beautiful number 8.
That number, by itself, is not that helpful. Ultimately, we are trying to decide whether a pair of records are duplicates, and I’m not sure what decision I should make if I see an 8. Does an 8 mean that the pair of records are really similar or really far apart, likely or unlikely to be duplicates. We’d like to define the record distances so that we can look at the number and know whether to decide whether it’s a duplicate.
Also, I really would rather not have to set the weights by hand every time. It can be very tricky to know which fields are going to matter and even if I know that some fields are more important I’m not sure how to quantify it (is it 2 times more important or 1.3 times)?
Fortunately, we can solve both problems with a technique called regularized logistic regression. If we supply pairs of records that we label as either being duplicates or distinct, then Dedupe will learn a set of weights such that the record distance can easily be transformed into our best estimate of the probability that a pair of records are duplicates.
Once we have learned these good weights, we want to use them to find which records are duplicates. But turns out that doing this the naive way will usually not work, and we’ll have to do something smarter.
Active learning
In order to learn those weights, Dedupe needs example pairs with labels. Most of the time, we will need people to supply those labels.
But the whole point of Dedupe is to save people’s time, and that includes making good use of your labeling time so we use an approach called Active Learning.
Basically, Dedupe keeps track of bunch unlabeled pairs and whether
the current learning blocking rules would cover the pairs
the current learned classifier would predict that the pairs are duplicates or are distinct
We maintain a set of the pairs where there is disagreement: that is pairs which classifier believes are duplicates but which are not covered by the current blocking rules, and the pairs which the classifier believes are distinct but which are blocked together.
Dedupe picks, at random from this disagreement set, a pair of records and asks the user to decide. Once it gets this label, it relearns the weights and blocking rules. We then recalculate the disagreement set.
Say we have magic function that takes in a pair of records and always returns a False if a pair of records are distinct and True if a pair of records refer to the same person or organization.
Let’s say that this function was pretty slow. It always took one second to return.
How long would it take to duplicate a thousand records?
Within a dataset of thousand records, there are (1,000×999) / 2 = 499,500 unique pairs of records. If we compared all of them using our magic function it would take six days.
But, one second is a long time, let’s say we sped it up so that we can make 10,000 comparisons per second. Now we can get through our thousand-record-long dataset in less than a minute.
Feeling good about our super-fast comparison function, let’s take on a dataset of 100,000 records. Now there are (100,000×99,999) / 2 = 4,999,950,000 unique possible pairs. If we compare all of them with our super-fast comparison function, it will take six days again. If we want to work with moderately sized data, we have to find a way of making fewer comparisons.
Duplicates are rare
In real world data, nearly all possible pairs of records are not duplicates.
In this four-record example below, only two pairs of records are duplicates–(1, 2) and (3, 4), while there are four unique pairs of records that are not duplicates–(1,3), (1,4), (2,3), and (2,4). Typically, as the size of the dataset grows, the fraction of pairs of records that are duplicates gets very small very quickly.
first name
last name
address
phone
record_id
bob
roberts
1600 pennsylvania ave.
555-0123
1
Robert
Roberts
1600 Pensylvannia Avenue
2
steve
Jones
123 Cowabunga Lane
555-0000
3
Stephen
Janes
123 Cawabunga Ln
444-555-0000
4
If we could only compare records that were true duplicates, we wouldn’t run into the explosion of comparisons. Of course, if we already knew where the true duplicates were, we wouldn’t need to compare any individual records. Unfortunately we don’t, but we do quite well if just compare records that are somewhat similar.
Blocking
Duplicate records almost always share something in common. If we define groups of data that share something and only compare the records in that group, or block, then we can dramatically reduce the number of comparisons we will make.If we define these blocks well, then we will make very few comparisons and still have confidence that will compare records that truly are duplicates.
This task is called blocking, and we approach it in two ways: predicate blocks and canopies.
Predicate blocks
A predicate block is a bundle of records that all share a feature – a feature produced by a simple function called a predicate.
Predicate functions take in a record field, and output a set of features for that field. These features could be “the first 3 characters of the field,” “every word in the field,” and so on. Records that share the same feature become part of a block.
Let’s take an example. Let’s use a “first 3 character” predicate on the address field below..
first name
last name
address
phone
record_id
bob
roberts
1600 pennsylvania ave.
555-0123
1
Robert
Roberts
1600 Pensylvannia Avenue
2
steve
Jones
123 Cowabunga Lane
555-0000
3
Stephen
Janes
123 Cawabunga Ln
444-555-0000
4
That leaves us with two blocks - The ‘160’ block, which contains records 1 and 2, and the ‘123’ block, which contains records 3 and 4.
{'160' : (1,2) # tuple of record_ids
'123' : (3,4)
}
Again, we’re applying the “first three characters” predicate function to the address field in our data, the function outputs the following features – 160, 160, 123, 123 – and then we group together the records that have identical features into “blocks”.
Others simple predicates Dedupe uses include:
whole field
token field
common integer
same three char start
same five char start
same seven char start
near integers
common four gram
common six gram
Index Blocks
Dedupe also uses another way of producing blocks from searching and index. First, we create a special data structure, like an inverted index, that lets us quickly find records similar to target records. We populate the index with all the unique values that appear in field.
When blocking, for each record we search the index for values similar to the record’s field. We block together records that share at least one common search result.
Index predicates require building an index from all the unique values in a field. This can take substantial time and memory. Index predicates are also usually slower than predicate blocking.
Combining blocking rules
If it’s good to put define blocks of records that share the same ‘city’ field, it might be even better to block records that share both the ‘city’ field and the ‘zip code’ field. Dedupe tries these cross-field blocks. These combinations blocks are called disjunctive blocks.
Learning good blocking rules for given data
Dedupe comes with a long set of predicates, and when these are combined Dedupe can have hundreds of possible blocking rules to choose from. We will want to find a small set of these rules that covers every labeled duplicated pair but minimizes the total number pairs dedupe will have to compare.
Once we have calculated the probability that pairs of record are duplicates or not, we still have a kind of thorny problem because it’s not just pairs of records that can be duplicates. Three, four, thousands of records could all refer to the same entity (person, organization, ice cream flavor, etc.,) but we only have pairwise measures.
Let’s say we have measured the following pairwise probabilities between records A, B, and C.
A -- 0.6 -- B -- 0.6 -- C
The probability that A and B are duplicates is 60%, the probability that B and C are duplicates is 60%, but what is the probability that A and C are duplicates?
Let’s say that everything is going perfectly and we can say there’s a 36% probability that A and C are duplicates. We’d probably want to say that A and C should not be considered duplicates.
Okay, then should we say that A and B are a duplicate pair and C is a distinct record or that A is the distinct record and that B and C are duplicates?
Well… this is a thorny problem, and we tried solving it a few different ways. In the end, we found that hierarchical clustering with centroid linkage gave us the best results. What this algorithm does is say that all points within some distance of centroid are part of the same group. In this example, B would be the centroid - and A, B, C and would all be put in the same group.
Unfortunately, a more principled answer does not exist because the estimated pairwise probabilities are not transitive.
Clustering the groups depends on us setting a threshold for group membership – the distance of the points to the centroid. Depending on how we choose that threshold, we’ll get very different groups, and we will want to choose this threshold wisely.
Dedupe can predict the probability that a pair of records are duplicates. So, how should we decide that a pair of records really are duplicates?
To answer this question we need to know something about Precision and Recall. Why don’t you check out the Wikipedia page and come back here.
There’s always a trade-off between precision and recall. That’s okay. As long as we know how much we care about precision vs. recall, we can define an F-score that will let us find a threshold for deciding when records are duplicates that is optimal for our priorities.
Typically, the way that we find that threshold is by looking at the true precision and recall of some data where we know their true labels - where we know the real duplicates. However, we will only get a good threshold if the labeled examples are representative of the data we are trying to classify.
So here’s the problem - the labeled examples that we make with Dedupe are not at all representative, and that’s by design. In the active learning step, we are not trying to find the most representative data examples. We’re trying to find the ones that will teach us the most.
The approach we take here is to take a random sample of blocked data, and then calculate the pairwise probability that records will be duplicates within each block. From these probabilities we can calculate the expected number of duplicates and distinct pairs, so we can calculate the expected precision and recall.
Special Cases
The process we have been describing is for the most general case–when you have a dataset where an arbitrary number of records can all refer to the same entity.
There are certain special cases where we can make more assumptions about how records can be linked, which if true, make the problem much simpler.
One important case we call Record Linkage. Say you have two datasets and you want to find the records in each dataset that refer to the same thing. If you can assume that each dataset, individually, is unique, then this puts a big constraint on how records can match. If this uniqueness assumption holds, then (A) two records can only refer to the same entity if they are from different datasets and (B) no other record can match either of those two records.