To calculate the similarity between two examples, you need to combine all the feature data for those two examples and combine them into a single numerical value.
Creating manual similarity measures is easier when the number of features is low.
As the number and complexity of features increase, it becomes harder to manually measure similarity. It’s better to use supervised similarity measure in such cases.
Example: Suppose there are two features:
We could use root mean square error (RMSE) to calculate a similarity measure.
Note: It’s a good idea to always scale (or normalize) the data before measuring similarity. This is to avoid one feature dominating the metric. If you don’t have enough data (to understand its distribution), scaling is enough.
Note: In general, you can prepare numerical data as described in Prepare data, and then combine the data by using Euclidean distance.
Categorical data can be either:
In the case of binary, if the data matches similarity is 1, otherwise it’s 0.
For Multi-valued, if you know all the category values, you can calculate similarity using the ratio of common values, called Jaccard similarity.
Example: Postal code
Postal codes representing areas that are close to each other should have a higher similarity. To encode the info required to calculate this similarity accurately, you can convert the postal codes into latitude and longitude. For a pair of postal codes, separately calculate the difference between their latitude and their longitude. Then add the differences to get a single numeric value.
Example: Color
Assume you have color data as text. Convert the textual values into numeric RGB values. Now you can find the difference in red, green, and blue values for two colors, and combine the differences into a numeric value by using the Euclidean distance
Notes:
***If you create a similarity measure that doesn’t truly reflect the similarity between examples, your derived clusters will not be meaningful. This is often the case with categorical data and brings us to a supervised measure.***
| Requirement | Manual | Supervised |
|---|---|---|
| Eliminate redundant information in correlated features. | No, you need to separately investigate correlations between features. | Yes, DNN eliminates redundant information. |
| Provide insight into calculated similarities. | Yes | No, embeddings cannot be deciphered. |
| Suitable for small datasets with few features. | Yes, designing a manual measure with a few features is easy. | No, small datasets do not provide enough training data for a DNN. |
| Suitable for large datasets with many features. | No, manually eliminating redundant information from multiple features and then combining them is very difficult. | Yes, the DNN automatically eliminates redundant information and combines features. |
The following figure shows how to create a supervised similarity measure:
Depending on your choice of labels, the resulting DNN is either an autoencoder DNN or a predictor DNN.
To train the DNN, you need to create a loss function by following these steps:
Note: When summing the losses, ensure that each feature contributes proportionately to the loss. For example, if you convert color data to RGB values, then you have three outputs. But summing the loss for three outputs means the loss for color is weighted three times as heavily as other features. Instead, multiply each output by 1/3.
An online machine learning system has a continuous stream of new input data. You’ll need to train your DNN on the new data. However, if you retrain your DNN from scratch, then your embeddings will be different because DNNs are initialized with random weights. Instead, always warm-start the DNN with the existing weights and then update the DNN with new data.
This example shows how to generate the embeddings used in a supervised similarity measure. Imagine you have this housing data:
Before you use feature data as input, you need to preprocess the data. The preprocessing steps are based on the steps you took when creating a manual similarity measure. Here’s a summary:
To generate embeddings, you can choose either an autoencoder or a predictor. Remember, your default choice is an autoencoder. You choose a predictor instead if specific features in your dataset determine similarity. For completeness, let’s look at both cases.
You need to choose those features as training labels for your DNN that are important in determining similarity between your examples. Let’s assume price is most important in determining similarity between houses.
Choose price as the training label, and remove it from the input feature data to the DNN. Train the DNN by using all other features as input data. For training, the loss function is simply the MSE between predicted and actual price. To learn how to train a DNN, see Training Neural Networks.
Train an autoencoder on our dataset by following these steps:
After training your DNN, whether predictor or autoencoder, extract the embedding for an example from the DNN. Extract the embedding by using the feature data of the example as input, and read the outputs of the final hidden layer. These outputs form the embedding vector. Remember, the vectors for similar houses should be closer together than vectors for dissimilar houses.
Next, you’ll see how to quantify the similarity for pairs of examples by using their embedding vectors.
You now have embeddings for any pair of examples. A similarity measure takes these embeddings and returns a number measuring their similarity. Remember that embeddings are simply vectors of numbers. To find the similarity between two vectors and , you have three similarity measures to choose from, as listed in the table below.
In contrast to the cosine, the dot product is proportional to the vector length. This is important because examples that appear very frequently in the training set (for example, popular YouTube videos) tend to have embedding vectors with large lengths. If you want to capture popularity, then choose dot product. However, the risk is that popular examples may skew the similarity metric. To balance this skew, you can raise the length to an exponent to calculate the dot product as .
To better understand how vector length changes the similarity measure, normalize the vector lengths to 1 and notice that the three measures become proportional to each other.
Supervised Similarity Calculation: Programming Exercise
To summarize, a similarity measure quantifies the similarity between a pair of examples, relative to other pairs of examples. The table below compares the two types of similarity measures: