1. Basic Models• The main question we're exploring in this section is how to train some basic models in big data platforms (i.e. Spark cluster + HDFS)?• In the MLExpert notes, we reviewed some ML models but most of them are designed for a single processor and maybe not so large of a dataset.• The idea of this section is go over some algorithms we can use for large datasets and how some of those algorithms might have to be altered or parallelized in order to make them work most effectively in a large data setting.2. Gradient Boosted Trees• Let's say that we have a dataset of 100M examples and 200 features, and roughly there's about 200 split points per feature.• Note: Just as a reminder, in boosted trees, for each feature in each feature vector, we're going to evaluate some particular split point and store the best split point across all possible split points → That split point will be assigned to be officially the node's split point → Then, we'd recurse to the left and right nodes down from that node and repeat the same process.• The problem with the above approach is that we're not taking advantage of anything that we can do in parallel.– One pretty obvious thing we can do is to analyze every single feature in every single split point in parallel.– A less intuitive thing that we can do is to analyze every single node on a particular level in parallel as well.• One of major portions of the gradient boosted tree algorithm is to sort the features in order to find their split points.– By simply using the originally sorted split point feature list in a clever way, we can prevent ourselves from needing to resort every time we want to evaluate a particular split point.• For gradient boosted trees, we can't parallelize the ensemble technique (i.e. boosting), because the subsequent tree depends on the previous tree.– With Random Forest, we can actually parallelize the construction of each tree in the ensemble because there's no dependency.• In general, all of these parallelized operations can speed up the training upwards of 90%.3. Matrix Factorization• Matrix factorization can use ALS. • It uses ALS to approximate some user-item matrix, such that two embedding matrices (i.e. user and item embedding matrix) when multiplied together will approximate the original user-item matrix.• ALS works by alternating the optimization of the user embedding and the product embedding.• As it alternates, it performs OLS.• Let's say we have 100K items and 100M users → Their embeddings can get quite large.– We can split these embeddings up when we're solving for the alternate embedding.– For example, if we're performing OLS on the items' matrix, P, while keeping the users' matrix, U, fixed:* We can multiply different sections of U across P such that we can still follow alternating squares exactly while splitting it up across multiple machines.a
u11u21u22u12......a
u1Nu2N......a
p11p21p1kp2k..................a
u11u21u22u12u1Nu2N............uPuPMachine 1Machine 2• As these machines calculate partial results, they will communicate with each other to make sure that all of the machines have all of the total results.4. Logistic Regression• This is the equation for logistic regression → y =1
1+e-(𝛽1x+𝛽0)• If we have multiple machines available then:– We would be able to put the parameters of logistic regression (i.e. 𝛽1 and 𝛽0) across the cluster– Then send different data within a mini-batch to each machine– Each machine will calculate the partial mini-batch gradient and ship it back to some single machine– The single machine will aggregate (i.e. adding them up) the partial mini-batch gradients.* Note: We're going to only use gradient descent here. Most production-ready applications will use limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm → It's effectively a variation of gradient descent except they don't just use the gradient, they also incorporate the gradient of the gradient.5. Spark MLlib• Generally, ML algorithms (i.e. most basic ML algorithms except for deep learning) have variations which can utilize parallelization.– This means a faster convergence to optimum values.• Scaling up ML models becomes more important in some applications where there's huge amount of data and the model needs to trained constantly. A good example of that is stock price prediction models.• Spark MLlib has implemented the parallelized (and optimized) versions of many basic ML models.• Some examples include:– Linear/logistic regression– SVM– Gradient boosted trees / Random Forests– Naive Bayes– Collaborative Filtering– Bisecting k-means– SVD/PCA• Spark also offer ways to analyze the performance of each of these models. Back to Top