Data Hosting
Table of Content
1. Data Hosting 2. Distributed Cache
1. Data HostingLet's say that we have set up a recommendation service after we have productionized our model and the service is serving real traffic.A user requests a page from our app → the app calls the recommendation service along with the user ID to get recommendations for her.The recommendation service needed information on this user, but how did it actually get it?In order to run the model, it needs user's features. So far, we only talked about two ways on getting features:* No features → i.e. cold start* Features are stored in HDFS.HDFS isn't ideal for hosting data in a low latency situation.It's probably a better idea to offer a low latency data serving layer. 2. Distributed CacheOne thing we can do to solve the low latency data serving problem is to use Airflow. → Have a hook on HDFS to recognize when the HDFS is updated so that every time an update happens, Airflow could communicate that to the recommendation service to tell it to re-fetch the data from HDFS → The recommendation service will get the data asynchronously in bulks.The problem is that now we're under the assumption that the data can fit on the host (i.e. recommendation service).* That means either all data in memory (RAM) or memory & disk.Tools that we can use to do something like that Redis, Memcached, Tarantool.Now, we have to worry about two ways in which we must scale.* Let's our in-memory DB is storing 30GB of data.* Now, if we get 1000 Q/s and the ML model would requiring more features → that get the in-memory data size to 80GB, we have to horizontally scale to handle higher # of requests and vertically scale to increase memory size. This makes the whole thing very expensive.To solve the in-memory problem, we can use distributed cache. This cache would follow the same process as getting a bulk load of data from HDFS, say every 24 hrs.The difference is that every service would now call the same distributed cache and it would have multiple machines in it to make a cluster each machine in the cluster would maintain some portion of data.With distributed cache, we need clients on the recommendation service to make sure that we know which hosts to call for some particular data.We also need to manage the membership of each node in the distributed cache cluster (we can use Zookeeper for that).* Zookeeper would ensure that clients become aware not to call on particular hosts that are now out of the cluster for reasons such as scaling down or that host just become unavailable.Another important factor the clients need to handle is caching.* If we have an extremely popular item, it's far more efficient to cache these items on the clients themselves, and maybe store 1% of what the distributed cache is storing.* What are some tools for distributed cache?Apache IgniteDynamoDB (DAX)Redis (AWS ElastiCache) Back to Top