Processing Orchestration
Table of Content
1. Processing Orchestration 2. Apache Airflow 2.1. Apache Airflow: Web Server 2.1.1. Apache Airflow: Scheduler 2.1.2. Apache Airflow: Worker 3. Airflow Process 4. Airflow technologies 5. Airflow in action
1. Processing OrchestrationThe processes we discussed in previous sections, generally, we don't want to do them only one time → For instance, we may have to repeat them every 24 hrs.The problem is that we don't want to run all different jobs manually every time.It'd also be nice to handle data dependencies automatically as well. * Whatever that should happen serially, happens in order and whatever that be done at the same time, get done in parallel.Finally, we want to be able to scale up → i.e. managing potentially thousands of scheduled jobs. 2. Apache AirflowWe're going to use Apache Airflow to solve all the problems mentioned above.Airflow uses the concept of DAG to manage jobs. 2.1. Apache Airflow: Web ServerIn order to show all of these DAGs and the previous runs of DAGs, Airflow offers a web server.It's actually a Flask app that let users to trigger DAGs.It also allows to browse DAG history (stored in DB). 2.1.1. Apache Airflow: SchedulerThe scheduler:Monitors DBs to check task statesFetch DAGs from DAG storeSend DAG to execution queueWrites DAG runs into DB for history2.1.2. Apache Airflow: WorkerThe Worker:Pulls the task queueRuns the tasksStores task state to the DB (so the Scheduler is aware of its progress). 3. Airflow Process 4. Airflow technologiesThe workers are implemented through a Celery worker. Celery ensures that if one of the workers goes offline (such as the host itself going down or host losing network connection), Celery will quickly adapt and assign tasks to other new workers to come and take its place.The queue is typically fronted by Rabbit MQ. It supports a cluster of queues so that we can scale up the number of queues if the scheduler has a very high load to be scheduled and it allows the queue to have a high availability. The scheduler is typically the single point of failure for Airflow. In older versions, we would need an active/passive hot standby situation.In more recent versions, they support concurrency amongst the schedulers → This means that you can have many schedulers running at the same time. It'll ensure that tasks don't get scheduled twice by mistake. For database, we need to have an active/passive hot standby as well.DAG store is typically in a distributed file store (e.g. S3). The webserver is typically going to consist of multiple instances behind some load balancers, such that if a single instance goes down, we can still schedule and view our DAGs because there's more than one host. 5. Airflow in actionThis is the Airflow webserver UI:
In this example, it shows that an instance (mlexpert_dag) is already running.This means that there is an S3 (if on AWS) bucket with mlexpert_dag's file in it.Both AWS and Google offer a fully managed Airflow service.If we click on mlexpert_dag, we can see the DAG:
If you go the code tab, it'll pull up the exact code that's backing this DAG:
Note: The DAG is imported from the airflow library. The lines 17-23 shows how to define a DAG.The next lines (25-44), shows how we can add tasks through SSHOperator to the DAG.
Note: The worker t1 would execute the commend argument "spark-submit s3://bucket_name/aggregate_search_interaction.py <DATE>" in the worker of the Airflow instance that we have. The worker would send to the name node of the Hadoop cluster. The name node of the Hadoop cluster would be the one to actually run this file (the *.py file) itself.Line 46 shows how we can arrange the tasks. In this case, it says that task 1, 2 can be done in parallel, and once they're done task 3 can be run.We can see the graph representation of it in the Graph View tab. Back to Top