4_processing_orchestration

Processing Orchestration

Table of Content

1. Processing Orchestration 2. Apache Airflow 2.1. Apache Airflow: Web Server 2.1.1. Apache Airflow: Scheduler 2.1.2. Apache Airflow: Worker 3. Airflow Process 4. Airflow technologies 5. Airflow in action

1. Processing Orchestration• The processes we discussed in previous sections, generally, we don't want to do them only one time → For instance, we may have to repeat them every 24 hrs.• The problem is that we don't want to run all different jobs manually every time.– It'd also be nice to handle data dependencies automatically as well. * Whatever that should happen serially, happens in order and whatever that be done at the same time, get done in parallel.– Finally, we want to be able to scale up → i.e. managing potentially thousands of scheduled jobs. 2. Apache Airflow• We're going to use Apache Airflow to solve all the problems mentioned above.• Airflow uses the concept of DAG to manage jobs. 2.1. Apache Airflow: Web Server• In order to show all of these DAGs and the previous runs of DAGs, Airflow offers a web server.• It's actually a Flask app that let users to trigger DAGs.• It also allows to browse DAG history (stored in DB). 2.1.1. Apache Airflow: Scheduler• The scheduler:– Monitors DBs to check task states– Fetch DAGs from DAG store– Send DAG to execution queue– Writes DAG runs into DB for history2.1.2. Apache Airflow: Worker• The Worker:– Pulls the task queue– Runs the tasks– Stores task state to the DB (so the Scheduler is aware of its progress). 3. Airflow Process

webserver

scheduler

queue

DAG store

Apache Airflow

1. user triggers DAG

2. fetch the DAG

typically a S3, or some distributed file store

3. schedule the DAG

worker

4. enqueue tasks (in parallel/serial)

5. get tasks

6. do the tasks

PostgreSQL

8. check (periodically) if task compeleted

This allows the scheduler to know when a task is completed so that it can schedule the subsequenct tasks that had dependencies.

7. task completed

If task completed then it schedules more tasks

Once all the tasks are completed, scheduler adds DAGs' history to the DB

9. Add DAG history

10. pull DAG status

11. DAG is completed

4. Airflow technologies• The workers are implemented through a Celery worker. Celery ensures that if one of the workers goes offline (such as the host itself going down or host losing network connection), Celery will quickly adapt and assign tasks to other new workers to come and take its place.• The queue is typically fronted by Rabbit MQ. It supports a cluster of queues so that we can scale up the number of queues if the scheduler has a very high load to be scheduled and it allows the queue to have a high availability. • The scheduler is typically the single point of failure for Airflow. – In older versions, we would need an active/passive hot standby situation.– In more recent versions, they support concurrency amongst the schedulers → This means that you can have many schedulers running at the same time. It'll ensure that tasks don't get scheduled twice by mistake. • For database, we need to have an active/passive hot standby as well.• DAG store is typically in a distributed file store (e.g. S3). • The webserver is typically going to consist of multiple instances behind some load balancers, such that if a single instance goes down, we can still schedule and view our DAGs because there's more than one host. 5. Airflow in action• This is the Airflow webserver UI:

• In this example, it shows that an instance (mlexpert_dag) is already running.– This means that there is an S3 (if on AWS) bucket with mlexpert_dag's file in it.• Both AWS and Google offer a fully managed Airflow service.• If we click on mlexpert_dag, we can see the DAG:

• If you go the code tab, it'll pull up the exact code that's backing this DAG:

• Note: The DAG is imported from the airflow library. The lines 17-23 shows how to define a DAG.• The next lines (25-44), shows how we can add tasks through SSHOperator to the DAG.

• Note: The worker t1 would execute the commend argument "spark-submit s3://bucket_name/aggregate_search_interaction.py <DATE>" in the worker of the Airflow instance that we have. – The worker would send to the name node of the Hadoop cluster. The name node of the Hadoop cluster would be the one to actually run this file (the *.py file) itself.• Line 46 shows how we can arrange the tasks. – In this case, it says that task 1, 2 can be done in parallel, and once they're done task 3 can be run.– We can see the graph representation of it in the Graph View tab. Back to Top