Dockerize Airflow Data Pipelines with Azure Containers

( This article assumes readers have working knowledge of Apache Airflow and Dockers)

When it comes to building data pipelines Apache Airflow works like a charm. With its robust set of operators Data Engineers can now integrate a wide variety of external data sources to their Internal data systems. Its UI is a great plus in monitoring the DAGs at an enterprise level and organizations especially startups with good Engineering talent are embracing it in a big way.

As with any ETL or Data Pipelines, organizations however struggle when it comes to provisioning a scalable underlying infrastructure for them. Handling peak loads which leads to long queues and failed jobs is a problem that doesn't change even with Apache Airflow.

Airflow community have provisioned the option of running airflow data pipelines in Azure containers to address this very issue. This option is a great remedy especially for those pipelines which involve ML algos and require exhaustive use of memory.

Dockerize Airflow

Just like other operators that Airflow provides, AzureContainerInstancesOperator is used to launch Azure container Instances from Airflow. The complete documentation regarding this Operator can be found http://airflow.apache.org/docs/apache-airflow-providers-microsoft-azure/stable/_api/airflow/providers/microsoft/azure/operators/azure_container_instances/index.html. The parameters for this operator are self explanatory and if any of you require help with this, please comment.

Sample Solution involving Airflow and Azure Container Instances

There is a flip side to this solution though, unlike the usual DAGs where we can use all the operators like sftptos3operator , S3ToSnowflake operator we will not able to use the airflow operators with this DAG except azurecontainerinstancesoperator. Unlike the typical airflow VM where the DAGs are run , Azure Containers will not have Airflow Installed. So, we will end by using usual Python functions to perform all copy and load operations.

I would recommend the below steps in sequence for dockerizing airflow data pipelines,

This way we can handle peak loads of Airflow jobs without them getting queued and failing. A scalable data pipeline like the one illustrated above is extremely challenging to implement but brings in great benefits to the Data and Infra team. Works like these require the best of both ‘Data’ and ‘Engineering’ !

Data enthusiast with passion for building enterprise data solutions