( This article assumes readers have working knowledge of Apache Airflow and Dockers)
When it comes to building data pipelines Apache Airflow works like a charm. With its robust set of operators Data Engineers can now integrate a wide variety of external data sources to their Internal data systems. Its UI is a great plus in monitoring the DAGs at an enterprise level and organizations especially startups with good Engineering talent are embracing it in a big way.
As with any ETL or Data Pipelines, organizations however struggle when it comes to provisioning a scalable underlying infrastructure for them. Handling peak loads which leads to long queues and failed jobs is a problem that doesn't change even with Apache Airflow.
Airflow community have provisioned the option of running airflow data pipelines in Azure containers to address this very issue. This option is a great remedy especially for those pipelines which involve ML algos and require exhaustive use of memory.
Just like other operators that Airflow provides, AzureContainerInstancesOperator is used to launch Azure container Instances from Airflow. The complete documentation regarding this Operator can be found http://airflow.apache.org/docs/apache-airflow-providers-microsoft-azure/stable/_api/airflow/providers/microsoft/azure/operators/azure_container_instances/index.html. The parameters for this operator are self explanatory and if any of you require help with this, please comment.
There is a flip side to this solution though, unlike the usual DAGs where we can use all the operators like sftptos3operator , S3ToSnowflake operator we will not able to use the airflow operators with this DAG except azurecontainerinstancesoperator. Unlike the typical airflow VM where the DAGs are run , Azure Containers will not have Airflow Installed. So, we will end by using usual Python functions to perform all copy and load operations.
I would recommend the below steps in sequence for dockerizing airflow data pipelines,
- Develop the python code to ingest data from external sources and ingest them to any data lake or blob storage. Then transform and load the data to your cloud data warehouse.
- Run this solution from your local or remote machine with all libraries installed.
- Once the Python Script is developed, now create the docker image with all dependencies and libraries installed.
- Place the Python script in the volume drive mounted to the container and execute the script, to check if docker image contains all dependencies.
- Once the Python script executes successfully it is now time to deploy the dockerimage in Azure. This article describes the steps in detail https://docs.microsoft.com/en-us/azure/container-registry/container-registry-tutorial-quick-task.
- Copy the image name, create a azure file share and mount it as a volume to this container. Place the Python script in the azure file share.
- Now develop the Airflow Dag where the azurecontainerinstances operator is called to execute the python script saved in the azure file share.
- Once the Dag is triggered you can monitor the creation of the container and its resource usage in azure portal.
This way we can handle peak loads of Airflow jobs without them getting queued and failing. A scalable data pipeline like the one illustrated above is extremely challenging to implement but brings in great benefits to the Data and Infra team. Works like these require the best of both ‘Data’ and ‘Engineering’ !