Best Practices for designing & developing Data Pipelines using Apache Airflow
Apache Airflow is a thing of beauty when it comes to designing and developing Data Pipelines especially for Big Data loads. Data Engineering has now transcended to Data Ops and this space has more customized tool offerings for just about any data need. In spite of this glut of tools in the market, Apache Airflow still flourishes to be the best choice of tool for developers for orchestrating data pipelines.
Airflow’s code as you go approach and its exhaustive operators and highly intuitive UI makes it top the list. The ability to create custom operators, extend the existing functionalities and its managed offerings based on Kubernetes in Astronomer, Google Cloud Composer and Amazon Managed Workflow makes it even more lucrative.
While this flexibility and ability to maneuver offers it unmatched advantage, excessive and overengineering can often make the implementations complex and difficult to maintain. Per my experience of having worked on design , development, review and maintenance of pipelines developed in Airflow, i have listed below, few of the best practices for designing data pipelines in Airflow.
- Identify the choice of data suites be it SFTPs, data lake(S3,HDFS,Azure data lake), data warehouse (Snowflake, Redshift), processing(Snowflake, Redshift, Spark), database(SQL Server, Postgres), reporting(Power BI, Tableau), archival(S3) at the organizational or portfolio level and ensure developers stick to it. These needs to be unit tested before Prod deployment.
- Well defined Database, Schema and Table structures with all data types indicating the ones where data will be landed or staged, ones where the processed data will be stored, one where the temp data will be stored during processing and ones where the reporting and archived data will be stored is essential. Lack of governance here , can lead to multiple copies of tables created in different schemas.
- Well defined Data Environments such as Dev, Test & Prod and the availability of data there with comparable volumes.
- The Dev & Test Environment can have Sequential Executors with bare minimum configuration. The Prod Environment though needs to be Celery, Kubernetes or at least a Local Executor for parallelism.
- Dev & Test Environments can have SQL Lite database with Sequential Executors, however Prod needs to have MYSQL database with Celery or Kubernetes executor.
- SLA for each task needs to be tracked while in dev, SLA should set at the task level for monitoring and alerting when the threshold exceeds.
- Execution timeout for each task needs to be tracked while in dev and should be set at task level. The task fails, if it doesn't complete successfully within the execution time. This ensures the task unnecessarily doesn't queue up tasks of other dags.
- If possible, mandate the use of specific file formats such as parquet or orc with appropriate compression techniques such as snappy or gzip where possible.
- Using custom built operators, Python functions with Python operators should be a last resort and creative use of built-in operators should be emphasized.
- Use parametrized variables and pass them as Json objects. This can then be deserialized to get the individual variable values.
- Size and time the workloads in terms of SLA and ensure multiple workloads running in the same time period dont drag the system down and queue the other tasks causing inordinate delays.
- Logging at the Airflow VM level, could eventually lead to Log file exceeding the space and dags failing due to log file space issue. It is therefore recommended to log the entries remotely on a data lake such as S3 or HDFS.
- For multiple mission critical workloads, Real time error tracking with Sentry can be used.
- Auto monitoring and tracking the Airflow jobs through notifications on mails or slack is a must.
- Retry attempts and intervals should be specified according to the criticality of the workload.
- Developers , please bear in mind , you will not maintain & support the pipelines once deployed to Prod. The Support engineer will probably look at the pipeline code for the first time only when the dag fails in Production! So, build a Pipeline with no more than 5 tasks and where possible split them in multiple pipelines.
- Make the Pipelines Idem Potent where possible, this way, Support engineers do not have to rack their brains when job fails and they look at re-running the job after fix from the last failed step.
The list is dynamic and i intend to keep upgrading them as and when i stumble upon newer and better options!