Best Practices for designing & developing Data Pipelines using Apache Airflow
Apache Airflow is a thing of beauty when it comes to designing and developing Data Pipelines especially for Big Data loads. Data Engineering has now transcended to Data Ops and this space has more customized tool offerings for just about any data need. In spite of this glut of tools in the market, Apache Airflow still flourishes to be the best choice of tool for developers for orchestrating data pipelines.
Airflow’s code as you go approach and its exhaustive operators and highly intuitive UI makes it top the list…
( This article assumes readers have working knowledge of Apache Airflow and Dockers)
When it comes to building data pipelines Apache Airflow works like a charm. With its robust set of operators Data Engineers can now integrate a wide variety of external data sources to their Internal data systems. Its UI is a great plus in monitoring the DAGs at an enterprise level and organizations especially startups with good Engineering talent are embracing it in a big way.
As with any ETL or Data Pipelines, organizations however struggle when it comes to provisioning a scalable underlying infrastructure for them. …
Renaissance in Indian IT Education
Every now and then we come across those studies and reports which find only about 10% of Indian Engineering Graduates to be having employable skills. Those studies not only pointed to the gap in soft skills but also lack of coding and design skills which are generally considered to be the strength of Indian IT Engineers.
It is this gap that the initial training programs of major IT companies try to address. While this model was successful to an extent in the past decade till 2020 or so, past few years have almost seen an…
Data Build Tool (DBT) : A niche SQL based Data Transformation tool for the Modern Data Warehouse
The data engineering landscape today is like a Kaleidoscope , with multitude of tools vying for their space. The Big 3 cloud players AWS, Azure and GCP have been pitching their specific products for each of the data needs in Database, Datalake, Datawarehouse ,Dashboard & Data Science space. Niche players like Databricks, Snowflake , Apache Airflow have carved out a separate space for them and are taking the big guys head on in their respective product stacks.
Over the last few weeks, i…
When i started my career in BI more than a decade back, reporting consisted of two major players SAP Business Objects and IBM Cognos. ‘Reporting’ was the general phrase and ‘Dashboards’ and ‘Stories’ weren't much in use. It was usually connecting to Oracle or SQL Server and building reports, publishing to repositories or scheduling them to end users’ inbox.
Centralized Governance in the BI space was limited to account usage and was done the by the ‘Reporting admin’ team. They took care of all the installations, upgrades, migration etc.
Years later BI & Portfolio managers started facing a common problem…
Recently i implemented an in house Real Time Data Alert solution, which would provide real time insights of the oil fields to the operators. I used AWS Big Data Services to develop this solution.
While exploring the solution design and approach, i realized that the solution with little tweaks could well be cross-applied to other domains such as real time log analysis, stock price movements, anomaly detection, health indicators monitoring and may more.
In this article, i will discuss about a broad based solution architecture for all such Real time data alerts use cases.
The source systems could comprise of…
How to clear Tableau Desktop Certified Associate Exam : Tips & Alerts
I recently cleared my Tableau Desktop Certified Qualified Associate Exam on 9th of May 2020 and i just managed to scrape through the exam with a passing score of 76%. Had i cleared the exam with a good score, i probably would not have bothered to write this article. Considering the difficultly level of the exam i faced , i thought i should share my inputs for aspirants of this exam. More so, since the Tableau Exam fee is higher at $ 250. With $250 , one could…
After reading and reviewing multiple article and architecture reviews about AWS data pipelines involving Redshift, i could see a common need across all of them, the ability to stage data from S3 or Dynamo DB or RDS in Redshift and then build the warehouse tables on top of it.
Generally in the Datawarehousing projects this is a general phenomenon where data from multiple sources are first loaded into a staging schema or a staging database and then transformed and loaded to the final datawarehouse.
In case of Redshift, the requirement to load all source data into Redshift as staging data…
Amazon Redshift gives us the option to upload data from multiple sources such as S3, Dynamo DB, EMR etc and in different formats such as csv, avro, parquet and so on.
Based on my working experience, i have realized that in an enterprise setup, it is generally recommended to get all the data to S3, stage the data there in logical buckets and folders and then upload to Redshift.
Before we write the scripts to upload the data from S3 to Redshift, there are certain checklists to be completed,
Once we spin up the Redshift Cluster, we are excited to create the tables , load data to it and query them to see the performance of this giant. Redshift has an inbuilt Query Editor through which we can fire the queries. However in an enterprise set-up it is generally recommended to use SQL client tools like Aginity or SQL Workbench.
Redshift by default is located in the default VPC with default subnets, route tables and internet gateways enabled.
Now let us say, i need to connect to this Redshift cluster from my Aginity client tool. …