After reading and reviewing multiple article and architecture reviews about AWS data pipelines involving Redshift, i could see a common need across all of them, the ability to stage data from S3 or Dynamo DB or RDS in Redshift and then build the warehouse tables on top of it.
Generally in the Datawarehousing projects this is a general phenomenon where data from multiple sources are first loaded into a staging schema or a staging database and then transformed and loaded to the final datawarehouse.
In case of Redshift, the requirement to load all source data into Redshift as staging data was seen as a limitation earlier. With the availability of the Federated query option in Redshift now this limitation is now overcome. In fact, many data architects i have discussed hail this as one of the top features of Redshift now.
Before i review the utility of this feature, let me quote what the Amazon docs say about this, “Federated queries in Amazon Redshift enable you to query and analyze data across operational databases, data warehouses, and data lakes. With the Federated Query feature, you can integrate queries from Amazon Redshift on live data in external databases with queries across your Amazon Redshift and Amazon S3 environments. Federated queries can work with external databases in Amazon RDS for PostgreSQL and Amazon Aurora with PostgreSQL compatibility.”. In a nutshell, this feature relieves us from staging data from multiple sources(not all though)in Redshift and the need for ETLs to get the data into Redshift.
Based on my experience of designing and building data solutions for enterprises, this definitely arms Redshift with great flexibility, especially for POCs which require quick turnaround time in terms of data ingestion, collection and analysis. Clients are generally not much concerned about the data pipelines upfront and are keen to see the final end product in terms of an integrated Dashboard, Machine Learning use cases being realized. A decent working NFRs is what assures them at this stage.
This option can also be used in the projects where customers have a limited digital or cloud budget and want to quickly develop and realize a parallel working version of their legacy systems. Get the buy-in from their management for compelte digital revamp of their legacy systems as part of a separate larger RFPs.
However when we look at designing and building at enterprise data pipelines and solutions, federated queries have limited utility. These are my reasons for it,
- Generally in large organisations, different teams manage and own different data systems , say one team could be managing RDS, another S3 and Redhsift and so on.
- I have seen large organisations where they have multiple teams managing different instances of the same Database or Datawarehouse.
- Such distributed ownership of data sources are fatal for datawarehouse implementation. Especially when it comes to data availability, data quality and data reconciliation. This is precisely the reason why staging layer from multiple sources is required. This staging layer also needs to be owned and managed by the same team which managed Datawarehouse.
- The downtime of all the source systems along with Redshift has to be in sync for continued availability of the data and dashboards.
- Data security rules with localisation of data in a particular region has to be factored in as well.
In spite of the above general limitations that are applicable across cloud datawarehouses, the ability to query and join data from multiple sources within Redshift has come as a great shot in the arm for Redshift and makes it more even more flexible for clod natives.