Multi-Cloud Analytics with BigQuery Omni : No time to load !

Ravi Manjunatha
Google Cloud - Community
7 min readMar 27, 2022

--

With increasing number of organizations adopting a Multi-Cloud Strategy, data movement has been a point of contention for the Analytics teams. Data sprawl and the egress costs of the host cloud providers are some serious side effects of this.

Independent Software Vendors (ISVs) who develop a multi-tenant Data Platform for their customers find the scenario of data spread across multi-cloud as extremely challenging.

BigQuery Omni Reporting for Duty!

In line with GCPs core philosophy of building a truly open and multi-cloud compliant platform, BigQuery Omni comes as a shot in the arm for Data Analytics team and Organizations alike who are implementing multi-cloud.

At its core BigQuery de-couples storage and compute. The essence of it is that with BigQuery, the engineers wanted to bring the compute to where the Big data resides rather than bring the data to where the compute resides. This is a paradigm shift from how the traditional large scale data warehouses operated. This enables independent scaling of compute vis-a-vis storage and is one of key reasons for the immense scalability and flexibility the BigQuery provides.

Extending this de-coupling of Storage from Compute across cloud environments, BigQuery Omni makes it possible for teams to query the data from BigQuery while the data resides in Amazon S3 or Azure Cloud Blob Storage. It is enabled by the Anthos Clusters which are fully managed by Google cloud, so teams do not have to provision or manage anything extra. Anthos clusters run in the same region where the data resides, so any latency due to shuffles are reduced further.

With this set-up, the traditional 3 step process of Locate, Load and Transform of a Data Pipeline is now reduced to a 2 step process of just Locate and Transform. The Load part which would have required making duplicate copies of data and involved writing additional ETL pipelines are now eliminated!

The consolidated data stack looks like below, making BigQuery a single pane of glass for all Analytics needs irrespective of which cloud the data resides.

Sharing the complete Implementation steps below for a sample dataset in AWS S3,

  1. Create policies and roles in AWS IAM for enabling BigQuery Connections to access files in S3.
  2. Create a BigQuery External Connection to AWS S3
  3. Create an External table from BigQuery on the data in AWS S3

Lets get into the details of each step ,

  1. Create policies and roles in AWS IAM for enabling BigQuery Connections to access files in S3.

Create a Bucket in S3 in US-East1 region and load the data that has to be processed. For this demo today, we will load the familiar Employee and Department data sets in csv format as found here

Copy the S3 URIs as this will be required for creating the tables in BigQuery later.

Create policies to list and get objects from BigQuery, IAM > Policies > Create Policy , say bqomniread

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": ["arn:aws:s3:::BUCKET_NAME"]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": ["arn:aws:s3:::BUCKET_NAME/*"]
}
]
}

Create new role which will be used from the BigQuery panel to connect to S3, go to IAM > Create new Role and select the options as below, give your role name as say bqomniread. Copy the ARN of the role that was created. This will be used in BigQuery for making the connection.

2. Create a BigQuery External Connection to AWS S3

We will now make a connection from BigQuery to AWS S3 using the BigQuery External Connection option.

Click Add Data > External Data Source > Select AWS (via BigQuery Omni) enter the ARN details of the AWS role that we created earlier in the AWS Role id.

Once the connection is created, it will appear under the External connections like below,

Click on the connection to get the connection details,

BigQuery External Connection Details

The connection that we created now has to be trusted to assume the role of bqomni_read_only. To trust the connection, navigate to AWS IAM > click on the role created > Trust relationship> paste the json code below, replace the “Identity ID” with the BigQuery Google Identity value from the BigQuery External Connection details as shown below,

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "accounts.google.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"accounts.google.com:sub": "IDENTITY_ID"
}
}
}
]
}

In addition to this, we will now have to configure the identity provider for the BigQuery External connection, to do this, navigate to AWS IAM > Identity Provider > enter the provider url as shown below and enter the BigQuery Google Identity (from the BigQuery External connection) as the Audience. Once done, you will be asked to assign the role, select the bq_omni_read_only role that was created earlier.

Now we are all set to create the external tables from the BigQuery console.

3. Create an External table from BigQuery on the data in AWS S3.

Create a dataset say, bqomni , with the region selected as aws-us-east-1

Navigate to the newly created dataset and Create new table, enter the details as below and select Automatically detect schema.

There can be an error message that may pop up saying the connection cannot assume the role and that if we are creating this for the first time to try after some time. This can be due to propagation delays, give some time and come back to this connection again. It should work this time.

Once the table is created, you can test running queries on it, as expected it will show that ‘query will process 0B ‘ in the top right as the data is not stored in BigQuery.

We will encounter the below error, when we run the query,

For BigQuery Omni, we need to have a reservation of 100 slots for the queries to work. A monthly commitment of 100 slots initially disappointed me of not being to query the table , but then this is where Flex slot reservation comes to the rescue.

Buy 100 slots under Flex type, with details as shown below,

Once you do your test runs and are thrilled at being able to connect and process data residing in S3 from BigQuery, you can come back and cancel this commitment. Assuming you will work on the queries for about an hour, you will incur $5 for this activity.

Once your slots are created, the next step is to go and Reserve them for the project where the bqomni dataset is created,

Under the reservation tab, click on the slots created row and select Create assignment, under the drop down, select the project where bqomni is created.

Once the assignment is created, you can see the assignment like below,

Now we are all set to query our external table !

You can create the employee table as well the similar way and then join both employee and department tables to get the output of two tables whose data resides in S3,

You can load a few more datasets in S3 , which are larger in size and check the performance of this Federated querying. As expected, it is a tad lower than if the data were to reside in BigQuery or in GCP platform. It however is within comparable range and is certainly worth implementing for production workloads.

Once you are done testing, do not forget to cancel the reservation and the slots you created with flex type !

In a production scenario, if you do not have the flat rate reservation type and if you have few federated pipelines to start off , flex slots are the best way to start off.

Assuming you have an orchestration tool like Cloud Composer (managed Apache Airflow), here is how you can implement the above pipeline in an automated manner,

Similar implementation can be done for data residing in Azure blob storage as well.

Looker and BigQuery ML models can be created on these External tables much similar to how it is enabled on internal tables. This makes BQ Omni even more potent and empowering to existing BigQuery users !

--

--