Reimagining Data Engineering & Data Analytics with Gemini

Ravi Manjunatha
Google Cloud - Community
8 min readMar 10, 2024

--

(views expressed are personal, they consist of point views and art of possible with Gen AI)

AI pair programming, code assistance has been in vogue for some time now. The Developer community has been exploring this space for augmenting their productivity. Faster CUT (Coding & Unit testing) phases has been all the rage in the recent months. Development teams cite a good jump in productivity due to these offerings.

However, the non-CUT phases such as Design, Governance such as Data Lineage and even Run and Maintenance are largely untapped and unexplored from a Gen AI perspective. In this article, i will share point of views and the art of possible with Gemini for these phases in a Data Engineering and Analytics SDLC lifecycle.

Gemini, is a natively multimodal model. So, it can take images and videos in addition to text as inputs and then generate text as output from it. In addition, it has state of the art reasoning and code generation capabilities as well. This makes it a potent combination, a swiss knife which could be multi-purposed for a wide range of use cases.

Design Phase

Let us say i need to build a batch data pipeline in GCP. I could be new to Data Engineering or i could be new to a cloud platform like GCP or both.

I will need to review the design of my solution, before i could starting coding and building that. I came up with this solution, (either in lucidchart or in power point or even handwritten or the meeting room rough sketches as long as they are legible :-)

(Please follow the first few steps as outlined in this official Google colab documentation for loading all the libraries, configuring your project and the helper functions,

If you are new to prompting and would rather try out these features in a UI to get some momentum, please use the Google AI studio. You can use some of the images shared in the article along with the prompts by typing them in the AI studio console as well.)

We will now ask Gemini to validate this architecture with the following steps,

data_pipeline_url = "https://storage.googleapis.com/<bucketname>/Datapipeline.JPG"
datapipeline = load_image_from_url(data_pipeline_url)

instructions = "Instructions: Consider the following image, datapipeline that contains Batch Data pipeline architetcure:"
prompt1 = "I need to design a Batch Data Pipeline validate if it contains all the required components such as scale, schedule, monitroing, governance\
."

prompt2 = """
Answer the question through these steps:
Step 1: Describe the Batch Data Pipeline architecture as depicted in the picture datapeline. Answer only from the image datapipeline, do not generate your own response.
Step 2: List out the standard features of the batch Data pipeline architecture that are missing as per the ask in prompt1
Step 3: Based on the response from Step3 can you recommend a solution to the question in prompt1 with required GCP data pipeline components

Answer and describe the steps taken:
"""

contents = [
instructions,
datapipeline,
prompt1,
prompt2
]

responses = multimodal_model.generate_content(contents, generation_config={"max_output_tokens": 2048, "temperature": 0.0}, stream=True)

print("-------Prompt--------")
print_multimodal_prompt(contents)

print("\n-------Response--------")
for response in responses:
print(response.text, end="")

The following is the response generated by the above prompts,

Gemini is able to validate the architecture and give recommendations on additional services that needs to be included to make my data pipeline more robust. This pattern , can be applied for RFP solution validation as well. It is cloud and technology agnostic. So, one can easily follow this pattern across different cloud providers and full stack web development and other tech stack as well.

With this one can easily explore Gemini as a AI pair Architect or Designer. I will keen to see how this feature can be explored in non-IT areas such as in housing, manufacturing, electronics etc where design is a key aspect of the solution.

Code Generation

Let us now explore the code generation part with Gemini. While, one can write prompts and generate code (as shared in this documentation) with Gemini just as we do with other AI pair code assistants or code generators , what is fascinating is how it can generate code with just the architecture diagrams as well.

We will use the same batch data pipeline architecture to generate the code with the below prompt,

code_data_pipeline_url = "https://storage.googleapis.com/<your bucket name>/Datapipeline.JPG"
code_data_pipeline = load_image_from_url(code_data_pipeline_url)


instructions = "Instructions: Consider the following image, code_data_pipeline that contains Low Level design of a Batch Data Pipeline:"
prompt1 = "I need to generate python code to implement the batch data pipeline. "

prompt2 = """
Answer the question through these steps:
Step 1: Describe the Data Platform architecture as depicted in the picture datapeline. Answer only from the image datapipeline, do not generate your own response.
Step 2: Generate Python code to implement the batch data pipeline as per the image code_data_pipeline.
Answer and describe the steps taken:
"""

contents = [
instructions,
code_data_pipeline,
prompt1,
prompt2
]

responses = multimodal_model.generate_content(contents, generation_config={"max_output_tokens": 2048, "temperature": 0.1}, stream=True)

print("-------Prompt--------")
print_multimodal_prompt(contents)

print("\n-------Response--------")
for response in responses:
print(response.text, end="")

Of course, it is rather unconventional to use architecture diagrams to generate code at least as of now. But “architecture as code” as a way of augmenting developer productivity could be explored in the days to come.

Let us take a more relevant use case for generating code , “Dashboard Migration”.

Let us say, i have dashboards built in licensed software tools and i want to migrate them to open source options to reduce costs and provide more options for customizations.

Let us take this dashboard screen, a publicly available one built with a Licensed software solution and ask Gemini to generate the python equivalent for migration.

code_visualisation_url = "https://storage.googleapis.com/<bucketname>/tableaucharts.JPG"
code_visualisation = load_image_from_url(code_visualisation_url)


instructions = "Instructions: Consider the following image, code_visualisation that contains a screenshot of Tableau Dashboard:"
prompt1 = "I need to generate python code to implement charts as in the Tableau Dashboard code_visualisation. "

prompt2 = """
Answer the question through these steps:
Step 1: Identify the charts in the image, code_visualisation .
Step 2: For each chart , extarct the data, generate python code to implement the chart as per the image code_visualisation.
Answer and describe the steps taken:
"""

contents = [
instructions,
code_visualisation,
prompt1,
prompt2
]

responses = multimodal_model.generate_content(contents, generation_config={"max_output_tokens": 2048, "temperature": 0.1}, stream=True)

print("-------Prompt--------")
print_multimodal_prompt(contents)

print("\n-------Response--------")
for response in responses:
print(response.text, end="")

While the code generated doesn't help me reproduce the exact same output, it helps me significantly boost my productivity in doing so.

Data Lineage

Let us now explore how data lineage and data governance can be tapped with Gemini.

I will consider a non-gcp product stack.

datalineage_url = "https://storage.googleapis.com/<yourbucektname>/datalineage.JPG"
datalineage = load_image_from_url(datalineage_url)


instructions = "Instructions: Consider the following image, datalineage that contains a screenshot of Data Lineage for a Business Lines:"
prompt1 = "I need to extract the data lineage from the picture datalineage. "

prompt2 = """
Answer the question through these steps:
Step 1: Identify the source and destination tables in each stage of the image .
Answer and describe the steps taken:
"""

contents = [
instructions,
datalineage,
prompt1,
prompt2
]

responses = multimodal_model.generate_content(contents, generation_config={"max_output_tokens": 2048, "temperature": 0.0}, stream=True)

print("-------Prompt--------")
print_multimodal_prompt(contents)

print("\n-------Response--------")
for response in responses:
print(response.text, end="")

The output generated from the above prompt is as below,

I can regulate or format the output by giving specific instructions regarding it in the prompt or by giving few shot samples in the prompt.

For analyzing data in Data warehouse platforms such as BigQuery, there are many native options built in there, more info regarding this can be found in my article.

Support & Maintenance

From a Run & maintenance perspective understanding query run times, monitoring Data and ML pipeline runs is very crucial. GCP has many of these features natively enabled within the pattern. More info on this , can be found here.

GCP offers an open cloud solution, where non-native GCP query engines such as Spark can also be deployed. In this context, let us explore how we can use Gemini to analyze Spark event timelines and resource utilization,

sparkdag_url = "https://storage.googleapis.com/<bucketname>/sparkdag.JPG"
sparkdag = load_image_from_url(sparkdag_url)


instructions = "Instructions: Consider the following image, sparkdag that contains a screenshot of event timeline and resource utilisation:"
prompt1 = "I need to extract the resource utilisation from the image sparkdag. "

prompt2 = """
Answer the question through these steps:
Step 1: Identify the different acitivities involved in the image sparkdag .
Step 2: Check in the image sparkdag if the partitions are evenly distributed
Stpe 3: check which activity takes the most compute in sparkdag
Answer and describe the steps taken:
"""

contents = [
instructions,
sparkdag,
prompt1,
prompt2
]

responses = multimodal_model.generate_content(contents, generation_config={"max_output_tokens": 2048, "temperature": 0.0}, stream=True)

print("-------Prompt--------")
print_multimodal_prompt(contents)

print("\n-------Response--------")
for response in responses:
print(response.text, end="")

The output from the above prompt can be seen as below,

In this article, i have shared the art of possible with Gemini across SDLC lifecycle with Data Engineering and Data Analytics. I will be keen to see how fellow practitioners can take this approach and cross-pollinate to their respective techstacks and improve productivity.

--

--