Use BigQuery DataFrames with Generative AI for code generation

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

NOTE: This notebook has been tested in the following environment:

Python version = 3.10

Overview#

Use this notebook to walk through an example use case of generating sample code by using BigQuery DataFrames and its integration with Generative AI support on Vertex AI.

Learn more about BigQuery DataFrames.

Objective#

In this tutorial, you create a CSV file containing sample code for calling a given set of APIs.

The steps include:

Defining an LLM model in BigQuery DataFrames, specifically the Gemini Model, using bigframes.ml.llm.
Creating a DataFrame by reading in data from Cloud Storage.
Manipulating data in the DataFrame to build LLM prompts.
Sending DataFrame prompts to the LLM model using the predict method.
Creating and using a custom function to transform the output provided by the LLM model response.
Exporting the resulting transformed DataFrame as a CSV file.

Dataset#

This tutorial uses a dataset listing the names of various pandas DataFrame and Series APIs.

Costs#

This tutorial uses billable components of Google Cloud:

BigQuery
Generative AI support on Vertex AI
Cloud Functions

Learn about BigQuery compute pricing, Generative AI support on Vertex AI pricing, and Cloud Functions pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Installation#

Install the following packages, which are required to run this notebook:

!pip install bigframes --upgrade --quiet

Before you begin#

Complete the tasks in this section to set up your environment.

Set up your Google Cloud project#

The following steps are required, regardless of your notebook environment.

Select or create a Google Cloud project. When you first create an account, you get a $300 credit towards your compute/storage costs.
Make sure that billing is enabled for your project.
Click here to enable the following APIs:

BigQuery API
BigQuery Connection API
Cloud Functions API
Cloud Run API
Artifact Registry API
Cloud Build API
Cloud Resource Manager API
Vertex AI API

If you are running this notebook locally, install the Cloud SDK.

Set your project ID#

If you don’t know your project ID, try the following:

Run gcloud config list.
Run gcloud projects list.
See the support page: Locate the project ID.

PROJECT_ID = ""  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

ERROR: (gcloud.config.set) argument VALUE: Must be specified.
Usage: gcloud config set SECTION/PROPERTY VALUE [optional flags]
  optional flags may be  --help | --installation

For detailed information on this command and its flags, run:
  gcloud config set --help

Set the region#

You can also change the REGION variable used by BigQuery. Learn more about BigQuery regions.

REGION = "US"  # @param {type: "string"}

Authenticate your Google Cloud account#

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

Vertex AI Workbench

Do nothing, you are already authenticated.

Local JupyterLab instance

Uncomment and run the following cell:

Colab

Uncomment and run the following cell:

# from google.colab import auth
# auth.authenticate_user()

Import libraries#

import bigframes.pandas as bf
from google.cloud import bigquery
from google.cloud import bigquery_connection_v1 as bq_connection

Set BigQuery DataFrames options#

# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bf.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bf.options.bigquery.location = REGION

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing bf.close_session(). After that, you can reuse bf.options.bigquery.location to specify another location.

Define the LLM model#

BigQuery DataFrames provides integration with Gemini Models via Vertex AI.

This section walks through a few steps required in order to use the model in your notebook.

Define the model#

Use bigframes.ml.llm to define the model:

from bigframes.ml.llm import GeminiTextGenerator

model = GeminiTextGenerator(model_name="gemini-2.0-flash-001")

Query job 0ee1a08e-788e-4fc7-b061-52c23ab25d5a is DONE. 0 Bytes processed. Open Job

Read data from Cloud Storage into BigQuery DataFrames#

You can create a BigQuery DataFrames DataFrame by reading data from any of the following locations:

A local data file
Data stored in a BigQuery table
A data file stored in Cloud Storage
An in-memory pandas DataFrame

In this tutorial, you create BigQuery DataFrames DataFrames by reading two CSV files stored in Cloud Storage, one containing a list of DataFrame API names and one containing a list of Series API names.

df_api = bf.read_csv("gs://cloud-samples-data/vertex-ai/bigframe/df.csv")
series_api = bf.read_csv("gs://cloud-samples-data/vertex-ai/bigframe/series.csv")

Take a peek at a few rows of data for each file:

Query job 48be241c-ee93-4dfa-a9e3-66b64c4b5150 is DONE. 0 Bytes processed. Open Job

Query job 6af9caa5-4f7a-48f0-a7df-d692ee063b7e is DONE. 0 Bytes processed. Open Job

	API
0	values
1	dtypes

2 rows × 1 columns

[2 rows x 1 columns in total]

Query job 41e4f2e7-689a-45d9-bf92-4416f5560b81 is DONE. 0 Bytes processed. Open Job

Query job aae0b164-f786-4734-8c79-2af9805af0cf is DONE. 0 Bytes processed. Open Job

	API
0	shape
1	size

2 rows × 1 columns

[2 rows x 1 columns in total]

Generate code using the LLM model#

Prepare the prompts and send them to the LLM model for prediction.

Prompt design in BigQuery DataFrames#

Designing prompts for LLMs is a fast growing area and you can read more in this documentation.

For this tutorial, you use a simple prompt to ask the LLM model for sample code for each of the API methods (or rows) from the last step’s DataFrames. The output is the new DataFrames df_prompt and series_prompt, which contain the full prompt text.

df_prompt_prefix = "Generate Pandas sample code for DataFrame."
series_prompt_prefix = "Generate Pandas sample code for Series."

df_prompt = (df_prompt_prefix + df_api['API'])
series_prompt = (series_prompt_prefix + series_api['API'])

df_prompt.head(2)

Query job 17f50c10-aa81-4023-b206-4ba59ddf2269 is DONE. 0 Bytes processed. Open Job

Query job d6d217aa-a623-4ea4-83fb-8f1b8bfb8e68 is DONE. 0 Bytes processed. Open Job

Query job a275a107-752e-46f8-be9f-9cb35eb6b0b9 is DONE. 132 Bytes processed. Open Job

0    Generate Pandas sample code for DataFrame.values
1    Generate Pandas sample code for DataFrame.dtypes
Name: API, dtype: string

Make predictions using the LLM model#

Use the BigQuery DataFrames DataFrame containing the full prompt text as the input to the predict method. The predict method calls the LLM model and returns its generated text output back to two new BigQuery DataFrames DataFrames, df_pred and series_pred.

Note: The predictions might take a few minutes to run.

df_pred = model.predict(df_prompt.to_frame(), max_output_tokens=1024)
series_pred = model.predict(series_prompt.to_frame(), max_output_tokens=1024)

Query job 01f95d2d-901d-4edf-bd3a-245d17c31ef6 is DONE. 0 Bytes processed. Open Job

Query job 55927a6f-b023-479a-b9bf-826abde77111 is DONE. 584 Bytes processed. Open Job

Query job 445eb0af-f643-40c5-9c1e-25aa3db8374a is DONE. 146 Bytes processed. Open Job

Query job ddee268c-773a-4dcc-b14c-ebdd90c2c347 is DONE. 0 Bytes processed. Open Job

Query job d7f1eb26-28b2-44ba-8858-5cd4df8621bd is DONE. 904 Bytes processed. Open Job

Query job f24d27a5-0e36-4fb5-953b-d09298f83af6 is DONE. 226 Bytes processed. Open Job

Once the predictions are processed, take a look at the sample output from the LLM, which provides code samples for the API names listed in the DataFrames dataset.

print(df_pred['ml_generate_text_llm_result'].iloc[0])

Query job 65599c98-72ad-4088-8b09-f29bf05c164b is DONE. 21.8 kB processed. Open Job

```python
import pandas as pd

# Create a DataFrame
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Get the values as a NumPy array
values = df.values

# Print the values
print(values)
```

Manipulate LLM output using a remote function#

The output that the LLM provides often contains additional text beyond the code sample itself. Using BigQuery DataFrames, you can deploy custom Python functions that process and transform this output.

Running the cell below creates a custom function that you can use to process the LLM output data in two ways:

Strip the LLM text output to include only the code block.
Substitute import pandas as pd with import bigframes.pandas as bf so that the resulting code block works with BigQuery DataFrames.

@bf.remote_function(cloud_function_service_account="default")
def extract_code(text: str) -> str:
  try:
    res = text[text.find('\n')+1:text.find('```', 3)]
    res = res.replace("import pandas as pd", "import bigframes.pandas as bf")
    if "import bigframes.pandas as bf" not in res:
      res = "import bigframes.pandas as bf\n" + res
    return res
  except:
    return ""

The custom function is deployed as a Cloud Function, and then integrated with BigQuery as a remote function. Save both of the function names so that you can clean them up at the end of this notebook.

CLOUD_FUNCTION_NAME = format(extract_code.bigframes_cloud_function)
print("Cloud Function Name " + CLOUD_FUNCTION_NAME)
REMOTE_FUNCTION_NAME = format(extract_code.bigframes_remote_function)
print("Remote Function Name " + REMOTE_FUNCTION_NAME)

Cloud Function Name projects/swast-scratch/locations/us-central1/functions/bigframes-6e7606963c3f06b8181b3cb9449a4363
Remote Function Name swast-scratch._63cfa399614a54153cc386c27d6c0c6fdb249f9e.bigframes_6e7606963c3f06b8181b3cb9449a4363

Apply the custom function to each LLM output DataFrame to get the processed results:

df_code = df_pred.assign(code=df_pred['ml_generate_text_llm_result'].apply(extract_code))
series_code = series_pred.assign(code=series_pred['ml_generate_text_llm_result'].apply(extract_code))

Query job 047903f8-ea67-430a-8281-8fb5a119b779 is DONE. 21.8 kB processed. Open Job

Query job 793df956-0b1a-46ba-bb5e-e428171f3bd0 is DONE. 26.3 kB processed. Open Job

You can see the differences by inspecting the first row of data:

print(df_code['code'].iloc[0])

Query job 6974c2b7-2ed9-4564-a80b-57aef6959e19 is DONE. 22.8 kB processed. Open Job

import bigframes.pandas as bf

# Create a DataFrame
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Get the values as a NumPy array
values = df.values

# Print the values
print(values)

Save the results to Cloud Storage#

BigQuery DataFrames lets you save a BigQuery DataFrames DataFrame as a CSV file in Cloud Storage for further use. Try that now with your processed LLM output data.

Create a new Cloud Storage bucket with a unique name:

import uuid
BUCKET_ID = "code-samples-" + str(uuid.uuid1())

!gcloud storage buckets create gs://{BUCKET_ID}

Creating gs://code-samples-773ee0f2-e302-11ee-8298-4201c0a8181f/...

Use to_csv to write each BigQuery DataFrames DataFrame as a CSV file in the Cloud Storage bucket:

df_code[["code"]].to_csv(f"gs://{BUCKET_ID}/df_code*.csv")
series_code[["code"]].to_csv(f"gs://{BUCKET_ID}/series_code*.csv")

Query job 81277037-032f-4557-a46e-1d39702f33d5 is DONE. 22.8 kB processed. Open Job

Query job 8dc5a38c-ac16-44e7-83dd-4187380f780f is DONE. 0 Bytes processed. Open Job

Query job 9087a758-b1f9-4be7-889b-7761ef0ad966 is DONE. 27.7 kB processed. Open Job

Query job 6126ea72-c6f7-43f0-8888-e1c2a464a8a4 is DONE. 0 Bytes processed. Open Job

You can navigate to the Cloud Storage bucket browser to download the two files and view them.

Run the following cell, and then follow the link to your Cloud Storage bucket browser:

print(f'https://console.developers.google.com/storage/browser/{BUCKET_ID}/')

https://console.developers.google.com/storage/browser/code-samples-773ee0f2-e302-11ee-8298-4201c0a8181f/

Summary and next steps#

You’ve used BigQuery DataFrames’ integration with LLM models (bigframes.ml.llm) to generate code samples, and have tranformed LLM output by creating and using a custom function in BigQuery DataFrames.

Learn more about BigQuery DataFrames in the documentation and find more sample notebooks in the GitHub repo.

Cleaning up#

To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the tutorial.

Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:

# # Delete the BigQuery Connection
# from google.cloud import bigquery_connection_v1 as bq_connection
# client = bq_connection.ConnectionServiceClient()
# CONNECTION_ID = f"projects/{PROJECT_ID}/locations/{REGION}/connections/{CONN_NAME}"
# client.delete_connection(name=CONNECTION_ID)
# print(f"Deleted connection '{CONNECTION_ID}'.")

# # Delete the Cloud Function
# ! gcloud functions delete {CLOUD_FUNCTION_NAME} --quiet
# # Delete the Remote Function
# REMOTE_FUNCTION_NAME = REMOTE_FUNCTION_NAME.replace(PROJECT_ID + ".", "")
# ! bq rm --routine --force=true {REMOTE_FUNCTION_NAME}

# # Delete the Google Cloud Storage bucket and files
# ! gcloud storage rm gs://{BUCKET_ID} --recursive
# print(f"Deleted bucket '{BUCKET_ID}'.")