Train a linear regression model with BigQuery DataFrames ML

# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

NOTE: This notebook has been tested in the following environment:

Python version = 3.11

Overview#

This notebook demonstrates training a linear regression model on Big Data using BigQuery DataFrames ML. BigQuery DataFrames ML provides a provides a scikit-learn-like API for ML powered by the BigQuery engine.

Learn more about BigQuery DataFrames.

Objective#

In this tutorial, we use BigQuery DataFrames to create a linear regression model that predicts the levels of Ozone in the atmosphere.

The steps include:

Creating a DataFrame from the BigQuery table.
Cleaning and preparing data using bigframes.pandas module.
Creating a linear regression model using bigframes.ml module.
Saving the ML model to BigQuery for future use.

Let’s formally define our problem as: Train a linear regression model to predict the level of ozone in the atmosphere given the measurements of other constituents and properties of the atmosphere.

Dataset#

In this tutorial we are going to use the bigquery-public-data.epa_historical_air_quality dataset. To quote the description of the dataset:

“The United States Environmental Protection Agency (EPA) protects both public health and the environment by establishing the standards for national air quality. The EPA provides annual summary data as well as hourly and daily data in the categories of criteria gases, particulates, meteorological, and toxics.”

There are several tables capturing data about the constituents of the atmosphere, see them in the BigQuery cloud console. Most tables carry 10’s of GBs of data, but that is not an issue with BigQuery DataFrames as the data is efficiently processed at BigQuery without transferring them to the client.

Costs#

This tutorial uses billable components of Google Cloud:

BigQuery (compute)
BigQuery ML

Learn about BigQuery compute pricing and BigQuery ML pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Installation#

If you don’t have bigframes package already installed, uncomment and execute the following cells to

Install the package
Restart the notebook kernel (Jupyter or Colab) to work with the package

# Automatically restart kernel after installs so that your environment can access the new packages

# import IPython
#
# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

Before you begin#

Complete the tasks in this section to set up your environment.

Set up your Google Cloud project#

The following steps are required, regardless of your notebook environment.

Select or create a Google Cloud project. When you first create an account, you get a $300 credit towards your compute/storage costs.
Make sure that billing is enabled for your project.
Enable the BigQuery API.
If you are running this notebook locally, install the Cloud SDK.

Set your project ID#

If you don’t know your project ID, try the following:

Run gcloud config list.
Run gcloud projects list.
See the support page: Locate the project ID.

PROJECT_ID = ""  # @param {type:"string"}

Set the BigQuery location#

You can also change the LOCATION variable used by BigQuery. Learn more about BigQuery locations.

LOCATION = "US"  # @param {type: "string"}

Set up APIs, IAM permissions and Authentication#

Follow the instructions at https://cloud.google.com/bigquery/docs/use-bigquery-dataframes#permissions.

Depending on your notebook environment, you might have to manually authenticate. Follow the relevant instructions below.

Vertex AI Workbench

Do nothing, you are already authenticated.

Local JupyterLab instance

Uncomment and run the following cell:

# ! gcloud auth login
# ! gcloud auth application-default login

Colab

Uncomment and run the following cell:

# from google.colab import auth
# auth.authenticate_user()

Import libraries#

import bigframes.pandas as bpd

Set BigQuery DataFrames options#

# NOTE: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# NOTE: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = LOCATION

# NOTE: For a machine learning model the order of the data is
# not important. So let's relax the ordering_mode to accept
# partial ordering. This allows BigQuery DataFrames to run cost
# and performance optimized jobs at the BigQuery engine.
bpd.options.bigquery.ordering_mode = "partial"

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing bpd.close_session(). After that, you can reuse bpd.options.bigquery.location to specify another location.

Read data in BigQuery tables as DataFrame#

Let’s read the tables in the dataset to construct a BigQuery DataFrames DataFrame. We will combine measurements of various parameters of the atmosphere from multiple tables to represent a consolidated dataframe to use for our model training and prediction. We have daily and hourly versions of the data available, but since we want to create a model that is dynamic so that it can capture the variance throughout the day, we would choose the hourly version.

Note that we would use the pandas APIs as we normally would on the BigQuery DataFrames DataFrame, but calculations happen in the BigQuery query engine instead of the local environment.

dataset = "bigquery-public-data.epa_historical_air_quality"
hourly_summary_tables = [
    "co_hourly_summary",
    "hap_hourly_summary",
    "no2_hourly_summary",
    "nonoxnoy_hourly_summary",
    "o3_hourly_summary",
    "pm10_hourly_summary",
    "pm25_frm_hourly_summary",
    "pm25_nonfrm_hourly_summary",
    "pm25_speciation_hourly_summary",
    "pressure_hourly_summary",
    "rh_and_dp_hourly_summary",
    "so2_hourly_summary",
    "temperature_hourly_summary",
    "voc_hourly_summary",
    "wind_hourly_summary",
]

Let’s pick index columns - to identify a measurement of the atmospheric parameter, param column - to identify which param the measurement pertains to, and value column - the column containing the measurement itself.

index_columns = ["state_name", "county_name", "site_num", "date_local", "time_local"]
param_column = "parameter_name"
value_column = "sample_measurement"

Let’s observe how much data each table contains:

for table in hourly_summary_tables:
    # get the bigframes global session
    bigframes_session = bpd.get_global_session()

    # get the bigquery table info
    table_info = bigframes_session.bqclient.get_table(f"{dataset}.{table}")

    # read the table as a dataframe
    df = bpd.read_gbq(f"{dataset}.{table}")

    # print metadata about the table
    print(
        f"{table}: "
        f"{round(table_info.num_bytes/1_000_000_000, 1)} GB, "
        f"{round(table_info.num_rows/1_000_000, 1)} million rows, "
        f"{df[param_column].nunique()} params"
    )

Let’s be mindful that the rows in each table may contain duplicates, which may introdude bias in any model trained on the raw data. We will make sure to drop the duplicates when we use the data for model training.

Since we want to predict ozone level, we obviously pick the o3 table. Let’s also pick the tables about other gases - co, no2 and so2. Let’s also pick pressure and temperature tables as they seem fundamental indicators for the atmosphere. Note that each of these tables capture measurements for a single parameter (i.e. the column parameter_name has a single unique value).

We are also interested in the nonoxny and wind tables, but they capture multiple parameters (i.e. the column parameter_name has a more than one unique values). We will include their measurements in later step, as they require extar processing to separate out the measurements for the individual parameters.

We skip the other tables in this exercise for either they have very little or fragmented data or they seem uninteresting for the purpose of predicting ozone levels. You can take this as a separate exercise to train a linear regression model by including those parameters.

Let’s maintain an array of dtaframes, one for each parameter, and eventually combine them into a single dataframe.

Let’s process the tables with single parameter measurements first.

table_param_dict = {
    "co_hourly_summary" : "co",
    "no2_hourly_summary" : "no2",
    "o3_hourly_summary" : "o3",
    "pressure_hourly_summary" : "pressure",
    "so2_hourly_summary" : "so2",
    "temperature_hourly_summary" : "temperature",
}

for table, param in table_param_dict.items():
    param_df = bpd.read_gbq(
        f"{dataset}.{table}",
        columns=index_columns + [value_column]
    )
    param_df = param_df\
        .sort_values(index_columns)\
        .drop_duplicates(index_columns)\
        .set_index(index_columns)\
        .rename(columns={value_column : param})
    params_dfs.append(param_df)

The nonoxnoy table captures measurements for 3 parameters. Let’s analyze how many instances of each parameter it contains.

nonoxnoy_table = f"{dataset}.nonoxnoy_hourly_summary"

bpd.read_gbq(nonoxnoy_table, columns=[param_column]).value_counts()

We see that the NOy data is significantly sparse as compared to NO and NOx, so we skip that and include NO and NOx data.

no_df = bpd.read_gbq(
    nonoxnoy_table,
    columns=index_columns + [value_column],
    filters=[(param_column, "==", "Nitric oxide (NO)")]
)
no_df = no_df\
    .sort_values(index_columns)\
    .drop_duplicates(index_columns)\
    .set_index(index_columns)\
    .rename(columns={value_column: "no_"})
params_dfs.append(no_df)

nox_df = bpd.read_gbq(
    nonoxnoy_table,
    columns=index_columns + [value_column],
    filters=[(param_column, "==", "Oxides of nitrogen (NOx)")]
)
nox_df = nox_df\
    .sort_values(index_columns)\
    .drop_duplicates(index_columns)\
    .set_index(index_columns)\
    .rename(columns={value_column: "nox"})
params_dfs.append(nox_df)

The wind table captures measurements for 2 parameters. Let’s analyze how many instances of each parameter it contains.

wind_table = f"{dataset}.wind_hourly_summary"

bpd.read_gbq(wind_table, columns=[param_column]).value_counts()

Let’s include the data for wind speed and wind direction.

wind_speed_df = bpd.read_gbq(
    wind_table,
    columns=index_columns + [value_column],
    filters=[(param_column, "==", "Wind Speed - Resultant")]
)
wind_speed_df = wind_speed_df\
    .sort_values(index_columns)\
    .drop_duplicates(index_columns)\
    .set_index(index_columns)\
    .rename(columns={value_column: "wind_speed"})
params_dfs.append(wind_speed_df)

wind_dir_df = bpd.read_gbq(
    wind_table,
    columns=index_columns + [value_column],
    filters=[(param_column, "==", "Wind Direction - Resultant")]
)
wind_dir_df = wind_dir_df\
    .sort_values(index_columns)\
    .drop_duplicates(index_columns)\
    .set_index(index_columns)\
    .rename(columns={value_column: "wind_dir"})
params_dfs.append(wind_dir_df)

Let’s observe each individual parameter and number of data points for each parameter.

for param_df in params_dfs:
    print(f"{param_df.columns.values}: {len(param_df)}")

Let’s combine data from all parameters into a single DataFrame. The measurements for each parameter may not be available for every (state, county, site, date, time) identifier, we will consider only those identifiers for which measurements of all parameters are available. To achieve this we will combine the measurements via “inner” join.

We will also materialize this combined data via cache method for efficient reuse in the subsequent steps.

df = bpd.concat(params_dfs, axis=1, join="inner").cache()
df.shape

Clean and prepare data#

Let’s temporarily bring the index columns as dataframe columns for further processing on the index values for the purpose of data preparation. We will reconstruct the index back at the time of the model training.

Observe the years from which we have consolidated data so far.

df["date_local"].dt.year.value_counts().sort_index().to_pandas()

In this tutorial we would train a model from the past data to predict ozone levels for the future data. Let’s define the cut-off year as 2020. We will pretend that the data before 2020 has known ozone levels, and the 2020 onwards the ozone levels are unknown, which we will predict using our model.

We should further separate the known data into training and test sets. The model would be trained on the training set and then evaluated on the test set to make sure the model generalizes beyond the training data. We could use train_test_split method to randomly split the training and test data, but we leave that for you to try out. In this exercise, let’s split based on another cutoff year 2017 - the known data before 2017 would be training data and 2017 onwards would be the test data. This way we stay with the idea that the model is trained on past data and then used to predict the future values.

train_data_filter = (df.date_local.dt.year < 2017)
test_data_filter = (df.date_local.dt.year >= 2017) & (df.date_local.dt.year < 2020)
predict_data_filter = (df.date_local.dt.year >= 2020)

df_train = df[train_data_filter].set_index(index_columns)
df_test = df[test_data_filter].set_index(index_columns)
df_predict = df[predict_data_filter].set_index(index_columns)

df_train.shape, df_test.shape, df_predict.shape

Prepare your feature (or input) columns and the target (or output) column for the purpose of model training and evaluation:

X_train = df_train.drop(columns="o3")
y_train = df_train["o3"]

X_test = df_test.drop(columns="o3")
y_test = df_test["o3"]

Prepare the unknown data for prediction.

X_predict = df_predict.drop(columns="o3")

Create the linear regression model#

BigQuery DataFrames ML lets you seamlessly transition from exploring data to creating machine learning models through its scikit-learn-like API, bigframes.ml. BigQuery DataFrames ML supports several types of ML models.

In this notebook, you create a LinearRegression model, a type of regression model that generates a continuous value from a linear combination of input features.

When you create a model with BigQuery DataFrames ML, it is saved in an internal location and limited to the BigQuery DataFrames session. However, as you’ll see in the next section, you can use to_gbq to save the model permanently to your BigQuery project.

Create the model using `bigframes.ml`#

Please note that BigQuery DataFrames ML is backed by BigQuery ML, which uses automatic preprocessing to encode string values and scale numeric values when you pass the feature columns without transforms.

BigQuery ML also automatically splits the data for training and evaluation, although for datasets with less than 500 rows (such as this one), all rows are used for training.

from bigframes.ml.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

Score the model#

Check how the model performs by using the score method. More information on BigQuery ML model scoring can be found here.

# On the training data
model.score(X_train, y_train)

# On the test data
model.score(X_test, y_test)

Predict using the model#

Use the model to predict the levels of ozone. The predicted levels are returned in the column predicted_o3.

df_pred = model.predict(X_predict)
df_pred.peek()

Save the model in BigQuery#

The model is saved locally within this session. You can save the model permanently to BigQuery for use in future sessions, and to make the model sharable with others.

Create a BigQuery dataset to house the model, adding a name for your dataset as the DATASET_ID variable:

DATASET_ID = ""  # @param {type:"string"}

if not DATASET_ID:
    raise ValueError("Please define the DATASET_ID")

client = bpd.get_global_session().bqclient
dataset = client.create_dataset(DATASET_ID, exists_ok=True)
print(f"Dataset {dataset.dataset_id} created.")

Save the model using the to_gbq method:

model.to_gbq(DATASET_ID + ".o3_lr_model" , replace=True)

You can view the saved model in the BigQuery console under the dataset you created in the first step. Run the following cell and follow the link to view your BigQuery console:

print(f'https://console.cloud.google.com/bigquery?ws=!1m5!1m4!5m3!1s{PROJECT_ID}!2s{DATASET_ID}!3so3_lr_model')

Summary and next steps#

You’ve created a linear regression model using bigframes.ml.

Learn more about BigQuery DataFrames in the documentation and find more sample notebooks in the GitHub repo.

Cleaning up#

To clean up all Google Cloud resources used in this project, you can delete the Google Cloud project you used for the tutorial.

Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:

# # Delete the BigQuery dataset and associated ML model
# client.delete_dataset(DATASET_ID, delete_contents=True, not_found_ok=True)