BigQuery DataFrames (bigframes) is an open-source Python library that brings the power of distributed computing to your data science workflow. By providing a familiar pandas and scikit-learn compatible API, BigFrames allows you to analyze and model massive datasets where they live—directly in BigQuery.
Why Choose BigQuery DataFrames?#
BigFrames eliminates the “data movement bottleneck.” Instead of downloading large datasets to a local environment, BigFrames translates your Python code into optimized SQL, executing complex transformations across the BigQuery fleet.
Petabyte-Scale Scalability: Effortlessly process datasets that far exceed local memory limits.
Familiar Python Ecosystem: Use the same
read_gbq,groupby,merge, andpivot_tablefunctions you already know from pandas.Integrated Machine Learning: Access BigQuery ML’s powerful algorithms via a scikit-learn-like interface (
bigframes.ml), including seamless Gemini AI integration.Enterprise-Grade Security: Maintain data governance and security by keeping your data within the BigQuery perimeter.
Hybrid Flexibility: Easily move between distributed BigQuery processing and local pandas analysis with
to_pandas().
Core Components of BigFrames#
BigQuery DataFrames is organized into specialized modules designed for the modern data stack:
bigframes.pandas: A high-performance, pandas-compatible API for scalable data exploration, cleaning, and transformation.bigframes.bigquery: Specialized utilities for direct BigQuery resource management, including integrations with Gemini and other AI models in thebigframes.bigquery.aisubmodule.
Quickstart: Scalable Data Analysis in Seconds#
Install BigQuery DataFrames via pip:
pip install --upgrade bigframes
The following example demonstrates how to perform a distributed aggregation on a public dataset with millions of rows using just a few lines of Python:
import bigframes.pandas as bpd # Initialize BigFrames and load a public dataset df = bpd.read_gbq("bigquery-public-data.usa_names.usa_1910_2013") # Perform familiar pandas operations that execute in the cloud top_names = ( df.groupby("name") .agg({"number": "sum"}) .sort_values("number", ascending=False) .head(10) ) # Bring the final, aggregated results back to local memory if needed print(top_names.to_pandas())
Explore the Documentation#
User Documentation
API Reference