Remote Functions — bigframes documentation

# BigQuery table data on which notebook should be run
TABLE='bigquery-public-data.stackoverflow.comments'

# Change this up to test the scale, down to run the notebook faster
MAX_ROWS=100000

%%time

##############################
#       Pandas World         #
##############################

import pandas as pd
df = pd.read_gbq(TABLE, max_results=MAX_ROWS)[['id', 'text', 'score']]
df.head(10)

CPU times: user 2.34 s, sys: 307 ms, total: 2.65 s
Wall time: 17.8 s

	id	text	score
0	10	It will help if you give some details of which...	6
1	25	infact it does. Look a the first lines of your...	10
2	27	"Currently + is implemented using StringBuffer...	7
3	41	I don't think that's the magic number he was r...	18
4	59	It's still very useful to know that magic numb...	12
5	96	This implementation is also nice if you wish t...	9
6	108	That's not full text searching, it's searching...	6
7	109	That's not full text searching, it's searching...	6
8	137	In vim you can open > 1 buffer. :e filename. T...	9
9	154	Sure, but what about a solution using O(1) mem...	8

# User defined function
# https://www.codespeedy.com/find-nth-prime-number-in-python/
def nth_prime(n: int) -> int:
    prime_numbers = [2,3]
    i=3
    if(0<n<=2):
        return prime_numbers[n-1]
    elif(n>2):
        while (True):
            i+=1
            status = True
            for j in range(2,int(i/2)+1):
                if(i%j==0):
                    status = False
                    break
            if(status==True):
                prime_numbers.append(i)
            if(len(prime_numbers)==n):
                break
        return prime_numbers[n-1]
    else:
        return -1

%%time

df = df.assign(n_prime=df['score'].apply(nth_prime))
df.head(10)

CPU times: user 3.32 s, sys: 0 ns, total: 3.32 s
Wall time: 3.32 s

	id	text	score	n_prime
0	10	It will help if you give some details of which...	6	13
1	25	infact it does. Look a the first lines of your...	10	29
2	27	"Currently + is implemented using StringBuffer...	7	17
3	41	I don't think that's the magic number he was r...	18	61
4	59	It's still very useful to know that magic numb...	12	37
5	96	This implementation is also nice if you wish t...	9	23
6	108	That's not full text searching, it's searching...	6	13
7	109	That's not full text searching, it's searching...	6	13
8	137	In vim you can open > 1 buffer. :e filename. T...	9	23
9	154	Sure, but what about a solution using O(1) mem...	8	19

%%time

##############################
#     BigQuery DataFrames     #
##############################

import bigframes.pandas as pd

df = pd.read_gbq(TABLE).head(MAX_ROWS)[['id', 'text', 'score']]
df.head(10)

CPU times: user 71.2 ms, sys: 0 ns, total: 71.2 ms
Wall time: 1.99 s

Query job ca4f5d64-fcb3-4388-b9f1-924c55c1aaa5 is DONE. 17.2 GB processed. Open Job

Query job a283cb39-41b1-44cd-a6c3-f2a2c6a55b25 is DONE. 17.2 GB processed. Open Job

	id	text	score
0	11231597	In your update, why are some of the system fun...	0
1	49684807	what you have tried so far . ??	1
2	7623925	@Michael: It should work. Perhaps you looked i...	0
3	34046685	Will it work with SQL compact? Please excuse m...	0
4	6426146	do you know the equation to your pdf?	0
5	60686114	m sorry but at least you have to think about it.	0
6	16631986	i think also making disable this by only jquer...	0
7	16498565	I am including these files on my header of the...	0
8	26601001	wrong answer, you didn't understand the logic	0
9	73255842	Call the setOnClickListener before return row.	0

10 rows × 3 columns

[10 rows x 3 columns in total]

# Tell the user what needs to be done offline before using BigQuery DataFrame
# remote functions
help(pd.remote_function)

Help on function remote_function in module bigframes.pandas:

remote_function(input_types: 'List[type]', output_type: 'type', dataset: 'Optional[str]' = None, bigquery_connection: 'Optional[str]' = None, reuse: 'bool' = True, name: 'Optional[str]' = None, packages: 'Optional[Sequence[str]]' = None)
    Decorator to turn a user defined function into a BigQuery remote function. Check out
    the code samples at: https://cloud.google.com/bigquery/docs/remote-functions#bigquery-dataframes.
    
    .. note::
        Please make sure following is setup before using this API:
    
    1. Have the below APIs enabled for your project:
    
        * BigQuery Connection API
        * Cloud Functions API
        * Cloud Run API
        * Cloud Build API
        * Artifact Registry API
        * Cloud Resource Manager API
    
       This can be done from the cloud console (change `PROJECT_ID` to yours):
       https://console.cloud.google.com/apis/enableflow?apiid=bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,cloudresourcemanager.googleapis.com&project=PROJECT_ID
    
       Or from the gcloud CLI:
    
       `$ gcloud services enable bigqueryconnection.googleapis.com cloudfunctions.googleapis.com run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com cloudresourcemanager.googleapis.com`
    
    2. Have following IAM roles enabled for you:
    
        * BigQuery Data Editor (roles/bigquery.dataEditor)
        * BigQuery Connection Admin (roles/bigquery.connectionAdmin)
        * Cloud Functions Developer (roles/cloudfunctions.developer)
        * Service Account User (roles/iam.serviceAccountUser) on the service account `PROJECT_NUMBER-compute@developer.gserviceaccount.com`
        * Storage Object Viewer (roles/storage.objectViewer)
        * Project IAM Admin (roles/resourcemanager.projectIamAdmin) (Only required if the bigquery connection being used is not pre-created and is created dynamically with user credentials.)
    
    3. Either the user has setIamPolicy privilege on the project, or a BigQuery connection is pre-created with necessary IAM role set:
    
        1. To create a connection, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#create_a_connection
        2. To set up IAM, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#grant_permission_on_function
    
           Alternatively, the IAM could also be setup via the gcloud CLI:
    
           `$ gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:CONNECTION_SERVICE_ACCOUNT_ID" --role="roles/run.invoker"`.
    
    Args:
        input_types (list(type)):
            List of input data types in the user defined function.
        output_type (type):
            Data type of the output in the user defined function.
        dataset (str, Optional):
            Dataset in which to create a BigQuery remote function. It should be in
            `<project_id>.<dataset_name>` or `<dataset_name>` format. If this
            parameter is not provided then session dataset id is used.
        bigquery_connection (str, Optional):
            Name of the BigQuery connection. You should either have the
            connection already created in the `location` you have chosen, or
            you should have the Project IAM Admin role to enable the service
            to create the connection for you if you need it. If this parameter is
            not provided then the BigQuery connection from the session is used.
        reuse (bool, Optional):
            Reuse the remote function if already exists.
            `True` by default, which will result in reusing an existing remote
            function and corresponding cloud function (if any) that was
            previously created for the same udf.
            Setting it to `False` would force creating a unique remote function.
            If the required remote function does not exist then it would be
            created irrespective of this param.
        name (str, Optional):
            Explicit name of the persisted BigQuery remote function. Use it with
            caution, because two users working in the same project and dataset
            could overwrite each other's remote functions if they use the same
            persistent name.
        packages (str[], Optional):
            Explicit name of the external package dependencies. Each dependency
            is added to the `requirements.txt` as is, and can be of the form
            supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.
    Returns:
        callable: A remote function object pointing to the cloud assets created
        in the background to support the remote execution. The cloud assets can be
        located through the following properties set in the object:
    
        `bigframes_cloud_function` - The google cloud function deployed for the user defined code.
    
        `bigframes_remote_function` - The bigquery remote function capable of calling into `bigframes_cloud_function`.

from time import perf_counter
start = perf_counter()
#-------------------------------------------------------------------------------------

# User defined function
# https://www.codespeedy.com/find-nth-prime-number-in-python/
@pd.remote_function(reuse=False, cloud_function_service_account="default")
def nth_prime(n: int) -> int:
    prime_numbers = [2,3]
    i=3
    if(0<n<=2):
        return prime_numbers[n-1]
    elif(n>2):
        while (True):
            i+=1
            status = True
            for j in range(2,int(i/2)+1):
                if(i%j==0):
                    status = False
                    break
            if(status==True):
                prime_numbers.append(i)
            if(len(prime_numbers)==n):
                break
        return prime_numbers[n-1]
    else:
        return -1

#-------------------------------------------------------------------------------------
print(f"\nWall time: {(perf_counter()-start):.4f} s")

%%time

# Let's apply the function to the dataframe
df = df.assign(n_prime=df['score'].apply(nth_prime))
df.head(10)

CPU times: user 55.8 ms, sys: 182 µs, total: 56 ms
Wall time: 54.5 ms

Query job 762aa155-a3ed-4bab-aca2-a0380ed96ebf is DONE. 17.2 GB processed. Open Job

Query job c0a2c187-364d-4978-97bc-30352828f624 is DONE. 17.2 GB processed. Open Job

	id	text	score	n_prime
0	11231597	In your update, why are some of the system fun...	0	-1
1	49684807	what you have tried so far . ??	1	2
2	7623925	@Michael: It should work. Perhaps you looked i...	0	-1
3	34046685	Will it work with SQL compact? Please excuse m...	0	-1
4	6426146	do you know the equation to your pdf?	0	-1
5	60686114	m sorry but at least you have to think about it.	0	-1
6	16631986	i think also making disable this by only jquer...	0	-1
7	16498565	I am including these files on my header of the...	0	-1
8	26601001	wrong answer, you didn't understand the logic	0	-1
9	73255842	Call the setOnClickListener before return row.	0	-1

10 rows × 4 columns

[10 rows x 4 columns in total]

# We can see the path to the BQ remote function and the google cloud function
# that was created under the hood
print(nth_prime.bigframes_remote_function)
print(nth_prime.bigframes_cloud_function)

shobs-test.bigframes_temp_us.bigframes_343b7b4bb93ca8747dae20c22bdaec8b_p27heyce
projects/shobs-test/locations/us-central1/functions/bigframes-343b7b4bb93ca8747dae20c22bdaec8b-p27heyce

# Let's try to simulate a scenario in which user shares this remote function to
# their colleague who simply wants to reuse it. BigFrames provides an API to do
# so via `read_gbq_function`. Usage details are available via `help` command.
help(pd.read_gbq_function)

Help on function read_gbq_function in module bigframes.pandas:

read_gbq_function(function_name: 'str')
    Loads a BigQuery function from BigQuery.
    
    Then it can be applied to a DataFrame or Series.
    
    .. note::
        The return type of the function must be explicitly specified in the
        function's original definition even if not otherwise required.
    
    BigQuery Utils provides many public functions under the ``bqutil`` project on Google Cloud Platform project
    (See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs#using-the-udfs).
    You can checkout Community UDFs to use community-contributed functions.
    (See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs/community#community-udfs).
    
    **Examples:**
    
    Use the ``cw_lower_case_ascii_only`` function from Community UDFs.
    (https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/cw_lower_case_ascii_only.sqlx)
    
        >>> import bigframes.pandas as bpd
        >>> bpd.options.display.progress_bar = None
    
        >>> df = bpd.DataFrame({'id': [1, 2, 3], 'name': ['AURÉLIE', 'CÉLESTINE', 'DAPHNÉ']})
        >>> df
           id       name
        0   1    AURÉLIE
        1   2  CÉLESTINE
        2   3     DAPHNÉ
        <BLANKLINE>
        [3 rows x 2 columns]
    
        >>> func = bpd.read_gbq_function("bqutil.fn.cw_lower_case_ascii_only")
        >>> df1 = df.assign(new_name=df['name'].apply(func))
        >>> df1
           id       name   new_name
        0   1    AURÉLIE    aurÉlie
        1   2  CÉLESTINE  cÉlestine
        2   3     DAPHNÉ     daphnÉ
        <BLANKLINE>
        [3 rows x 3 columns]
    
    Args:
        function_name (str):
            the function's name in BigQuery in the format
            `project_id.dataset_id.function_name`, or
            `dataset_id.function_name` to load from the default project, or
            `function_name` to load from the default project and the dataset
            associated with the current session.
    
    Returns:
        callable: A function object pointing to the BigQuery function read
        from BigQuery.
    
        The object is similar to the one created by the `remote_function`
        decorator, including the `bigframes_remote_function` property, but
        not including the `bigframes_cloud_function` property.

EXISTING_REMOTE_FUNCTION=nth_prime.bigframes_remote_function

# Let's read the existing remote function in bigframes
nth_prime_existing = pd.read_gbq_function(EXISTING_REMOTE_FUNCTION)

%%time

# Let's apply the existing function to the dataframe
df = df.assign(n_prime_again=df['score'].apply(nth_prime_existing))
df.head(10)

CPU times: user 70.8 ms, sys: 3.49 ms, total: 74.3 ms
Wall time: 75.2 ms

Query job f9a0e979-aeac-4ddd-a4c9-6720a5e91009 is DONE. 17.2 GB processed. Open Job

Query job 4d3da7ed-42e6-4b2b-b656-ac9ef6d2e871 is DONE. 17.2 GB processed. Open Job

	id	text	score	n_prime	n_prime_again
0	11231597	In your update, why are some of the system fun...	0	-1	-1
1	49684807	what you have tried so far . ??	1	2	2
2	7623925	@Michael: It should work. Perhaps you looked i...	0	-1	-1
3	34046685	Will it work with SQL compact? Please excuse m...	0	-1	-1
4	6426146	do you know the equation to your pdf?	0	-1	-1
5	60686114	m sorry but at least you have to think about it.	0	-1	-1
6	16631986	i think also making disable this by only jquer...	0	-1	-1
7	16498565	I am including these files on my header of the...	0	-1	-1
8	26601001	wrong answer, you didn't understand the logic	0	-1	-1
9	73255842	Call the setOnClickListener before return row.	0	-1	-1

10 rows × 5 columns

[10 rows x 5 columns in total]

# Clean up GCP assets created as part of bigframes remote_function
def cleanup_remote_function_assets(remote_udf, ignore_failures=False):
    """Clean up the GCP assets behind a bigframes remote function."""

    session = pd.get_global_session()

    # Clean up BQ remote function
    try:
        session.bqclient.delete_routine(remote_udf.bigframes_remote_function)
    except Exception:
        # By default don't raise exception in cleanup
        if not ignore_failures:
            raise

    # Clean up cloud function
    try:
        session.cloudfunctionsclient.delete_function(name=remote_udf.bigframes_cloud_function)
    except Exception:
        # By default don't raise exception in cleanup
        if not ignore_failures:
            raise

cleanup_remote_function_assets(nth_prime)