# BigQuery table data on which notebook should be run TABLE='bigquery-public-data.stackoverflow.comments' # Change this up to test the scale, down to run the notebook faster MAX_ROWS=100000
%%time ############################## # Pandas World # ############################## import pandas as pd df = pd.read_gbq(TABLE, max_results=MAX_ROWS)[['id', 'text', 'score']] df.head(10)
CPU times: user 2.34 s, sys: 307 ms, total: 2.65 s Wall time: 17.8 s
| id | text | score | |
|---|---|---|---|
| 0 | 10 | It will help if you give some details of which... | 6 |
| 1 | 25 | infact it does. Look a the first lines of your... | 10 |
| 2 | 27 | "Currently + is implemented using StringBuffer... | 7 |
| 3 | 41 | I don't think that's the magic number he was r... | 18 |
| 4 | 59 | It's still very useful to know that magic numb... | 12 |
| 5 | 96 | This implementation is also nice if you wish t... | 9 |
| 6 | 108 | That's not full text searching, it's searching... | 6 |
| 7 | 109 | That's not full text searching, it's searching... | 6 |
| 8 | 137 | In vim you can open > 1 buffer. :e filename. T... | 9 |
| 9 | 154 | Sure, but what about a solution using O(1) mem... | 8 |
# User defined function # https://www.codespeedy.com/find-nth-prime-number-in-python/ def nth_prime(n: int) -> int: prime_numbers = [2,3] i=3 if(0<n<=2): return prime_numbers[n-1] elif(n>2): while (True): i+=1 status = True for j in range(2,int(i/2)+1): if(i%j==0): status = False break if(status==True): prime_numbers.append(i) if(len(prime_numbers)==n): break return prime_numbers[n-1] else: return -1
%%time df = df.assign(n_prime=df['score'].apply(nth_prime)) df.head(10)
CPU times: user 3.32 s, sys: 0 ns, total: 3.32 s Wall time: 3.32 s
| id | text | score | n_prime | |
|---|---|---|---|---|
| 0 | 10 | It will help if you give some details of which... | 6 | 13 |
| 1 | 25 | infact it does. Look a the first lines of your... | 10 | 29 |
| 2 | 27 | "Currently + is implemented using StringBuffer... | 7 | 17 |
| 3 | 41 | I don't think that's the magic number he was r... | 18 | 61 |
| 4 | 59 | It's still very useful to know that magic numb... | 12 | 37 |
| 5 | 96 | This implementation is also nice if you wish t... | 9 | 23 |
| 6 | 108 | That's not full text searching, it's searching... | 6 | 13 |
| 7 | 109 | That's not full text searching, it's searching... | 6 | 13 |
| 8 | 137 | In vim you can open > 1 buffer. :e filename. T... | 9 | 23 |
| 9 | 154 | Sure, but what about a solution using O(1) mem... | 8 | 19 |
%%time ############################## # BigQuery DataFrames # ############################## import bigframes.pandas as pd df = pd.read_gbq(TABLE).head(MAX_ROWS)[['id', 'text', 'score']] df.head(10)
CPU times: user 71.2 ms, sys: 0 ns, total: 71.2 ms Wall time: 1.99 s
Query job ca4f5d64-fcb3-4388-b9f1-924c55c1aaa5 is DONE. 17.2 GB processed. Open Job
Query job a283cb39-41b1-44cd-a6c3-f2a2c6a55b25 is DONE. 17.2 GB processed. Open Job
| id | text | score | |
|---|---|---|---|
| 0 | 11231597 | In your update, why are some of the system fun... | 0 |
| 1 | 49684807 | what you have tried so far . ?? | 1 |
| 2 | 7623925 | @Michael: It should work. Perhaps you looked i... | 0 |
| 3 | 34046685 | Will it work with SQL compact? Please excuse m... | 0 |
| 4 | 6426146 | do you know the equation to your pdf? | 0 |
| 5 | 60686114 | m sorry but at least you have to think about it. | 0 |
| 6 | 16631986 | i think also making disable this by only jquer... | 0 |
| 7 | 16498565 | I am including these files on my header of the... | 0 |
| 8 | 26601001 | wrong answer, you didn't understand the logic | 0 |
| 9 | 73255842 | Call the setOnClickListener before return row. | 0 |
10 rows × 3 columns
[10 rows x 3 columns in total]
# Tell the user what needs to be done offline before using BigQuery DataFrame # remote functions help(pd.remote_function)
Help on function remote_function in module bigframes.pandas:
remote_function(input_types: 'List[type]', output_type: 'type', dataset: 'Optional[str]' = None, bigquery_connection: 'Optional[str]' = None, reuse: 'bool' = True, name: 'Optional[str]' = None, packages: 'Optional[Sequence[str]]' = None)
Decorator to turn a user defined function into a BigQuery remote function. Check out
the code samples at: https://cloud.google.com/bigquery/docs/remote-functions#bigquery-dataframes.
.. note::
Please make sure following is setup before using this API:
1. Have the below APIs enabled for your project:
* BigQuery Connection API
* Cloud Functions API
* Cloud Run API
* Cloud Build API
* Artifact Registry API
* Cloud Resource Manager API
This can be done from the cloud console (change `PROJECT_ID` to yours):
https://console.cloud.google.com/apis/enableflow?apiid=bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,cloudresourcemanager.googleapis.com&project=PROJECT_ID
Or from the gcloud CLI:
`$ gcloud services enable bigqueryconnection.googleapis.com cloudfunctions.googleapis.com run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com cloudresourcemanager.googleapis.com`
2. Have following IAM roles enabled for you:
* BigQuery Data Editor (roles/bigquery.dataEditor)
* BigQuery Connection Admin (roles/bigquery.connectionAdmin)
* Cloud Functions Developer (roles/cloudfunctions.developer)
* Service Account User (roles/iam.serviceAccountUser) on the service account `PROJECT_NUMBER-compute@developer.gserviceaccount.com`
* Storage Object Viewer (roles/storage.objectViewer)
* Project IAM Admin (roles/resourcemanager.projectIamAdmin) (Only required if the bigquery connection being used is not pre-created and is created dynamically with user credentials.)
3. Either the user has setIamPolicy privilege on the project, or a BigQuery connection is pre-created with necessary IAM role set:
1. To create a connection, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#create_a_connection
2. To set up IAM, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#grant_permission_on_function
Alternatively, the IAM could also be setup via the gcloud CLI:
`$ gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:CONNECTION_SERVICE_ACCOUNT_ID" --role="roles/run.invoker"`.
Args:
input_types (list(type)):
List of input data types in the user defined function.
output_type (type):
Data type of the output in the user defined function.
dataset (str, Optional):
Dataset in which to create a BigQuery remote function. It should be in
`<project_id>.<dataset_name>` or `<dataset_name>` format. If this
parameter is not provided then session dataset id is used.
bigquery_connection (str, Optional):
Name of the BigQuery connection. You should either have the
connection already created in the `location` you have chosen, or
you should have the Project IAM Admin role to enable the service
to create the connection for you if you need it. If this parameter is
not provided then the BigQuery connection from the session is used.
reuse (bool, Optional):
Reuse the remote function if already exists.
`True` by default, which will result in reusing an existing remote
function and corresponding cloud function (if any) that was
previously created for the same udf.
Setting it to `False` would force creating a unique remote function.
If the required remote function does not exist then it would be
created irrespective of this param.
name (str, Optional):
Explicit name of the persisted BigQuery remote function. Use it with
caution, because two users working in the same project and dataset
could overwrite each other's remote functions if they use the same
persistent name.
packages (str[], Optional):
Explicit name of the external package dependencies. Each dependency
is added to the `requirements.txt` as is, and can be of the form
supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.
Returns:
callable: A remote function object pointing to the cloud assets created
in the background to support the remote execution. The cloud assets can be
located through the following properties set in the object:
`bigframes_cloud_function` - The google cloud function deployed for the user defined code.
`bigframes_remote_function` - The bigquery remote function capable of calling into `bigframes_cloud_function`.
from time import perf_counter start = perf_counter() #------------------------------------------------------------------------------------- # User defined function # https://www.codespeedy.com/find-nth-prime-number-in-python/ @pd.remote_function(reuse=False, cloud_function_service_account="default") def nth_prime(n: int) -> int: prime_numbers = [2,3] i=3 if(0<n<=2): return prime_numbers[n-1] elif(n>2): while (True): i+=1 status = True for j in range(2,int(i/2)+1): if(i%j==0): status = False break if(status==True): prime_numbers.append(i) if(len(prime_numbers)==n): break return prime_numbers[n-1] else: return -1 #------------------------------------------------------------------------------------- print(f"\nWall time: {(perf_counter()-start):.4f} s")
%%time # Let's apply the function to the dataframe df = df.assign(n_prime=df['score'].apply(nth_prime)) df.head(10)
CPU times: user 55.8 ms, sys: 182 µs, total: 56 ms Wall time: 54.5 ms
Query job 762aa155-a3ed-4bab-aca2-a0380ed96ebf is DONE. 17.2 GB processed. Open Job
Query job c0a2c187-364d-4978-97bc-30352828f624 is DONE. 17.2 GB processed. Open Job
| id | text | score | n_prime | |
|---|---|---|---|---|
| 0 | 11231597 | In your update, why are some of the system fun... | 0 | -1 |
| 1 | 49684807 | what you have tried so far . ?? | 1 | 2 |
| 2 | 7623925 | @Michael: It should work. Perhaps you looked i... | 0 | -1 |
| 3 | 34046685 | Will it work with SQL compact? Please excuse m... | 0 | -1 |
| 4 | 6426146 | do you know the equation to your pdf? | 0 | -1 |
| 5 | 60686114 | m sorry but at least you have to think about it. | 0 | -1 |
| 6 | 16631986 | i think also making disable this by only jquer... | 0 | -1 |
| 7 | 16498565 | I am including these files on my header of the... | 0 | -1 |
| 8 | 26601001 | wrong answer, you didn't understand the logic | 0 | -1 |
| 9 | 73255842 | Call the setOnClickListener before return row. | 0 | -1 |
10 rows × 4 columns
[10 rows x 4 columns in total]
# We can see the path to the BQ remote function and the google cloud function # that was created under the hood print(nth_prime.bigframes_remote_function) print(nth_prime.bigframes_cloud_function)
shobs-test.bigframes_temp_us.bigframes_343b7b4bb93ca8747dae20c22bdaec8b_p27heyce projects/shobs-test/locations/us-central1/functions/bigframes-343b7b4bb93ca8747dae20c22bdaec8b-p27heyce
# Let's try to simulate a scenario in which user shares this remote function to # their colleague who simply wants to reuse it. BigFrames provides an API to do # so via `read_gbq_function`. Usage details are available via `help` command. help(pd.read_gbq_function)
Help on function read_gbq_function in module bigframes.pandas:
read_gbq_function(function_name: 'str')
Loads a BigQuery function from BigQuery.
Then it can be applied to a DataFrame or Series.
.. note::
The return type of the function must be explicitly specified in the
function's original definition even if not otherwise required.
BigQuery Utils provides many public functions under the ``bqutil`` project on Google Cloud Platform project
(See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs#using-the-udfs).
You can checkout Community UDFs to use community-contributed functions.
(See: https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs/community#community-udfs).
**Examples:**
Use the ``cw_lower_case_ascii_only`` function from Community UDFs.
(https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/cw_lower_case_ascii_only.sqlx)
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> df = bpd.DataFrame({'id': [1, 2, 3], 'name': ['AURÉLIE', 'CÉLESTINE', 'DAPHNÉ']})
>>> df
id name
0 1 AURÉLIE
1 2 CÉLESTINE
2 3 DAPHNÉ
<BLANKLINE>
[3 rows x 2 columns]
>>> func = bpd.read_gbq_function("bqutil.fn.cw_lower_case_ascii_only")
>>> df1 = df.assign(new_name=df['name'].apply(func))
>>> df1
id name new_name
0 1 AURÉLIE aurÉlie
1 2 CÉLESTINE cÉlestine
2 3 DAPHNÉ daphnÉ
<BLANKLINE>
[3 rows x 3 columns]
Args:
function_name (str):
the function's name in BigQuery in the format
`project_id.dataset_id.function_name`, or
`dataset_id.function_name` to load from the default project, or
`function_name` to load from the default project and the dataset
associated with the current session.
Returns:
callable: A function object pointing to the BigQuery function read
from BigQuery.
The object is similar to the one created by the `remote_function`
decorator, including the `bigframes_remote_function` property, but
not including the `bigframes_cloud_function` property.
EXISTING_REMOTE_FUNCTION=nth_prime.bigframes_remote_function # Let's read the existing remote function in bigframes nth_prime_existing = pd.read_gbq_function(EXISTING_REMOTE_FUNCTION)
%%time # Let's apply the existing function to the dataframe df = df.assign(n_prime_again=df['score'].apply(nth_prime_existing)) df.head(10)
CPU times: user 70.8 ms, sys: 3.49 ms, total: 74.3 ms Wall time: 75.2 ms
Query job f9a0e979-aeac-4ddd-a4c9-6720a5e91009 is DONE. 17.2 GB processed. Open Job
Query job 4d3da7ed-42e6-4b2b-b656-ac9ef6d2e871 is DONE. 17.2 GB processed. Open Job
| id | text | score | n_prime | n_prime_again | |
|---|---|---|---|---|---|
| 0 | 11231597 | In your update, why are some of the system fun... | 0 | -1 | -1 |
| 1 | 49684807 | what you have tried so far . ?? | 1 | 2 | 2 |
| 2 | 7623925 | @Michael: It should work. Perhaps you looked i... | 0 | -1 | -1 |
| 3 | 34046685 | Will it work with SQL compact? Please excuse m... | 0 | -1 | -1 |
| 4 | 6426146 | do you know the equation to your pdf? | 0 | -1 | -1 |
| 5 | 60686114 | m sorry but at least you have to think about it. | 0 | -1 | -1 |
| 6 | 16631986 | i think also making disable this by only jquer... | 0 | -1 | -1 |
| 7 | 16498565 | I am including these files on my header of the... | 0 | -1 | -1 |
| 8 | 26601001 | wrong answer, you didn't understand the logic | 0 | -1 | -1 |
| 9 | 73255842 | Call the setOnClickListener before return row. | 0 | -1 | -1 |
10 rows × 5 columns
[10 rows x 5 columns in total]
# Clean up GCP assets created as part of bigframes remote_function def cleanup_remote_function_assets(remote_udf, ignore_failures=False): """Clean up the GCP assets behind a bigframes remote function.""" session = pd.get_global_session() # Clean up BQ remote function try: session.bqclient.delete_routine(remote_udf.bigframes_remote_function) except Exception: # By default don't raise exception in cleanup if not ignore_failures: raise # Clean up cloud function try: session.cloudfunctionsclient.delete_function(name=remote_udf.bigframes_cloud_function) except Exception: # By default don't raise exception in cleanup if not ignore_failures: raise cleanup_remote_function_assets(nth_prime)