bigframes.bigquery.vector_search — bigframes documentation

bigframes.bigquery.vector_search(base_table: str, column_to_search: str, query: dataframe.DataFrame | series.Series, *, query_column_to_search: str | None = None, top_k: int | None = None, distance_type: Literal['euclidean', 'cosine', 'dot_product'] | None = None, fraction_lists_to_search: float | None = None, use_brute_force: bool | None = None, allow_large_results: bool | None = None) → dataframe.DataFrame[source]#

Conduct vector search which searches embeddings to find semantically similar entities.

This method calls the VECTOR_SEARCH() SQL function.

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq

DataFrame embeddings for which to find nearest neighbors. The ARRAY<FLOAT64> column is used as the search query:

>>> search_query = bpd.DataFrame({"query_id": ["dog", "cat"],
...                               "embedding": [[1.0, 2.0], [3.0, 5.2]]})
>>> bbq.vector_search(
...             base_table="bigframes-dev.bigframes_tests_sys.base_table",
...             column_to_search="my_embedding",
...             query=search_query,
...             top_k=2).sort_values("id")
  query_id  embedding  id my_embedding  distance
0      dog    [1. 2.]   1      [1. 2.]       0.0
1      cat  [3.  5.2]   2      [2. 4.]   1.56205
0      dog    [1. 2.]   4    [1.  3.2]       1.2
1      cat  [3.  5.2]   5    [5.  5.4]  2.009975

[4 rows x 5 columns]

Series embeddings for which to find nearest neighbors:

>>> search_query = bpd.Series([[1.0, 2.0], [3.0, 5.2]],
...                            index=["dog", "cat"],
...                            name="embedding")
>>> bbq.vector_search(
...             base_table="bigframes-dev.bigframes_tests_sys.base_table",
...             column_to_search="my_embedding",
...             query=search_query,
...             top_k=2,
...             use_brute_force=True).sort_values("id")
     embedding  id my_embedding  distance
dog    [1. 2.]   1      [1. 2.]       0.0
cat  [3.  5.2]   2      [2. 4.]   1.56205
dog    [1. 2.]   4    [1.  3.2]       1.2
cat  [3.  5.2]   5    [5.  5.4]  2.009975

[4 rows x 4 columns]

You can specify the name of the column in the query DataFrame embeddings and distance type. If you specify query_column_to_search_value, it will use the provided column which contains the embeddings for which to find nearest neighbors. Otherwiese, it uses the column_to_search value.

>>> search_query = bpd.DataFrame({"query_id": ["dog", "cat"],
...                               "embedding": [[1.0, 2.0], [3.0, 5.2]],
...                               "another_embedding": [[0.7, 2.2], [3.3, 5.2]]})
>>> bbq.vector_search(
...             base_table="bigframes-dev.bigframes_tests_sys.base_table",
...             column_to_search="my_embedding",
...             query=search_query,
...             distance_type="cosine",
...             query_column_to_search="another_embedding",
...             top_k=2).sort_values("id")
  query_id  embedding another_embedding  id my_embedding  distance
1      cat  [3.  5.2]         [3.3 5.2]   1      [1. 2.]  0.005181
1      cat  [3.  5.2]         [3.3 5.2]   2      [2. 4.]  0.005181
0      dog    [1. 2.]         [0.7 2.2]   3    [1.5 7. ]  0.004697
0      dog    [1. 2.]         [0.7 2.2]   4    [1.  3.2]  0.000013

[4 rows x 6 columns]

Parameters:

base_table (str) – The table to search for nearest neighbor embeddings.
column_to_search (str) – The name of the base table column to search for nearest neighbor embeddings. The column must have a type of ARRAY<FLOAT64>. All elements in the array must be non-NULL.
query (bigframes.dataframe.DataFrame | bigframes.dataframe.Series) – A Series or DataFrame that provides the embeddings for which to find nearest neighbors.
query_column_to_search (str) – Specifies the name of the column in the query that contains the embeddings for which to find nearest neighbors. The column must have a type of ARRAY<FLOAT64>. All elements in the array must be non-NULL and all values in the column must have the same array dimensions as the values in the column_to_search column. Can only be set when query is a DataFrame.
top_k (int) – Sepecifies the number of nearest neighbors to return. Default to 10.
distance_type (str, defalt "euclidean") – Specifies the type of metric to use to compute the distance between two vectors. Possible values are “euclidean”, “cosine” and “dot_product”. Default to “euclidean”.
fraction_lists_to_search (float, range in [0.0, 1.0]) – Specifies the percentage of lists to search. Specifying a higher percentage leads to higher recall and slower performance, and the converse is true when specifying a lower percentage. It is only used when a vector index is also used. You can only specify fraction_lists_to_search when use_brute_force is set to False.
use_brute_force (bool) – Determines whether to use brute force search by skipping the vector index if one is available. Default to False.
allow_large_results (bool, optional) – Whether to allow large query results. If True, the query results can be larger than the maximum response size. Defaults to bpd.options.compute.allow_large_results.

Returns:

A DataFrame containing vector search result.

Return type:

bigframes.dataframe.DataFrame