[DO NOT MERGE] Pickle vs. Arrow Type Coercion Difference by xinrong-meng · Pull Request #41706 · apache/spark

added 2 commits

June 22, 2023 16:23

xinrong-meng added a commit that referenced this pull request

Jun 29, 2023
…return type in Arrow Python UDF

### What changes were proposed in this pull request?
Explicit Arrow casting for the mismatched return type of Arrow Python UDF.

### Why are the changes needed?
A more standardized and coherent type coercion.

Please refer to #41706 for a comprehensive comparison between type coercion rules of Arrow and Pickle(used by the default Python UDF) separately.

See more at [[Design] Type-coercion in Arrow Python UDFs](https://docs.google.com/document/d/e/2PACX-1vTEGElOZfhl9NfgbBw4CTrlm-8F_xQCAKNOXouz-7mg5vYobS7lCGUsGkDZxPY0wV5YkgoZmkYlxccU/pub).

### Does this PR introduce _any_ user-facing change?
Yes.

FROM
```py
>>> df = spark.createDataFrame(['1', '2'], schema='string')
df.select(pandas_udf(lambda x: x, 'int')('value')).show()
>>> df.select(pandas_udf(lambda x: x, 'int')('value')).show()
...
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to int32
```

TO
```py
>>> df = spark.createDataFrame(['1', '2'], schema='string')
>>> df.select(pandas_udf(lambda x: x, 'int')('value')).show()
+---------------+
|<lambda>(value)|
+---------------+
|              1|
|              2|
+---------------+
```
### How was this patch tested?
Unit tests.

Closes #41503 from xinrong-meng/type_coersion.

Authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Xinrong Meng <xinrong@apache.org>

HyukjinKwon pushed a commit that referenced this pull request

Jul 3, 2023
…return type in Arrow Python UDF

### What changes were proposed in this pull request?
Explicit Arrow casting for the mismatched return type of Arrow Python UDF.

### Why are the changes needed?
A more standardized and coherent type coercion.

Please refer to #41706 for a comprehensive comparison between type coercion rules of Arrow and Pickle(used by the default Python UDF) separately.

See more at [[Design] Type-coercion in Arrow Python UDFs](https://docs.google.com/document/d/e/2PACX-1vTEGElOZfhl9NfgbBw4CTrlm-8F_xQCAKNOXouz-7mg5vYobS7lCGUsGkDZxPY0wV5YkgoZmkYlxccU/pub).

### Does this PR introduce _any_ user-facing change?
Yes.

FROM
```py
>>> df = spark.createDataFrame(['1', '2'], schema='string')
df.select(pandas_udf(lambda x: x, 'int')('value')).show()
>>> df.select(pandas_udf(lambda x: x, 'int')('value')).show()
...
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to int32
```

TO
```py
>>> df = spark.createDataFrame(['1', '2'], schema='string')
>>> df.select(pandas_udf(lambda x: x, 'int')('value')).show()
+---------------+
|<lambda>(value)|
+---------------+
|              1|
|              2|
+---------------+
```
### How was this patch tested?
Unit tests.

Closes #41800 from xinrong-meng/snd_type_coersion.

Authored-by: Xinrong Meng <xinrong@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>