[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types by codesorcery · Pull Request #47083

[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types by codesorcery · Pull Request #47083 · apache/spark

added 2 commits

June 25, 2024 12:01

codesorcery changed the title ~~Spark [SPARK-48710][PYTHON] Use NumPy 2.0 compatible types~~ [SPARK-48710][PYTHON] Use NumPy 2.0 compatible types

Jun 25, 2024

HyukjinKwon pushed a commit that referenced this pull request

Jul 3, 2024

…1.15,<2)

### What changes were proposed in this pull request?
 * Add a constraint for `numpy<2` to the PySpark package

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.

#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

### Does this PR introduce _any_ user-facing change?
NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.

### How was this patch tested?
Via existing CI jobs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound.

Authored-by: Patrick Marx <6949483+codesorcery@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon pushed a commit that referenced this pull request

Jul 3, 2024

…1.15,<2)

### What changes were proposed in this pull request?
 * Add a constraint for `numpy<2` to the PySpark package

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.

#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

### Does this PR introduce _any_ user-facing change?
NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.

### How was this patch tested?
Via existing CI jobs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound.

Authored-by: Patrick Marx <6949483+codesorcery@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 44eba46)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

gaecoli pushed a commit to gaecoli/spark that referenced this pull request

Jul 10, 2024

…1.15,<2)

### What changes were proposed in this pull request?
 * Add a constraint for `numpy<2` to the PySpark package

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.

apache#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

### Does this PR introduce _any_ user-facing change?
NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.

### How was this patch tested?
Via existing CI jobs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47175 from codesorcery/SPARK-48710-numpy-upper-bound.

Authored-by: Patrick Marx <6949483+codesorcery@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon pushed a commit that referenced this pull request

Jul 30, 2024

…ptional dependencies

### What changes were proposed in this pull request?

This is a follow-up of #47083 to recover PySpark RDD tests.

### Why are the changes needed?

`PySpark Core` test should not fail on optional dependencies.

**BEFORE**
```
$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test
    import numpy as np
ModuleNotFoundError: No module named 'numpy'
```

**AFTER**
```
$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
Tests passed in 189 seconds

Skipped tests in pyspark.tests.test_memory_profiler with python3:
    test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.'
    test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.'
    test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.'
    test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.'
    test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.'
    test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.'
    test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.'

Skipped tests in pyspark.tests.test_rdd with python3:
    test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed'

Skipped tests in pyspark.tests.test_serializers with python3:
    test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed'
    test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed'

Skipped tests in pyspark.tests.test_worker with python3:
    test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux."
    test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12'
    test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12'
```

### Does this PR introduce _any_ user-facing change?

No. The failure happens during testing.

### How was this patch tested?

Pass the CIs and do the manual test without optional dependencies.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47526 from dongjoon-hyun/SPARK-48710.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

sunchao pushed a commit that referenced this pull request

Mar 10, 2026

…1.15,<2)

### What changes were proposed in this pull request?
 * Add a constraint for `numpy<2` to the PySpark package

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.

#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

### Does this PR introduce _any_ user-facing change?
NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.

### How was this patch tested?
Via existing CI jobs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound.

Authored-by: Patrick Marx <6949483+codesorcery@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>