Unify Table representations by timsaucer · Pull Request #1256 · apache/datafusion-python

and others added 30 commits

September 18, 2025 15:11
docs/tests, add DataFrame view support, and improve Send/concurrency
support.

migrates the codebase from using `Table` to a
`TableProvider`-based API, refactors registration and access paths to
simplify catalog/context interactions, and updates documentation and
examples. DataFrame view handling is improved (`into_view` is now
public), the test-suite is expanded to cover new registration and async
SQL scenarios, and `TableProvider` now supports the `Send` trait across
modules for safer concurrency. Minor import cleanup and utility
adjustments (including a refined `pyany_to_table_provider`) are
included.
DataFrame→TableProvider conversion, plus tests and FFI/pycapsule
improvements.

-- Registration logic & API

* Refactor of table provider registration logic for improved clarity and
  simpler call sites.
* Remove PyTableProvider registration from an internal module (reduces
  surprising side effects).
* Update table registration method to call `register_table` instead of
  `register_table_provider`.
* Extend `register_table` to support `TableProviderExportable` so more
  provider types can be registered uniformly.
* Improve error messages related to registration failures (missing
  PyCapsule name and DataFrame registration errors).

-- DataFrame ↔ TableProvider conversions

* Introduce utility functions to simplify table provider conversions and
  centralize conversion logic.
* Rename `into_view_provider` → `to_view_provider` for clearer intent.
* Fix `from_dataframe` to return the correct type and update
  `DataFrame.into_view` to import the correct `TableProvider`.
* Remove an obsolete `dataframe_into_view` test case after the refactor.

-- FFI / PyCapsule handling

* Update `FFI_TableProvider` initialization to accept an optional
  parameter (improves FFI ergonomics).
* Introduce `table_provider_from_pycapsule` utility to standardize
  pycapsule-based construction.
* Improve the error message when a PyCapsule name is missing to help
  debugging.

-- DeltaTable & specific integrations

* Update TableProvider registration for `DeltaTable` to use the correct
  registration method (matches the new API surface).

-- Tests, docs & minor fixes

* Add tests for registering a `TableProvider` from a `DataFrame` and
  from a capsule to ensure conversion paths are covered.
* Fix a typo in the `register_view` docstring and another typo in the
  error message for unsupported volatility type.
* Simplify version retrieval by removing exception handling around
  `PackageNotFoundError` (streamlines code path).
* Removed unused helpers (`extract_table_provider`, `_wrap`) and dead code to simplify maintenance.
* Consolidated and streamlined table-provider extraction and registration logic; improved error handling and replaced a hardcoded error message with `EXPECTED_PROVIDER_MSG`.
* Marked `from_view` as deprecated; updated deprecation message formatting and adjusted the warning `stacklevel` so it points to caller code.
* Removed the `Send` marker from TableProvider trait objects to increase type flexibility — review threading assumptions.
* Added type hints to `register_schema` and `deregister_table` methods.
* Adjusted tests and exceptions (e.g., changed one test to expect `RuntimeError`) and updated test coverage accordingly.
* Introduced a refactored `TableProvider` class and enhanced Python integration by adding support for extracting `PyDataFrame` in `PySchema`.

Notes:

* Consumers should migrate away from `TableProvider::from_view` to the new TableProvider integration.
* Audit any code relying on `Send` for trait objects passed across threads.
* Update downstream tests and documentation to reflect the changed exception types and deprecation.
utilities, docs, and robustness fixes

* Normalized table-provider handling and simplified registration flow
  across the codebase; multiple commits centralize provider coercion and
normalization.
* Introduced utility helpers (`coerce_table_provider`,
  `extract_table_provider`, `_normalize_table_provider`) to centralize
extraction, error handling, and improve clarity.
* Simplified `from_dataframe` / `into_view` behavior: clearer
  implementations, direct returns of DataFrame views where appropriate,
and added internal tests for DataFrame flows.
* Fixed DataFrame registration semantics: enforce `TypeError` for
  invalid registrations; added handling for `DataFrameWrapper` by
converting it to a view.
* Added tests, including a schema registration test using a PyArrow
  dataset and internal DataFrame tests to cover new flows.
* Documentation improvements: expanded `from_dataframe` docstrings with
  parameter details, added usage examples for `into_view`, and
documented deprecations (e.g., `register_table_provider` →
`register_table`).
* Warning and UX fixes: synchronized deprecation `stacklevel` so
  warnings point to caller code; improved `__dir__` to return sorted,
unique attributes.
* Cleanup: removed unused imports (including an unused error import from
  `utils.rs`) and other dead code to reduce noise.
…d avoid documentation duplication

@timsaucer

@timsaucer

@timsaucer

@timsaucer

@timsaucer

@timsaucer

timsaucer

@timsaucer