Readme Updates -2 by sumitmsft · Pull Request #5 · microsoft/mssql-python

@sumitmsft

bewithgaurav added a commit that referenced this pull request

Nov 10, 2025
- Add ColumnProcessor typedef and ColumnProcessors namespace with specialized handlers
- Build columnProcessors[] array once per batch (switch executed per column, not per cell)
- Fast path: Direct function call for common types (INT, VARCHAR, etc.)
- Slow path: Fallback switch for complex types (DECIMAL, DATETIME, GUID)
- Eliminates 99.99% of switch evaluations (20 vs 200,000 for 10K rows × 20 cols)
- All processor functions use direct Python C API (PyList_SET_ITEM, PyLong_FromLong, etc.)
- Expected improvement: 60% reduction in CPU cycles per cell, better branch prediction
- This completes the optimization sequence - file now matches ddbc_bindings_profiled_optimized.cpp line-by-line

bewithgaurav added a commit that referenced this pull request

Nov 10, 2025
- Created typedef ColumnProcessor for function pointer type
- Added ColumnProcessors namespace with specialized inline processors:
  * ProcessInteger, ProcessSmallInt, ProcessBigInt, ProcessTinyInt, ProcessBit
  * ProcessReal, ProcessDouble
  * ProcessChar, ProcessWChar, ProcessBinary (handle LOBs, NULL, zero-length)
- Added ColumnInfoExt struct to pass metadata efficiently
- Build columnProcessors array once during cache_column_metadata
- Fast path: Direct function call via columnProcessors[col-1] (no switch)
- Slow path: Fallback switch for complex types (DECIMAL, DATETIME, GUID)
- Eliminates switch evaluation from O(rows × columns) to O(columns)
- All processors use direct Python C API from OPT #1 and OPT #2

bewithgaurav added a commit that referenced this pull request

Nov 10, 2025
Eliminates switch statement overhead from hot loop by pre-computing
function pointer dispatch table once per batch instead of per cell.

Problem:
- Previous code evaluated switch statement 100,000 times for 1,000 rows × 10 cols
- Each switch evaluation costs 5-12 CPU cycles
- Total overhead: 500K-1.2M cycles per batch

Solution:
- Extract 10 processor functions for common types (INT, VARCHAR, etc.)
- Build function pointer array once per batch (10 switch evaluations)
- Hot loop uses direct function calls (~1 cycle each)
- Complex types (Decimal, DateTime, Guid) use fallback switch

Implementation:
- Created ColumnProcessor typedef for function pointer signature
- Added ColumnInfoExt struct with metadata needed by processors
- Implemented 10 inline processor functions in ColumnProcessors namespace:
  * ProcessInteger, ProcessSmallInt, ProcessBigInt, ProcessTinyInt, ProcessBit
  * ProcessReal, ProcessDouble
  * ProcessChar, ProcessWChar, ProcessBinary
- Build processor array after OPT #3 metadata prefetch
- Modified hot loop to use function pointers with fallback for complex types

Performance Impact:
- Reduces dispatch overhead by 70-80%
- 100,000 switch evaluations → 10 setup switches + 100,000 direct calls
- Estimated savings: ~450K-1.1M cycles per 1,000-row batch

Builds successfully on macOS Universal2 (arm64 + x86_64)

bewithgaurav added a commit that referenced this pull request

Nov 12, 2025
- Moved typedef ColumnProcessor, struct ColumnInfoExt, and all 10 inline processor functions from ddbc_bindings.cpp to ddbc_bindings.h
- Added new 'INTERNAL: Performance Optimization Helpers' section in header
- Added forward declarations for ColumnBuffers struct and FetchLobColumnData function
- Enables true cross-compilation-unit inlining for performance optimization
- Follows C++ best practices for inline function placement

Addresses review comments #4, #5, #6 from subrata-ms

bewithgaurav added a commit that referenced this pull request

Nov 12, 2025
…der file

- Moved DateTimeOffset struct definition to header (required by ColumnBuffers)
- Moved ColumnBuffers struct definition to header (required by inline functions)
- Moved typedef ColumnProcessor, struct ColumnInfoExt, and all 10 inline processor functions to header
- Added new 'INTERNAL: Performance Optimization Helpers' section in header
- Added forward declaration for FetchLobColumnData function
- Enables true cross-compilation-unit inlining for performance optimization
- Follows C++ best practices for inline function placement

Addresses review comments #4, #5, #6 from subrata-ms
Build verified successful (universal2 binary for macOS arm64 + x86_64)