FEAT: Comprehensive Logging Framework with Python-C++ Bridge & Instrumentation by bewithgaurav · Pull Request #312 · microsoft/mssql-python

@bewithgaurav

github-advanced-security[bot]

@bewithgaurav

@bewithgaurav

github-advanced-security[bot]

…nflict and include logger_bridge.hpp in ddbc_bindings.h

@bewithgaurav

@bewithgaurav

@bewithgaurav

@bewithgaurav

@bewithgaurav

@bewithgaurav

@bewithgaurav

@bewithgaurav

- Changed log filename format: timestamp now has no separator (YYYYMMDDHHMMSS)
- Added CSV header with metadata: script name, PID, Python version, OS info
- Converted log format to CSV: Timestamp, ThreadID, Level, Location, Source, Message
- Replaced trace ID with OS native thread ID for debugger compatibility
- Updated Python formatter to parse [Python]/[DDBC] tags into Source column
- Updated C++ logger_bridge to use makeRecord() for proper file/line attribution
- Logs are now easily parseable as CSV for analysis in Excel/pandas
- Counter logic for connection/cursor tracking kept internally but not displayed
- Updated LOGGING.md:
  * Changed log format examples from trace ID to CSV format
  * Updated filename format (YYYYMMDDHHMMSS with no separators)
  * Replaced trace ID section with Thread Tracking section
  * Added CSV parsing examples with pandas
  * Updated all log output samples to show CSV columns

- Updated MSSQL-Python-Logging-Design.md:
  * Changed file handler config to describe CSV format
  * Replaced Trace ID System with Thread Tracking System
  * Updated architecture to reflect OS native thread IDs
  * Added CSV formatter implementation details
  * Updated all code examples to use setup_logging() API
  * Changed log output examples to CSV format

- Thread tracking now uses OS native thread IDs (threading.get_native_id())
- CSV columns: Timestamp, ThreadID, Level, Location, Source, Message
- File header includes metadata (PID, script name, Python version, etc.)
- Easy analysis with pandas/Excel/CSV tools
- CSV format now mentioned only once as optional import capability
- Focus on log structure and content, not format
- Removed repetitive CSV parsing examples
- Single section 'Importing Logs as CSV (Optional)' in LOGGING.md
- Brief mention in design doc that format is importable as CSV
Features:
- Whitelist log file extensions: .txt, .log, .csv only
- Raise ValueError for invalid extensions
- Export driver_logger for use in application code
- Allow apps to use mssql-python's logger: from mssql_python.logging import driver_logger
- Updated documentation with usage examples
- Added validation in _setLevel() method

Benefits:
- Prevents accidental use of wrong file types
- Clear error messages for invalid extensions
- Unified logging - apps can use same logger as driver
- Same format and thread tracking for app logs
- Added 'Executing query:' log at start of execute() method
- Removed duplicate log statement that was causing queries to appear twice
- executemany() uses existing detailed log (shows parameter set count)
- Each query now logged exactly once at DEBUG level
- Parameters excluded from basic query log (pending PII review)

@bewithgaurav

@bewithgaurav

@bewithgaurav

@bewithgaurav

… (Linux/macOS)

Problem:
- Linux/macOS performed double conversion for NVARCHAR columns
- SQLWCHAR → std::wstring (via SQLWCHARToWString) → Python unicode
- Created unnecessary intermediate std::wstring allocation

Solution:
- Use PyUnicode_DecodeUTF16() to convert UTF-16 directly to Python unicode
- Single-step conversion eliminates intermediate allocation
- Platform-specific optimization (Linux/macOS only)

Impact:
- Reduces memory allocations for wide-character string columns
- Eliminates one full conversion step per NVARCHAR cell
- Regular VARCHAR/CHAR columns unchanged (already optimal)
… (Linux/macOS)

Problem:
- Linux/macOS performed double conversion for NVARCHAR columns
- SQLWCHAR → std::wstring (via SQLWCHARToWString) → Python unicode
- Created unnecessary intermediate std::wstring allocation

Solution:
- Use PyUnicode_DecodeUTF16() to convert UTF-16 directly to Python unicode
- Single-step conversion eliminates intermediate allocation
- Platform-specific optimization (Linux/macOS only)

Impact:
- Reduces memory allocations for wide-character string columns
- Eliminates one full conversion step per NVARCHAR cell
- Regular VARCHAR/CHAR columns unchanged (already optimal)
Problem:
- All numeric conversions used pybind11 wrappers with overhead:
  * Type detection, wrapper object creation, bounds checking
  * ~20-40 CPU cycles overhead per cell

Solution:
- Use direct Python C API calls:
  * PyLong_FromLong/PyLong_FromLongLong for integers
  * PyFloat_FromDouble for floats
  * PyBool_FromLong for booleans
  * PyList_SET_ITEM macro (no bounds check - list pre-sized)

Changes:
- SQL_INTEGER, SQL_SMALLINT, SQL_BIGINT, SQL_TINYINT → PyLong_*
- SQL_BIT → PyBool_FromLong
- SQL_REAL, SQL_DOUBLE, SQL_FLOAT → PyFloat_FromDouble
- Added explicit NULL handling for each type

Impact:
- Eliminates pybind11 wrapper overhead for simple numeric types
- Direct array access via PyList_SET_ITEM macro
- Affects 7 common numeric SQL types

@bewithgaurav

Problem:
--------
Column metadata (dataType, columnSize, isLob, fetchBufferSize) was accessed
from the columnInfos vector inside the hot row processing loop. For a query
with 1,000 rows × 10 columns, this resulted in 10,000 struct field accesses.

Each access involves:
- Vector bounds checking
- Large struct loading (~50+ bytes per ColumnInfo)
- Poor cache locality (struct fields scattered in memory)
- Cost: ~10-15 CPU cycles per access (L2 cache misses likely)

Solution:
---------
Prefetch metadata into tightly-packed local arrays before the row loop:
- std::vector<SQLSMALLINT> dataTypes (2 bytes per element)
- std::vector<SQLULEN> columnSizes (8 bytes per element)
- std::vector<uint64_t> fetchBufferSizes (8 bytes per element)
- std::vector<bool> isLobs (1 byte per element)

Total: ~190 bytes for 10 columns vs 500+ bytes with structs.

These arrays stay hot in L1 cache for the entire batch processing,
eliminating repeated struct access overhead.

Changes:
--------
- Added 4 prefetch vectors before row processing loop
- Added prefetch loop to populate metadata arrays (read columnInfos once)
- Replaced all columnInfos[col-1].field accesses with array lookups
- Updated SQL_CHAR/SQL_VARCHAR cases
- Updated SQL_WCHAR/SQL_WVARCHAR cases
- Updated SQL_BINARY/SQL_VARBINARY cases

Impact:
-------
- Eliminates O(rows × cols) metadata lookups
- 10,000 array accesses @ 3-5 cycles vs 10,000 struct accesses @ 10-15 cycles
- ~70% reduction in metadata access overhead
- Better L1 cache utilization (190 bytes vs 500+ bytes)
- Expected 15-25% overall performance improvement on large result sets

sumitmsft

sumitmsft

sumitmsft

@bewithgaurav

github-advanced-security[bot]

github-advanced-security[bot]

@bewithgaurav

@bewithgaurav

@bewithgaurav

sumitmsft

@bewithgaurav bewithgaurav changed the title FEAT: Logging Framework FEAT: Comprehensive Logging Framework with Python-C++ Bridge & Instrumentation

Nov 13, 2025

subrata-ms