[GH-1147] Fix null byte crash in semantic memory ingestion by o-love · Pull Request #1165 · MemMachine/MemMachine
Purpose of the change
Prevent asyncpg.CharacterNotInRepertoireError crashes caused by null bytes (\x00) in LLM-generated strings inserted into PostgreSQL TEXT columns.
Description
Implements defense-in-depth sanitization at two layers:
- Pydantic model layer (early catch):
field_validatoronSemanticCommandandLLMReducedFeaturestrips null bytes as soon as LLM output is parsed, before it reaches any storage code. - PostgreSQL storage layer (backstop): A shared
sanitize_pg_text()utility is called inadd_feature()andupdate_feature()to strip null bytes fromtag,feature, andvaluefields before insertion — catching any path that bypasses models (e.g. REST API direct inserts).
The sanitize_pg_text utility logs a warning when it strips null bytes so data-quality issues are visible in monitoring.
Fixes/Closes
Fixes #1147
Type of change
- Bug fix (non-breaking change which fixes an issue)
How Has This Been Tested?
- Unit Test
- Integration Test
Tests added:
TestSanitizePgText— 10 unit tests covering clean passthrough, null byte stripping (single, multiple, leading, trailing, all-null), warning logging, and no-log for clean strings.TestSemanticCommandNullByteStripping— 4 tests verifying the Pydantic validator strips null bytes fromfeature,tag, andvaluefields onSemanticCommand.TestLLMReducedFeatureNullByteStripping— 4 tests verifying the same forLLMReducedFeature.test_add_feature_with_null_bytes_does_not_crash— Regression test confirming null bytes in feature values do not crash storage insertion.
Test Results: 1329 passed, 9 skipped, 0 failures. Ruff lint and format clean.
Checklist
- My code follows the style guidelines of this project (See STYLE_GUIDE.md)
- I have performed a self-review of my own code
- I have commented my code
- My changes generate no new warnings
- I have added unit tests that prove my fix is effective or that my feature works
- New and existing unit tests pass locally with my changes
- I have checked my code and corrected any misspellings
Screenshots/Gifs
N/A
Further comments
- Not in scope: Fixing the LLM itself (upstream model behavior), per-feature error recovery in ingestion (separate follow-up), and Neo4j storage sanitization (handles null bytes differently).
- The
sanitize_pg_textutility is intentionally kept separate from the Pydantic validators so it can serve as a backstop for all PostgreSQL insertion paths, not just model-validated ones.