feat: custom Markdown extensions for partition_md by claytonlin1110 · Pull Request #4292 · Unstructured-IO/unstructured
@claytonlin1110 a few small findings, after those are fixed this PR should be good to merge :)
Findings
- Medium:
partition_md()rejects the most useful form of Python-Markdown extensions. The new validation only acceptslist[str], but Python-Markdown supports passing actualExtensioninstances, which is the normal way to use configured/custom extensions. On this branch, a valid call likeextensions=[MyExtension(...)]would be treated as invalid, logged, and silently replaced with the default extensions, so the advertised feature does not actually support real custom extensions.
_default_extensions = ["tables", "fenced_code"] extensions = kwargs.pop("extensions", _default_extensions) if not (isinstance(extensions, list) and all(isinstance(ext, str) for ext in extensions)): logging.warning( "Ignoring invalid 'extensions' argument (expected list of strings): %r", extensions ) extensions = _default_extensions html = markdown.markdown(text, extensions=extensions)
- Medium: Invalid
extensionsvalues fail open instead of fail fast. Because the extension set directly changes the parsed element structure, silently swapping bad input for["tables", "fenced_code"]can produce the wrong partitioned output without surfacing an API error to the caller. That makes typos and unsupported values hard to diagnose, and the new test locks that behavior in rather than challenging it.
def test_partition_md_invalid_extensions_logs_and_falls_back(mocker: MockFixture): """Invalid `extensions` value is ignored with a warning and falls back to the default list.""" text = "# Heading" logger = mocker.patch("unstructured.partition.md.logging.warning") elements = partition_md(text=text, extensions="not-a-list") # type: ignore[arg-type] # Still parses something assert len(elements) > 0 # Warning was logged logger.assert_called_once()
- Low: The new tests are too weak for the feature they are adding.
test_partition_md_invalid_extensions_logs_and_falls_back()only proves that a warning happened and that some element was returned; it does not verify that the default extension set was actually used.test_partition_md_custom_extensions_parameter()only covers string extension names, so it would not catch the extension-instance regression above.