feat: custom Markdown extensions for partition_md by claytonlin1110 · Pull Request #4292 · Unstructured-IO/unstructured

@claytonlin1110 a few small findings, after those are fixed this PR should be good to merge :)

Findings

  • Medium: partition_md() rejects the most useful form of Python-Markdown extensions. The new validation only accepts list[str], but Python-Markdown supports passing actual Extension instances, which is the normal way to use configured/custom extensions. On this branch, a valid call like extensions=[MyExtension(...)] would be treated as invalid, logged, and silently replaced with the default extensions, so the advertised feature does not actually support real custom extensions.
    _default_extensions = ["tables", "fenced_code"]
    extensions = kwargs.pop("extensions", _default_extensions)
    if not (isinstance(extensions, list) and all(isinstance(ext, str) for ext in extensions)):
        logging.warning(
            "Ignoring invalid 'extensions' argument (expected list of strings): %r", extensions
        )
        extensions = _default_extensions

    html = markdown.markdown(text, extensions=extensions)
  • Medium: Invalid extensions values fail open instead of fail fast. Because the extension set directly changes the parsed element structure, silently swapping bad input for ["tables", "fenced_code"] can produce the wrong partitioned output without surfacing an API error to the caller. That makes typos and unsupported values hard to diagnose, and the new test locks that behavior in rather than challenging it.
def test_partition_md_invalid_extensions_logs_and_falls_back(mocker: MockFixture):
    """Invalid `extensions` value is ignored with a warning and falls back to the default list."""
    text = "# Heading"
    logger = mocker.patch("unstructured.partition.md.logging.warning")

    elements = partition_md(text=text, extensions="not-a-list")  # type: ignore[arg-type]

    # Still parses something
    assert len(elements) > 0
    # Warning was logged
    logger.assert_called_once()
  • Low: The new tests are too weak for the feature they are adding. test_partition_md_invalid_extensions_logs_and_falls_back() only proves that a warning happened and that some element was returned; it does not verify that the default extension set was actually used. test_partition_md_custom_extensions_parameter() only covers string extension names, so it would not catch the extension-instance regression above.