Add automated docs & notebooks freshness + normalization checks by C-Achard · Pull Request #3228

Add automated docs & notebooks freshness + normalization checks by C-Achard · Pull Request #3228 · DeepLabCut/DeepLabCut

added 6 commits

March 5, 2026 11:08

Introduce a new CLI tool (.github/tools/docs_and_notebooks_check.py) to scan notebooks and Markdown docs for staleness and verification metadata under the 'deeplabcut' namespace. Adds a default YAML config (.github/tools/docs_and_notebooks_report_config.yml), a README for the tool (.github/tools/docs_and_notebooks_tool_README.md), and an output ignore entry in .gitignore. The tool uses pydantic schemas, computes last_git_updated from git history, reads/writes notebook top-level metadata and Markdown frontmatter (idempotent updates), and supports report/check/update modes. Outputs machine- and human-readable reports (nb_docs_status.json / .md). Requires pydantic and PyYAML; designed to be safe-by-default for CI (read-only unless --write is passed).

Introduce a GitHub Actions workflow to scan docs and notebooks for staleness. The workflow runs on push and PRs to main, checks out full git history, uses Python 3.12, installs pydantic and pyyaml, and runs a read-only staleness report and an optional policy check using .github/tools/docs_and_notebooks_check.py with tools/staleness_config.yml. Results (JSON/MD) are uploaded as the staleness-report artifact. Workflow is limited to content read permissions and has a 10-minute timeout.

Rename OUTPUT_FILENAME from 'nb_docs_status' to 'docs_nb_checks' and use it for the default --out-dir (tmp/docs_nb_checks). Update the README to show the check command as a fenced code block and clarify allowlist behavior. Update .gitignore to ignore the new tmp/docs_nb_checks path.

Ensure DLC metadata is JSON-serializable by converting date/datetime fields to ISO strings and preserving exclude_none behavior. Uses pydantic v2 API (model_dump(mode="json", exclude_none=True)) and falls back to pydantic v1 via json.loads(meta.json(...)). Adds a docstring and clarifying comments. This prevents json.dumps from failing when writing .ipynb files and keeps compatibility across pydantic versions.

Update docs-and-notebooks tool to use nbformat for reading/writing notebooks, validate .ipynb files, and detect whether notebooks are normalized. Add notebook_is_normalized helper and ensure write_ipynb_meta uses nbformat.writes/validate. Introduce a new policy field require_notebook_normalized (and add it to the report config defaults) and enforce it to emit violations when notebooks are not normalized. Also update CI job to install nbformat and pin pydantic, and update the script header notes to list the new dependency. These changes let CI detect invalid or non-normalized notebooks and reduce formatting churn when normalizing files.

Add a local pre-commit hook 'dlc-docs-notebooks-check' that runs .github/tools/docs_and_notebooks_check.py to check DLC docs and notebooks for staleness, validate nbformat, and perform normalization. The hook targets Jupyter and Markdown files, passes filenames to the script, and declares additional dependencies (pydantic>=2,<3, pyyaml, nbformat>=5).

Use docs_and_notebooks_report_config.yml as the default config and resolve it relative to the script. Rename machine/human report outputs to docs_nb_checks.{json,md}. Add an optional --targets argument to the report and check subcommands; scan_files now accepts a targets list and filters scanned paths to only those targets. Make --config default a string path and adjust error-exit logic so parsing errors don't cause a non-zero exit in report mode. Minor doc/formatting tweaks.

Update references and examples in .github/tools/docs_and_notebooks_tool_README.md: change config reference to .github/tools/docs_and_notebooks_report_config.yml, update report output paths to tmp/docs_nb_checks/..., simplify the example 'check' command, and replace usages of tools/staleness.py with .github/tools/docs_and_notebooks_check.py in the update/example commands. Also tidy the 'Writes' section formatting.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Rename .github/tools/docs_and_notebooks_* to tools/ and update references. Updated workflow (.github/workflows/docs_and_notebooks_checks.yml) and pre-commit config to call tools/docs_and_notebooks_check.py and use tools/docs_and_notebooks_report_config.yml, updated the tool script's internal docs and the README paths, and tweaked the workflow name to "Docs & notebooks freshness and formatting checks".

Clarify and expand the Docs & Notebooks checks tool README: rename last_git_updated to last_content_updated (computed from git but ignoring metadata-only commits), add last_metadata_updated and verified_for metadata fields, and emphasize separation of content vs metadata. Document the META_COMMIT_MARKER requirement for metadata-only/normalization commits and provide suggested commit messaging and guardrails for update/normalize operations. Reorganize commands (report, check, update, normalize), note that update/normalize are write-only for maintainers, and add CI guidance (use actions/checkout fetch-depth: 0) and required dependencies. Also include troubleshooting tips and mention deterministic notebook normalization and Pydantic model rebuild guidance.

Reduce job timeout from 10 to 5 minutes, upgrade actions/checkout to v6 and actions/setup-python to v6, and allow the staleness policy check step to continue-on-error. These changes use newer action releases, shorten runtime limits, and ensure the optional gate doesn't fail the workflow.

Include pydantic>2 in pyproject.toml dependencies to require Pydantic v2+ for project data models/validation and ensure compatibility with code expecting Pydantic v2 behavior.

Differentiate between absent and invalid DLC metadata in notebooks and markdown files. read_ipynb_meta now returns a has_dlc flag; parse_dlc_meta returns (meta, valid). scan_files uses the new flags to set rec.meta and append explicit warnings "missing_metadata" or "invalid_metadata" instead of treating all None as missing. Call sites in update_files and normalize_notebooks updated to unpack the extra return value. Misc cleanup around frontmatter handling and error/warning reporting.

Add two tests to verify notebook metadata validation: one ensures a notebook missing the "deeplabcut" namespace triggers a "missing_metadata" warning and leaves meta as None; the other ensures an invalid "deeplabcut" namespace (bad last_verified value) triggers an "invalid_metadata" warning and leaves meta as None. Both tests create minimal notebooks in a temp git repo, commit them, run the tool scan, and assert the expected warnings and record kinds.

Change GitHub Actions test step to install the package with development extras (pip install -e .[dev]) instead of only pytest. Update pyproject.toml dev dependency group by adding pydantic>2 and nbformat>5 and removing black so CI/tests have the required dev libraries available.

Introduce a --no-step-summary CLI flag and use it in the GitHub Actions workflow to prevent writing to GITHUB_STEP_SUMMARY during the docs/notebooks check job. The script now respects args.no_step_summary when deciding whether to emit the step summary. Also adjust the summary writing behavior to include the full markdown content (previous truncation to 220 lines was removed/commented) when writing is enabled.

Update GitHub Actions workflow to use actions/checkout@v6 and modify the editable install command used in the test step. Replaces "pip install -e .[dev]" with "pip install -e . --group dev" before running pytest, aligning the workflow with the newer checkout action and the revised dependency installation syntax.

Update CI workflows and developer tooling: bump actions/checkout and setup-python usages (checkout@v6, setup-python@v6), upgrade peaceiris/actions-gh-pages to v4, and update codespell workflow. Simplify python-package workflow to install dev extras once using `--group dev` and remove the duplicate install. Adjust pre-commit config to pass args to the name-tests-test hook. Fix dependency declarations in pyproject.toml (move/clean up pydantic and nbformat entries, remove duplicate pydantic line). Reformat tests/tools/docs_and_notebooks_checks/test_check_contracts.py for readability (multi-line calls, string quote consistency) and simplify one assertion.

Only set last_metadata_updated when an actual file write will occur (not unconditionally). For notebooks and markdown files the metadata stamping and merge now happen only when write is performed; when not writing, the in-memory record.meta is still updated so warnings and reports remain accurate. Rename the summary/report field and related text from "git_stale" to "content_stale" and adjust wording from "last_git_updated" to "last_git_touched / last_content_updated". A new import (from curses import meta) was also added.

Build the desired metadata without mutating last_metadata_updated and use a base "desired_base" when comparing/merging. Only set meta.last_metadata_updated (and produce the final merged metadata) if an actual file write will occur. Apply the same logic to both notebooks and markdown frontmatter, simplify variable names, and remove redundant in-branch rec.meta assignments so the record is consistently updated once at the end.

Drop an unused/errant `from curses import meta` import and remove a redundant pre-metadata call to `write_ipynb_meta`. Notebooks are now written once after merging/updating DLC metadata (comment updated accordingly), reducing unnecessary I/O and avoiding potential name conflicts.

Pass an explicit "HEAD" ref to git log calls and add --fixed-strings to the grep invocation. This prevents Git from misinterpreting the path or commit selector ordering and ensures the META_COMMIT_MARKER is treated as a literal string (not a regex). Changes applied to git_last_touched and git_last_content_updated to make commit/date lookups more robust.

Add a reusable _git_log_date helper to build git log args and return parsed dates, and make git_last_touched delegate to it. Improve _parse_git_iso_date to handle plain ISO dates and trailing Z timezone markers. Update git_last_content_updated to call the new helper with grep/invert-grep args to skip META_COMMIT_MARKER and fall back to raw last-touched date when needed.

C-Achard marked this pull request as ready for review

March 11, 2026 12:04

Split the hook command into a simple entry plus explicit args and replace types_or with a files regex to more precisely target docs, examples (JUPYTER/COLAB) and tools .md/.ipynb files. This improves pre-commit handling of the CLI arguments and ensures the hook only runs on relevant files; pass_filenames and additional_dependencies remain unchanged.

Clarify README language around metadata commits and notebook normalization: explicitly note git correctly reports rewritten files as “updated now”, require ack flag text capitalization, add a warning about future changes to the metadata-commit marker, rephrase allowlists line to indicate they are currently empty, promote the notebook note to IMPORTANT and reword for clarity, and make minor troubleshooting wording/capitalization tweaks.

Update docs_and_notebooks_check.py to continue reporting scan/parsing errors without failing checks by default and now provide an opt-in strict mode. Key changes:

- Clarify behavior in top-level docs and Markdown output: scan errors are reported as non-fatal by default and labeled as "scan errors" in summaries.
- Add a policy config option fail_on_scan_errors (default false) and a CLI --strict-mode flag which forces check to fail on scan/parsing errors.
- Introduce collect_scan_issues helper to aggregate scan errors/warnings and print a brief summary to console.
- Adjust check command help text and check flow to respect the combined strict-mode/config setting; when strict, scan errors cause non-zero exit.
- Rename suggested commit message constant to SUGGESTED_TAGGED_COMMIT and update printed suggestions.
- Minor UX improvements: more explanatory messages after report generation and limited preview of scan errors.

All changes are confined to tools/docs_and_notebooks_check.py and focus on behavior and messaging around scan error handling and strictness.

Enforce pydantic v2 APIs and tighten frontmatter parsing/validation and error reporting. Changes include:

- Require pydantic>=2 and nbformat>=5 in docs.
- read_md_frontmatter now returns an optional error string and reports unterminated or non-mapping frontmatter.
- Propagate frontmatter parse errors into scan/update flows (mark invalid_metadata and add explicit error codes).
- read_ipynb_meta returns the raw DLC metadata and preserves presence flag; notebook metadata handling clarified.
- Use model_dump(mode='json') / model_validate everywhere (remove pydantic v1 fallbacks).
- Simplify meta_to_jsonable to assume pydantic v2 output.
- Add future-date validation for last_content_updated/last_metadata_updated/last_verified and report errors.
- Improve enforcement logic to treat invalid metadata separately from missing metadata and to avoid false passes.
- Fail-safe handling in update mode when frontmatter is invalid; adjust report header and JSON output path writing.
- Minor whitespace/cleanup and reordering of strict_mode evaluation.

Overall this makes metadata parsing more strict, yields clearer diagnostics for frontmatter issues, and migrates code paths to pydantic v2.

Adapt tests to recent API changes: replace SUGGESTED_META_COMMIT_MESSAGE with SUGGESTED_TAGGED_COMMIT, and update calls to read_md_frontmatter to unpack a third return value (fm, body, _). Keeps tests aligned with the tool's renamed constant and modified function signature.