feat: make telemetry off by default by claytonlin1110 · Pull Request #4281 · Unstructured-IO/unstructured

@claytonlin1110 There you go :)

Code Review: claytonlin1110:feat/telemetry-off-by-default

This PR changes telemetry from opt-out to opt-in (off by default), adds a new UNSTRUCTURED_TELEMETRY_ENABLED env var, updates docs and version. Below are the consolidated findings from all review agents, deduplicated and ranked by severity.


Critical Issues

1. Case-sensitivity asymmetry in opt-out vs opt-in logic

The opt-in check is case-insensitive, but opt-out is case-sensitive -- a dangerous inconsistency:

opt_out = os.getenv("SCARF_NO_ANALYTICS") == "true" or os.getenv("DO_NOT_TRACK") == "true"
opt_in = os.getenv("UNSTRUCTURED_TELEMETRY_ENABLED", "").lower() in ("true", "1")
  • UNSTRUCTURED_TELEMETRY_ENABLED=TRUE will opt in (.lower() normalizes)
  • DO_NOT_TRACK=TRUE will NOT opt out (exact == "true" fails)
  • DO_NOT_TRACK=1 will NOT opt out

This violates the Console Do Not Track standard which specifies any non-empty value should be honored. The docstring claims "opt-out env vars are always respected and take precedence" but the code doesn't deliver on that promise for common case variations. This is a blocker.

2. Wasted nvidia-smi subprocess on every import (ordering bug)

The function shells out to nvidia-smi and computes python_version before checking whether telemetry is even enabled:

def scarf_analytics():
    try:
        subprocess.check_output("nvidia-smi")  # runs FIRST, every import
        gpu_present = True
    except Exception:
        gpu_present = False

    python_version = ".".join(platform.python_version().split(".")[:2])

    opt_out = ...
    opt_in = ...
    if opt_out or not opt_in:  # checked AFTER subprocess
        return

Since telemetry is now off by default, the vast majority of users pay the cost of a subprocess spawn on every import unstructured for zero benefit. This is a regression introduced by this PR -- the old code had the opt-out inside the try block so the subprocess cost was at least paired with the telemetry send. The env var check must be moved to the very top.

3. subprocess.check_output("nvidia-smi") is never mocked in tests

Every test actually shells out to nvidia-smi. This is slow, non-deterministic (different gpu= values on different machines), and noisy on stderr. The tests pass either way by accident, not by design.


Major Issues

4. Massive code duplication in dev/non-dev URL branches

The if "dev" in __version__ / else branches are 12 lines of identical code differing only in &dev=true vs &dev=false. This duplication existed before but the PR was already restructuring the function and had a natural opportunity to collapse it:

dev = "dev" in __version__
requests.get(
    "https://packages.unstructured.io/python-telemetry",
    params={
        "version": __version__,
        "platform": platform.system(),
        "python": python_version,
        "arch": platform.machine(),
        "gpu": str(gpu_present),
        "dev": str(dev).lower(),
    },
    timeout=10,
)

Using params= also fixes the pre-existing URL bug (issue #6) and the fragile string concatenation.

5. Mixing monkeypatch and unittest.mock.patch in tests

The tests use pytest's monkeypatch for env vars but unittest.mock.patch (context manager) for requests.get. This is inconsistent within the same class and with the rest of the test file. Should use one approach throughout -- monkeypatch.setattr is more Pythonic in a pytest codebase.

6. Test isolation gaps -- ambient env vars can mask test intent

it_does_not_send_when_do_not_track_is_set_even_if_opt_in doesn't clear SCARF_NO_ANALYTICS, and vice versa. If a developer's shell has either variable set, the test passes for the wrong reason.

7. dev vs non-dev URL path is never tested

The function has two distinct code paths based on "dev" in __version__ but no test mocks __version__ or asserts which dev= value appears in the URL. Half the code paths are untested.

8. Patch version bump for a behavioral breaking change

Going from 0.21.12 to 0.21.13 (patch) is inappropriate. Changing telemetry from on-by-default to off-by-default is a behavioral change that silently stops analytics for all existing users on upgrade. This should arguably be a minor bump (0.22.0) and deserves a migration note.


Minor Issues

9. Pre-existing bug preserved: missing = in URL query parameter

+ "&python"    # produces &python3.11 instead of &python=3.11
+ python_version

This is inherited from main but the PR is modifying this exact function. Boy scout rule applies.

10. Import-time HTTP request with 10-second timeout

When telemetry IS enabled, requests.get(..., timeout=10) runs at import time. In air-gapped environments or slow networks, import unstructured blocks for up to 10 seconds. Should be fire-and-forget in a daemon thread or use a much shorter timeout.

11. except Exception: pass is too broad

The catch-all silently swallows all exceptions including MemoryError, SystemError, etc. Should narrow to except (requests.RequestException, OSError) or at minimum log at debug level.

12. Missing test coverage for exception handling

No test verifies that the function handles requests.get exceptions gracefully.

13. Missing test coverage for explicit non-opt-in values

UNSTRUCTURED_TELEMETRY_ENABLED=false, =0, =yes, =on are not tested to verify they suppress telemetry.

14. URL assertions in tests are too shallow

Only checking "python-telemetry" in call_url and "version=" in call_url -- doesn't verify actual values, timeout parameter, or URL structure.


Nits

15. Redundant docstring + inline comment -- both say the same thing about "off by default, opt-in via env var."

16. subprocess.check_output("nvidia-smi") should use list form ["nvidia-smi"] and redirect stderr=subprocess.DEVNULL to avoid console noise.

17. Overly broad except Exception for nvidia-smi -- should narrow to except (FileNotFoundError, subprocess.CalledProcessError).

18. it_accepts_opt_in_value_1 should be a parametrized test covering "true", "True", "TRUE", "1" rather than a standalone method.


Recommended Refactored Implementation

All of issues 1-4, 6, 9, 11, 15-17 could be resolved with this rewrite:

def scarf_analytics():
    """Send a lightweight analytics ping. Off by default.

    Set UNSTRUCTURED_TELEMETRY_ENABLED=true to opt in.
    Opt-out env vars (DO_NOT_TRACK, SCARF_NO_ANALYTICS) are always respected and take precedence.
    """
    opt_out = (
        os.getenv("SCARF_NO_ANALYTICS", "").lower() in ("true", "1")
        or bool(os.getenv("DO_NOT_TRACK", ""))
    )
    opt_in = os.getenv("UNSTRUCTURED_TELEMETRY_ENABLED", "").lower() in ("true", "1")
    if opt_out or not opt_in:
        return

    try:
        subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL)
        gpu_present = True
    except (FileNotFoundError, subprocess.CalledProcessError):
        gpu_present = False

    python_version = ".".join(platform.python_version().split(".")[:2])

    try:
        requests.get(
            "https://packages.unstructured.io/python-telemetry",
            params={
                "version": __version__,
                "platform": platform.system(),
                "python": python_version,
                "arch": platform.machine(),
                "gpu": str(gpu_present),
                "dev": str("dev" in __version__).lower(),
            },
            timeout=10,
        )
    except Exception:
        pass