feat: make telemetry off by default by claytonlin1110 · Pull Request #4281 · Unstructured-IO/unstructured
@claytonlin1110 There you go :)
Code Review: claytonlin1110:feat/telemetry-off-by-default
This PR changes telemetry from opt-out to opt-in (off by default), adds a new UNSTRUCTURED_TELEMETRY_ENABLED env var, updates docs and version. Below are the consolidated findings from all review agents, deduplicated and ranked by severity.
Critical Issues
1. Case-sensitivity asymmetry in opt-out vs opt-in logic
The opt-in check is case-insensitive, but opt-out is case-sensitive -- a dangerous inconsistency:
opt_out = os.getenv("SCARF_NO_ANALYTICS") == "true" or os.getenv("DO_NOT_TRACK") == "true" opt_in = os.getenv("UNSTRUCTURED_TELEMETRY_ENABLED", "").lower() in ("true", "1")
UNSTRUCTURED_TELEMETRY_ENABLED=TRUEwill opt in (.lower()normalizes)DO_NOT_TRACK=TRUEwill NOT opt out (exact== "true"fails)DO_NOT_TRACK=1will NOT opt out
This violates the Console Do Not Track standard which specifies any non-empty value should be honored. The docstring claims "opt-out env vars are always respected and take precedence" but the code doesn't deliver on that promise for common case variations. This is a blocker.
2. Wasted nvidia-smi subprocess on every import (ordering bug)
The function shells out to nvidia-smi and computes python_version before checking whether telemetry is even enabled:
def scarf_analytics(): try: subprocess.check_output("nvidia-smi") # runs FIRST, every import gpu_present = True except Exception: gpu_present = False python_version = ".".join(platform.python_version().split(".")[:2]) opt_out = ... opt_in = ... if opt_out or not opt_in: # checked AFTER subprocess return
Since telemetry is now off by default, the vast majority of users pay the cost of a subprocess spawn on every import unstructured for zero benefit. This is a regression introduced by this PR -- the old code had the opt-out inside the try block so the subprocess cost was at least paired with the telemetry send. The env var check must be moved to the very top.
3. subprocess.check_output("nvidia-smi") is never mocked in tests
Every test actually shells out to nvidia-smi. This is slow, non-deterministic (different gpu= values on different machines), and noisy on stderr. The tests pass either way by accident, not by design.
Major Issues
4. Massive code duplication in dev/non-dev URL branches
The if "dev" in __version__ / else branches are 12 lines of identical code differing only in &dev=true vs &dev=false. This duplication existed before but the PR was already restructuring the function and had a natural opportunity to collapse it:
dev = "dev" in __version__ requests.get( "https://packages.unstructured.io/python-telemetry", params={ "version": __version__, "platform": platform.system(), "python": python_version, "arch": platform.machine(), "gpu": str(gpu_present), "dev": str(dev).lower(), }, timeout=10, )
Using params= also fixes the pre-existing URL bug (issue #6) and the fragile string concatenation.
5. Mixing monkeypatch and unittest.mock.patch in tests
The tests use pytest's monkeypatch for env vars but unittest.mock.patch (context manager) for requests.get. This is inconsistent within the same class and with the rest of the test file. Should use one approach throughout -- monkeypatch.setattr is more Pythonic in a pytest codebase.
6. Test isolation gaps -- ambient env vars can mask test intent
it_does_not_send_when_do_not_track_is_set_even_if_opt_in doesn't clear SCARF_NO_ANALYTICS, and vice versa. If a developer's shell has either variable set, the test passes for the wrong reason.
7. dev vs non-dev URL path is never tested
The function has two distinct code paths based on "dev" in __version__ but no test mocks __version__ or asserts which dev= value appears in the URL. Half the code paths are untested.
8. Patch version bump for a behavioral breaking change
Going from 0.21.12 to 0.21.13 (patch) is inappropriate. Changing telemetry from on-by-default to off-by-default is a behavioral change that silently stops analytics for all existing users on upgrade. This should arguably be a minor bump (0.22.0) and deserves a migration note.
Minor Issues
9. Pre-existing bug preserved: missing = in URL query parameter
+ "&python" # produces &python3.11 instead of &python=3.11 + python_version
This is inherited from main but the PR is modifying this exact function. Boy scout rule applies.
10. Import-time HTTP request with 10-second timeout
When telemetry IS enabled, requests.get(..., timeout=10) runs at import time. In air-gapped environments or slow networks, import unstructured blocks for up to 10 seconds. Should be fire-and-forget in a daemon thread or use a much shorter timeout.
11. except Exception: pass is too broad
The catch-all silently swallows all exceptions including MemoryError, SystemError, etc. Should narrow to except (requests.RequestException, OSError) or at minimum log at debug level.
12. Missing test coverage for exception handling
No test verifies that the function handles requests.get exceptions gracefully.
13. Missing test coverage for explicit non-opt-in values
UNSTRUCTURED_TELEMETRY_ENABLED=false, =0, =yes, =on are not tested to verify they suppress telemetry.
14. URL assertions in tests are too shallow
Only checking "python-telemetry" in call_url and "version=" in call_url -- doesn't verify actual values, timeout parameter, or URL structure.
Nits
15. Redundant docstring + inline comment -- both say the same thing about "off by default, opt-in via env var."
16. subprocess.check_output("nvidia-smi") should use list form ["nvidia-smi"] and redirect stderr=subprocess.DEVNULL to avoid console noise.
17. Overly broad except Exception for nvidia-smi -- should narrow to except (FileNotFoundError, subprocess.CalledProcessError).
18. it_accepts_opt_in_value_1 should be a parametrized test covering "true", "True", "TRUE", "1" rather than a standalone method.
Recommended Refactored Implementation
All of issues 1-4, 6, 9, 11, 15-17 could be resolved with this rewrite:
def scarf_analytics(): """Send a lightweight analytics ping. Off by default. Set UNSTRUCTURED_TELEMETRY_ENABLED=true to opt in. Opt-out env vars (DO_NOT_TRACK, SCARF_NO_ANALYTICS) are always respected and take precedence. """ opt_out = ( os.getenv("SCARF_NO_ANALYTICS", "").lower() in ("true", "1") or bool(os.getenv("DO_NOT_TRACK", "")) ) opt_in = os.getenv("UNSTRUCTURED_TELEMETRY_ENABLED", "").lower() in ("true", "1") if opt_out or not opt_in: return try: subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL) gpu_present = True except (FileNotFoundError, subprocess.CalledProcessError): gpu_present = False python_version = ".".join(platform.python_version().split(".")[:2]) try: requests.get( "https://packages.unstructured.io/python-telemetry", params={ "version": __version__, "platform": platform.system(), "python": python_version, "arch": platform.machine(), "gpu": str(gpu_present), "dev": str("dev" in __version__).lower(), }, timeout=10, ) except Exception: pass