Skip to content

Windows: _get_cache_dir flakes on tmpdir 8.3 short names (PathTraversalError) #886

@danielmeppiel

Description

@danielmeppiel

Symptom

tests/unit/policy/test_cache_atomicity.py::TestCacheAtomicity::test_concurrent_writers_no_torn_files fails intermittently on the Windows unit-test job in build-release.yml with:

apm_cli.utils.path_security.PathTraversalError: Policy cache path 
'C:\Users\RUNNER~1\AppData\Local\Temp\tmpu3aaqijk\apm_modules\.policy-cache' 
resolves outside project root 
'C:\Users\RUNNER~1\AppData\Local\Temp\tmpu3aaqijk' 
-- refusing to read or write the cache here.

First observed in run 24861961730 (post-merge, commit 8665f4b4, PR #885). Linux + macOS jobs are unaffected.

Root cause

In src/apm_cli/policy/discovery.py::_get_cache_dir (introduced in #832), the path-security guard does:

def _get_cache_dir(project_root: Path) -> Path:
    base = project_root / "apm_modules"
    candidate = base / POLICY_CACHE_DIR
    try:
        ensure_path_within(candidate, project_root)
    except PathTraversalError:
        raise PathTraversalError(...)

ensure_path_within resolves the candidate path through Path.resolve() (which on Windows expands 8.3 short names like RUNNER~1 to the long form runneradmin and follows symlinks), but the project_root argument is compared as-passed.

When the test calls it with tempfile.mkdtemp(), Windows returns the short-name form (C:\Users\RUNNER~1\...). The candidate resolves to C:\Users\runneradmin\...\apm_modules\.policy-cache, the project root stays C:\Users\RUNNER~1\..., the prefix check fails, and PathTraversalError is raised.

The test is concurrent (ThreadPoolExecutor), which is why it manifests racily -- depending on which thread wins to resolve first / on the runner's filesystem state, sometimes one or more threads see the short-name path and trip the guard.

Why now

The bug has been latent since #832 merged. It surfaces only when:

  1. The Windows runner's tempfile.mkdtemp() returns a path with a ~1 short-name component (depends on user profile name length on the specific runner image).
  2. The concurrent test path _write_cache -> _get_cache_dir runs.

Recent main runs that ran the same Windows job have been mostly green by luck.

Suggested fix

Resolve project_root once at the top of _get_cache_dir before passing it to ensure_path_within, so both sides are normalized:

def _get_cache_dir(project_root: Path) -> Path:
    project_root = project_root.resolve()
    base = project_root / "apm_modules"
    candidate = base / POLICY_CACHE_DIR
    try:
        ensure_path_within(candidate, project_root)
    except PathTraversalError:
        raise PathTraversalError(...)
    ...

Optionally, the same defensive normalization belongs inside ensure_path_within itself in src/apm_cli/utils/path_security.py so future callers can't repeat the mistake -- but the local fix is sufficient to unblock the test.

How to verify

Add a regression test that constructs a project_root with an unresolved component (e.g. via tempfile.mkdtemp() on Windows, or a deliberately-symlinked tmp dir on POSIX) and asserts _get_cache_dir does not raise.

Run the existing concurrent test under stress (pytest --count=20 tests/unit/policy/test_cache_atomicity.py with pytest-repeat) on Windows.

Scope notes

  • Touch only src/apm_cli/policy/discovery.py::_get_cache_dir (and possibly path_security.py if going for the defensive fix).
  • Add a focused regression test, not a refactor of the cache layer.
  • No behavior change on Linux / macOS.

Refs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions