Symptom
tests/unit/policy/test_cache_atomicity.py::TestCacheAtomicity::test_concurrent_writers_no_torn_files fails intermittently on the Windows unit-test job in build-release.yml with:
apm_cli.utils.path_security.PathTraversalError: Policy cache path
'C:\Users\RUNNER~1\AppData\Local\Temp\tmpu3aaqijk\apm_modules\.policy-cache'
resolves outside project root
'C:\Users\RUNNER~1\AppData\Local\Temp\tmpu3aaqijk'
-- refusing to read or write the cache here.
First observed in run 24861961730 (post-merge, commit 8665f4b4, PR #885). Linux + macOS jobs are unaffected.
Root cause
In src/apm_cli/policy/discovery.py::_get_cache_dir (introduced in #832), the path-security guard does:
def _get_cache_dir(project_root: Path) -> Path:
base = project_root / "apm_modules"
candidate = base / POLICY_CACHE_DIR
try:
ensure_path_within(candidate, project_root)
except PathTraversalError:
raise PathTraversalError(...)
ensure_path_within resolves the candidate path through Path.resolve() (which on Windows expands 8.3 short names like RUNNER~1 to the long form runneradmin and follows symlinks), but the project_root argument is compared as-passed.
When the test calls it with tempfile.mkdtemp(), Windows returns the short-name form (C:\Users\RUNNER~1\...). The candidate resolves to C:\Users\runneradmin\...\apm_modules\.policy-cache, the project root stays C:\Users\RUNNER~1\..., the prefix check fails, and PathTraversalError is raised.
The test is concurrent (ThreadPoolExecutor), which is why it manifests racily -- depending on which thread wins to resolve first / on the runner's filesystem state, sometimes one or more threads see the short-name path and trip the guard.
Why now
The bug has been latent since #832 merged. It surfaces only when:
- The Windows runner's
tempfile.mkdtemp() returns a path with a ~1 short-name component (depends on user profile name length on the specific runner image).
- The concurrent test path
_write_cache -> _get_cache_dir runs.
Recent main runs that ran the same Windows job have been mostly green by luck.
Suggested fix
Resolve project_root once at the top of _get_cache_dir before passing it to ensure_path_within, so both sides are normalized:
def _get_cache_dir(project_root: Path) -> Path:
project_root = project_root.resolve()
base = project_root / "apm_modules"
candidate = base / POLICY_CACHE_DIR
try:
ensure_path_within(candidate, project_root)
except PathTraversalError:
raise PathTraversalError(...)
...
Optionally, the same defensive normalization belongs inside ensure_path_within itself in src/apm_cli/utils/path_security.py so future callers can't repeat the mistake -- but the local fix is sufficient to unblock the test.
How to verify
Add a regression test that constructs a project_root with an unresolved component (e.g. via tempfile.mkdtemp() on Windows, or a deliberately-symlinked tmp dir on POSIX) and asserts _get_cache_dir does not raise.
Run the existing concurrent test under stress (pytest --count=20 tests/unit/policy/test_cache_atomicity.py with pytest-repeat) on Windows.
Scope notes
- Touch only
src/apm_cli/policy/discovery.py::_get_cache_dir (and possibly path_security.py if going for the defensive fix).
- Add a focused regression test, not a refactor of the cache layer.
- No behavior change on Linux / macOS.
Refs
Symptom
tests/unit/policy/test_cache_atomicity.py::TestCacheAtomicity::test_concurrent_writers_no_torn_filesfails intermittently on the Windows unit-test job inbuild-release.ymlwith:First observed in run 24861961730 (post-merge, commit
8665f4b4, PR #885). Linux + macOS jobs are unaffected.Root cause
In
src/apm_cli/policy/discovery.py::_get_cache_dir(introduced in #832), the path-security guard does:ensure_path_withinresolves the candidate path throughPath.resolve()(which on Windows expands 8.3 short names likeRUNNER~1to the long formrunneradminand follows symlinks), but theproject_rootargument is compared as-passed.When the test calls it with
tempfile.mkdtemp(), Windows returns the short-name form (C:\Users\RUNNER~1\...). The candidate resolves toC:\Users\runneradmin\...\apm_modules\.policy-cache, the project root staysC:\Users\RUNNER~1\..., the prefix check fails, andPathTraversalErroris raised.The test is concurrent (
ThreadPoolExecutor), which is why it manifests racily -- depending on which thread wins to resolve first / on the runner's filesystem state, sometimes one or more threads see the short-name path and trip the guard.Why now
The bug has been latent since #832 merged. It surfaces only when:
tempfile.mkdtemp()returns a path with a~1short-name component (depends on user profile name length on the specific runner image)._write_cache -> _get_cache_dirruns.Recent main runs that ran the same Windows job have been mostly green by luck.
Suggested fix
Resolve
project_rootonce at the top of_get_cache_dirbefore passing it toensure_path_within, so both sides are normalized:Optionally, the same defensive normalization belongs inside
ensure_path_withinitself insrc/apm_cli/utils/path_security.pyso future callers can't repeat the mistake -- but the local fix is sufficient to unblock the test.How to verify
Add a regression test that constructs a
project_rootwith an unresolved component (e.g. viatempfile.mkdtemp()on Windows, or a deliberately-symlinked tmp dir on POSIX) and asserts_get_cache_dirdoes not raise.Run the existing concurrent test under stress (
pytest --count=20 tests/unit/policy/test_cache_atomicity.pywithpytest-repeat) on Windows.Scope notes
src/apm_cli/policy/discovery.py::_get_cache_dir(and possiblypath_security.pyif going for the defensive fix).Refs
8665f4b4)src/apm_cli/utils/path_security.py::ensure_path_within