Skip to content

perf(discovery): prune excluded subtrees during traversal#870

Open
Roozi489 wants to merge 3 commits intomicrosoft:mainfrom
Roozi489:perf/discovery-prune
Open

perf(discovery): prune excluded subtrees during traversal#870
Roozi489 wants to merge 3 commits intomicrosoft:mainfrom
Roozi489:perf/discovery-prune

Conversation

@Roozi489
Copy link
Copy Markdown

@Roozi489 Roozi489 commented Apr 23, 2026

APM compilation was too slow on large workspaces with many generated files (like Unreal Engine project, where it was more than 30 mins). These changes take into account excluded directories when scanning folders which in my case led (along with #871) to an at least 10x speedup for apm compile.

Replace glob.glob(recursive=True) in find_primitive_files() with os.walk and early directory pruning so compilation.exclude patterns prevent traversal into large subtrees. Adds DEFAULT_SKIP_DIRS constant. 125 new unit tests.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve primitive-file discovery performance by replacing glob.glob(recursive=True) with an os.walk traversal that can prune excluded directories early, reducing time spent in large excluded subtrees.

Changes:

  • Reworked find_primitive_files() to use os.walk() and prune directories using DEFAULT_SKIP_DIRS + exclude_patterns.
  • Added a shared DEFAULT_SKIP_DIRS constant for unconditional directory skipping during traversals.
  • Added a new unit test module covering the new walk-based discovery helpers and expected behaviors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
src/apm_cli/primitives/discovery.py Switches primitive discovery from glob.glob to os.walk, adds _glob_match + early directory pruning hooks, and threads exclude_patterns into local discovery.
src/apm_cli/constants.py Introduces DEFAULT_SKIP_DIRS shared skip set for traversal pruning.
tests/unit/primitives/test_discovery_walk.py Adds tests for the new walk-based discovery behavior and related helpers (also asserts behavior for ContextOptimizer).
CHANGELOG.md Adds an Unreleased entry describing the discovery perf change and new shared constant.

Comment thread src/apm_cli/primitives/discovery.py Outdated
Comment on lines +453 to +460
# Match files against the requested patterns
for file_name in files:
file_path = current / file_name
rel_str = str(file_path.relative_to(base_path)).replace(os.sep, '/')
for pattern in patterns:
if _glob_match(rel_str, pattern):
all_files.append(file_path)
break
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclude_patterns are only used to prune directories, but are no longer applied to matched files. This changes behavior from the previous implementation (callers used to should_exclude() each candidate) and means file-level excludes like **/*.instructions.md will no longer work unless they exclude an entire directory. Consider filtering matched files with should_exclude(file_path, base_path, exclude_patterns) before adding/returning them, while keeping directory pruning for perf.

Copilot uses AI. Check for mistakes.
Comment thread src/apm_cli/primitives/discovery.py Outdated
Comment on lines +396 to +414
"""Match a forward-slash relative path against a glob pattern with ** support.

fnmatch treats * as matching everything (including /), so ** works for
one-or-more path segments. However, ** is also supposed to match *zero*
segments (e.g. ``**/*.md`` should match ``readme.md`` at the root).
We handle that by also trying the pattern with the leading ``**/``
stripped (the "zero-segments" case).
"""
if '**/' not in pattern:
return fnmatch.fnmatch(rel_path, pattern)

prefix, suffix = pattern.split('**/', 1)
# Zero-segment case: ** matches nothing, so collapse it away
if fnmatch.fnmatch(rel_path, prefix + suffix):
return True
# One-or-more segment case: fnmatch * already crosses /
return fnmatch.fnmatch(rel_path, pattern)


Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_glob_match() uses fnmatch.fnmatch() on the full path string, but fnmatch allows * to match path separators. That makes patterns like **/.apm/instructions/*.instructions.md also match deeper paths such as .apm/instructions/subdir/x.instructions.md, which differs from glob semantics and could cause over-inclusion. Consider switching to a segment-aware matcher (e.g., compare path parts, or reuse the utils.exclude recursive matcher logic) so * does not cross directory boundaries.

Suggested change
"""Match a forward-slash relative path against a glob pattern with ** support.
fnmatch treats * as matching everything (including /), so ** works for
one-or-more path segments. However, ** is also supposed to match *zero*
segments (e.g. ``**/*.md`` should match ``readme.md`` at the root).
We handle that by also trying the pattern with the leading ``**/``
stripped (the "zero-segments" case).
"""
if '**/' not in pattern:
return fnmatch.fnmatch(rel_path, pattern)
prefix, suffix = pattern.split('**/', 1)
# Zero-segment case: ** matches nothing, so collapse it away
if fnmatch.fnmatch(rel_path, prefix + suffix):
return True
# One-or-more segment case: fnmatch * already crosses /
return fnmatch.fnmatch(rel_path, pattern)
"""Match a forward-slash relative path against a glob pattern.
This matcher is segment-aware: ``*`` and ``?`` only match within a single
path segment, while ``**`` matches zero or more complete segments. This
preserves standard glob semantics for patterns such as
``**/.apm/instructions/*.instructions.md`` so that the final ``*`` does
not match through nested directories.
Args:
rel_path (str): Relative path using forward slashes.
pattern (str): Glob pattern using forward slashes.
Returns:
bool: True when the path matches the pattern, otherwise False.
"""
path_parts: List[str] = [part for part in rel_path.split("/") if part]
pattern_parts: List[str] = [part for part in pattern.split("/") if part]
memo: Dict[tuple[int, int], bool] = {}
def _match(path_index: int, pattern_index: int) -> bool:
"""Recursively match path and pattern segments."""
key = (path_index, pattern_index)
if key in memo:
return memo[key]
if pattern_index == len(pattern_parts):
result = path_index == len(path_parts)
memo[key] = result
return result
current_pattern = pattern_parts[pattern_index]
if current_pattern == "**":
result = _match(path_index, pattern_index + 1)
if not result and path_index < len(path_parts):
result = _match(path_index + 1, pattern_index)
memo[key] = result
return result
if path_index >= len(path_parts):
memo[key] = False
return False
result = (
fnmatch.fnmatchcase(path_parts[path_index], current_pattern)
and _match(path_index + 1, pattern_index + 1)
)
memo[key] = result
return result
return _match(0, 0)

Copilot uses AI. Check for mistakes.
Comment thread src/apm_cli/primitives/discovery.py Outdated
Comment on lines +453 to +456
# Match files against the requested patterns
for file_name in files:
file_path = current / file_name
rel_str = str(file_path.relative_to(base_path)).replace(os.sep, '/')
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.walk() does not guarantee ordering for the files list. Since the previous glob.glob() path was deterministically sorted, this can introduce nondeterministic discovery order across platforms/filesystems. Consider iterating for file_name in sorted(files): (and/or sorting the final valid_files) to keep results stable.

Copilot uses AI. Check for mistakes.
Comment thread src/apm_cli/constants.py Outdated
Comment on lines +35 to +53
# Directory names unconditionally skipped during file-tree traversal.
# These never contain APM primitives or user source files and can be
# very large (e.g. node_modules, .git objects). Used by discovery,
# compilation, and content hashing to avoid expensive walks.
# NOTE: .apm is intentionally absent -- it is where primitives live.
# ---------------------------------------------------------------------------
DEFAULT_SKIP_DIRS: frozenset = frozenset({
".git",
"node_modules",
"__pycache__",
".pytest_cache",
".venv",
"venv",
".tox",
"build",
"dist",
".mypy_cache",
"apm_modules",
})
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says DEFAULT_SKIP_DIRS is used by "discovery, compilation, and content hashing", but in this PR it is only referenced from primitives/discovery.py (no other DEFAULT_SKIP_DIRS usages found). Either update the compilation/content-hashing walkers to use this shared constant, or adjust the comment to avoid promising behavior that isn't implemented.

Copilot uses AI. Check for mistakes.
Comment on lines +241 to +261
def test_cached_glob_respects_exclude_patterns(self):
"""_cached_glob should not return files under excluded directories."""
from apm_cli.compilation.context_optimizer import ContextOptimizer

# Create files in included and excluded dirs
(self.base / "src").mkdir()
(self.base / "src" / "app.py").touch()
(self.base / "vendor" / "lib").mkdir(parents=True)
(self.base / "vendor" / "lib" / "dep.py").touch()

optimizer = ContextOptimizer(
base_dir=str(self.base),
exclude_patterns=["vendor"],
)

matches = optimizer._cached_glob("**/*.py")
match_strs = [m.replace("\\", "/") for m in matches]

self.assertTrue(any("src/app.py" in m for m in match_strs))
self.assertFalse(any("vendor" in m for m in match_strs))

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests assume ContextOptimizer._cached_glob() filters via a pre-built file list and respects exclude_patterns, but the current implementation uses glob.glob(..., recursive=True) and does not filter out excluded directories. As written, this test will fail unless ContextOptimizer is updated accordingly (or the test is rewritten to match the actual behavior/entry point being changed).

Copilot uses AI. Check for mistakes.
Comment on lines +272 to +309
def test_directory_files_cache_skips_default_dirs(self):
"""_directory_files_cache must not include files from DEFAULT_SKIP_DIRS."""
from apm_cli.compilation.context_optimizer import ContextOptimizer

(self.base / "src").mkdir()
(self.base / "src" / "ok.py").touch()
(self.base / "node_modules" / "pkg").mkdir(parents=True)
(self.base / "node_modules" / "pkg" / "bad.js").touch()
(self.base / "__pycache__").mkdir()
(self.base / "__pycache__" / "mod.pyc").touch()

optimizer = ContextOptimizer(base_dir=str(self.base))
optimizer._analyze_project_structure()
all_files = [str(f) for files in optimizer._directory_files_cache.values() for f in files]

self.assertTrue(any("ok.py" in s for s in all_files))
self.assertFalse(any("node_modules" in s for s in all_files))
self.assertFalse(any("__pycache__" in s for s in all_files))

def test_directory_files_cache_skips_custom_excludes(self):
"""_directory_files_cache must also respect user-supplied exclude_patterns."""
from apm_cli.compilation.context_optimizer import ContextOptimizer

(self.base / "src").mkdir()
(self.base / "src" / "ok.py").touch()
(self.base / "Binaries" / "Win64").mkdir(parents=True)
(self.base / "Binaries" / "Win64" / "huge.dll").touch()

optimizer = ContextOptimizer(
base_dir=str(self.base),
exclude_patterns=["Binaries"],
)
optimizer._analyze_project_structure()
all_files = [str(f) for files in optimizer._directory_files_cache.values() for f in files]

self.assertTrue(any("ok.py" in s for s in all_files))
self.assertFalse(any("Binaries" in s for s in all_files))

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContextOptimizer does not define _directory_files_cache anywhere (searching the codebase finds no such attribute), so these tests will raise AttributeError. Either update ContextOptimizer to expose/maintain the cache being asserted here, or change the test to assert against the actual data structure it uses (e.g., _file_list_cache / _directory_cache, depending on the intended contract).

Copilot uses AI. Check for mistakes.
Comment thread CHANGELOG.md Outdated

### Changed

- `find_primitive_files()` now uses `os.walk` with early directory pruning instead of `glob.glob(recursive=True)`, so `compilation.exclude` patterns prevent traversal into expensive subtrees (e.g. large game-engine asset trees). Adds shared `DEFAULT_SKIP_DIRS` constant used by both discovery and compilation.
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog entry does not follow the repo's Keep-a-Changelog rule of one line per PR ending with a PR number (e.g., (#123)), and it also claims DEFAULT_SKIP_DIRS is used by compilation even though this PR introduces no DEFAULT_SKIP_DIRS usage outside primitives/discovery.py. Please add the PR reference and align the wording with what actually changed (or update compilation to use the shared constant).

Copilot generated this review using guidance from repository custom instructions.
@Roozi489
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Replace glob.glob(recursive=True) in find_primitive_files() with an
os.walk loop that prunes directories before descending into them.

This allows compilation.exclude patterns (and the new DEFAULT_SKIP_DIRS
constant) to short-circuit traversal into large subtrees -- e.g. a UE5
game repo with 265 K files no longer hangs because node_modules / build
/ apm_modules are skipped at the first encountered node.

Changes:
- constants.py: add DEFAULT_SKIP_DIRS frozenset
- discovery.py: os.walk + early prune; thread exclude_patterns into
  traversal; add segment-aware _glob_match() helper that prevents *
  from crossing path separators (preserves real glob semantics);
  apply file-level exclude per file; sort os.walk file lists for
  deterministic discovery
- test_discovery_walk.py: new unit tests covering prune behaviour,
  exclude_patterns (directory + file level), symlink rejection, and
  segment-aware ** / * pattern matching

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread src/apm_cli/primitives/discovery.py Outdated
Comment thread src/apm_cli/constants.py
Comment on lines +41 to +53
DEFAULT_SKIP_DIRS: frozenset = frozenset({
".git",
"node_modules",
"__pycache__",
".pytest_cache",
".venv",
"venv",
".tox",
"build",
"dist",
".mypy_cache",
"apm_modules",
})
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_SKIP_DIRS is annotated as an unparameterized frozenset, which loses type information for static checking. Consider annotating it as frozenset[str] (and optionally typing.Final) so callers get correct types and IDE help.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel added the panel-review Trigger the apm-review-panel gh-aw workflow label Apr 23, 2026
@github-actions
Copy link
Copy Markdown

APM Review Panel Verdict

Disposition: APPROVE


Per-persona findings

Python Architect: Clean, well-scoped change. Replacing glob.glob(recursive=True) with os.walk + early pruning is the correct architectural move -- O(work) is now proportional to the files actually visited. Extracting DEFAULT_SKIP_DIRS to constants.py eliminates duplication between the old inline set in _should_skip_directory and the new traversal logic; the existing function now delegates cleanly and is retained only because tests/unit/primitives/test_discovery_parser.py still imports it (acceptable). One item of technical debt: the _is_under_directory(file_path, apm_modules_path) guard in scan_local_primitives (line 170) is now dead code -- apm_modules is in DEFAULT_SKIP_DIRS, so find_primitive_files prunes it before any file is collected. Not a bug, but worth a comment or follow-up cleanup. _glob_match is a focused, well-memoized recursive matcher; the logic is correct and tests cover the key boundary cases.

CLI Logging Expert: No concerns. No changes to output architecture. The two logger.debug("Excluded by pattern: %s", file_path) calls follow the established pattern and are routed through the module logger, not _rich_* helpers. Nothing to action here.

DevX UX Expert: No CLI surface changes; this is a pure internal performance fix. The CHANGELOG entry is clear and user-facing ("compilation.exclude patterns prevent traversal into excluded subtrees on large repos"). The UE5 use case in the commit body (265K files, no longer hangs) is the right level of concrete detail for release notes. No docs update required for an internal implementation change. Positive impact for enterprise and game-dev adoption segments.

Supply Chain Security Expert: Security posture maintained. os.walk defaults to followlinks=False, so symlinked directories are not descended -- no directory traversal via symlinked subtrees. Symlinked files collected in the walk are still rejected in the post-filter pass (file_path.is_symlink() check), same guarantee as before. base_path = Path(base_dir).resolve() is the correct anchoring idiom (symlink-resolved absolute path). One observation: user-supplied exclude_patterns from apm.yml could theoretically match .apm itself (e.g., exclude: [".apm"]), which would suppress primitive discovery entirely. This is user-misconfiguration, not a security vulnerability, and the test test_apm_dir_not_skipped correctly validates that .apm is absent from DEFAULT_SKIP_DIRS. No new auth or download surfaces touched. No path-security regressions.

Auth Expert: No authentication surfaces touched. No findings.

OSS Growth Hacker: Strong external-contributor story. Martin Matous is an outside contributor who filed a real pain-point fix (compile hangs on large monorepos) and iterated responsibly with a follow-up commit (platform-aware case matching). The UE5 angle (265K-file game repos) is a compelling vertical that APM should amplify in release notes and community posts. Side-channel to CEO: this is a good candidate for a "spotlight" callout in the vX.Y release announcement -- "first external contributor, $industry use case unlocked."

APM CEO: This is a textbook good PR -- focused scope, real-world motivation, external contributor, CHANGELOG present, no breaking changes. The fnmatch.fnmatch vs fnmatch.fnmatchcase switch in the follow-up commit is correct: on Linux both are case-sensitive (verified), on Windows fnmatch.fnmatch restores the case-insensitive behavior that the previous glob.glob implementation had. No positioning risk. No deprecation ceremony needed. Approve.


CEO arbitration

No specialist disagreements to resolve. All five active reviewers reached independent positive verdicts. The only items surfaced are minor technical debt (dead _is_under_directory guard) and an observation-only security note (user-misconfiguration path), neither of which blocks the merge. The performance motivation is real and externally validated; the implementation is proportional and does not over-engineer. Ship it.


Required actions before merge

None blocking. The PR is mergeable as-is.


Optional follow-ups

  • scan_local_primitives line 170: the _is_under_directory(file_path, apm_modules_path) guard is now dead code since apm_modules is pruned by DEFAULT_SKIP_DIRS at walk time. Either add a # defensive: apm_modules already pruned by DEFAULT_SKIP_DIRS comment or remove it in a follow-up cleanup PR.
  • Consider adding a note to docs/src/content/docs/reference/cli-commands.md or the compilation reference that compilation.exclude patterns are now applied at traversal time (not just post-filter), which means they affect performance as well as output. Low priority but useful for power users configuring large monorepos.
  • Release notes for the next version should highlight the UE5 / large-repo fix explicitly and credit the external contributor.

Generated by PR Review Panel for issue #870 · ● 666K ·

Copy link
Copy Markdown
Collaborator

@danielmeppiel danielmeppiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @Roozi489 !

@danielmeppiel danielmeppiel self-requested a review April 23, 2026 21:03
Copy link
Copy Markdown
Collaborator

@danielmeppiel danielmeppiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI/CD failed, please correct per below so that we avoid cross-platform issues in Windows:

src/apm_cli/primitives/discovery.py:493:            rel_str = str(file_path.relative_to(base_path)).replace(os.sep, '/')
Error: Found raw str(path.relative_to()) calls. Use portable_relpath() from apm_cli.utils.paths instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

panel-review Trigger the apm-review-panel gh-aw workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants