feat(parser): shebang-based language detection for extension-less scripts (#237) by azizur100389 · Pull Request #238 · tirth8205/code-review-graph

azizur100389 · 2026-04-11T23:22:51Z

Summary

Adds a shebang fallback to CodeParser.detect_language() so extension-less Unix scripts are routed to the correct tree-sitter grammar based on their first line. Closes #237.

Common path	Typical shebang	After
`.git/hooks/pre-commit`	`#!/bin/bash`	Parsed as bash
`bin/myapp`	`#!/usr/bin/env python3`	Parsed as python
`scripts/deploy`	`#!/bin/sh`	Parsed as bash
`.husky/pre-push`	`#!/usr/bin/env sh`	Parsed as bash
`tools/bootstrap`	`#!/usr/bin/env node`	Parsed as javascript

Root cause

detect_language() was a single-line lookup against EXTENSION_TO_LANGUAGE. Any file with no extension returned None, which filters it out of both incremental_update() and full_build() before parsing. Real-world repos rely heavily on extension-less scripts for entrypoints, git hooks, CI installers, and shell tooling — all currently invisible to callers_of, get_impact_radius, detect_changes, and architecture mapping.

Fix

New module-level SHEBANG_INTERPRETER_TO_LANGUAGE table mapping common interpreter basenames to languages already registered in EXTENSION_TO_LANGUAGE. This file strictly routes extension-less scripts to existing grammars — it does not introduce new languages:
- bash / sh / zsh / ksh / dash / ash → "bash"
- python / python2 / python3 / pypy / pypy3 → "python"
- node / nodejs → "javascript"
- ruby, perl, lua, Rscript, php
New _SHEBANG_PROBE_BYTES = 256 — maximum bytes read from the head when probing. Enough for any reasonable shebang line while keeping worst-case I/O tiny.
New CodeParser._detect_language_from_shebang(path) static method. Opens the file, reads up to 256 bytes, verifies #! prefix, splits on the first newline AND first NUL byte (defensive against binary), decodes UTF-8 strictly so malformed content returns None instead of raising. Handles:
- Direct form: #!/bin/bash
- /usr/bin/env indirection: #!/usr/bin/env bash
- Linux env -S flag: #!/usr/bin/env -S node --experimental-vm-modules
- Trailing flags: #!/bin/bash -e
- Interpreter basename extraction from any absolute path
- CRLF line endings
detect_language(path) now tries extension lookup first, and if it returns None and path.suffix == "", falls back to the shebang probe. Files with a known extension are never re-read — extension-based detection remains authoritative.

Non-regressions guaranteed by design

.py files still parse as Python even if the first line is a misleading #!/bin/bash (locked in by test_detect_shebang_does_not_override_extension)
Extension-less README / LICENSE files return None after a cheap 256-byte read that finds no shebang.
Binary files whose first bytes are not #! return None without raising.
Unknown interpreters (e.g. #!/usr/bin/env ocaml) return None — same semantics as an unmapped extension.
No performance impact on files with a known extension — the shebang probe only runs for the path.suffix == "" branch.

Tests added (`tests/test_parser.py::TestCodeParser` — 16 tests)

test_detect_shebang_bin_bash
test_detect_shebang_bin_sh_routed_to_bash
test_detect_shebang_env_bash
test_detect_shebang_env_python3
test_detect_shebang_direct_python
test_detect_shebang_node
test_detect_shebang_env_dash_s_flag
test_detect_shebang_ruby
test_detect_shebang_perl
test_detect_shebang_with_trailing_flags
test_detect_shebang_missing_returns_none
test_detect_shebang_empty_file_returns_none
test_detect_shebang_binary_content_returns_none
test_detect_shebang_unknown_interpreter_returns_none
test_detect_shebang_does_not_override_extension (regression guard against extension override)
test_parse_shebang_script_produces_function_nodes — end-to-end parse_file() check: an extension-less bash script is detected AND parsed into File + Function nodes, all tagged language="bash".

Test results

Stage	Result
Stage 1 — new targeted shebang tests	16/16 passed
Stage 2 — `tests/test_parser.py` full	83/83 passed
Stage 3 — adjacent `tests/test_multilang.py`	151/151 passed
Stage 4 — full suite	748 passed (up from 733 baseline — +15 net), 8 pre-existing Windows failures in `test_incremental`/`test_main`/`test_notebook` (verified identical on unchanged `main`)
Stage 5 — `ruff check`	`code_review_graph/parser.py`: clean. `tests/test_parser.py`: 1 pre-existing `F841` on line 1038 (in `test_map_dispatch_qualified_reference`, unrelated to this PR — reproducible on unchanged `main` at line 901)

Zero regressions. Purely additive fallback that only fires for files with no extension.

…ipts (tirth8205#237) Add a shebang fallback to `CodeParser.detect_language()` so that extension-less Unix scripts (`bin/myapp`, `.git/hooks/pre-commit`, `scripts/deploy`, `.husky/pre-push`, `installer`, ...) are routed to the correct tree-sitter grammar based on their first line. Root cause of tirth8205#237 ------------------ `detect_language()` was a single-line lookup against `EXTENSION_TO_LANGUAGE` keyed on `path.suffix.lower()`. Any file with no extension returned `None`, which filters it out of both `incremental_update()` and `full_build()` before parsing. Real-world repos rely heavily on extension-less scripts for entrypoints, git hooks, CI installers, and shell tooling — all currently invisible to `callers_of`, `get_impact_radius`, `detect_changes`, and architecture mapping. Fix --- 1. New module-level `SHEBANG_INTERPRETER_TO_LANGUAGE` table mapping common interpreter basenames to languages that are *already* registered: - bash / sh / zsh / ksh / dash / ash -> "bash" - python / python2 / python3 / pypy / pypy3 -> "python" - node / nodejs -> "javascript" - ruby, perl, lua, Rscript, php This file strictly *routes* extension-less files to existing languages; it does NOT introduce new grammars. 2. New `_SHEBANG_PROBE_BYTES = 256` constant — maximum bytes read from the head of a file when probing. Enough for any reasonable shebang line while keeping worst-case I/O tiny. 3. New `CodeParser._detect_language_from_shebang(path)` static method. Opens the file, reads up to 256 bytes, verifies `#!` prefix, splits on the first newline AND first NUL byte (defensive against binary), and decodes UTF-8 strictly so malformed content returns None instead of raising. Handles: - direct form #!/bin/bash - env indirection #!/usr/bin/env bash - env -S flag (Linux) #!/usr/bin/env -S node --experimental-vm-modules - trailing flags #!/bin/bash -e - interpreter basename extraction from any absolute path - CRLF line endings (`.split(b"\n", 1)`) 4. `detect_language(path)` now tries the extension lookup first, and if it returns None AND `path.suffix == ""`, falls back to the shebang probe. Files with a *known* extension are NEVER re-read — extension-based detection remains authoritative. Non-regressions guaranteed by the design ---------------------------------------- - `.py` files still parse as Python even if the first line is a misleading `#!/bin/bash` (`test_detect_shebang_does_not_override_extension`) - Extension-less README / LICENSE files return None with a 256-byte read that finds no shebang. - Binary files whose first bytes are not `#!` return None without raising. - Unknown interpreters (e.g. `#!/usr/bin/env ocaml`) return None — same semantics as an unmapped extension. Tests added (tests/test_parser.py::TestCodeParser — 16 tests) ------------------------------------------------------------- - test_detect_shebang_bin_bash - test_detect_shebang_bin_sh_routed_to_bash - test_detect_shebang_env_bash - test_detect_shebang_env_python3 - test_detect_shebang_direct_python - test_detect_shebang_node - test_detect_shebang_env_dash_s_flag - test_detect_shebang_ruby - test_detect_shebang_perl - test_detect_shebang_with_trailing_flags - test_detect_shebang_missing_returns_none - test_detect_shebang_empty_file_returns_none - test_detect_shebang_binary_content_returns_none - test_detect_shebang_unknown_interpreter_returns_none - test_detect_shebang_does_not_override_extension - test_parse_shebang_script_produces_function_nodes (end-to-end parse_file check: extension-less bash script is detected AND parsed into File + Function nodes, all tagged language="bash") Test results ------------ Stage 1 (new targeted shebang tests): 16/16 passed. Stage 2 (tests/test_parser.py full): 83/83 passed. Stage 3 (tests/test_multilang.py adjacent): 151/151 passed. Stage 4 (full suite): 748 passed (up from 733), 8 pre-existing Windows failures in test_incremental (3) + test_main async coroutine detection (1) + test_notebook Databricks (4) — verified identical on unchanged main. Stage 5 (ruff check): - code_review_graph/parser.py: clean - tests/test_parser.py: 1 pre-existing F841 on line 1038 (test_map_dispatch_qualified_reference, unrelated to this PR — reproducible on unchanged main at line 901). Zero regressions. Purely additive fallback that only fires for files with no extension.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parser): shebang-based language detection for extension-less scripts (#237)#238

feat(parser): shebang-based language detection for extension-less scripts (#237)#238
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
azizur100389:feat/shebang-detection

azizur100389 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azizur100389 commented Apr 11, 2026

Summary

Root cause

Fix

Non-regressions guaranteed by design

Tests added (tests/test_parser.py::TestCodeParser — 16 tests)

Test results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tests added (`tests/test_parser.py::TestCodeParser` — 16 tests)