feat(parser): shebang-based language detection for extension-less scripts (#237)#238
Open
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
Open
feat(parser): shebang-based language detection for extension-less scripts (#237)#238azizur100389 wants to merge 1 commit intotirth8205:mainfrom
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
Conversation
…ipts (tirth8205#237) Add a shebang fallback to `CodeParser.detect_language()` so that extension-less Unix scripts (`bin/myapp`, `.git/hooks/pre-commit`, `scripts/deploy`, `.husky/pre-push`, `installer`, ...) are routed to the correct tree-sitter grammar based on their first line. Root cause of tirth8205#237 ------------------ `detect_language()` was a single-line lookup against `EXTENSION_TO_LANGUAGE` keyed on `path.suffix.lower()`. Any file with no extension returned `None`, which filters it out of both `incremental_update()` and `full_build()` before parsing. Real-world repos rely heavily on extension-less scripts for entrypoints, git hooks, CI installers, and shell tooling — all currently invisible to `callers_of`, `get_impact_radius`, `detect_changes`, and architecture mapping. Fix --- 1. New module-level `SHEBANG_INTERPRETER_TO_LANGUAGE` table mapping common interpreter basenames to languages that are *already* registered: - bash / sh / zsh / ksh / dash / ash -> "bash" - python / python2 / python3 / pypy / pypy3 -> "python" - node / nodejs -> "javascript" - ruby, perl, lua, Rscript, php This file strictly *routes* extension-less files to existing languages; it does NOT introduce new grammars. 2. New `_SHEBANG_PROBE_BYTES = 256` constant — maximum bytes read from the head of a file when probing. Enough for any reasonable shebang line while keeping worst-case I/O tiny. 3. New `CodeParser._detect_language_from_shebang(path)` static method. Opens the file, reads up to 256 bytes, verifies `#!` prefix, splits on the first newline AND first NUL byte (defensive against binary), and decodes UTF-8 strictly so malformed content returns None instead of raising. Handles: - direct form #!/bin/bash - env indirection #!/usr/bin/env bash - env -S flag (Linux) #!/usr/bin/env -S node --experimental-vm-modules - trailing flags #!/bin/bash -e - interpreter basename extraction from any absolute path - CRLF line endings (`.split(b"\n", 1)`) 4. `detect_language(path)` now tries the extension lookup first, and if it returns None AND `path.suffix == ""`, falls back to the shebang probe. Files with a *known* extension are NEVER re-read — extension-based detection remains authoritative. Non-regressions guaranteed by the design ---------------------------------------- - `.py` files still parse as Python even if the first line is a misleading `#!/bin/bash` (`test_detect_shebang_does_not_override_extension`) - Extension-less README / LICENSE files return None with a 256-byte read that finds no shebang. - Binary files whose first bytes are not `#!` return None without raising. - Unknown interpreters (e.g. `#!/usr/bin/env ocaml`) return None — same semantics as an unmapped extension. Tests added (tests/test_parser.py::TestCodeParser — 16 tests) ------------------------------------------------------------- - test_detect_shebang_bin_bash - test_detect_shebang_bin_sh_routed_to_bash - test_detect_shebang_env_bash - test_detect_shebang_env_python3 - test_detect_shebang_direct_python - test_detect_shebang_node - test_detect_shebang_env_dash_s_flag - test_detect_shebang_ruby - test_detect_shebang_perl - test_detect_shebang_with_trailing_flags - test_detect_shebang_missing_returns_none - test_detect_shebang_empty_file_returns_none - test_detect_shebang_binary_content_returns_none - test_detect_shebang_unknown_interpreter_returns_none - test_detect_shebang_does_not_override_extension - test_parse_shebang_script_produces_function_nodes (end-to-end parse_file check: extension-less bash script is detected AND parsed into File + Function nodes, all tagged language="bash") Test results ------------ Stage 1 (new targeted shebang tests): 16/16 passed. Stage 2 (tests/test_parser.py full): 83/83 passed. Stage 3 (tests/test_multilang.py adjacent): 151/151 passed. Stage 4 (full suite): 748 passed (up from 733), 8 pre-existing Windows failures in test_incremental (3) + test_main async coroutine detection (1) + test_notebook Databricks (4) — verified identical on unchanged main. Stage 5 (ruff check): - code_review_graph/parser.py: clean - tests/test_parser.py: 1 pre-existing F841 on line 1038 (test_map_dispatch_qualified_reference, unrelated to this PR — reproducible on unchanged main at line 901). Zero regressions. Purely additive fallback that only fires for files with no extension.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a shebang fallback to
CodeParser.detect_language()so extension-less Unix scripts are routed to the correct tree-sitter grammar based on their first line. Closes #237..git/hooks/pre-commit#!/bin/bashbin/myapp#!/usr/bin/env python3scripts/deploy#!/bin/sh.husky/pre-push#!/usr/bin/env shtools/bootstrap#!/usr/bin/env nodeRoot cause
detect_language()was a single-line lookup againstEXTENSION_TO_LANGUAGE. Any file with no extension returnedNone, which filters it out of bothincremental_update()andfull_build()before parsing. Real-world repos rely heavily on extension-less scripts for entrypoints, git hooks, CI installers, and shell tooling — all currently invisible tocallers_of,get_impact_radius,detect_changes, and architecture mapping.Fix
New module-level
SHEBANG_INTERPRETER_TO_LANGUAGEtable mapping common interpreter basenames to languages already registered inEXTENSION_TO_LANGUAGE. This file strictly routes extension-less scripts to existing grammars — it does not introduce new languages:bash/sh/zsh/ksh/dash/ash→"bash"python/python2/python3/pypy/pypy3→"python"node/nodejs→"javascript"ruby,perl,lua,Rscript,phpNew
_SHEBANG_PROBE_BYTES = 256— maximum bytes read from the head when probing. Enough for any reasonable shebang line while keeping worst-case I/O tiny.New
CodeParser._detect_language_from_shebang(path)static method. Opens the file, reads up to 256 bytes, verifies#!prefix, splits on the first newline AND first NUL byte (defensive against binary), decodes UTF-8 strictly so malformed content returnsNoneinstead of raising. Handles:#!/bin/bash/usr/bin/envindirection:#!/usr/bin/env bashenv -Sflag:#!/usr/bin/env -S node --experimental-vm-modules#!/bin/bash -edetect_language(path)now tries extension lookup first, and if it returnsNoneandpath.suffix == "", falls back to the shebang probe. Files with a known extension are never re-read — extension-based detection remains authoritative.Non-regressions guaranteed by design
.pyfiles still parse as Python even if the first line is a misleading#!/bin/bash(locked in bytest_detect_shebang_does_not_override_extension)Noneafter a cheap 256-byte read that finds no shebang.#!returnNonewithout raising.#!/usr/bin/env ocaml) returnNone— same semantics as an unmapped extension.path.suffix == ""branch.Tests added (
tests/test_parser.py::TestCodeParser— 16 tests)test_detect_shebang_bin_bashtest_detect_shebang_bin_sh_routed_to_bashtest_detect_shebang_env_bashtest_detect_shebang_env_python3test_detect_shebang_direct_pythontest_detect_shebang_nodetest_detect_shebang_env_dash_s_flagtest_detect_shebang_rubytest_detect_shebang_perltest_detect_shebang_with_trailing_flagstest_detect_shebang_missing_returns_nonetest_detect_shebang_empty_file_returns_nonetest_detect_shebang_binary_content_returns_nonetest_detect_shebang_unknown_interpreter_returns_nonetest_detect_shebang_does_not_override_extension(regression guard against extension override)test_parse_shebang_script_produces_function_nodes— end-to-endparse_file()check: an extension-less bash script is detected AND parsed into File + Function nodes, all taggedlanguage="bash".Test results
tests/test_parser.pyfulltests/test_multilang.pytest_incremental/test_main/test_notebook(verified identical on unchangedmain)ruff checkcode_review_graph/parser.py: clean.tests/test_parser.py: 1 pre-existingF841on line 1038 (intest_map_dispatch_qualified_reference, unrelated to this PR — reproducible on unchangedmainat line 901)Zero regressions. Purely additive fallback that only fires for files with no extension.