Skip to content

feat(parser): shebang-based language detection for extension-less scripts (#237)#238

Open
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
azizur100389:feat/shebang-detection
Open

feat(parser): shebang-based language detection for extension-less scripts (#237)#238
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
azizur100389:feat/shebang-detection

Conversation

@azizur100389
Copy link
Copy Markdown
Contributor

Summary

Adds a shebang fallback to CodeParser.detect_language() so extension-less Unix scripts are routed to the correct tree-sitter grammar based on their first line. Closes #237.

Common path Typical shebang Before After
.git/hooks/pre-commit #!/bin/bash 0 nodes Parsed as bash
bin/myapp #!/usr/bin/env python3 0 nodes Parsed as python
scripts/deploy #!/bin/sh 0 nodes Parsed as bash
.husky/pre-push #!/usr/bin/env sh 0 nodes Parsed as bash
tools/bootstrap #!/usr/bin/env node 0 nodes Parsed as javascript

Root cause

detect_language() was a single-line lookup against EXTENSION_TO_LANGUAGE. Any file with no extension returned None, which filters it out of both incremental_update() and full_build() before parsing. Real-world repos rely heavily on extension-less scripts for entrypoints, git hooks, CI installers, and shell tooling — all currently invisible to callers_of, get_impact_radius, detect_changes, and architecture mapping.

Fix

  1. New module-level SHEBANG_INTERPRETER_TO_LANGUAGE table mapping common interpreter basenames to languages already registered in EXTENSION_TO_LANGUAGE. This file strictly routes extension-less scripts to existing grammars — it does not introduce new languages:

    • bash / sh / zsh / ksh / dash / ash"bash"
    • python / python2 / python3 / pypy / pypy3"python"
    • node / nodejs"javascript"
    • ruby, perl, lua, Rscript, php
  2. New _SHEBANG_PROBE_BYTES = 256 — maximum bytes read from the head when probing. Enough for any reasonable shebang line while keeping worst-case I/O tiny.

  3. New CodeParser._detect_language_from_shebang(path) static method. Opens the file, reads up to 256 bytes, verifies #! prefix, splits on the first newline AND first NUL byte (defensive against binary), decodes UTF-8 strictly so malformed content returns None instead of raising. Handles:

    • Direct form: #!/bin/bash
    • /usr/bin/env indirection: #!/usr/bin/env bash
    • Linux env -S flag: #!/usr/bin/env -S node --experimental-vm-modules
    • Trailing flags: #!/bin/bash -e
    • Interpreter basename extraction from any absolute path
    • CRLF line endings
  4. detect_language(path) now tries extension lookup first, and if it returns None and path.suffix == "", falls back to the shebang probe. Files with a known extension are never re-read — extension-based detection remains authoritative.

Non-regressions guaranteed by design

  • .py files still parse as Python even if the first line is a misleading #!/bin/bash (locked in by test_detect_shebang_does_not_override_extension)
  • Extension-less README / LICENSE files return None after a cheap 256-byte read that finds no shebang.
  • Binary files whose first bytes are not #! return None without raising.
  • Unknown interpreters (e.g. #!/usr/bin/env ocaml) return None — same semantics as an unmapped extension.
  • No performance impact on files with a known extension — the shebang probe only runs for the path.suffix == "" branch.

Tests added (tests/test_parser.py::TestCodeParser — 16 tests)

  • test_detect_shebang_bin_bash
  • test_detect_shebang_bin_sh_routed_to_bash
  • test_detect_shebang_env_bash
  • test_detect_shebang_env_python3
  • test_detect_shebang_direct_python
  • test_detect_shebang_node
  • test_detect_shebang_env_dash_s_flag
  • test_detect_shebang_ruby
  • test_detect_shebang_perl
  • test_detect_shebang_with_trailing_flags
  • test_detect_shebang_missing_returns_none
  • test_detect_shebang_empty_file_returns_none
  • test_detect_shebang_binary_content_returns_none
  • test_detect_shebang_unknown_interpreter_returns_none
  • test_detect_shebang_does_not_override_extension (regression guard against extension override)
  • test_parse_shebang_script_produces_function_nodes — end-to-end parse_file() check: an extension-less bash script is detected AND parsed into File + Function nodes, all tagged language="bash".

Test results

Stage Result
Stage 1 — new targeted shebang tests 16/16 passed
Stage 2 — tests/test_parser.py full 83/83 passed
Stage 3 — adjacent tests/test_multilang.py 151/151 passed
Stage 4 — full suite 748 passed (up from 733 baseline — +15 net), 8 pre-existing Windows failures in test_incremental/test_main/test_notebook (verified identical on unchanged main)
Stage 5 — ruff check code_review_graph/parser.py: clean. tests/test_parser.py: 1 pre-existing F841 on line 1038 (in test_map_dispatch_qualified_reference, unrelated to this PR — reproducible on unchanged main at line 901)

Zero regressions. Purely additive fallback that only fires for files with no extension.

…ipts (tirth8205#237)

Add a shebang fallback to `CodeParser.detect_language()` so that
extension-less Unix scripts (`bin/myapp`, `.git/hooks/pre-commit`,
`scripts/deploy`, `.husky/pre-push`, `installer`, ...) are routed to the
correct tree-sitter grammar based on their first line.

Root cause of tirth8205#237
------------------
`detect_language()` was a single-line lookup against `EXTENSION_TO_LANGUAGE`
keyed on `path.suffix.lower()`.  Any file with no extension returned
`None`, which filters it out of both `incremental_update()` and
`full_build()` before parsing.  Real-world repos rely heavily on
extension-less scripts for entrypoints, git hooks, CI installers, and
shell tooling — all currently invisible to `callers_of`,
`get_impact_radius`, `detect_changes`, and architecture mapping.

Fix
---
1. New module-level `SHEBANG_INTERPRETER_TO_LANGUAGE` table mapping common
   interpreter basenames to languages that are *already* registered:
     - bash / sh / zsh / ksh / dash / ash -> "bash"
     - python / python2 / python3 / pypy / pypy3 -> "python"
     - node / nodejs -> "javascript"
     - ruby, perl, lua, Rscript, php
   This file strictly *routes* extension-less files to existing languages;
   it does NOT introduce new grammars.

2. New `_SHEBANG_PROBE_BYTES = 256` constant — maximum bytes read from the
   head of a file when probing.  Enough for any reasonable shebang line
   while keeping worst-case I/O tiny.

3. New `CodeParser._detect_language_from_shebang(path)` static method.
   Opens the file, reads up to 256 bytes, verifies `#!` prefix, splits on
   the first newline AND first NUL byte (defensive against binary), and
   decodes UTF-8 strictly so malformed content returns None instead of
   raising.  Handles:
     - direct form            #!/bin/bash
     - env indirection        #!/usr/bin/env bash
     - env -S flag (Linux)    #!/usr/bin/env -S node --experimental-vm-modules
     - trailing flags         #!/bin/bash -e
     - interpreter basename extraction from any absolute path
     - CRLF line endings (`.split(b"\n", 1)`)

4. `detect_language(path)` now tries the extension lookup first, and if it
   returns None AND `path.suffix == ""`, falls back to the shebang probe.
   Files with a *known* extension are NEVER re-read — extension-based
   detection remains authoritative.

Non-regressions guaranteed by the design
----------------------------------------
- `.py` files still parse as Python even if the first line is a misleading
  `#!/bin/bash`  (`test_detect_shebang_does_not_override_extension`)
- Extension-less README / LICENSE files return None with a 256-byte read
  that finds no shebang.
- Binary files whose first bytes are not `#!` return None without raising.
- Unknown interpreters (e.g. `#!/usr/bin/env ocaml`) return None — same
  semantics as an unmapped extension.

Tests added (tests/test_parser.py::TestCodeParser — 16 tests)
-------------------------------------------------------------
- test_detect_shebang_bin_bash
- test_detect_shebang_bin_sh_routed_to_bash
- test_detect_shebang_env_bash
- test_detect_shebang_env_python3
- test_detect_shebang_direct_python
- test_detect_shebang_node
- test_detect_shebang_env_dash_s_flag
- test_detect_shebang_ruby
- test_detect_shebang_perl
- test_detect_shebang_with_trailing_flags
- test_detect_shebang_missing_returns_none
- test_detect_shebang_empty_file_returns_none
- test_detect_shebang_binary_content_returns_none
- test_detect_shebang_unknown_interpreter_returns_none
- test_detect_shebang_does_not_override_extension
- test_parse_shebang_script_produces_function_nodes (end-to-end parse_file
  check: extension-less bash script is detected AND parsed into File +
  Function nodes, all tagged language="bash")

Test results
------------
Stage 1 (new targeted shebang tests):       16/16 passed.
Stage 2 (tests/test_parser.py full):        83/83 passed.
Stage 3 (tests/test_multilang.py adjacent): 151/151 passed.
Stage 4 (full suite):                       748 passed (up from 733),
  8 pre-existing Windows failures in test_incremental (3) + test_main
  async coroutine detection (1) + test_notebook Databricks (4) —
  verified identical on unchanged main.
Stage 5 (ruff check):
  - code_review_graph/parser.py: clean
  - tests/test_parser.py: 1 pre-existing F841 on line 1038
    (test_map_dispatch_qualified_reference, unrelated to this PR —
    reproducible on unchanged main at line 901).

Zero regressions. Purely additive fallback that only fires for files
with no extension.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(parser): shebang-based language detection for extension-less scripts

1 participant