Skip to content

Latest commit

 

History

History
104 lines (69 loc) · 3.41 KB

File metadata and controls

104 lines (69 loc) · 3.41 KB

BDD Status

Checked 49 scenario(s) across 41 test file(s).

Feature: Tokenise source code

  • Tokenise a Python hello-world file
  • Tokenise a JavaScript hello-world file
  • Tokenise a TypeScript hello-world file
  • Tokenise a Rust hello-world file
  • Tokenise a C++ hello-world file
  • Tokenise a Fortran hello-world file
  • Tokenise a Vyper hello-world file

Feature: JSON output format

  • Output is a valid JSON array
  • Token types are non-empty strings
  • Concatenating token values reconstructs the original input
  • Unrecognised characters are reported on stderr not stdout

Feature: Token type identification

  • Keywords are identified as keyword tokens
  • Identifiers are identified as identifier tokens
  • Whitespace is preserved as whitespace tokens
  • String literals are identified as string literal tokens
  • Operators are identified as operator tokens
  • Keywords are not misidentified as identifiers

Feature: Comprehensive language tokenisation

  • Tokenise a Python file with function definition and control flow
  • Tokenise a JavaScript file with variable declarations and arrow functions
  • Tokenise a Rust file with struct and impl definitions

Feature: CLI error handling

  • Missing command-line arguments prints usage and exits non-zero
  • Input file not found exits with a clear error message
  • Invalid YAML lexer config exits with a clear error message

Feature: Lexer schema validation

  • Valid lexer YAML passes validation
  • Lexer YAML missing required field fails validation
  • Lexer YAML with a token missing both value and pattern fails validation
  • All bundled lexer files pass validation

Feature: Custom lexer configuration

  • A custom lexer tokenises a simple DSL

Feature: Comment tokenisation

  • Single-line comments are tokenised as comment tokens
  • Multi-line comments are tokenised as comment tokens

Feature: Import statement tokenisation

  • Python import statement is tokenised correctly
  • JavaScript import statement is tokenised correctly
  • Rust use statement is tokenised correctly

Feature: Multi-character operator tokenisation

  • Equality operator is tokenised as a single token
  • Arrow operator is tokenised as a single token
  • Compound assignment operators are tokenised as single tokens

Feature: Number literal tokenisation

  • Integer literals are tokenised as number tokens
  • Float literals are tokenised as number tokens
  • Hexadecimal literals are tokenised as number tokens

Feature: Empty input handling

  • Empty file produces an empty token array
  • Whitespace-only file produces only whitespace tokens

Feature: Duplicate token deduplication

  • Duplicate value-based tokens are rejected during validation
  • Bundled lexers contain no duplicate token values

Feature: Regex pattern precompilation

  • Patterns are compiled once before tokenisation begins
  • Large file tokenisation completes within a reasonable time

Feature: Efficient string slicing

  • String slice extraction uses direct slicing instead of character-by-character concatenation

Feature: CLI error messages

  • File not found produces a clean error message without a traceback
  • Invalid YAML produces a clean error message without a traceback
  • Unreadable file produces a clean error message without a traceback

49/49 scenarios covered.