This document is created as part of a repository that provides a practical diagnostic prompt for diagnosing causes of AI misinference during code generation and evaluating countermeasures (prompt improvements, additional metadata, explicit validation instructions, etc.) that should be given to AI. Furthermore, as a framework for thinking to theoretically understand why AI misinfers, we devised a language ecosystem evaluation model.
This framework was developed as a thought experiment based on practical experience and observations in AI-assisted coding.
Current Status:
- ✅ Structured as a theoretical framework
- 🔄 Empirical validation in progress
- 💬 Community-driven extensions welcome
Design Philosophy:
- Tool-agnostic: Stable across changing AI coding agents and inference models
- Language-agnostic: Applicable regardless of programming language evolution
- Focus on structure: Identifies mis-inference-prone structures, not specific implementations
- Long-term stability: Principles over concrete tools that rapidly evolve
Scope:
- ✅ Structural vulnerabilities that cause AI mis-inference
⚠️ Non-functional requirements (security, performance) are distributed across the 9 axes as they relate to mis-inference- ℹ️ Project-specific non-functional requirements may require separate evaluation frameworks
Future Direction:
This framework's structural approach can be adapted to identify mis-inference patterns in other domains (security, performance, etc.) as observation-specific evaluation frameworks.
How to Use:
- Use diagnostic_prompt.md for AI-powered diagnostics
- Note: Any AI model can be used for diagnostics
- Recommended: Use with appropriate context and project-specific information
- Limitation: Diagnostic quality depends on the AI model's training data and capabilities
- As a starting point for language selection discussions
- For practical evaluation in actual projects
- Triggered at project start or when project dependencies are updated
- As inspiration for developing your own evaluation criteria
This paper presents a novel framework for evaluating programming language ecosystems in the AI coding era. Unlike traditional metrics focused on syntax or performance, our model prioritizes "fixability" — the ability of an ecosystem to support AI's correction loop through rich semantic information.
Key Contributions:
- 9-axis evaluation framework spanning static and runtime dimensions
- 4-layer semantic architecture (Core, Service, Dependency, Community)
- Unified theory treating compilers as semantic verification engines
- Practical observations showing tests function as specification complements in dynamic languages
The model presents a hypothesis that in AI-assisted development, language strength is determined not by initial code generation quality, but by the ecosystem's capacity to provide semantic transparency for iterative correction.
Keywords: AI-assisted coding, semantic information, language ecosystems, fixability, verification loop
The essence of AI coding is not the quality of initial generation,
but rather these two capabilities:
In an era where AI continuously generates code,
what matters is not "producing correct code in one shot",
but whether we can stably maintain the correction loop of:
- Correction
- Regeneration
- Verification
What's needed for this is not the language specification itself,
but the quality and quantity of semantic information provided by
the entire ecosystem: language, runtime, toolchain, and community.
Starting from this philosophy,
the evaluation model of 4 Layers × 9 Axes × Verification Loop was born.
AI is vulnerable to ambiguity, inconsistency, and lack of information.
Therefore, consistency, information richness, and stability of language ecosystems become crucial.
This model adopts the following principles for fair language comparison:
Rather than distorting essence through scoring or ranking, we organize the following in "words":
- Language characteristics
- Strengths for AI
- Weaknesses for AI
- Suitable use cases
- Design philosophy and culture are treated as "differences in values," not "good/bad"
- Enterprise → Stability, backward compatibility
- Web / Startup → Development speed, flexibility
- Research / Education → Expressiveness, experimental features
- System / Embedded → Transparency of runtime semantics
The quality required for AI-generated code also varies by use case.
This model evaluates not just language specifications alone,
but the entire ecosystem including language, implementation, toolchain, and community.
Examples:
- ❌ Python (specification only)
- ✅ Python ecosystem (CPython + pip + pytest + typing + community)
Similarly:
- JavaScript → Node.js / Bun / Deno ecosystem
- C# → .NET ecosystem
- Rust → Cargo ecosystem
This allows evaluation of the practical environment as a whole
that AI faces during actual coding.
The 9 axes are not independent; the following interactions exist:
- Axis8 Compatibility Culture ⇔ Axis2 Static Semantic Improvement
- Prioritizing backward compatibility may delay semantic refinement
- Axis3 Metadata Richness ⇔ Axis9 Semantic Extensibility
- Strict type systems can constrain extensibility
- Axis4 Accessibility & Automation ⇔ Learning Cost
- Rich APIs can become barriers for beginners
Notably, AI and humans have completely opposite vulnerabilities in terms of which layers matter most (see Section 5.3 for details).
- For AI: Collapse of the Community layer (Layer4) is critical; Core layer (Layer1) has relatively minor impact
- For Humans: Collapse of the Core layer (Layer1) is critical; Community layer (Layer4) has relatively minor impact
Therefore, "good languages for AI" and "good languages for humans" do not necessarily align.
Since optimal solutions for these trade-offs vary by use case,
we adopt qualitative evaluation rather than scoring.
The process by which AI generates, semantically verifies, corrects, and regenerates code
can be structured into the following 7 phases:
Phase1: Static Knowledge (Prior Knowledge)
Phase2: Generation (Initial Generation)
├ External Reference
└ Environment Semantics
Phase3: Static Semantic Verification
Phase4: Launch Check
Phase5: Test Execution
├ Phase5-1: Quality Validation (Application/Specification Dependent)
├ Phase5-2: Runtime Profiling Observation
└ Phase5-3: Runtime Profiling Semantics
Phase6: Test Feedback (Runtime Feedback)
Phase7: Regeneration (Corrective Generation)
├ External Reference
└ Environment Semantics
→ Return to Phase3
flowchart TD
A[Phase1 Static Knowledge]
B[Phase2 Generation]
C[Phase3 Static Semantic Validation]
D[Phase4 Startup Check]
E[Phase5 Test Execution]
F[Phase6 Test Feedback]
G[Phase7 Regeneration]
A --> B
B --> C
C -->|Failed| G
C -->|Passed| D
D -->|Failed| G
D -->|Passed| E
E --> F
F -->|Success| H[End]
F -->|Failed| G
G --> C
The purpose is singular:
Enable AI to "fix" code.
Provide all semantic information necessary for that purpose.
Detailed descriptions of each phase follow.
- AI's training data (OSS, Q&A, official docs, blogs, etc.)
- Language specifications, standard libraries
- Common coding patterns
- Initial code generation by AI
- From prompts and context
- Type checking
- Syntax validation
- Linter-based verification
- Build/Compilation (※Cannot reach Phase4 if compilation fails)
- → If failed, go to Phase7
※Here we verify "code semantics (types, syntax, static analysis)". Syntactic dependency resolution occurs here, but actual executability is verified in Phase4.
- Actual dependency resolution and loading
- Environment variables and configuration file verification
- Basic startup confirmation
- → If failed, go to Phase7
※Here we verify "execution environment semantics (dependency existence, environment, initialization)". Handles cases where compilation succeeds but startup fails due to missing dependencies, unconfigured environment, etc.
-
Phase5-1: Quality validation (depends on app/spec)
- Unit tests
- Integration tests
- Code review
-
Phase5-2: Runtime profiling observation
- Execution performance
- Resource usage
- Logging
-
Phase5-3: Runtime profiling semantics
- Semantic interpretation based on observation results
- Detection of anomalies or unexpected behaviors
- Patterns from execution traces
-
→ If failed, go to Phase7
- Error messages
- Test results
- Runtime logs
- Performance bottlenecks
- AI regenerates code based on feedback
- Reflect Phase3, Phase4, Phase5, or Phase6 feedback to improve code
- → Return to Phase3 validation loop
Note on terminology: Throughout this document, we use "verification loop," "correction loop," and "fix loop" to refer to the iterative process of AI-driven code generation, verification, and improvement. These terms are used contextually but refer to essentially the same process.
The 9 axes for evaluating language ecosystems are organized
along two dimensions: Implementation (Static) and Runtime.
Axis1: Public Knowledge Availability(公開知識の可用性)
Axis2: Static Semantic Consistency(静的意味論の一貫性)
Axis3: Semantic Metadata Richness(意味論的メタデータの豊富さ)
Axis4: Semantic Access & Automation(意味論アクセスと自動化)
Axis5: Runtime Semantic Continuity(実行時意味論の一貫性)
Axis6: Dependency Stability(依存関係安定性)
Axis7: Runtime Specification Conformance(実行時仕様準拠)
Axis8: Compatibility Culture(互換性文化)
Axis9: Semantic Extensibility(意味論拡張性)
| Dimension | Role | Examples | Contribution to Verification Loop |
|---|---|---|---|
| Static | Knowledge for AI pre-training | OSS, Q&A, Blogs | Phase1 Static Knowledge, Phase2 Generation |
| Runtime | Knowledge referenced in correction loop | API Docs, Specifications | Phase7 Regeneration basis strengthening |
| Dimension | Role | Examples | Contribution |
|---|---|---|---|
| Static | Static semantic consistency | Types, AST, Scopes | Phase3 Static Semantic Verification |
| Runtime | Runtime semantic consistency | Exceptions, Dynamic types | Phase5-3 Profiling Semantics, Phase5-1 Quality Validation |
| Dimension | Role | Examples | Contribution |
|---|---|---|---|
| Static | Materials for static analysis | Type annotations, LSP, Contracts | Phase3 Semantic Verification |
| Runtime | Granularity of runtime observation | Profilers, Traces | Phase5-2 Profiling Observation |
| Dimension | Role | Examples | Contribution |
|---|---|---|---|
| Static | Access to semantic APIs | Roslyn, tsserver, Symbol API | Phase3 Verification, Phase4 Launch Check |
| Runtime | Automated execution environment | Test runners, Profilers | Phase5 Test Execution |
| Dimension | Role | Examples | Contribution |
|---|---|---|---|
| Static | Consideration of runtime differences | Node/Bun, CPython/PyPy | Phase2 Generation |
| Runtime | Runtime semantic continuity | GC, JIT, Exception models | Phase5-3 Profiling Semantics |
| Dimension | Role | Examples | Contribution |
|---|---|---|---|
| Static | Dependency consistency | Versions, ABI | Phase3 Verification, Phase4 Launch Check |
| Runtime | Runtime dependency behavior | Actual loading | Phase5-1 Quality Validation |
| Dimension | Role | Examples | Contribution |
|---|---|---|---|
| Static | Semantic fixation based on specs | API Docs, RFCs | Phase2 Generation, Phase3 Verification |
| Runtime | Runtime spec conformance | Behavior per specification | Phase5-1 Quality Validation |
| Dimension | Role | Examples | Contribution |
|---|---|---|---|
| Static | Use of backward-compatible APIs | Deprecated API warnings | Phase2 Generation, Phase3 Verification |
| Runtime | Operation in legacy environments | LTS, Stable APIs | Phase4 Launch Check, Phase5-1 Quality Validation |
| Dimension | Role | Examples | Contribution |
|---|---|---|---|
| Static | Extensible design | Interfaces, Abstractions | Phase2 Generation, Phase3 Verification |
| Runtime | Post-extension behavior validation | Plugins, Modules | Phase5-1 Quality Validation, Phase7 Regeneration |
Language ecosystems can be structured into the following 4 layers:
Layer1: Semantic Core Layer(意味論コア層)
Layer2: Semantic Service Layer(意味論サービス層)
Layer3: Dependency Semantics Layer(依存関係意味論層)
Layer4: Community Semantics Layer(コミュニティ意味論層)
- Layers 1-3 represent "official" semantics
- Layer 4 represents "social" semantics
- Type system
- Scope rules
- Memory model
- Evaluation strategy
- Backward compatibility policy
Related Axes: Axis2, Axis5, Axis8, Axis9
Contributions: Phase3, Phase4, Phase5, Phase7
Represents the language specification itself, including types, scopes, memory models, etc.
- AST / Symbol API
- Type information API
- Diagnostics & Errors
- LSP
- Static analysis API
- Compiler
- Toolchain
- Custom attributes (Attribute / Annotation / Decorator)
- Comments (natural language / semantic metadata like XML comments)
- Macros / Source Generators
- Analyzer extension points
Related Axes: Axis3, Axis4, Axis5, Axis7, Axis9
Contributions: Phase2, Phase3, Phase4, Phase5, Phase7
This is the interface layer that exposes language specifications externally, allowing AI to acquire, interpret, and modify semantics.
Layer 2 functions as the "interface layer" that exposes language specifications (Layer 1) externally, enabling AI to acquire, interpret, and modify semantics. Services (mechanisms) for providing feedback from the language side to AI belong here.
Layer 2 integrates static, dynamic, and dependency semantics, serving as "machine-readable semantics" for AI.
It also includes services (toolchain) for verifying, analyzing, and manipulating dependencies. This is because Layer 2 represents "how to handle" dependencies, while Layer 3 represents the "content" of dependencies.
- Standard library
- Package management systems (npm, pip, cargo, etc.)
- Version management and dependency resolution
- API lifetime (deprecation, breaking changes)
- Runtime compatibility (ABI, binary compatibility)
Related Axes: Axis6, Axis7, Axis8, Axis9
Contributions: Phase3, Phase4, Phase5, Phase7
In language ecosystems, this layer represents the management, versioning, and compatibility of external libraries and modules that code depends on. While Layer 2 represents the "mechanisms for handling" these, Layer 3 represents the "semantics of dependencies themselves."
- OSS
- Q&A
- Blogs
- Best practices
- Coding conventions
Related Axes: Axis1, Axis6, Axis8, Axis9
Contributions: Phase1, Phase2, Phase6, Phase7
Community Semantics
↓
Dependency Semantics
↓
Semantic Service Layer
↓
Semantic Core Layer
AI absorbs semantics from the outside and understands toward the center (language specification).
In other words, AI infers by reverse-engineering from outer data:
- First, observe "how humans write" and "how it's used" in large quantities (training data source)
- Grasp semantics by observing massive API usage patterns
- Absorb the structural semantics of the language
- Infer the syntax of the language
Semantic Core Layer
↓
Semantic Service Layer
↓
Dependency Semantics
↓
Community Semantics
Humans expand their understanding from the center (language specification) outward.
When semantic layers break, AI and humans have completely opposite vulnerabilities.
| Layer | AI Vulnerability | Human Vulnerability |
|---|---|---|
| 4 Community | Critical | Minor impact |
| 3 Dependency | Major impact | Moderate |
| 2 Service | Moderate | Major impact |
| 1 Core | Minor impact | Critical |
AI: Outer-layer dependent → Vulnerable when outer layers break
Human: Inner-layer dependent → Vulnerable when inner layers break
-
What AI is good at:
- Language conversion (Python → Rust → TS)
- Pseudocode interpretation
- Natural language to code generation
- Semantic refactoring of code
-
What humans are good at:
- Fine nuances of syntax
- Code style
- Readability judgment
What is "language = spec + grammar" for humans becomes "language = spec + grammar + toolchain + examples + community + culture" for AI.
What humans consider "peripheral information" becomes part of semantics for AI.
In other words, in the AI era, languages are not defined solely by their specifications, but derive meaning through community practice.
We call this perspective Language-as-Ecosystem — treating the entire ecosystem as the definition of the language itself.
The essence of AI coding lies in:
- Fixability
- Semantic Transparency (ability to provide semantic information)
This model is a framework for structurally evaluating
how much of a "fixable environment" a language ecosystem can provide to AI.
Three Foundational Theories Supporting Language Ecosystems in the AI Era
— Semantic Layers, Test Culture, Compiler Redefinition —**
This appendix summarizes the three pillars of philosophy, practical evidence, and unified theory
that support the main content (4 Layers × 9 Axes × Verification Loop).
- Appendix A: Semantic Layers are the True Essence (Core Philosophy)
- Appendix B: Tests Complement Specifications (Practical Observations)
- Appendix C: Redefining Compilers as Semantic Verification Engines (Unified Theory)
With these three together,
the AI-era language ecosystem evaluation model is completed as
a three-layer structure of Philosophy → Practice → Theory.
Semantic Layers are the True Essence
— Moving Beyond Syntax-Centrism —
- Language syntax functions as a user interface for both AI and humans
- Just character strings to AI
- Same structure if semantics match, regardless of different grammar
- No fundamental difference between Python, C#, or JavaScript
Syntax is appearance; it is not the essence.
What AI needs to run the correction loop is semantic information.
- AST
- Type information
- Contracts (pre/post/invariant)
- Metadata
- Runtime versions
- Standard libraries
- Breaking change history
- Runtime behavior
- Side effects
- Exception conditions
These become the primary information sources for AI's correction loop.
There's no need to embed semantics inside AI models.
A structure that follows external semantic layers is optimal.
This ensures:
- Even if models change
- Even if vendors change
- Even if language specifications evolve
The stability of the correction loop is maintained.
In the AI era, the true essence of code is the semantic layer,
and syntax is the user interface for both AI and humans.
Test Libraries Function as "Specification Complements"
— Practical Observations that Language Specifications Alone are Insufficient —
Especially in the following languages,
the quality of test libraries determines the success of AI coding:
- Python / Ruby / JavaScript (dynamic typing)
- Java / TypeScript (type erasure)
- Go / Java (complex runtime behavior)
In these languages:
AI cannot fix correctly unless runtime specifications are explicitly defined through tests.
For AI to fix code, it needs:
- Correct specifications
- Error location information
- Runtime reasons (exceptions, logs)
Language specifications alone often cannot satisfy these requirements.
Therefore:
- Tests
- Logs
- Traces
- Structured errors
- Reproducible execution environments
fill the gaps in specifications.
In languages with strong static semantics like C#:
- Type system is robust
- AST and metadata are rich
Therefore, tests can focus on:
- Behavioral specifications
- Concurrency correctness
- Side effect validation
These are supplements to dynamic semantics.
In Python or JavaScript:
- Types are weak
- Runtime behavior is ambiguous
Therefore, tests handle:
- Type information complementation
- Specification concretization
- Intent clarification
and function as de facto specifications.
However, this is:
Not an assertion that tests should replace specifications,
but an observation that they function as practical means to complement incomplete specifications.
Language specifications alone are insufficient.
Test culture, toolchains, and logs function as "specification complements"
supporting AI's correction loop.
The AI Era Perspective: Compilation's Value Lies in Semantic Verification
— A Unified View of Static and Dynamic Languages —
Traditional primary purpose:
Compilation = Executable file generation
AI era perspective:
Compilation's primary value = Semantic verification phase
For AI, this verification information is what matters
This does not negate the traditional definition,
but rather represents a shift in value perspective within the context of AI coding.
In AI's correction loop, compilers provide:
- Core Semantics (types, scopes, evaluation strategies) verification
- Extended Semantics (Analyzer, Linter) integration
- Dependency consistency checking
- Executability assurance
In other words, from AI's perspective:
Compiler = An engine that establishes semantic consistency
and provides information necessary for fixes
Executable file generation is one of the artifacts obtained as a result of this verification.
→ Semantic verification at compile time
→ Semantic verification via Linter + Tests
Unified view:
The two differ only in "where semantic verification occurs,"
but are fundamentally the same structure.
This enables
evaluation of static and dynamic languages within the same framework.
From the AI era perspective, compilation's primary value lies in semantic verification,
and this view enables unified treatment of static and dynamic languages.
Fixability
The capacity of a language ecosystem to support iterative correction of AI-generated code through provision of rich semantic information.
Semantic Transparency
The degree to which a language ecosystem exposes semantic information (types, contracts, runtime behavior) to AI tools.
Correction Loop / Fix Loop
The iterative process of: Generation → Verification → Feedback → Regeneration, central to AI-assisted development.
AST (Abstract Syntax Tree)
A tree representation of the abstract syntactic structure of source code, used by compilers and static analyzers.
LSP (Language Server Protocol)
A protocol for communication between development tools and language servers, providing features like autocomplete and diagnostics.
ABI (Application Binary Interface)
The interface between program modules at the binary level, critical for dependency compatibility.
Type Erasure
A compilation technique where generic type information is removed at runtime (e.g., Java, TypeScript), affecting runtime semantic richness.
JIT (Just-In-Time Compilation)
Runtime compilation technique that can affect semantic continuity between development and production environments.
Semantic Layers
The four-layer architecture of language ecosystems: Core, Service, Dependency, and Community semantics.
Static/Runtime Dimensions
The two perspectives for each evaluation axis: information available at implementation time vs. execution time.
De Facto Specifications
Tests and runtime validation that effectively serve as specifications in languages with weak static semantics.
Semantic Verification Engine
The redefined role of compilers in the AI era: not just code generation, but semantic consistency validation.