This repository contains the framework, benchmarking data, and comparative analysis for a research project investigating the efficacy of autonomous AI agents in software development. The study evaluates whether an agentic framework can match or exceed the code quality of junior-level developers.
The core of this project is a self-improving AI agent that writes, executes, and refines Python code autonomously based on terminal outputs and quality feedback.
Figure 1: Architecture of the autonomous agentic framework.
- Code Designer: Generates technical requirements and function compliance standards.
- Code Generator (Gemini): Produces the initial implementation using the Gemini Flash 2.0 model.
- Script Evaluator: Runs the script, analyzes outcomes, and triggers refinement loops if requirements aren't met.
- Web Search Tool: Enables the agent to retrieve up-to-date API information and fill training gaps.
- Library Installer: Automatically resolves and installs dependencies in isolated environments.
The study benchmarked AI-generated scripts against human-authored counterparts across five real-world scenarios: CLI utilities, data parsers, HTTP servers, and AI interfaces.
The AI agent demonstrated significant advantages in several key industry-standard metrics:
| Metric | AI Agent Mean | Human Mean | Difference ( |
|---|---|---|---|
| Pylint Score | 7.74 | 7.27 | +0.47 |
| Maintainability Index | 76.59 | 64.39 | +12.2 |
| Bug Density | 0.30 | 0.43 | -0.13 |
| Lines of Code (LLOC) | 59.8 | 95.6 | -35.8 |
Key Findings:
- Structural Excellence: AI-generated code achieved a higher Maintainability Index, indicating more modular and self-documenting designs.
- Complexity Trade-off: AI code showed higher Cyclomatic Complexity (3.91 vs 1.63), which reflects a deliberate choice to use discrete functions rather than monolithic, nested loops.
- Cost Efficiency: Once developed, the AI agent operates at less than 5% of a junior developer's salary.
A critical test of the framework involved designing a digital notch filter to attenuate specific frequencies.
Figure 2: Magnitude Spectrum and Time Series of AI-generated filter results.
Through the Web Search component, the agent was able to autonomously correct coefficient calculations to achieve a 90% power reduction of the target frequency, proving its utility in technical and academic fields.
This project addresses three primary Research Questions (RQs):
-
RQ1: Can an AI agentic framework achieve code quality comparable to human developers in Python?
-
RQ2: What are the relative strengths and weaknesses of AI-generated code compared to human-written code across different task types?
-
RQ3: What is the correlation between framework components (e.g., web search) and the functional correctness of the generated code? `
.
├── README.md
├── assets/ # Images and diagrams
├── AIWritten/ # Scripts generated by the AI agent
│ ├── ai.py
│ ├── analyze_quality.sh # Static analysis automation
│ ├── basic_cli.py
│ ├── code_quality_report.csv
│ ├── csv_tool.py
│ ├── server.py
│ └── todo_cli.py
└── HumanWritten/ # Human-authored benchmark scripts
├── ai.py
├── analyze_quality.sh
├── code_quality_report.csv
├── csv_parser.py
├── server.py
├── size.py
└── todo.py
- Python 3.x
- LLM: Google Gemini API key (set as GOOGLE_API_KEY in environment).
- Static Analysis: Pylint, Radon, Bandit.
-
Clone the repository:
git clone https://github.com/Alpsource/SQM_Test.git -
Install dependencies:
pip install pylint radon bandit google-genai -
Set your API Key:
export GOOGLE_API_KEY='your_key_here'