Code Quality Checks for LLM-Generated Code: The Missing Layer
Large language models produce code at an astonishing rate. A single prompt can generate hundreds of lines of working, syntax-correct code in seconds. Individual developers and entire teams are adopting LLM-assisted workflows to accelerate delivery, and the volume of generated code entering production repositories is climbing every quarter.
There is a problem that becomes obvious the moment you review a significant body of LLM-generated code: it passes every conventional quality check, but it still feels wrong. The linter is happy. The complexity metrics are within range. The test coverage threshold is met. But the codebase has odd dependencies that should not exist, modules that have no clear responsibility boundary, and interfaces that expose internal details for no reason.
This is not an accident. The tools we currently rely on for automated code quality were built for a world where humans write all the code. That world no longer exists. The gap between what our tools catch and what actually degrades code quality in LLM-generated code is large and growing.
What Existing Linting Tools Catch
The current generation of code quality tools operates at a level of granularity that maps well to human-written code. ESLint, Pylint, RuboCop, and their equivalents across languages check for:
- Style violations: indentation, naming conventions, unused imports, line length.
- Local complexity: cyclomatic complexity thresholds, function length, nesting depth, number of parameters.
- Known bug patterns: unsafe type coercions, unreachable code, resource leaks, null pointer risks.
- Basic architectural rules: dependency direction between layers (e.g., controllers should not call repositories directly in some frameworks).
These checks catch real problems. Every team should run them in CI. But they operate on a function-by-function or file-by-file basis, with limited ability to assess the relationships between modules. They tell you if a function is too long. They do not tell you if that function is in the wrong module, or if it creates a dependency that couples two unrelated subsystems.
What the Tools Miss in LLM-Generated Code
LLMs generate code that is syntactically correct and locally reasonable, but architecturally undisciplined. The model does not hold a mental model of your entire codebase. It sees a prompt, often with limited context, and generates the most statistically plausible completion. This produces patterns that no linter flags but every experienced engineer notices.
Odd dependency structure. The LLM will import and use whatever classes or functions the prompt suggests, regardless of whether they belong in that module. I have seen a data-access module import a notification service because the LLM decided that “when an order is saved, also send a confirmation” should be implemented as a direct call inside the repository method. A linter sees an import statement and a method call. Both are syntactically valid. The architectural violation is invisible at the file level.
Module boundary violations. LLMs have no sense of what belongs where. Given a prompt that mentions “payment processing,” the model might generate payment logic inside the order entity, inside the checkout controller, or inside a dedicated service class, with equal probability depending on which pattern is most common in its training data. It does not know that your project separates read models from write models, or that your domain entities should not reference infrastructure types. The resulting code compiles and passes all low-level checks, but it erodes the module boundaries that keep the codebase maintainable.
Poorly defined interfaces. When an LLM generates a class or function signature, it tends to include every parameter that might conceivably be useful, because that covers more possible usage scenarios. The result is interfaces that are too broad, with parameters that no caller uses, and return types that expose internal state. These interfaces pass style checks and complexity metrics. They create coupling problems that surface months later.
Inconsistent abstraction levels. An LLM-generated method might start with a high-level business operation and then drop into low-level implementation details within the same function, because the training data contains both styles and the model mixes them. A linter will flag line length but not the fact that the method mixes SQL string building, business rule evaluation, and HTTP response formatting.
The Structural Reason Conventional Tools Fall Short
The reason is structural. Most linting and static analysis tools operate on the abstract syntax tree (AST) of individual files or functions. They can detect that a method has fifteen parameters or that a class has fifty methods. They cannot easily detect that a module in the data layer depends on a module in the presentation layer, because that dependency is not a local violation of a rule. It is a property of the module graph, and the AST of a single file does not contain the module graph.
Dedicated architectural analysis tools exist: Structure101, jQAssistant, ArchUnit for Java, and similar tools in other ecosystems. They model the full dependency graph and can enforce rules about which modules may depend on which. But these tools are underused compared to linters. They require upfront configuration of architectural boundaries, which teams rarely invest in until the codebase is already suffering.
LLM-generated code accelerates the timeline. With human-written code, architectural drift takes months to become visible. With LLM-generated code, it can happen in a single pull request, because the model has no awareness of the intended architecture and has generated a hundred new files based on prompt context alone.
What a Better Automated Quality Check Looks Like
From my experience working with codebases that have high volumes of LLM-generated code, I have come to believe that the automated quality tools we need operate at the module and dependency level, not the function level. They need to answer questions like:
- Does this new module create a dependency cycle where none existed before?
- Does this class import types from domains that are unrelated to its stated responsibility?
- Does this interface expose methods or types that belong in an internal implementation rather than a public contract?
- Is there a pattern of bidirectional coupling between two modules that should have a one-directional dependency?
- Are there modules that have no clear single responsibility, measured by the diversity of external types they depend on?
These questions are not hard to answer programmatically. Dependency graphs are computable. Import-level coupling metrics are straightforward to measure. The challenge is that existing tools do not surface these metrics in a way that integrates into CI and blocks pull requests, the same way that a linting violation does.
I have been experimenting with building a tool that addresses exactly this gap. The core idea is simple: generate a module-level dependency graph for the codebase, compute coupling and cohesion metrics for each module, and flag any new code that degrades those metrics beyond a configurable threshold. The tool does not care about line length or function nesting. It cares about whether a new import in the repository layer pulls in a dependency on the email service, or whether a new public method on the order entity leaks infrastructure types.
The experiment is early, but the signal is already clear. The dependency-level checks catch architectural violations that no linter in my CI pipeline flags. A pull request that adds a direct dependency between the data access module and the notification module gets rejected on coupling grounds, before any human reviewer looks at it.
The Tooling Ecosystem Was Built for a Different Era
The tooling ecosystem for code quality is conservative by nature. Linting tools mature over decades. The patterns they detect are well-understood and codified in style guides and language specifications. Architectural quality is harder to codify because it depends on context. What constitutes a reasonable module boundary in one project is a violation in another.
LLMs have changed the economics of code generation faster than the tooling ecosystem can adapt. The volume of code that can be produced per unit time has increased by orders of magnitude, but the tools that validate that code have not changed at the same pace. The result is a growing stock of generated code that passes all automated checks but contributes to long-term maintenance problems.
The problem is compounded by the fact that LLMs do not learn from feedback in the same way that human developers do. A junior developer who introduces a bad dependency in a pull request will be corrected in code review and will learn not to do it again. An LLM generating the next pull request carries no memory of the previous correction. The same architectural mistake can recur across dozens of pull requests, with no mechanism to prevent it beyond manual review.
Building Toward a Solution
The approach I am exploring is not the only path. Teams using ArchUnit or similar tools can codify architectural rules and enforce them in CI today, for languages that support them. Teams with mature architecture governance processes can extend those processes to require architectural review for any pull request that introduces new dependencies.
But these approaches require ongoing human effort to define and maintain the rules. What I think we need is a tool that can, with minimal configuration, analyze a codebase and detect the kinds of architectural issues that LLM-generated code is prone to introducing. The tool should not require a team to specify every allowed dependency upfront. It should be able to detect anomalies relative to the existing codebase structure and flag new code that deviates from it.
This is the problem I am trying to solve with my own tool. If you are working on something similar, or if you have encountered the same gap in your own CI pipeline, I would be interested to hear about your approach. The industry needs better solutions in this space, and the problem is not going to get smaller as LLM code generation becomes more widespread.
Recommendations for Teams Using LLM Code Generation
While the tooling ecosystem catches up, there are practical steps you can take to prevent architectural degradation from LLM-generated code.
Review dependency graphs as part of pull request review. Do not just review the diff file by file. Look at what new imports the change introduces and whether those imports create dependencies between modules that should be independent. This is tedious to do manually, but it catches the most common category of LLM-generated architectural violations.
Use architecture testing frameworks. ArchUnit (Java), NetArchTest (.NET), and similar tools let you write automated assertions about your module dependency structure. Enforce those assertions in CI. If your team has not invested in this yet, the growing volume of LLM-generated code is a strong reason to start.
Invest in well-defined interfaces before generating code. The Hexagonal Architecture approach of defining port interfaces in the domain core before implementing adapters in infrastructure creates natural boundaries that LLMs are less likely to violate. When the prompt includes a clear interface definition, the generated code is more likely to respect it.
Use the script pattern for code generation tasks that involve analysis or computation. The script pattern produces deterministic, verifiable outputs and keeps the LLM in the role of code generator rather than computation engine. This reduces the surface area for architectural mistakes because the generated scripts are typically self-contained and limited in scope, making architectural violations less likely.
Track architectural drift over time. Run a dependency graph analysis regularly and compare it to the previous state. If the number of cross-module dependencies is increasing faster than your team can address in refactoring, that is a signal to tighten automated checks or reduce the scope of LLM-generated code contributions.
The key insight is that LLM-generated code requires the same architectural discipline as human-written code, but it needs more automation to enforce that discipline because the model will not learn from past corrections. The quality checks that worked when humans wrote all the code are not sufficient anymore. The tools need to evolve, and teams need to adapt their processes now, before the architectural debt becomes unmanageable.