2.2 Common Vulnerability Patterns in AI-Generated Code
AI coding tools consistently reproduce a predictable set of security vulnerabilities. These are not random mistakes — they follow directly from how AI models are trained. An AI learns by studying billions of lines of code from public repositories and tutorials. If those examples mostly show insecure patterns, the AI will reproduce those same patterns. Each vulnerability described here was the dominant approach in older tutorials and public repositories, so it appears far more often in the training data than the secure alternative. The AI is not trying to write insecure code; it is simply generating the most statistically common completion for your prompt.
Veracode's 2025 benchmark tested over 100 LLMs and found that cross-site scripting tasks failed 86% of the time. SQL injection vulnerabilities appear in roughly 20% of AI-generated database code. Missing input validation is the most frequently cited security flaw across independent studies. Understanding why each insecure pattern occurs will help you recognize it in any AI-generated code — not just memorize a list.
The Five Patterns at a Glance#
The five vulnerability patterns AI coding tools consistently reproduce
| Pattern | CWE | Severity | Why AI Does It | One-Line Fix |
|---|---|---|---|---|
| Missing Input Validation | CWE-20 | High | Introductory tutorials show working code first; validation is treated as a polish step that most beginner examples omit entirely. | Validate type, range, format, and length before passing any value to a downstream system. |
| SQL Injection via String Concatenation | CWE-89 | Critical | String concatenation was the dominant SQL pattern in tutorials from the 1990s and 2000s — statistically more common in training data than parameterized queries. | Use parameterized queries (prepared statements). Never build SQL strings from user input. |
| Hardcoded Credentials | CWE-798 | High | Quick-start guides and early tutorials hardcode credentials for simplicity. Loading from environment variables requires extra setup steps that beginner examples skip. | Store all secrets in environment variables or a secrets manager. Never put literal values in source code. |
| Insecure Cryptography | CWE-916 | Critical | MD5 and SHA-1 appear far more often in historical code than bcrypt or Argon2. The insecure function runs without errors and satisfies the prompt — the flaw is invisible at syntax level. | Use bcrypt, scrypt, or Argon2id for password hashing. Never use MD5 or SHA-1 for passwords. |
| XSS via Unsanitized Output | CWE-79 | High | Early web templates render user data directly into HTML for simplicity. Output escaping is a defensive step that most tutorial examples omit, making it statistically uncommon in training data. | Use textContent instead of innerHTML. In frameworks that escape by default (React JSX), avoid unsafe escape hatches. |
Severity ratings follow CVSS conventions. Failure rates from Veracode GenAI Code Security Report 2025 and Endor Labs research.
From Training Data to Vulnerability#
Each pattern below maps directly to a class of code that appears frequently in training data — not because the patterns are correct, but because they are common. The diagram below shows how each bias in the training data leads to a specific vulnerability in generated code.
Missing Input Validation#
What it is: AI generates functions that accept user-supplied values and use them directly — passing them to databases, file systems, or external services without first checking whether the values are valid, within an expected range, or even the correct data type.
Why AI does it: Introductory coding tutorials almost never include validation. They show you how to make something work first, and validation is treated as a separate "polish" step. Since most training examples omit it, the model omits it too. Research by Endor Labs identifies missing input validation (CWE-20) as the single most common security flaw in LLM-generated code across every language they tested.
One especially dangerous pattern in AI-assisted applications: the AI often adds validation only in the React (or other frontend) component, leaving the API endpoint completely unprotected. An attacker who calls the API directly — bypassing the frontend entirely — faces no checks at all. This is why server-side validation is always required, regardless of what the frontend does.
Missing Input Validation
HighAI-generated endpoint accepts any user_id value and passes it directly to the database without checking whether it is a valid, positive integer.
SQL Injection via String Concatenation#
What it is: AI builds database queries by joining strings together with the user's input embedded directly inside the query. An attacker can insert SQL syntax into their input to manipulate the query — bypassing authentication, extracting data, or destroying records.
Why AI does it: String concatenation was the dominant pattern in SQL tutorials from the 1990s and 2000s. The model has seen it far more often than the safer alternative (parameterized queries). The concatenated version also runs without errors on normal input, so there is no signal to indicate that anything is wrong — the flaw only surfaces when someone deliberately supplies malicious input.
Independent research analyzing over 435 Copilot-generated GitHub code snippets found SQL injection among the top security weakness categories. The Cloud Security Alliance (2025) summarizes the root cause directly: "If an unsafe pattern — such as string-concatenated SQL queries — appears frequently in the training set, the assistant will readily produce it."
SQL Injection via String Concatenation
CriticalUser input is concatenated directly into the SQL query string. An attacker can inject SQL syntax to bypass authentication, dump all rows, or delete records.
Hardcoded Credentials#
What it is: API keys, passwords, database connection strings, and tokens written directly into source code. AI does this most often when asked to "connect to this database" or "set up this API integration." Once committed to a git repository, a hardcoded credential persists in the version history even after the file is edited or deleted.
Why AI does it: Quick-start guides and early tutorials hardcode credentials for simplicity — often with a comment like "replace this with your actual key." The model learns the pattern (credential directly in source) but not the warning comment. This risk is compounded by how AI coding agents work: they automatically read your project files — including .env and other config files — to understand your project's context. If those files contain real credentials, the model may copy those exact values directly into the code it generates. See Section 2.6 for the full explanation.
Researchers at the Chinese University of Hong Kong prompted Copilot to fill in redacted secrets from 900 real GitHub code snippets. The result: 2,702 valid secrets extracted at a 33.2% valid rate, including confirmed working Stripe payment keys and AWS Access Keys. GitGuardian found that AI-assisted commits leak secrets at twice the baseline rate of human-only commits.
Hardcoded Credentials
HighAPI key and database password written directly into the source file. When this file is committed to git, both secrets are permanently recorded in version history.
Insecure Cryptography#
What it is: AI uses weak or outdated cryptographic algorithms — especially MD5 and SHA-1 for password hashing — because these are the most frequently referenced hash functions in historical code examples. The critical flaw is not just their age: MD5 and SHA-1 are fast hash functions. For password hashing, speed is the opposite of what you want. (You will see exactly why in the "How It Works" section of the card below.)
Why AI does it: MD5 and SHA-1 appear far more often in training data than modern alternatives like bcrypt, scrypt, or Argon2id. The insecure function runs without errors, produces output that looks like a valid hash, and satisfies the prompt. The flaw is invisible at the syntax level — it only becomes a serious problem when the password database is leaked and attackers crack every hash in hours.
MD5 collision attacks have been known since the mid-1990s. SHA-1 was formally deprecated by NIST in 2011. Google demonstrated a full SHA-1 collision in 2017. Yet AI tools still routinely produce both of them for password hashing, because the training data reflects decades of widespread usage, not current best practices.
Insecure Password Hashing
CriticalMD5 is a fast hash function designed for data integrity checks, not password storage. Modern hardware can compute billions of MD5 hashes per second, making brute-force attacks trivially cheap.
Cross-Site Scripting (XSS) via Unsanitized Output#
What it is: AI generates HTML templates or API responses that embed user-controlled content without escaping it first. "Escaping" means converting special characters like < and > into harmless text so the browser displays them as characters rather than treating them as HTML tags. Without escaping, a user who stores a malicious script in their data can make that script run in every other user's browser that loads the page — stealing session cookies, performing actions on the victim's behalf, or displaying fake login forms.
Why AI does it: Early web development examples — and many quick-start templates — render user data directly into HTML for simplicity. Output escaping requires an extra function call that most tutorial authors omit. Veracode's 2025 benchmark measured an 86% failure rate for XSS-related tasks — the highest failure rate of any vulnerability category they tested.
XSS is so common in AI-generated code because the unsafe patterns (innerHTML, unescaped template literals, dangerouslySetInnerHTML) are what most basic examples use. The safer alternatives (textContent, framework auto-escaping) are rarely demonstrated in tutorial-style code, making them statistically uncommon in the training data.
XSS via Unsanitized Output
HighUser-supplied content inserted directly into innerHTML. An attacker who stores a script tag in this field causes it to execute in every visitor's browser.
How to Catch These Patterns#
After every AI-assisted coding session, run through this quick checklist before committing:
- Grep for hardcoded secrets:
apiKey,password,secret,Bearer, and suspiciously long base64-like strings - Look for SQL string concatenation:
f"SELECT,"WHERE " +, query variables built with+or template literals - Check input validation: every function that accepts external input (request parameters, form data, file uploads) should have explicit validation before the value is used
- Check crypto imports in authentication code:
hashlib.md5,hashlib.sha1,crypto.createHash('md5'),MessageDigest.getInstance("MD5") - Search for
innerHTMLassignments and verify each instance either uses trusted content or is sanitized with DOMPurify
Static analysis rules for each pattern — run these with Semgrep or CodeQL before committing
| Pattern | Grep / Semgrep Target | What to Look For |
|---|---|---|
| Missing Validation | request.args.get, req.query, req.body, req.params | Is the returned value used directly without a type or range check? |
| SQL Injection | f"SELECT, f"WHERE, + " WHERE", execute( followed by string concat | Is user input ever part of the SQL string, not a parameter placeholder? |
| Hardcoded Credentials | apiKey =, password =, secret =, Bearer , strings matching sk_live_, AKIA | Any string literal that looks like a key, token, or password adjacent to a connection or auth call |
| Weak Crypto | hashlib.md5, hashlib.sha1, createHash('md5'), createHash('sha1'), getInstance("MD5") | Any use of MD5 or SHA-1 in authentication, password storage, or security-sensitive context |
| XSS | innerHTML, dangerouslySetInnerHTML, document.write, bypassSecurityTrust | Is the right-hand side user-controlled? If so, is it sanitized before assignment? |
Semgrep has pre-built rules covering these patterns. Run: semgrep scan --config p/owasp-top-ten on AI-generated code before committing.
Static analysis tools like Semgrep and CodeQL automate most of these checks. Running them as a pre-commit hook (a script that runs automatically before each git commit) or in your CI pipeline catches patterns that look syntactically correct but are structurally insecure — exactly the kind of flaw that manual code review most often misses in AI-generated output.
Using an AI coding agent as a security reviewer#
Static analysis catches known patterns, but it cannot reason about application-specific logic — whether an endpoint should require authentication, whether a particular data flow exposes sensitive information, or whether a cryptographic choice is appropriate for the threat model. A second AI agent, invoked separately with an explicit security-review role, can fill this gap.
The key principle: the reviewer must be a separate session from the code generator. The agent that wrote the code is optimizing for "make it work"; a reviewer agent must optimize for "find what's wrong." Same model, different job, separate context.
In practice, this looks like:
- Dedicated review commands. Tools like Claude Code provide a built-in
/reviewcommand that examines your pending changes with a security-focused lens. Use it after generating code, not during. - Custom security prompts. Start a new AI session with an explicit adversarial prompt: "You are a security reviewer. Examine the following code for missing input validation, SQL injection, hardcoded credentials, weak cryptography, XSS, missing authentication checks, and overly permissive configurations. For each issue found, explain the attack scenario and suggest a fix."
- Scope the review to the diff. Feed the reviewer only what changed (
git diff), so it focuses on new code rather than getting lost in the broader codebase. - Treat the AI review as a first pass, not a final verdict. An AI reviewer catches mechanical issues reliably — missing validation, obvious injection paths, hardcoded secrets — but it can miss subtle logic flaws, business-rule violations, and novel attack vectors. For critical components (authentication, payment, access control, cryptography), human review remains non-negotiable.
This two-agent pattern — one to write, one to audit — is significantly more effective than asking the same session to "check its own work," because the reviewer starts fresh without the sunk-cost bias of having generated the code.
Sources:
- Veracode GenAI Code Security Report (2025)
- Endor Labs — Top 10 Risks for LLM-Generated Code (2025)
- Cloud Security Alliance — Security Implications of AI Code Assistants (2025)
- CWE-20: Improper Input Validation (MITRE)
- CWE-89: SQL Injection (MITRE)
- CWE-798: Use of Hard-coded Credentials (MITRE)
- CWE-916: Use of Password Hash With Insufficient Computational Effort (MITRE)
- CWE-79: Cross-site Scripting (MITRE)
- OWASP Password Storage Cheat Sheet
- Copilot Secrets Extraction via Code Completion (CUHK, 2024)
- State of Secrets Sprawl 2026 (GitGuardian)