Build vs. Buy: Choosing What to Build and What to Use

Every system you design will reach a moment where you must answer a deceptively simple question: Should we build this ourselves, or use something that already exists?

This is not just a technical question. It is a question about where your team's limited engineering time creates the most value. Getting it wrong in either direction is costly: building something you could have bought wastes months of engineering effort on commodity infrastructure. Buying something you should have built creates vendor dependency on the very capability that makes your product unique.

The goal of this tutorial is to give you a mental model and a practical framework for making this decision well — and to apply it to the AI systems you are likely building today.

The Central Question: Differentiator or Commodity?#

Before any cost analysis or feature comparison, ask a single diagnostic question:

"Is this capability a core business differentiator, or a supporting utility?"

A differentiator is something that directly makes your product better than the alternatives for your specific users. A commodity is a well-understood problem with many good existing solutions — something that must exist in your system, but where your specific implementation provides no competitive advantage.

Rendering diagram...

The practical default: buy for commodity functions, build for differentiators, and look for a hybrid option when a managed service covers 80% of your need at a fraction of the cost. Most modern systems are not purely "built" or "bought" — they are assembled from managed services, open-source libraries, and a thin layer of custom logic that makes them specific to their use case.

Some clear examples to calibrate your intuition:

Function	Usually Commodity?	Usually Build or Buy?
User authentication	Yes — security is complex, standards exist	Buy (Auth0, Clerk, NextAuth + GitHub OAuth)
Email delivery	Yes — deliverability is infrastructure work	Buy (SendGrid, Resend, AWS SES)
Payment processing	Yes — compliance complexity is extreme	Buy (Stripe, Braintree)
Product recommendation algorithm	Depends — if ML ranking is your differentiator, build	Build or hybrid (custom model on managed infra)
Core AI reasoning/domain logic	No — this is likely your product's core value	Build (fine-tune or engineer the prompt chain)
Vector search infrastructure	Mostly yes — but scale and cost affect the answer	Start with managed (Pinecone, Qdrant Cloud); self-host at scale
LLM API calls	Usually yes — hosting LLMs is specialized infrastructure	Buy until cost makes self-hosting justified (see below)
Fraud detection model	No — your data distribution is unique to your business	Build on managed ML infrastructure

What "Buy" Actually Costs: Total Cost of Ownership#

The most common mistake developers make when evaluating a managed service is comparing only the subscription price against zero — treating "build it ourselves" as free. The real comparison is Total Cost of Ownership (TCO) over a meaningful time horizon, typically three to five years.

TCO for the "buy" path includes more than the license fee:

License/subscription fees — the number on the pricing page
Integration engineering — the time required to connect the service to your system; often 1–4 weeks of engineer time for a non-trivial integration
Ongoing customization — the work required when your needs outgrow what the service supports
Migration cost — the engineering effort required to switch vendors if the relationship ends or prices rise
Training — time for your team to become productive with the service's API and conventions

TCO for the "build" path includes:

Initial development — writing, testing, and shipping the solution
Ongoing maintenance — bug fixes, dependency upgrades, security patches; typically 15–20% of initial development cost per year
Operational overhead — monitoring, incident response, capacity planning
Opportunity cost — the features and improvements you did not ship while building this instead

The TCO Crossover: Why 'Free' Can Be Expensive

Build costs are front-loaded: high initial investment, but stable or declining over time as the system matures. Buy costs are back-loaded: low upfront, but they accumulate. The crossover — where the total cost of building equals the total cost of buying — typically falls around 18 to 36 months for most mid-size software teams. The crossover point shifts earlier if the service has aggressive pricing or the feature is complex to build.

Rendering diagram...

A useful rule of thumb: managed services have a crossover point. Below it, the managed service costs less than the engineering effort to build and operate an equivalent. Above it, self-hosting or building wins economically. Knowing roughly where your current usage falls relative to that crossover is the most practical input to the build vs. buy decision.

Three Real-World Examples#

Abstract frameworks are easy to agree with and hard to apply in practice. These three concrete examples illustrate how the decision plays out in systems you are likely to encounter.

Authentication: The Classic "Buy" Case#

Authentication is the textbook example of a function you should almost always buy. The apparent simplicity is deceptive: a correct, production-grade authentication system must handle password hashing, session management, OAuth flows, CSRF protection, rate limiting on login endpoints, refresh token rotation, and account recovery — all correctly, and all in ways that attackers actively probe for weaknesses.

Most teams underestimate this complexity and discover the gaps only after a security incident.

Option	Type	Approximate Monthly Cost	Best For
NextAuth.js v5	Self-managed open source	$0 (infrastructure costs only)	Full control, no per-user pricing, comfortable with owning OAuth complexity
Clerk	Fully managed SaaS	$25–$1,025/mo (by MAU)	Best developer experience for Next.js, fastest setup (minutes not hours)
Auth0	Fully managed SaaS	$0–$5,000+/mo (by MAU)	Compliance certifications (SOC 2, HIPAA, FedRAMP), enterprise SSO needs
AWS Cognito	Managed, AWS-native	~$0.0055/MAU ($550/100k users)	Already deeply in AWS; large user base at predictable cost
Supabase Auth	Managed, includes database	$0–$599/mo	Already using Supabase for your database

The hybrid insight: WiseBuilder uses NextAuth v5 with GitHub OAuth — a well-reasoned hybrid decision. NextAuth is open-source and free, eliminating per-user pricing at scale. GitHub OAuth offloads credential security to GitHub's battle-tested infrastructure. The team owns the session storage (Prisma/PostgreSQL) and the integration code, but the hard parts — OAuth security, token management — are handled by well-maintained open-source libraries. This is the hybrid model in practice: use the commodity security layer that already exists; build only the integration logic that is specific to your application.

The rule: Unless authentication is somehow your product's differentiator (it almost never is), buy it. The cost of getting it wrong — a data breach, a compliance failure, a session fixation vulnerability — dramatically exceeds the cost of a managed service.

Vector Databases: The AI-Specific Build vs. Buy Decision#

Vector databases are one of the most consequential build vs. buy decisions for teams building AI-powered systems. They are the backbone of Retrieval-Augmented Generation (RAG) — a technique where your system retrieves relevant information from a knowledge base and passes it to the LLM as context, rather than relying solely on the model's training data. Vector databases store embeddings (dense numerical representations of text that capture semantic meaning) and let you search them by similarity, so your AI can find the most relevant context for any query.

The options range from fully managed cloud services to self-hosted databases to using an extension on your existing PostgreSQL instance.

Vector Database Options: From Managed to Self-Hosted

Vector database choices form a spectrum between operational simplicity (managed cloud) and cost/control (self-hosted). The right choice depends on your query volume, team's operational capacity, and whether you are still validating the product or running a proven production system.

Rendering diagram...

The practical starting point: If you are building your first RAG system and are already running PostgreSQL, start with pgvector. It handles the early stages of any RAG product without adding new infrastructure or vendor relationships. This lets you focus on what actually determines retrieval quality at this stage: your chunking strategy (how you split documents) and your embedding model — both of which matter far more than the choice of vector database at early scale. Migrate to a purpose-built vector database only when performance profiling confirms that pgvector is the actual bottleneck.

LLM APIs: When Self-Hosting Becomes Worth It#

LLM APIs are the most extreme example of the build vs. buy spectrum in AI systems. Using a hosted API (OpenAI, Anthropic, Google) is the fastest way to get started, but the cost structure — pay-per-token, where you are charged for every word sent to and received from the model — creates a compounding bill that can grow faster than your revenue.

Scenario	Recommended Approach	Why
Under 100k daily requests, under $50k/yr projected	Hosted API (Claude Haiku, GPT-5.4 Mini)	Operational simplicity far outweighs cost premium at this scale
100k–2M daily requests	Hybrid: route simple requests to cheaper models, complex to premium	GPT-5.4 Mini at $0.75/1M input tokens is 3× cheaper than GPT-5.4; most requests don't need the expensive model
Strict data privacy (HIPAA, classified, financial PII)	Self-hosted open-weight model	Data cannot leave your infrastructure; regulatory requirement overrides cost optimization
Over 2M tokens/day with stable request patterns	Self-hosted open-weight model (Llama 3, Mistral)	Cost per token is 100–400× lower at full GPU utilization; cost justifies GPU infrastructure investment

The compounding cost problem is real. A pattern observed across many teams: LLM API spend grows from $15,000 to $60,000 per month in three months as the product scales, without any obvious trigger. The fix is rarely to self-host everything — it is to implement model routing, the practice of automatically directing each request to the most cost-appropriate model based on its complexity. Route routine classification and summarization to Claude Haiku or GPT-5.4 Mini (the fast, inexpensive models). Route complex multi-step reasoning to the premium model. Batch non-latency-sensitive workloads for processing during off-peak hours. Teams that apply this consistently report monthly cost reductions of 70–80% with no measurable change in user-perceived quality.

The key insight: before evaluating self-hosting, first ask whether your hosted API cost is actually optimized. Most teams that use a single model for all tasks — regardless of complexity — have significant cost reduction available without any infrastructure investment.

Vendor Lock-in: The Silent Risk#

Every "buy" decision carries vendor lock-in risk. The risk is not that the service stops working today — it is that you become dependent on a vendor and then face one of these events:

The vendor raises prices by 2–5× (common as cloud services mature and investors demand returns)
The vendor is acquired and the product is deprecated or consolidated
The vendor's API changes in ways that break your integration
Your data volume grows until the vendor's per-unit pricing structure becomes untenable

The mitigation is not avoiding managed services — it is architecting behind an abstraction layer.

Rendering diagram...

The abstraction pattern works like this: instead of calling your vector database's SDK directly from your business logic, you define a small interface in your own code — something like VectorStore.search(query, topK) — and implement it once for your current vendor. If you need to migrate, you write a new implementation of the same interface. Your business logic stays untouched.

This is not abstraction for its own sake. It is the specific engineering habit that turns a vendor decision from permanent into reversible.

A concrete example: WiseBuilder's LLM layer uses exactly this pattern. The lib/llm/factory.ts module returns a LanguageModel from Vercel AI SDK — an abstraction that works identically whether the underlying provider is OpenAI, Anthropic, or Gemini. Switching models or providers requires changing a single line in the factory function, not tracking down and updating every API call across the codebase.

How AI Coding Tools Change the Calculus#

AI coding tools like Claude Code and GitHub Copilot meaningfully reduce the cost and time required to build custom solutions. This directly shifts the build vs. buy crossover.

Before AI coding tools: A feature that required 3 months of engineering time cost roughly $75,000 in labor (at a $300k/yr fully-loaded engineer cost). Buying a managed service for $10,000/year was clearly the better economic choice for most teams.

With AI coding tools: The same feature might take 2–4 weeks. The labor cost falls to $15,000–$25,000. The economic advantage of buying a managed service narrows significantly, and more things are now worth building that were previously too expensive to justify.

This shift has practical implications:

Build vs. Buy Factor	Before AI Coding Tools	With AI Coding Tools
Initial development cost	High — typically 2–6 months of engineer time	Lower — AI accelerates greenfield development by 30–50%
Maintenance cost	15–20% of initial cost per year	Largely unchanged — AI helps, but systems still need humans to run them
Customization effort	Major effort to extend beyond what a vendor supports	Reduced — AI can generate custom logic faster, making bespoke solutions more practical
Security review burden	Manual review required	Still required — AI-generated code has well-documented security failure modes; review cannot be skipped
Net effect on crossover point	—	Crossover shifts earlier — more things are now worth building that were previously too expensive

The important caveat: AI coding tools reduce the cost of writing code, but they do not reduce the cost of operating infrastructure. A self-hosted vector database is not cheaper because Claude Code wrote the deployment scripts — it still requires monitoring, incident response, capacity planning, and a team that understands it when something breaks at 2 AM. The maintenance burden and operational overhead of custom systems remain real. AI shifts the build cost; it does not eliminate the operate cost.

A Decision Framework#

Before committing to build or buy, work through these eight questions:

Question	Why It Matters
Is this a core business differentiator?	If yes, build. If no, strong presumption toward buying. This is the most important question.
What is the 3-year total cost of ownership for each path?	Include integration, maintenance, operational overhead, and opportunity cost — not just the license fee vs. zero
Do we have (or can we hire) the expertise to build and maintain this?	A technically superior custom solution that your team cannot operate is a bad solution
What is our exit strategy if the vendor raises prices 3x?	If there is no realistic migration path, the lock-in risk is higher than the subscription price suggests
Does regulatory compliance prevent sending this data to a third party?	HIPAA, GDPR data residency, classified information — compliance requirements can make managed services legally unavailable
How urgent is time-to-market?	Managed services deploy 40–60% faster. In a competitive market with a narrow window, speed can outweigh long-term cost
Can we architect behind an abstraction layer?	If yes, the buy decision becomes more reversible and less risky
Is there a hybrid option that buys the commodity layer and builds only the differentiating 10–20%?	This is often the best option — it is not always a binary choice

The Build vs. Buy Decision Flow

Walk through these questions in order. Most decisions resolve at the first or second question — only a small fraction of features require the full analysis.

Rendering diagram...

Applying Build vs. Buy When Working with AI Agents#

AI coding agents introduce a specific version of the build vs. buy question: when the agent generates a solution, it makes an implicit build vs. buy decision for you — and it may not make the right one.

Agents generate code based on context. They default to patterns that are easy to implement, well-represented in their training data, and technically correct. They do not default to the patterns that are economically optimal or architecturally appropriate for your specific situation.

Common AI agent defaults that warrant review:

Agents frequently generate custom implementations of functions that already have mature libraries or managed services (e.g., custom retry logic when a battle-tested library exists, custom auth middleware when NextAuth or Clerk would work)
Agents rarely calculate the operational cost of the solution they produce — they optimize for correctness, not for total cost of ownership
Agents do not account for your team's operational capacity — a solution that requires a Kubernetes cluster is technically valid but inappropriate for a two-person team, yet the agent will generate it if the requirements point in that direction

The review habit: After an AI agent generates a solution that introduces a new external service or a significant new subsystem, pause and apply the decision framework. Ask: "Is the component this agent just built something we should actually be building, or should we be using a managed service?" — and equally: "Is the managed service this agent just wired up something we should own ourselves, given our scale and compliance requirements?"

A useful prompt to add to your agent workflow:

"Before you implement this, tell me: are you proposing to build something that already has a mature open-source library or managed service? If so, name the alternatives and explain why a custom implementation is better for our use case."

This forces the agent to surface the build vs. buy decision explicitly rather than burying it in the implementation details.

Summary#

Principle	What It Means in Practice
Differentiator vs. commodity	Buy commodity functions (auth, email, payments); build or own what makes your product uniquely valuable
TCO, not sticker price	The comparison is always fully-loaded build cost vs. fully-loaded buy cost over 3–5 years — not license fee vs. zero
Vendor lock-in is manageable	Abstract vendor-specific code behind your own interface so the vendor is swappable; this turns a permanent decision into a reversible one
Start managed, migrate when justified	For vector DBs and LLM APIs, start with a managed service to validate the product; self-host only when the cost crossover makes it economically justified
AI tools reduce build cost, not operate cost	AI coding tools shift the crossover point by reducing development time — but operational overhead persists; factor it into the TCO calculation
Hybrid is usually the answer	Buy the commodity platform layer; build only the differentiating 10–20% on top of it. Most production AI systems are composed this way
Review agent defaults explicitly	AI agents make implicit build vs. buy decisions; after any agent-generated solution, verify that new dependencies and subsystems are the right choice for your scale and requirements

The build vs. buy decision is ultimately a resource allocation decision: where does your team's limited engineering time create the most value? Answering that question well — with a clear-eyed view of total cost, vendor risk, and the distinction between what makes your product unique and what merely keeps it running — is one of the most impactful architectural judgments you will make.

Sources:

PreviousClarifying Requirements

NextSimplicity vs. Scalability

Build vs. Buy: Choosing What to Build and What to Use

The TCO Crossover: Why 'Free' Can Be Expensive

Vector Database Options: From Managed to Self-Hosted

The Build vs. Buy Decision Flow

Arch Advisor