Build vs. Buy: Choosing What to Build and What to Use
Every system you design will reach a moment where you must answer a deceptively simple question: Should we build this ourselves, or use something that already exists?
This is not just a technical question. It is a question about where your team's limited engineering time creates the most value. Getting it wrong in either direction is costly: building something you could have bought wastes months of engineering effort on commodity infrastructure. Buying something you should have built creates vendor dependency on the very capability that makes your product unique.
The goal of this tutorial is to give you a mental model and a practical framework for making this decision well — and to apply it to the AI systems you are likely building today.
The Central Question: Differentiator or Commodity?#
Before any cost analysis or feature comparison, ask a single diagnostic question:
"Is this capability a core business differentiator, or a supporting utility?"
A differentiator is something that directly makes your product better than the alternatives for your specific users. A commodity is a well-understood problem with many good existing solutions — something that must exist in your system, but where your specific implementation provides no competitive advantage.
The practical default: buy for commodity functions, build for differentiators, and look for a hybrid option when a managed service covers 80% of your need at a fraction of the cost. Most modern systems are not purely "built" or "bought" — they are assembled from managed services, open-source libraries, and a thin layer of custom logic that makes them specific to their use case.
Some clear examples to calibrate your intuition:
| Function | Usually Commodity? | Usually Build or Buy? |
|---|---|---|
| User authentication | Yes — security is complex, standards exist | Buy (Auth0, Clerk, NextAuth + GitHub OAuth) |
| Email delivery | Yes — deliverability is infrastructure work | Buy (SendGrid, Resend, AWS SES) |
| Payment processing | Yes — compliance complexity is extreme | Buy (Stripe, Braintree) |
| Product recommendation algorithm | Depends — if ML ranking is your differentiator, build | Build or hybrid (custom model on managed infra) |
| Core AI reasoning/domain logic | No — this is likely your product's core value | Build (fine-tune or engineer the prompt chain) |
| Vector search infrastructure | Mostly yes — but scale and cost affect the answer | Start with managed (Pinecone, Qdrant Cloud); self-host at scale |
| LLM API calls | Usually yes — hosting LLMs is specialized infrastructure | Buy until cost makes self-hosting justified (see below) |
| Fraud detection model | No — your data distribution is unique to your business | Build on managed ML infrastructure |
What "Buy" Actually Costs: Total Cost of Ownership#
The most common mistake developers make when evaluating a managed service is comparing only the subscription price against zero — treating "build it ourselves" as free. The real comparison is Total Cost of Ownership (TCO) over a meaningful time horizon, typically three to five years.
TCO for the "buy" path includes more than the license fee:
- License/subscription fees — the number on the pricing page
- Integration engineering — the time required to connect the service to your system; often 1–4 weeks of engineer time for a non-trivial integration
- Ongoing customization — the work required when your needs outgrow what the service supports
- Migration cost — the engineering effort required to switch vendors if the relationship ends or prices rise
- Training — time for your team to become productive with the service's API and conventions
TCO for the "build" path includes:
- Initial development — writing, testing, and shipping the solution
- Ongoing maintenance — bug fixes, dependency upgrades, security patches; typically 15–20% of initial development cost per year
- Operational overhead — monitoring, incident response, capacity planning
- Opportunity cost — the features and improvements you did not ship while building this instead
The TCO Crossover: Why 'Free' Can Be Expensive
Build costs are front-loaded: high initial investment, but stable or declining over time as the system matures. Buy costs are back-loaded: low upfront, but they accumulate. The crossover — where the total cost of building equals the total cost of buying — typically falls around 18 to 36 months for most mid-size software teams. The crossover point shifts earlier if the service has aggressive pricing or the feature is complex to build.
A useful rule of thumb: managed services have a crossover point. Below it, the managed service costs less than the engineering effort to build and operate an equivalent. Above it, self-hosting or building wins economically. Knowing roughly where your current usage falls relative to that crossover is the most practical input to the build vs. buy decision.
Three Real-World Examples#
Abstract frameworks are easy to agree with and hard to apply in practice. These three concrete examples illustrate how the decision plays out in systems you are likely to encounter.
Authentication: The Classic "Buy" Case#
Authentication is the textbook example of a function you should almost always buy. The apparent simplicity is deceptive: a correct, production-grade authentication system must handle password hashing, session management, OAuth flows, CSRF protection, rate limiting on login endpoints, refresh token rotation, and account recovery — all correctly, and all in ways that attackers actively probe for weaknesses.
Most teams underestimate this complexity and discover the gaps only after a security incident.
| Option | Type | Approximate Monthly Cost | Best For |
|---|---|---|---|
| NextAuth.js v5 | Self-managed open source | $0 (infrastructure costs only) | Full control, no per-user pricing, comfortable with owning OAuth complexity |
| Clerk | Fully managed SaaS | $25–$1,025/mo (by MAU) | Best developer experience for Next.js, fastest setup (minutes not hours) |
| Auth0 | Fully managed SaaS | $0–$5,000+/mo (by MAU) | Compliance certifications (SOC 2, HIPAA, FedRAMP), enterprise SSO needs |
| AWS Cognito | Managed, AWS-native | ~$0.0055/MAU ($550/100k users) | Already deeply in AWS; large user base at predictable cost |
| Supabase Auth | Managed, includes database | $0–$599/mo | Already using Supabase for your database |
The hybrid insight: WiseBuilder uses NextAuth v5 with GitHub OAuth — a well-reasoned hybrid decision. NextAuth is open-source and free, eliminating per-user pricing at scale. GitHub OAuth offloads credential security to GitHub's battle-tested infrastructure. The team owns the session storage (Prisma/PostgreSQL) and the integration code, but the hard parts — OAuth security, token management — are handled by well-maintained open-source libraries. This is the hybrid model in practice: use the commodity security layer that already exists; build only the integration logic that is specific to your application.
The rule: Unless authentication is somehow your product's differentiator (it almost never is), buy it. The cost of getting it wrong — a data breach, a compliance failure, a session fixation vulnerability — dramatically exceeds the cost of a managed service.
Vector Databases: The AI-Specific Build vs. Buy Decision#
Vector databases are one of the most consequential build vs. buy decisions for teams building AI-powered systems. They are the backbone of Retrieval-Augmented Generation (RAG) — a technique where your system retrieves relevant information from a knowledge base and passes it to the LLM as context, rather than relying solely on the model's training data. Vector databases store embeddings (dense numerical representations of text that capture semantic meaning) and let you search them by similarity, so your AI can find the most relevant context for any query.
The options range from fully managed cloud services to self-hosted databases to using an extension on your existing PostgreSQL instance.
Vector Database Options: From Managed to Self-Hosted
Vector database choices form a spectrum between operational simplicity (managed cloud) and cost/control (self-hosted). The right choice depends on your query volume, team's operational capacity, and whether you are still validating the product or running a proven production system.
The practical starting point: If you are building your first RAG system and are already running PostgreSQL, start with pgvector. It handles the early stages of any RAG product without adding new infrastructure or vendor relationships. This lets you focus on what actually determines retrieval quality at this stage: your chunking strategy (how you split documents) and your embedding model — both of which matter far more than the choice of vector database at early scale. Migrate to a purpose-built vector database only when performance profiling confirms that pgvector is the actual bottleneck.
LLM APIs: When Self-Hosting Becomes Worth It#
LLM APIs are the most extreme example of the build vs. buy spectrum in AI systems. Using a hosted API (OpenAI, Anthropic, Google) is the fastest way to get started, but the cost structure — pay-per-token, where you are charged for every word sent to and received from the model — creates a compounding bill that can grow faster than your revenue.
| Scenario | Recommended Approach | Why |
|---|---|---|
| Under 100k daily requests, under $50k/yr projected | Hosted API (Claude Haiku, GPT-5.4 Mini) | Operational simplicity far outweighs cost premium at this scale |
| 100k–2M daily requests | Hybrid: route simple requests to cheaper models, complex to premium | GPT-5.4 Mini at $0.75/1M input tokens is 3× cheaper than GPT-5.4; most requests don't need the expensive model |
| Strict data privacy (HIPAA, classified, financial PII) | Self-hosted open-weight model | Data cannot leave your infrastructure; regulatory requirement overrides cost optimization |
| Over 2M tokens/day with stable request patterns | Self-hosted open-weight model (Llama 3, Mistral) | Cost per token is 100–400× lower at full GPU utilization; cost justifies GPU infrastructure investment |
The compounding cost problem is real. A pattern observed across many teams: LLM API spend grows from $15,000 to $60,000 per month in three months as the product scales, without any obvious trigger. The fix is rarely to self-host everything — it is to implement model routing, the practice of automatically directing each request to the most cost-appropriate model based on its complexity. Route routine classification and summarization to Claude Haiku or GPT-5.4 Mini (the fast, inexpensive models). Route complex multi-step reasoning to the premium model. Batch non-latency-sensitive workloads for processing during off-peak hours. Teams that apply this consistently report monthly cost reductions of 70–80% with no measurable change in user-perceived quality.
The key insight: before evaluating self-hosting, first ask whether your hosted API cost is actually optimized. Most teams that use a single model for all tasks — regardless of complexity — have significant cost reduction available without any infrastructure investment.
Vendor Lock-in: The Silent Risk#
Every "buy" decision carries vendor lock-in risk. The risk is not that the service stops working today — it is that you become dependent on a vendor and then face one of these events:
- The vendor raises prices by 2–5× (common as cloud services mature and investors demand returns)
- The vendor is acquired and the product is deprecated or consolidated
- The vendor's API changes in ways that break your integration
- Your data volume grows until the vendor's per-unit pricing structure becomes untenable
The mitigation is not avoiding managed services — it is architecting behind an abstraction layer.
The abstraction pattern works like this: instead of calling your vector database's SDK directly from your business logic, you define a small interface in your own code — something like VectorStore.search(query, topK) — and implement it once for your current vendor. If you need to migrate, you write a new implementation of the same interface. Your business logic stays untouched.
This is not abstraction for its own sake. It is the specific engineering habit that turns a vendor decision from permanent into reversible.
A concrete example: WiseBuilder's LLM layer uses exactly this pattern. The lib/llm/factory.ts module returns a LanguageModel from Vercel AI SDK — an abstraction that works identically whether the underlying provider is OpenAI, Anthropic, or Gemini. Switching models or providers requires changing a single line in the factory function, not tracking down and updating every API call across the codebase.
How AI Coding Tools Change the Calculus#
AI coding tools like Claude Code and GitHub Copilot meaningfully reduce the cost and time required to build custom solutions. This directly shifts the build vs. buy crossover.
Before AI coding tools: A feature that required 3 months of engineering time cost roughly $75,000 in labor (at a $300k/yr fully-loaded engineer cost). Buying a managed service for $10,000/year was clearly the better economic choice for most teams.
With AI coding tools: The same feature might take 2–4 weeks. The labor cost falls to $15,000–$25,000. The economic advantage of buying a managed service narrows significantly, and more things are now worth building that were previously too expensive to justify.
This shift has practical implications:
| Build vs. Buy Factor | Before AI Coding Tools | With AI Coding Tools |
|---|---|---|
| Initial development cost | High — typically 2–6 months of engineer time | Lower — AI accelerates greenfield development by 30–50% |
| Maintenance cost | 15–20% of initial cost per year | Largely unchanged — AI helps, but systems still need humans to run them |
| Customization effort | Major effort to extend beyond what a vendor supports | Reduced — AI can generate custom logic faster, making bespoke solutions more practical |
| Security review burden | Manual review required | Still required — AI-generated code has well-documented security failure modes; review cannot be skipped |
| Net effect on crossover point | — | Crossover shifts earlier — more things are now worth building that were previously too expensive |
The important caveat: AI coding tools reduce the cost of writing code, but they do not reduce the cost of operating infrastructure. A self-hosted vector database is not cheaper because Claude Code wrote the deployment scripts — it still requires monitoring, incident response, capacity planning, and a team that understands it when something breaks at 2 AM. The maintenance burden and operational overhead of custom systems remain real. AI shifts the build cost; it does not eliminate the operate cost.
A Decision Framework#
Before committing to build or buy, work through these eight questions:
| Question | Why It Matters |
|---|---|
| Is this a core business differentiator? | If yes, build. If no, strong presumption toward buying. This is the most important question. |
| What is the 3-year total cost of ownership for each path? | Include integration, maintenance, operational overhead, and opportunity cost — not just the license fee vs. zero |
| Do we have (or can we hire) the expertise to build and maintain this? | A technically superior custom solution that your team cannot operate is a bad solution |
| What is our exit strategy if the vendor raises prices 3x? | If there is no realistic migration path, the lock-in risk is higher than the subscription price suggests |
| Does regulatory compliance prevent sending this data to a third party? | HIPAA, GDPR data residency, classified information — compliance requirements can make managed services legally unavailable |
| How urgent is time-to-market? | Managed services deploy 40–60% faster. In a competitive market with a narrow window, speed can outweigh long-term cost |
| Can we architect behind an abstraction layer? | If yes, the buy decision becomes more reversible and less risky |
| Is there a hybrid option that buys the commodity layer and builds only the differentiating 10–20%? | This is often the best option — it is not always a binary choice |
The Build vs. Buy Decision Flow
Walk through these questions in order. Most decisions resolve at the first or second question — only a small fraction of features require the full analysis.
Applying Build vs. Buy When Working with AI Agents#
AI coding agents introduce a specific version of the build vs. buy question: when the agent generates a solution, it makes an implicit build vs. buy decision for you — and it may not make the right one.
Agents generate code based on context. They default to patterns that are easy to implement, well-represented in their training data, and technically correct. They do not default to the patterns that are economically optimal or architecturally appropriate for your specific situation.
Common AI agent defaults that warrant review:
- Agents frequently generate custom implementations of functions that already have mature libraries or managed services (e.g., custom retry logic when a battle-tested library exists, custom auth middleware when NextAuth or Clerk would work)
- Agents rarely calculate the operational cost of the solution they produce — they optimize for correctness, not for total cost of ownership
- Agents do not account for your team's operational capacity — a solution that requires a Kubernetes cluster is technically valid but inappropriate for a two-person team, yet the agent will generate it if the requirements point in that direction
The review habit: After an AI agent generates a solution that introduces a new external service or a significant new subsystem, pause and apply the decision framework. Ask: "Is the component this agent just built something we should actually be building, or should we be using a managed service?" — and equally: "Is the managed service this agent just wired up something we should own ourselves, given our scale and compliance requirements?"
A useful prompt to add to your agent workflow:
"Before you implement this, tell me: are you proposing to build something that already has a mature open-source library or managed service? If so, name the alternatives and explain why a custom implementation is better for our use case."
This forces the agent to surface the build vs. buy decision explicitly rather than burying it in the implementation details.
Summary#
| Principle | What It Means in Practice |
|---|---|
| Differentiator vs. commodity | Buy commodity functions (auth, email, payments); build or own what makes your product uniquely valuable |
| TCO, not sticker price | The comparison is always fully-loaded build cost vs. fully-loaded buy cost over 3–5 years — not license fee vs. zero |
| Vendor lock-in is manageable | Abstract vendor-specific code behind your own interface so the vendor is swappable; this turns a permanent decision into a reversible one |
| Start managed, migrate when justified | For vector DBs and LLM APIs, start with a managed service to validate the product; self-host only when the cost crossover makes it economically justified |
| AI tools reduce build cost, not operate cost | AI coding tools shift the crossover point by reducing development time — but operational overhead persists; factor it into the TCO calculation |
| Hybrid is usually the answer | Buy the commodity platform layer; build only the differentiating 10–20% on top of it. Most production AI systems are composed this way |
| Review agent defaults explicitly | AI agents make implicit build vs. buy decisions; after any agent-generated solution, verify that new dependencies and subsystems are the right choice for your scale and requirements |
The build vs. buy decision is ultimately a resource allocation decision: where does your team's limited engineering time create the most value? Answering that question well — with a clear-eyed view of total cost, vendor risk, and the distinction between what makes your product unique and what merely keeps it running — is one of the most impactful architectural judgments you will make.
Sources:
- Build vs. Buy Software: A 3-Model Decision Framework — Neontri
- Build vs. Buy Software Development: A Comprehensive Decision Framework — Full Scale
- Vector Database Comparison: Pinecone vs Weaviate vs Qdrant vs FAISS vs Milvus vs Chroma — Liquid Metal AI
- When Self-Hosting Vector Databases Becomes Cheaper Than SaaS — OpenMetal
- LLM Total Cost of Ownership 2025: Build vs Buy Math — Ptolemay
- Build vs Buy — LLM Gateways — Portkey
- Next.js Authentication Showdown: NextAuth vs Clerk vs Auth0 in 2025