— Tool comparison · April 2026

LangSmith vs Langfuse vs Arize: which one should you actually pick for production AI agents?

All three are observability tools. None of them are governance tools. Here's the difference, and what we recommend depending on what you're trying to control.

If you're researching tools to run AI agents in production, you've probably landed on LangSmith, Langfuse, and Arize. They keep showing up in the same lists. The marketing pages all promise some version of the same thing: visibility, evaluation, monitoring.

The reality is messier. These tools share a category but solve different problems, and none of them solve the one most enterprise teams actually need: can I prove my agent did what it said it did, and stop it when it didn't? That's a governance question, not an observability one. We'll get to that.

This is an honest comparison. If you just need traces and dashboards, one of these three will do the job. If you need audit trails for a SOC 2 Type II auditor or a treasury team that wants to cap agent spend, none of them will.

The 30-second answer

Tool	Best for	OSS?	Pricing model	Skip if
LangSmith	Teams already on LangChain / LangGraph	No (closed source)	Per-trace, free tier ~5k/mo	You don't use LangChain
Langfuse	OSS-first teams that want self-hosting	Yes (MIT)	Self-host free; cloud per-event	You need a managed-only solution
Arize	ML-first teams scaling beyond LLMs	Phoenix is OSS; Arize AI is commercial	Annual contracts, enterprise focus	You're a small startup needing self-serve pricing

LangSmith

LangSmith is LangChain Inc.'s observability and evaluation platform. It's the obvious choice if you've already adopted LangChain or LangGraph — instrumentation is one decorator and traces show up automatically. The eval framework is genuinely strong: you can wire up regression tests, A/B prompt comparisons, and hit-rate metrics with relatively little code.

What it does well: Out-of-the-box LangChain integration. Trace visualization is the cleanest of the three. Eval datasets with version control. Prompt playground that handles your existing chains.

What annoys me: Closed source, so you're locked in. Pricing scales with traces — easy to spend $200-500/mo without realizing it on a chatty agent. Limited value if you've intentionally avoided LangChain (a not-uncommon choice). Self-hosting is enterprise-only and gated behind sales calls.

Pricing reality: Free tier handles a small project. Plus tier kicks in around $39/user/mo, and the per-trace overage adds up faster than you expect on a multi-agent system that's making lots of LLM calls per task.

Langfuse

Langfuse is the open-source alternative. MIT-licensed, framework-agnostic, and you can run the whole stack on your own infra. That's a real differentiator if you're in a regulated environment or just don't want to ship traces to a third party.

What it does well: Self-hosting actually works (we've deployed it on a single Hetzner VPS). The TypeScript and Python SDKs are clean. Prompt management with versioning is in the OSS tier — most competitors charge for that. Active development, strong community on Discord.

What annoys me: The eval ecosystem is thinner than LangSmith's. The UI is functional but less polished. If you self-host, you're now operating Postgres + ClickHouse and dealing with backups yourself — which is a real cost.

Pricing reality: Self-hosted is free (your infra costs only). Cloud Hobby is $29/mo. Cloud Pro starts at $199/mo. Enterprise is custom — typical for a B2B OSS-with-cloud play.

Arize

Arize comes from the ML monitoring world, not the LLM world. They were monitoring traditional ML models for years before pivoting to add LLM observability. That heritage shows: the platform is built for teams that need drift detection, embedding analysis, and the kind of statistical rigor that comes from production ML, not just trace replay.

What it does well: Best-in-class drift and performance analytics if your agents touch ML models beyond just LLM calls. Phoenix (their OSS tracer) is great if you want a quick local-only setup for development. Strong RAG evaluation tooling specifically.

What annoys me: Heavy. Pricing is annual-contract enterprise — there's no quick "give us a card and start" path for the commercial product. The complexity is real if you're just running a couple of agent workflows; you'll feel like you're using the wrong tool. Documentation assumes ML maturity that LLM-only teams may not have.

Pricing reality: Phoenix OSS is free. Arize AI is annual contracts, typically $20k+ to start, scaling significantly with usage. Made for ML platform teams at companies with budget, not solo founders.

The honest tradeoff

Here's the part most comparison posts skip: none of these is a governance tool. They tell you what your agent did. They don't tell you whether your agent was allowed to do it, they don't enforce spending limits, and they don't produce the kind of tamper-evident audit log that a SOC 2 auditor or a treasury controller will accept.

That's fine if your agents are doing things like answering customer support tickets or summarizing documents. The blast radius is small. Observability is enough.

It stops being fine when agents make purchases, sign contracts, modify customer records, or operate inside regulated industries. At that point, the question stops being "what did the agent do?" and becomes "what was the agent permitted to do, and how do I prove it didn't exceed those permissions?"

What we'd actually recommend

Already on LangChain/LangGraph and just need visibility: LangSmith. The friction is lowest, the integration is real.
Want to self-host or you're framework-agnostic: Langfuse. Open source, owns your data, the cloud tier is reasonable.
You have an ML platform team and traditional model monitoring needs alongside LLMs: Arize. Otherwise overkill.
Need to enforce policies, control spend, or produce audit trails for compliance: Pair any of the three with a governance layer. Observability alone won't pass a SOC 2 review.

If governance is the gap, take a look at Ledgerline.

Ledgerline is the accountability layer that sits underneath whichever observability tool you pick: hierarchical agent identity, per-agent spending caps with virtual cards, policy enforcement, and a hash-chained audit log designed for SOC 2 and internal compliance teams. Built specifically for the production-readiness gap that LangSmith, Langfuse, and Arize don't fill.

See how it works →

Bottom line

If your agents are read-only or low-stakes, pick the observability tool that fits your stack and move on. If your agents touch money, contracts, or regulated data, observability is necessary but not sufficient — you'll need a governance layer alongside it. We built Ledgerline because we kept hitting that gap ourselves.