Lovelaice framing

Vibe check

Also known as: Vibe checking, Vibe-based evaluation

Definition

Eyeballing a handful of AI outputs and concluding the model 'seems fine.' The dominant industry practice for AI quality assessment — and the practice that systematic evaluation replaces.

Vibe checking is not a term Lovelaice coined, but Lovelaice owns the framing of why it's the central methodological problem in shipping AI. A vibe check is the unstructured, intuition-based review of a small number of AI outputs that ends in a verdict like 'looks good.' It is what teams reach for when nobody has yet documented what 'good' actually means for the product, the user, and the domain. It is also the implicit methodology behind a generic LLM-as-judge prompted with 'is this helpful?' — automation that codifies a vibe, not a standard.

Origin

The framing behind the term

The term predates Lovelaice; the framing — that vibe checks are the diagnostic of an evaluation system that hasn't earned the right to automate yet — is Lovelaice's.

Why it matters

  • It doesn't scale. Reviewing 5 outputs by gut feel can't tell you what happens at 5,000.
  • It's inconsistent across team members. Two PMs reviewing the same response disagree more often than they agree.
  • It misses edge cases entirely. Vibe checks happen on happy paths; failures live in the long tail.
  • It produces false confidence. A generic LLM-as-judge running on vibe criteria achieves only 60-70% agreement with human evaluators — barely better than random for anything beyond surface formatting.

Industry prevalence

In roughly 90% of teams shipping AI features, quality evaluation is manual and vibe-based — no structured criteria, no systematic testing, no grounding in actually observed failures.

What replaces it

Structured evaluation: deterministic checks for measurable criteria (format, length, allowed values, required fields) and a surgical, validated LLM-as-judge for the subjective dimensions that genuinely require human-like judgment. See the Evaluation Ladder.