Building AI Products: A Masterclass in Experimentation and Personalization

Building AI Products: A Masterclass in Experimentation and Personalization

Written by Madalina Turlea

14 Nov 2025

How we tested infinite personalization for Airbnb property descriptions—and what we learned about AI product development

Live Experiment: Personalizing Airbnb Property Descriptions

Let's walk through a real experiment demonstrating these principles in action.

The Hypothesis

Data shows conversion rates are higher when property descriptions highlight features most relevant to each traveler:

  • - Families with kids: Safety features, playgrounds, pools
  • - Business travelers: High-speed WiFi, proximity to public transport, nearby cafes

Instead of showing everyone the same generic description, what if we personalized it based on the traveler's profile?

Setting Up the Experiment

Role: Product manager at Airbnb

Goal: Test if AI can personalize property descriptions effectively

Success criteria:

  • - Does it work?
  • - Which model performs best?
  • - What are the costs and runtime performance?
  • - What accuracy can we expect?

The System Prompts (Two Versions)

Version 1 - Basic Prompt (where most teams start):

You are an Airbnb listing copywriter. Generate a personalized property
description based on traveler information.

Version 2 - Comprehensive Prompt (with domain expertise):

You are creating personalized Airbnb property descriptions.

Context: This description appears on the property listing page and should
help travelers quickly understand why this property fits their needs.

Requirements:
- Keep under 500 characters
- Be truthful - only highlight features the property actually has
- Personalize based on traveler profile (age, travel purpose, group size)
- Maintain Airbnb's friendly, welcoming tone
- Follow content guidelines: [specific guidelines]
- No titles or subtitles in output - just the description text

Input format: You'll receive traveler type (adults, children, pets,
travel purpose) and property details (type, location, amenities)

Examples:
[Traveler: 1 adult, business travel | Property: shared room, San Francisco,
private bathroom, high-speed WiFi]
→ "Perfect for your business trip to SF. Private bathroom and blazing-fast
WiFi let you stay productive. Walking distance to public transit."

[Traveler: 2 adults, 2 children, leisure | Property: entire home, Orlando,
pool, playground]
→ "Your family will love this spacious Orlando home. The kids can enjoy
the pool and playground while you relax on the patio. Close to major
attractions."

Models Tested

We tested six models from major providers:

  • - Anthropic: Claude Sonnet 4.5 (latest), Claude Sonnet 4.0 (previous)
  • - OpenAI: GPT-5
  • - Google: Gemini
  • - DeepSeek: DeepSeek
  • - Mistral: Mistral

Test Inputs (5 Sample Cases)

  1. - One adult, no children/pets, traveling for business → Two-bedroom home in San Francisco, high-speed WiFi, private bathroom
  2. - One adult, no children/pets, traveling for leisure → Private room in Barcelona
  3. - Family with children → Entire home in Orlando with pool and playground
  4. - Business traveler → Shared room in San Francisco
  5. - Couple traveling for leisure → Entire apartment in Austin with workspace

Total experiment: 2 prompt versions × 5 test cases × 6 models = 60 generated descriptions

Analyzing the Results

The Review Process

For each of the 60 outputs, we evaluated:

  • - Format: Does it match Airbnb's style? No unwanted titles or subtitles?
  • - Accuracy: Does it only mention features the property actually has?
  • - Tone: Is it appropriate for the traveler's age and context?
  • - Length: Does it fit within the 500-character limit?
  • - Relevance: Does it highlight what matters to this specific traveler?

What We Found

Basic Prompt Results: Inconsistent, often problematic

Example failures:

  • - Added huge titles and bullet points (not aligned with Airbnb's UI)
  • - Exposed thinking process: "OK, I have a traveler who is a single adult, no children..." (showing its reasoning instead of the output)
  • - Rambled endlessly without finishing
  • - Made up features that don't exist

Comprehensive Prompt Results: Much better, but still issues

Example failures:

  • - Still added "Output:" titles in some cases
  • - One model said "shared bathroom is a great opportunity to socialize with fellow guests" (inappropriate tone)
  • - Some models added features not actually in the property description

The Winning Combination

Best performer: Comprehensive prompt + Claude Sonnet 4.0

  • - Accuracy: 80% (4 out of 5 test cases passed)
  • - Right length, format, and tone
  • - Truthful (only highlighted real features)
  • - Appropriately personalized

Key Insights

1. Detailed Instructions Matter

The top three performers all used the comprehensive prompt. The basic prompt failed consistently. Domain expertise in crafting instructions makes a massive difference.

2. Newer ≠ Better

Claude Sonnet 4.5 (the latest model) got zero out of five correct. Claude Sonnet 4.0 (the previous version) got 80% accuracy.

This destroys the assumption that you should default to the newest model. Each model has different characteristics, and only testing reveals which works best for your use case.

3. Cost Varies Dramatically

Comparing the top three performers:

  • - DeepSeek: 60% accuracy, $0.0079 per description
  • - GPT-5: 60% accuracy, $0.08 per description
  • - Claude Sonnet 4.0: 80% accuracy, moderate cost

GPT-5 costs over 10 times more than DeepSeek for the same accuracy. At Airbnb's scale, this could mean the difference between thousands of dollars per month and millions.

If an average user views 10 properties per session, multiply that cost difference across thousands of simultaneous users. Cost sustainability becomes a make-or-break factor.

Learnings and Next Steps

What Worked

  • - Comprehensive prompts with examples dramatically improved results
  • - Testing multiple models revealed surprising performance differences
  • - Side-by-side comparison made evaluation much easier than judging outputs in isolation

What to Improve in Next Iteration

From the failures, we identified specific improvements:

  1. - Add explicit instruction: "Output only the description text, no titles or labels"
  2. - Add guardrail: "Do not mention features not explicitly listed in the property details"
  3. - Test DeepSeek with refined prompt to see if we can push accuracy above 80% while maintaining cost efficiency

Why This Matters

As a product manager, we were able to:

  • - Validate feasibility: Yes, AI can personalize descriptions effectively
  • - Identify the right approach: Comprehensive prompts + Claude Sonnet 4.0
  • - Understand costs and constraints: Know what's sustainable at scale
  • - Find concrete next steps: Specific improvements to test in the next iteration

This gives confidence that shipping this feature will improve user experience rather than harm it.

The Power of Systematic Testing

Imagine if we'd only tested one model with the basic prompt on one happy-path case. We might have thought "Great, it works!" and shipped it.

Then in production:

  • - Some users see titles and subtitles that break the UI
  • - Some get descriptions with fake features (eroding trust)
  • - Some see rambling text that never ends
  • - Costs are 10x higher than expected

You cannot catch these issues without systematic experimentation across diverse test cases and models.

Practical Takeaways for Product Teams

1. Experimentation is Not Optional

You need tools that let you:

  • - Test multiple prompts and models in parallel
  • - Run experiments across diverse test cases
  • - Compare results side by side
  • - Track accuracy, cost, and performance systematically

Switching between OpenAI's platform, Anthropic's platform, and copy-pasting test inputs everywhere gets messy fast.

2. Domain Experts Should Own Prompts

Don't let prompt creation sit solely with engineering. The people with context—PMs, domain experts—should define instructions and evaluate results.

3. Latest ≠ Best

Don't default to the newest, most expensive model. Test across multiple options. Older or smaller models often outperform on specific tasks.

4. Context is Your Competitive Advantage

Two teams using the same model will get wildly different results based on the instructions they write. Your domain knowledge is what makes AI features actually work.

5. Start Small, Iterate

This experiment used only 5 test cases (a very small sample). Even with that, we identified clear failure patterns and improvement directions. Start small, learn fast, then scale up testing.

Conclusion: Building Confidence Through Practice

A year ago, we felt overwhelmed by AI hype and unsure where to start. What changed?

Not finding some secret formula. Not reading one more think-piece about prompt engineering. Systematic experimentation.

By testing, comparing, evaluating, and iterating, we built the confidence to:

  • - Understand what's possible and what's not
  • - Recognize which approaches work for specific problems
  • - Evaluate quality objectively
  • - Make data-backed decisions about model selection and costs

AI product development is closer to product discovery than traditional software engineering. It requires iteration, learning, and continuous refinement.

The good news? You don't need to be an "AI expert" to start. You need to understand the fundamentals, experiment systematically, and leverage your domain expertise—the context only you have about your users, product, and problems.

The tools and models are getting better every week. The real question is: are you building the practice of experimentation that lets you take advantage of them?

Ready to start?

Try it yourself. Get started for free!

Join our mailing list

Lovelaice transforms how teams evaluate AI automation and features.