Building AI Products: A Masterclass in Experimentation and Personalization
Written by Madalina Turlea
14 Nov 2025
How we tested infinite personalization for Airbnb property descriptions—and what we learned about AI product development
Live Experiment: Personalizing Airbnb Property Descriptions
Let's walk through a real experiment demonstrating these principles in action.
The Hypothesis
Data shows conversion rates are higher when property descriptions highlight features most relevant to each traveler:
- - Families with kids: Safety features, playgrounds, pools
- - Business travelers: High-speed WiFi, proximity to public transport, nearby cafes
Instead of showing everyone the same generic description, what if we personalized it based on the traveler's profile?
Setting Up the Experiment
Role: Product manager at Airbnb
Goal: Test if AI can personalize property descriptions effectively
Success criteria:
- - Does it work?
- - Which model performs best?
- - What are the costs and runtime performance?
- - What accuracy can we expect?
The System Prompts (Two Versions)
Version 1 - Basic Prompt (where most teams start):
You are an Airbnb listing copywriter. Generate a personalized property
description based on traveler information.
Version 2 - Comprehensive Prompt (with domain expertise):
You are creating personalized Airbnb property descriptions.
Context: This description appears on the property listing page and should
help travelers quickly understand why this property fits their needs.
Requirements:
- Keep under 500 characters
- Be truthful - only highlight features the property actually has
- Personalize based on traveler profile (age, travel purpose, group size)
- Maintain Airbnb's friendly, welcoming tone
- Follow content guidelines: [specific guidelines]
- No titles or subtitles in output - just the description text
Input format: You'll receive traveler type (adults, children, pets,
travel purpose) and property details (type, location, amenities)
Examples:
[Traveler: 1 adult, business travel | Property: shared room, San Francisco,
private bathroom, high-speed WiFi]
→ "Perfect for your business trip to SF. Private bathroom and blazing-fast
WiFi let you stay productive. Walking distance to public transit."
[Traveler: 2 adults, 2 children, leisure | Property: entire home, Orlando,
pool, playground]
→ "Your family will love this spacious Orlando home. The kids can enjoy
the pool and playground while you relax on the patio. Close to major
attractions."
Models Tested
We tested six models from major providers:
- - Anthropic: Claude Sonnet 4.5 (latest), Claude Sonnet 4.0 (previous)
- - OpenAI: GPT-5
- - Google: Gemini
- - DeepSeek: DeepSeek
- - Mistral: Mistral
Test Inputs (5 Sample Cases)
- - One adult, no children/pets, traveling for business → Two-bedroom home in San Francisco, high-speed WiFi, private bathroom
- - One adult, no children/pets, traveling for leisure → Private room in Barcelona
- - Family with children → Entire home in Orlando with pool and playground
- - Business traveler → Shared room in San Francisco
- - Couple traveling for leisure → Entire apartment in Austin with workspace
Total experiment: 2 prompt versions × 5 test cases × 6 models = 60 generated descriptions
Analyzing the Results
The Review Process
For each of the 60 outputs, we evaluated:
- - Format: Does it match Airbnb's style? No unwanted titles or subtitles?
- - Accuracy: Does it only mention features the property actually has?
- - Tone: Is it appropriate for the traveler's age and context?
- - Length: Does it fit within the 500-character limit?
- - Relevance: Does it highlight what matters to this specific traveler?
What We Found
Basic Prompt Results: Inconsistent, often problematic
Example failures:
- - Added huge titles and bullet points (not aligned with Airbnb's UI)
- - Exposed thinking process: "OK, I have a traveler who is a single adult, no children..." (showing its reasoning instead of the output)
- - Rambled endlessly without finishing
- - Made up features that don't exist
Comprehensive Prompt Results: Much better, but still issues
Example failures:
- - Still added "Output:" titles in some cases
- - One model said "shared bathroom is a great opportunity to socialize with fellow guests" (inappropriate tone)
- - Some models added features not actually in the property description
The Winning Combination
Best performer: Comprehensive prompt + Claude Sonnet 4.0
- - Accuracy: 80% (4 out of 5 test cases passed)
- - Right length, format, and tone
- - Truthful (only highlighted real features)
- - Appropriately personalized
Key Insights
1. Detailed Instructions Matter
The top three performers all used the comprehensive prompt. The basic prompt failed consistently. Domain expertise in crafting instructions makes a massive difference.
2. Newer ≠ Better
Claude Sonnet 4.5 (the latest model) got zero out of five correct. Claude Sonnet 4.0 (the previous version) got 80% accuracy.
This destroys the assumption that you should default to the newest model. Each model has different characteristics, and only testing reveals which works best for your use case.
3. Cost Varies Dramatically
Comparing the top three performers:
- - DeepSeek: 60% accuracy, $0.0079 per description
- - GPT-5: 60% accuracy, $0.08 per description
- - Claude Sonnet 4.0: 80% accuracy, moderate cost
GPT-5 costs over 10 times more than DeepSeek for the same accuracy. At Airbnb's scale, this could mean the difference between thousands of dollars per month and millions.
If an average user views 10 properties per session, multiply that cost difference across thousands of simultaneous users. Cost sustainability becomes a make-or-break factor.
Learnings and Next Steps
What Worked
- - Comprehensive prompts with examples dramatically improved results
- - Testing multiple models revealed surprising performance differences
- - Side-by-side comparison made evaluation much easier than judging outputs in isolation
What to Improve in Next Iteration
From the failures, we identified specific improvements:
- - Add explicit instruction: "Output only the description text, no titles or labels"
- - Add guardrail: "Do not mention features not explicitly listed in the property details"
- - Test DeepSeek with refined prompt to see if we can push accuracy above 80% while maintaining cost efficiency
Why This Matters
As a product manager, we were able to:
- - Validate feasibility: Yes, AI can personalize descriptions effectively
- - Identify the right approach: Comprehensive prompts + Claude Sonnet 4.0
- - Understand costs and constraints: Know what's sustainable at scale
- - Find concrete next steps: Specific improvements to test in the next iteration
This gives confidence that shipping this feature will improve user experience rather than harm it.
The Power of Systematic Testing
Imagine if we'd only tested one model with the basic prompt on one happy-path case. We might have thought "Great, it works!" and shipped it.
Then in production:
- - Some users see titles and subtitles that break the UI
- - Some get descriptions with fake features (eroding trust)
- - Some see rambling text that never ends
- - Costs are 10x higher than expected
You cannot catch these issues without systematic experimentation across diverse test cases and models.
Practical Takeaways for Product Teams
1. Experimentation is Not Optional
You need tools that let you:
- - Test multiple prompts and models in parallel
- - Run experiments across diverse test cases
- - Compare results side by side
- - Track accuracy, cost, and performance systematically
Switching between OpenAI's platform, Anthropic's platform, and copy-pasting test inputs everywhere gets messy fast.
2. Domain Experts Should Own Prompts
Don't let prompt creation sit solely with engineering. The people with context—PMs, domain experts—should define instructions and evaluate results.
3. Latest ≠ Best
Don't default to the newest, most expensive model. Test across multiple options. Older or smaller models often outperform on specific tasks.
4. Context is Your Competitive Advantage
Two teams using the same model will get wildly different results based on the instructions they write. Your domain knowledge is what makes AI features actually work.
5. Start Small, Iterate
This experiment used only 5 test cases (a very small sample). Even with that, we identified clear failure patterns and improvement directions. Start small, learn fast, then scale up testing.
Conclusion: Building Confidence Through Practice
A year ago, we felt overwhelmed by AI hype and unsure where to start. What changed?
Not finding some secret formula. Not reading one more think-piece about prompt engineering. Systematic experimentation.
By testing, comparing, evaluating, and iterating, we built the confidence to:
- - Understand what's possible and what's not
- - Recognize which approaches work for specific problems
- - Evaluate quality objectively
- - Make data-backed decisions about model selection and costs
AI product development is closer to product discovery than traditional software engineering. It requires iteration, learning, and continuous refinement.
The good news? You don't need to be an "AI expert" to start. You need to understand the fundamentals, experiment systematically, and leverage your domain expertise—the context only you have about your users, product, and problems.
The tools and models are getting better every week. The real question is: are you building the practice of experimentation that lets you take advantage of them?