Comparing OAI 120B OSS, Qwen 3.5, and Gemini 3.0 Flash with LLM Multi-Agent Avalon
The author has been running a multi-agent test for the social deduction game Avalon.
Community-submitted content. Signal comes from upvotes, not editorial vetting. Always check the linked source.
Key Takeaways
- This tests context tracking, hidden intentions, and theory of mind.
- Here is a breakdown of how different models handled the gameplay.
- System Architecture Notes: * Structured Non-Native CoT: The test prompts all models to generate a JSON response before taking action or speaking publicly.
What It Means
Context
This tests context tracking, hidden intentions, and theory of mind. Here is a breakdown of how different models handled the gameplay. System Architecture Notes: * Structured Non-Native CoT: The test prompts all models to generate a JSON response before taking action or speaking publicly. Instead of a single reasoning field, it forces a structured breakdown across 4 specific fields: self_check (persona verifica
For builders
This tests context tracking, hidden intentions, and theory of mind.
For Builders
This tests context tracking, hidden intentions, and theory of mind.