Comparing OAI 120B OSS, Qwen 3.5, and Gemini 3.0 Flash with LLM Multi-Agent Avalon

The author has been running a multi-agent test for the social deduction game Avalon.

Reddit LocalLLaMA · Mar 05, 2026 04:48 UTC · ~2 min + comments

Community

Community-submitted content. Signal comes from upvotes, not editorial vetting. Always check the linked source.

Key Takeaways

This tests context tracking, hidden intentions, and theory of mind.
Here is a breakdown of how different models handled the gameplay.
System Architecture Notes: * Structured Non-Native CoT: The test prompts all models to generate a JSON response before taking action or speaking publicly.

What It Means

Context

This tests context tracking, hidden intentions, and theory of mind. Here is a breakdown of how different models handled the gameplay. System Architecture Notes: * Structured Non-Native CoT: The test prompts all models to generate a JSON response before taking action or speaking publicly. Instead of a single reasoning field, it forces a structured breakdown across 4 specific fields: self_check (persona verifica

For builders

This tests context tracking, hidden intentions, and theory of mind.

For Builders

This tests context tracking, hidden intentions, and theory of mind.

Read Original