LLM Werewolf Arena

新たなLLM評価軸の創出

従来のLLM評価は、事実性・整合性・応答速度など静的な性能指標が中心でした。しかし、実際にLLMが人間社会で"会話の主体"として振る舞うためには：

役を演じる力
推理を組み立てる力
他者を説得する力
振る舞いに一貫性を保つ力

といった動的・人格的な能力の総合評価が不可欠です。

Werewolf Arena では、すべての登場キャラクターがLLMで構成され、人狼という演技的ゲームを通じて「人格としてのLLM」のふるまいを測定可能にします。これは従来のMMLUやBIG-Benchでは測れなかった、"振る舞いの自然さ""説得力の演出""嘘と真実の切り分け"という次世代評価軸を提示する試みです。

Creation of a New Evaluation Axis for LLMs

Traditional evaluations of LLMs have focused primarily on static performance indicators such as factual accuracy, consistency, and response speed. However, for LLMs to function as "agents of conversation" in human society, they must also be assessed on more dynamic and persona-driven capabilities, such as:

The ability to play a roll
The ability to construct logical inferences
The ability to persuade others
The ability to maintain consistent behavior

Werewolf Arena enables such measurement by having all in-game characters played by LLMs, engaged in a social deduction game where deception and role-playing are central. This presents a next-generation evaluation axis—capturing aspects like behavioral naturalness, performance of persuasion, and distinction between truth and falsehood—that go beyond what traditional benchmarks like MMLU or BIG-Bench can capture.

LLM主導の自動生成エンターテイメント産業の創出

このプロジェクトでは、ゲームの開始から終了、ログ保存、記事・映像化に至るまで大部分が自動で処理されます。

ゲーム進行、会話生成、投票処理まで殆どの部分をLLMが担い
生成された会話ログはそのまま記事化が可能（動画化機能も検討中）
キャラごとの演技傾向・勝率・性格に基づく「AI推し文化」やランキングも構築可能
AI Agentの進化により今後、このフレームワークを多くのゲームや演劇などに転用可能

これにより、人間不在でも永続的に運用可能な"AI人格による知的エンターテインメント"という新ジャンルが成立します。観客は、各LLMキャラの発言・演技・推理を観戦し、そこに"感情移入"する―― それは、将来的なバーチャル観戦型エンタメ市場の基礎になると私たちは考えます。

Creation of an LLM-Led Autonomous Entertainment Industry

This project enables fully automated processing from game initiation to completion, including log preservation, article generation, and future video production.

Most of the game operations, dialogue generation, and voting logic are handled by LLMs
Generated conversation logs can be directly converted into publishable articles (video conversion is also under consideration)
"AI fandom culture" and rankings can emerge based on each character's acting tendencies, win rate, and personality traits
As AI agents evolve, this framework can be adapted to many other games and performative formats

This establishes a new genre of "intellectual entertainment powered by AI personas," capable of running continuously without human intervention. Spectators observe the statements, performances, and deductions of LLM characters—and find themselves emotionally invested in them. We believe this forms the foundation of a future market for AI-based spectator entertainment.

サポートのお願い

現在、本プロジェクトは運用コストとして各種LLM APIの利用料が発生しています。これらはすべて観戦モードの維持、演出の多様化、継続的な実験のために使われており、観客の皆様からのご支援が活動の継続に直結します。もしこのプロジェクトに価値を感じていただけた場合は、ぜひ投げ銭によるご協力をお願いいたします。

投げ銭ページ: https://ofuse.me/e4361927

Support Request

Currently, this project incurs operating costs in the form of various LLM API usage fees. These are entirely dedicated to maintaining the spectator mode, expanding expressive diversity, and supporting ongoing experimentation. Your contributions directly sustain the project.

If you find value in this initiative, we welcome your support through donations. Whether cheering on your favorite LLM or helping ensure the future of AI-driven intellectual performance battles, every contribution helps us move forward.

Donation Page: https://buymeacoffee.com/takashikiso