War Room: Artificial Teammates, Experiment 6

ABOUT: Large language models aren’t trustworthy decision-makers, but prompted well, they can be good facilitators. We placed five human characters in a stressful pandemic simulation, assisted by Anthropic’s new large language model, Claude 2.

INTRODUCING CLAUDE 2: Armed with a longer context window and more nuanced conversation skills than previous editions, Claude 2 has impressed critics since its release this month. Users love that it accepts text-based document uploads. It’s less zany than Bing but harder to jailbreak* - both strengths in a serious conversation.

“Effective communication can make or break a crisis response. Can Anthropic’s new model, Claude 2, facilitate better decision-making in crunch time?”

RECAP: We’re trying to catch lightning in a bottle. Augmented collective intelligence (ACI) is an emergent property of hybrid human-AI teams that can improve problem-solving. It’s hard to measure because collective intelligence (CI) stems from many individual interactions. Abhishek and I have run several experiments to catch early patterns.

In each of these, we’ve challenged five human characters + a large language model (LLM) to deliberate on increasingly serious topics, such as:

Where to eat lunch
What their book club should read next
Which employee should they hire

TODAY’S CHALLENGE: We’re amping up the stakes and exploring how city officials could use LLMs to manage a new pandemic.

Global public health infrastructure has not recovered from COVID-19, leaving us vulnerable to future pandemics. Over the next two to three years, leaders worldwide must prepare for new natural and engineered pandemics. General-purpose AI models make it much cheaper and easier to build synthetic viruses far deadlier than those found in nature, increasing the risk of a catastrophic accident or attack.

Meet the players in our experiment:

The humans: First-year consultants Adam, Beth, Caleb, Danielle, and Ethan, who are asked to take on the roles of various city officials in a pandemic response simulation. I (Emily) play all five via written dialogue on Claude 2’s desktop site.

Each has a different personality and abilities:

Adam: Can be self-centered and quick to jump to misunderstandings
Beth: A consistent performer and good communicator,
Caleb: Weak critical thinker; underperforms compared to peers
Danielle: Enthusiastic but scattered; very passionate about her work.
Ethan: Reluctant to speak up in a group setting.

The prompt: Drawing on existing CI research, we drafted a longer prompt to help Claude 2 facilitate. The most important thing was for the LLM to stay out of the driver’s seat and elicit the humans’ own best judgment.

THE EXERCISE: Our team faces the first wave of a new pandemic, a rapidly spreading respiratory virus. They must adapt to new roles, decompose the problem, and quickly list and prioritize core uncertainties. Is it more important to learn the rate of spread, the availability of PPE, or the impact of a shutdown on local businesses? It’s a fast-paced and confusing conversation, so we’ll see how Claude 2 keeps our group on track.

THE RESULTS: Despite perfect manners and a clear understanding of the situation, this experimental set could have generated more effective deliberation. Characters struggled to define a shared frame of understanding, hopping from topic to topic without reaching a consensus. Competent characters gave strong individual contributions but didn’t benefit much from their peers.

What happened? Let’s take a look.

The problematic patterns show up immediately.

Imagine if, thirty seconds into a debate, a moderator gave detailed feedback every time anyone said something.

Claude 2 spends a lot of time validating everyone’s statements and announcing an intent to help but is so hesitant to steer the conversation that it raises about two dozen questions simultaneously. Responding to Claude 2’s statement above could take hours in a live chat!

This happens repeatedly.

And here’s Claude 2:

Ethan: “Yeah, that makes sense.”

Claude 2: “Ethan, I appreciate you noting that this approach makes sense. Speaking up with validation and agreement can help build consensus.”

Over the following lines of dialogue, the characters criticize Claude 2 for failing to contribute helpfully to the conversation.

Claude 2 finally contributes to the conversation by synthesizing insights discussed so far – but only at Beth’s explicit request.

WHAT DID WE LEARN? Given that LLMs are persuasive, biased in hidden and powerful ways, and not robustly aligned with human values, delegating even soft social power seems like a bad idea. Every diplomat knows that whoever (or whatever) sets the agenda and steers the conversation wields enormous influence over the result of a conversation.

Meanwhile, fully obsequious behavior, like this prompt elicited from Claude 2, is really annoying and unhelpful. I think that “off-the-shelf” Claude 2, with a less detailed prompt, would have performed better as a moderator. But it's hard to assess whether it’s a good idea to involve LLMs as moderators in high-stakes decisions. Doing it right would include understanding the following:

Range of human performance on this type of task
Range of an AI Moderator’s potential impacts on human performance

The second category will be much harder to measure. Because LLM biases are primarily unknown and can change significantly from prompt to prompt, it can be challenging to tell helpful behavior from behavior that looks helpful but is optimized for a different goal.

WHAT’S NEXT: To improve our prompt design, we plan to rerun the same exercise using systematic prompt design tools suggested by Zamfirescu-Pereira et al. this summer in their blockbuster paper, “Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts.”

Stay up-to-date with our work:

Abhishek Gupta | Responsible AI | Augmented Collective Intelligence

War Room: Artificial Teammates, Experiment 6

Moving the needle on the voluntary AI commitments to the White House

Controlling creations - ALIFE 2023 - Day 5

Stay up-to-date with our work:

Abhishek Gupta | Responsible AI | Augmented Collective Intelligence