Allow Prompt/Sampling Messages to contain multiple content blocks. by evalstate · Pull Request #198 · modelcontextprotocol/modelcontextprotocol

I'm slightly worried about allowing message content array w/o requiring a strict message role alternance.
And very worried about the breaking change.

Most inference APIs (OpenAI's chat completions, Claude's, but also OSS in HF transformers and llama.cpp) require or assume a strict assistant / user alternance in messages, with message content being a single string or an array of typed parts.

The current sampling API amounts to flattened version of this & allows consecutive repeated roles, but is currently trivial and unambiguous to unflatten, by just grouping by role:

// Sampling messages

[
  {"role": "user", "content": {"type": "text", "text": "Describe and enhance this pic:"}},
  {"role": "user", "content": {"type": "image", "mimeType": "image/png", "data": "base64..."}},
  {"role": "assistant", "content": {"type": "text", "text": "It's dull. I've spiced it up"}},
  {"role": "assistant", "content": {"type": "image", "mimeType": "image/png", "data": "base64..."}},
  {"role": "user", "content": {"type": "text", "text": "And then?"}}
]

Converted to OpenAI / HF-style format (content: string | ({type: "text", text: string} | ...)[]):

// OpenAI- / HF-style messages

[
  {"role": "user", "content": [
    {"type": "text", "text": "Describe and enhance this pic:"},
    {"type": "image", "mimeType": "image/png", "data": "base64..."}
  ]},
  {"role": "assistant", "content": [
    {"type": "text", "text": "It's dull. I've spiced it up"},
    {"type": "image", "mimeType": "image/png", "data": "base64..."}
  ]},
  {"role": "user", "content": {"type": "text", "text": "And then?"}}
]

Now if we allow this:

[
  {"role": "user", "content": [{"type": "text", "text": "content1.1"}, {"type": "text", "text": "content1.2"}]},
  {"role": "user", "content": [{"type": "text", "text": "content2"}]}
]

The only way to implement it w/ actual inference APIs will be to coalesce these, loosing the kinda-implied semantic grouping of content1.1 and content1.2:

[
  {"role": "user", "content": [
    {"type": "text", "text": "content1.1"},
    {"type": "text", "text": "content1.2"},
    {"type": "text", "text": "content2"}
  ]}
]

My take is we should:

Have content accept a single MessageContent or an array of it, to avoid backwards-incompatibility:

type MessageContent = TextContent | ImageContent | AudioContent | EmbeddedResource;
export interface PromptMessage {
  role: Role;
  content: MessageContent | MessageContent[];
}

Introduce backward-compatible message role alternance: maybe something like:

Consecutive sub-sequences of messages with the same role MUST either all have a content with a single MessageContent, or be of length 1.