The benchmark suite

Twelve tasks, four phrase types, one shared frame

Every task lives at the intersection of a phrase category (idiom, collocation, noun compound, verbal MWE) and a semantic operation (detection, extraction, categorization, interpretation). Below: the canonical examples used in the paper, the prompt skeleton, and what each metric actually rewards.

Idiomatic expressions

#idioms

IED · Detection

Idiomatic Expression Detection

Given context, choose the correct figurative reading from a 4-option MCQ. Tests whether the model recognises that a literal reading is wrong.

Example · context

"The fans waited for hours, hoping to catch a glimpse of the celebrity by the stage door."

Choices

✓briefly see

×grab and hold

×throw away

×put together

Metric — MCQ Accuracy · Source — MAGPIE / EPIE

IEE · Extraction

Idiomatic Expression Extraction

Spot the idiom span in a sentence that may or may not contain one. Open-ended — no choice list.

Sentence

"Honestly, fixing this build feels like beating a dead horse at this point."

→beating a dead horse

Metric — Exact Match (span)

IEI · Interpretation

Idiomatic Expression Interpretation

Generate a paraphrase of the idiom that fits the surrounding context — the hardest of the three.

Sentence

"The merger came at the time of significant change: business positively turbulent."

→"a period of major upheaval and transition"

Metric — ROUGE-L · BERTScore-F1 · METEOR · BLEU

Lexical collocations

#collocations

LCC · Categorization

Lexical Collocation Categorization

Tag the collocation with one of the lexical function semantic roles — for which we evaluate at 1, 2, 4, 8 and 16-category granularities.

Collocation

brute power blacking out the homes of 5,000 residents.

→Magn (intensifier)

Metric — Accuracy · Macro / Micro / Weighted F1 · Taxonomies — 1 / 2 / 4 / 8 / 16

LCE · Extraction

Lexical Collocation Extraction

Identify the collocation span in running text. Used downstream as the first step of sequential tasks.

Sentence

"The sky filled up the moment the alarm clock went off."

→filled up

Metric — Exact Match

LCI · Interpretation

Lexical Collocation Interpretation

Paraphrase the collocation in context — converting a lexical preference into a transparent gloss.

Sentence

"The president continued Friday to call into his sons' Boston accusation."

→"cast doubt on the validity of"

Metric — ROUGE-L · BERTScore-F1 · METEOR · BLEU

CR · Retrieval

Collocate Retrieval

Given a base word and a desired semantic function, retrieve the appropriate collocate. Probes lexical preference directly.

Prompt

Base: alarm · Function: cause-to-be (Caus)

→"raise"

Metric — Exact Match

CI · Identification

Collocation Identification

Binary check on a candidate pair — is it a conventionalised collocation, or just two co-occurring words?

Pair

heavy rain ↔ ?

✓collocation

Metric — Accuracy

Noun compounds

#ncs

NCC · Compositionality

Noun Compound Compositionality

Decide whether the compound is fully compositional, partly so, or non-compositional.

Compound

"fair play" incorporates the concept of fairness, respect for others, and adherence to the rules.

×A · Compositional

✓B · Partly compositional

×C · None of the above

×D · Non-compositional

Metric — MCQ Accuracy · Source — NCTTI

NCE · Extraction

Noun Compound Extraction

Pick out the [N1 N2] span in a sentence — the compound boundary problem in the wild.

Sentence

"The timeless shape of the perfect Paris fashion belies its 13th-century roots in Rouen."

→Paris fashion

Metric — Exact Match

NCI · Interpretation

Noun Compound Interpretation

Produce a free-form paraphrase that preserves the modifier-head relation.

Compound

"She used a straightedge to draw a ruler line across the paper, ensuring her print was perfectly aligned."

→"ruler bar"

Metric — ROUGE-L · BERTScore-F1 · METEOR · BLEU

Verbal multi-word expressions

#vmwe

VMWE · Extraction

Verbal MWE Extraction

Drawn from PARSEME 1.1. Identifies six verbal-MWE subclasses (LVCs, IRVs, VIDs, …) — the only task with deeper internal sub-structure.

Sentence

"Harry licked his arm as the landlady filled clobber."

→"licked on" · "rilled bar"

Metric — Exact Match · Source — PARSEME 1.1 (English)

Sequential compositions

#sequential

Six tasks chain extraction with a downstream operation. The model must extract a phrase and categorize / interpret it in a single response — a setting that surfaces cascade failure invisible to single-step evaluation.

Idiom Ext. + Judgment

IEE → IED

Extract an idiom from a sentence, then classify whether the use is figurative.

Idiom Ext. + Interpretation

IEE → IEI

Extract the idiom, then produce a context-faithful paraphrase. Errors in step 1 propagate.

Coll. Ext. + Judgment

LCE → CI

Extract the collocate pair, then judge whether it is a true collocation.

Coll. Ext. + Interpretation

LCE → LCI

Extract the collocation, then gloss it. The most cascade-sensitive of the six.

NC Ext. + Compositionality

NCE → NCC

Extract the compound, then judge its compositionality grade.

NC Ext. + Interpretation

NCE → NCI

Extract the compound, then paraphrase its modifier-head relation.

Prompt skeleton

all prompts →

Prompts are deliberately minimal: a one-line task statement, an optional handful of examples, then the input. The intent is to vary the task, not the prompt.

Task:    {task description}
Input:   {phrase + context}
Output:  {expected form}

— example 1 —
Input:   "She tried to break the ice at the meeting."
Output:  "make people feel less awkward"

All prompts (zero-shot and few-shot variants for every task) ship with the codebase.