Empirical findings

What 30 models tell us about phrasal semantics

Headlines first, then five readouts: where models lead, where they collapse, and where in-context learning helps versus hurts. All numbers below are taken from the paper's Table 5, Figures 4–6 and the appendix benchmark tables.

82.8

Best IED accuracy (GPT-5)

−39 pp

LCC drop · 2 → 16 categories

66 %

Median cascade error inflation

Models compared

Headline benchmark — selected models

#leaderboard

Selected results across the four operation families. Rows ordered by overall semantic competence (mean of column means). All numbers are zero-shot unless noted.

Model	IED	IEI R-L	LCC 8-cat	LCE	NCC	NCI R-L	VMWE
GPT-5	82.8	35.6	62.1	73.4	71.0	28.4	66.7
OpenAI o3	81.5	34.1	60.2	72.0	69.6	27.7	65.1
Claude Sonnet 4.5	79.0	33.8	58.7	71.1	68.2	27.0	63.4
Gemini 3.1 Pro	78.2	32.7	57.3	70.5	67.4	26.4	62.8
DeepSeek-R1	75.6	31.1	54.0	68.3	65.0	25.2	60.5
Kimi K2 Instruct	73.4	29.5	52.2	66.7	63.1	24.4	58.6
GLM-4.6	72.0	28.9	51.0	65.3	62.0	23.7	57.2
Qwen3-235B-A22B	71.2	28.1	49.6	64.5	61.0	23.0	56.4
Gemma 3 27B IT	63.1	23.4	42.7	56.3	52.5	19.6	48.8
Llama 3 8B	55.9	19.0	35.8	49.2	45.6	16.2	41.7
Random baseline	25.0	—	12.5	—	25.0	—	—

IED · NCC report MCQ accuracy. IEI · NCI report ROUGE-L on a held-out gloss. LCC is reported on the 8-category split. LCE · VMWE report exact-match span F1.

01 · Detection beats interpretation, every time

#01

Across all four phrase categories, surface-level discrimination is roughly twice as accurate as producing a faithful gloss. The same model that picks the right MCQ for an idiom often cannot paraphrase it.

IED · MCQ accuracy 82.8

IEI · ROUGE-L 35.6

NCC · MCQ accuracy 71.0

NCI · ROUGE-L 28.4

CI · pair accuracy 74.2

LCI · ROUGE-L 30.4

Why this matters. Picking among four candidate readings is a discrimination task; producing a paraphrase requires the model to generate the right lexical-semantic content. The gap is consistent regardless of phrase type, suggesting a uniform competence ceiling on phrasal semantics rather than a phrase-specific blind spot.

02 · LCC accuracy collapses with category granularity

#02

The clearest scaling effect in SemanticQA. As the lexical function taxonomy grows from binary to 16 classes, every model's accuracy drops monotonically — even closed frontier models lose more than 35 percentage points.

Accuracy on LCC as the lexical function label set is widened from 2 to 16 classes (zero-shot). The gradient is steepest in the 4 → 8 transition.

What this tells us. Lexical-functional categories are not a natural carving for current LMs. Models can tell collocation from not-collocation, but distinguishing Magn, Caus, Oper and Real requires structural-semantic typology that isn't well-represented in pre-training distributions.

03 · ICL helps interpretation, hurts judgment

#03

The effect of in-context learning is not uniform. Few-shot examples help generative tasks (IEI / NCI / LCI) — but for the multiple-choice judgment tasks (IED / NCC / CI), the best score is most often zero-shot.

Generative tasks · ICL helps

IEI · 0-shot 30.4

IEI · 5-shot 35.6

NCI · 0-shot 24.2

NCI · 5-shot 28.4

LCI · 0-shot 25.7

LCI · 5-shot 30.4

Judgment tasks · ICL hurts

IED · 0-shot 82.8

IED · 5-shot 78.4

NCC · 0-shot 71.0

NCC · 5-shot 66.5

CI · 0-shot 74.2

CI · 5-shot 70.1

Likely cause. Few-shot exemplars push generative tasks toward a sharper output distribution — useful for free-form gloss. But for closed-form judgment, the same exemplars introduce spurious priors that can flip otherwise correct decisions on borderline items.

04 · Sequential cascade error is real, and non-additive

#04

When extraction is followed by interpretation in a single response, errors do not add — they multiply. A model that is 80% on extraction and 35% on interpretation does not score 28% in the cascade; it scores 18.

Cascade	Step 1 extraction acc.	Step 2 downstream acc.	Expected independent	Observed joint	Δ
IEE → IEI	81.0	35.6	28.8	21.4	−7.4
LCE → LCI	73.4	30.4	22.3	14.1	−8.2
NCE → NCI	79.2	28.4	22.5	17.6	−4.9
IEE → IED	81.0	82.8	67.1	62.0	−5.1
LCE → CI	73.4	74.2	54.5	48.0	−6.5
NCE → NCC	79.2	71.0	56.2	52.7	−3.5

Numbers are GPT-5, zero-shot, on the SemanticQA test split. "Independent" assumes the two steps fail independently; "Δ" is observed minus that independence baseline.

Implication. Standalone task scores systematically over-estimate real-world phrasal competence. The most cascade-sensitive setting — collocation extract + interpret — also has the most brittle lexical preferences.

05 · Closed leads, open closes — selectively

#05

Closed frontier models (GPT-5, o3, Claude Sonnet 4.5, Gemini 3.1 Pro) hold the top of every column. Open-weight models close the gap on extraction-style tasks but lag on categorization and interpretation, where structured semantic knowledge matters most.

Where open closes

Span-level extraction

Top open models trail GPT-5 by less than 8 points on IEE / NCE / LCE — extraction rewards string-pattern competence which open models inherit from training mix.

Where open lags

Fine-grained categorization

On 16-category LCC the gap stretches to 18+ points. Lexical-functional roles seem to require alignment data that open ecosystems still lack.

Takeaways for benchmark builders

#takeaways

Single-metric leaderboards mislead. Phrasal semantics is multi-axis; a single number averages away precisely the structure you want to measure.
Composition is where models actually live. Sequential tasks expose failure modes that no single-step benchmark can.
Granularity is a stress test. Scaling category count is a cheap and discriminating probe — far more so than yet another harder MCQ.
Detection alone is not understanding. If your eval is detection-shaped, you are measuring discrimination, not semantics.

Full per-model tables ship with the codebase under semantic_qa/results/; mean & SD scripts are in scripts/calc_mean_sd.py.