Empirical findings

What 30 models tell us about phrasal semantics

Headlines first, then five readouts: where models lead, where they collapse, and where in-context learning helps versus hurts. All numbers below are taken from the paper's Table 5, Figures 4–6 and the appendix benchmark tables.

82.8
Best IED accuracy (GPT-5)
−39 pp
LCC drop · 2 → 16 categories
66 %
Median cascade error inflation
31
Models compared

Headline benchmark — selected models

#leaderboard

Selected results across the four operation families. Rows ordered by overall semantic competence (mean of column means). All numbers are zero-shot unless noted.

Model IED IEI R-L LCC 8-cat LCE NCC NCI R-L VMWE
GPT-5 82.835.662.173.471.028.466.7
OpenAI o3 81.534.160.272.069.627.765.1
Claude Sonnet 4.5 79.033.858.771.168.227.063.4
Gemini 3.1 Pro 78.232.757.370.567.426.462.8
DeepSeek-R1 75.631.154.068.365.025.260.5
Kimi K2 Instruct 73.429.552.266.763.124.458.6
GLM-4.6 72.028.951.065.362.023.757.2
Qwen3-235B-A22B 71.228.149.664.561.023.056.4
Gemma 3 27B IT 63.123.442.756.352.519.648.8
Llama 3 8B 55.919.035.849.245.616.241.7
Random baseline 25.0 12.5 25.0

IED · NCC report MCQ accuracy. IEI · NCI report ROUGE-L on a held-out gloss. LCC is reported on the 8-category split. LCE · VMWE report exact-match span F1.

01 · Detection beats interpretation, every time

#01

Across all four phrase categories, surface-level discrimination is roughly twice as accurate as producing a faithful gloss. The same model that picks the right MCQ for an idiom often cannot paraphrase it.

IED · MCQ accuracy 82.8
IEI · ROUGE-L 35.6
NCC · MCQ accuracy 71.0
NCI · ROUGE-L 28.4
CI · pair accuracy 74.2
LCI · ROUGE-L 30.4
Why this matters. Picking among four candidate readings is a discrimination task; producing a paraphrase requires the model to generate the right lexical-semantic content. The gap is consistent regardless of phrase type, suggesting a uniform competence ceiling on phrasal semantics rather than a phrase-specific blind spot.

02 · LCC accuracy collapses with category granularity

#02

The clearest scaling effect in SemanticQA. As the lexical function taxonomy grows from binary to 16 classes, every model's accuracy drops monotonically — even closed frontier models lose more than 35 percentage points.

LCC accuracy vs. category count 100 75 50 25 0 2 4 6 8 16 GPT-5 DeepSeek-R1 Kimi K2 Gemma 3 27B
Accuracy on LCC as the lexical function label set is widened from 2 to 16 classes (zero-shot). The gradient is steepest in the 4 → 8 transition.
What this tells us. Lexical-functional categories are not a natural carving for current LMs. Models can tell collocation from not-collocation, but distinguishing Magn, Caus, Oper and Real requires structural-semantic typology that isn't well-represented in pre-training distributions.

03 · ICL helps interpretation, hurts judgment

#03

The effect of in-context learning is not uniform. Few-shot examples help generative tasks (IEI / NCI / LCI) — but for the multiple-choice judgment tasks (IED / NCC / CI), the best score is most often zero-shot.

Generative tasks · ICL helps

IEI · 0-shot 30.4
IEI · 5-shot 35.6
NCI · 0-shot 24.2
NCI · 5-shot 28.4
LCI · 0-shot 25.7
LCI · 5-shot 30.4

Judgment tasks · ICL hurts

IED · 0-shot 82.8
IED · 5-shot 78.4
NCC · 0-shot 71.0
NCC · 5-shot 66.5
CI · 0-shot 74.2
CI · 5-shot 70.1
Likely cause. Few-shot exemplars push generative tasks toward a sharper output distribution — useful for free-form gloss. But for closed-form judgment, the same exemplars introduce spurious priors that can flip otherwise correct decisions on borderline items.

04 · Sequential cascade error is real, and non-additive

#04

When extraction is followed by interpretation in a single response, errors do not add — they multiply. A model that is 80% on extraction and 35% on interpretation does not score 28% in the cascade; it scores 18.

Cascade Step 1
extraction acc.
Step 2
downstream acc.
Expected
independent
Observed
joint
Δ
IEE → IEI 81.035.6 28.821.4 −7.4
LCE → LCI 73.430.4 22.314.1 −8.2
NCE → NCI 79.228.4 22.517.6 −4.9
IEE → IED 81.082.8 67.162.0 −5.1
LCE → CI 73.474.2 54.548.0 −6.5
NCE → NCC 79.271.0 56.252.7 −3.5

Numbers are GPT-5, zero-shot, on the SemanticQA test split. "Independent" assumes the two steps fail independently; "Δ" is observed minus that independence baseline.

Implication. Standalone task scores systematically over-estimate real-world phrasal competence. The most cascade-sensitive setting — collocation extract + interpret — also has the most brittle lexical preferences.

05 · Closed leads, open closes — selectively

#05

Closed frontier models (GPT-5, o3, Claude Sonnet 4.5, Gemini 3.1 Pro) hold the top of every column. Open-weight models close the gap on extraction-style tasks but lag on categorization and interpretation, where structured semantic knowledge matters most.

Where open closes

Span-level extraction

Top open models trail GPT-5 by less than 8 points on IEE / NCE / LCE — extraction rewards string-pattern competence which open models inherit from training mix.

Where open lags

Fine-grained categorization

On 16-category LCC the gap stretches to 18+ points. Lexical-functional roles seem to require alignment data that open ecosystems still lack.

Takeaways for benchmark builders

#takeaways
  1. Single-metric leaderboards mislead. Phrasal semantics is multi-axis; a single number averages away precisely the structure you want to measure.
  2. Composition is where models actually live. Sequential tasks expose failure modes that no single-step benchmark can.
  3. Granularity is a stress test. Scaling category count is a cheap and discriminating probe — far more so than yet another harder MCQ.
  4. Detection alone is not understanding. If your eval is detection-shaped, you are measuring discrimination, not semantics.

Full per-model tables ship with the codebase under semantic_qa/results/; mean & SD scripts are in scripts/calc_mean_sd.py.