What 30 models tell us about phrasal semantics
Headlines first, then five readouts: where models lead, where they collapse, and where in-context learning helps versus hurts. All numbers below are taken from the paper's Table 5, Figures 4–6 and the appendix benchmark tables.
Headline benchmark — selected models
#leaderboardSelected results across the four operation families. Rows ordered by overall semantic competence (mean of column means). All numbers are zero-shot unless noted.
| Model | IED | IEI R-L | LCC 8-cat | LCE | NCC | NCI R-L | VMWE |
|---|---|---|---|---|---|---|---|
| GPT-5 | 82.8 | 35.6 | 62.1 | 73.4 | 71.0 | 28.4 | 66.7 |
| OpenAI o3 | 81.5 | 34.1 | 60.2 | 72.0 | 69.6 | 27.7 | 65.1 |
| Claude Sonnet 4.5 | 79.0 | 33.8 | 58.7 | 71.1 | 68.2 | 27.0 | 63.4 |
| Gemini 3.1 Pro | 78.2 | 32.7 | 57.3 | 70.5 | 67.4 | 26.4 | 62.8 |
| DeepSeek-R1 | 75.6 | 31.1 | 54.0 | 68.3 | 65.0 | 25.2 | 60.5 |
| Kimi K2 Instruct | 73.4 | 29.5 | 52.2 | 66.7 | 63.1 | 24.4 | 58.6 |
| GLM-4.6 | 72.0 | 28.9 | 51.0 | 65.3 | 62.0 | 23.7 | 57.2 |
| Qwen3-235B-A22B | 71.2 | 28.1 | 49.6 | 64.5 | 61.0 | 23.0 | 56.4 |
| Gemma 3 27B IT | 63.1 | 23.4 | 42.7 | 56.3 | 52.5 | 19.6 | 48.8 |
| Llama 3 8B | 55.9 | 19.0 | 35.8 | 49.2 | 45.6 | 16.2 | 41.7 |
| Random baseline | 25.0 | — | 12.5 | — | 25.0 | — | — |
IED · NCC report MCQ accuracy. IEI · NCI report ROUGE-L on a held-out gloss. LCC is reported on the 8-category split. LCE · VMWE report exact-match span F1.
01 · Detection beats interpretation, every time
#01Across all four phrase categories, surface-level discrimination is roughly twice as accurate as producing a faithful gloss. The same model that picks the right MCQ for an idiom often cannot paraphrase it.
02 · LCC accuracy collapses with category granularity
#02The clearest scaling effect in SemanticQA. As the lexical function taxonomy grows from binary to 16 classes, every model's accuracy drops monotonically — even closed frontier models lose more than 35 percentage points.
03 · ICL helps interpretation, hurts judgment
#03The effect of in-context learning is not uniform. Few-shot examples help generative tasks (IEI / NCI / LCI) — but for the multiple-choice judgment tasks (IED / NCC / CI), the best score is most often zero-shot.
Generative tasks · ICL helps
Judgment tasks · ICL hurts
04 · Sequential cascade error is real, and non-additive
#04When extraction is followed by interpretation in a single response, errors do not add — they multiply. A model that is 80% on extraction and 35% on interpretation does not score 28% in the cascade; it scores 18.
| Cascade | Step 1 extraction acc. |
Step 2 downstream acc. |
Expected independent |
Observed joint |
Δ |
|---|---|---|---|---|---|
| IEE → IEI | 81.0 | 35.6 | 28.8 | 21.4 | −7.4 |
| LCE → LCI | 73.4 | 30.4 | 22.3 | 14.1 | −8.2 |
| NCE → NCI | 79.2 | 28.4 | 22.5 | 17.6 | −4.9 |
| IEE → IED | 81.0 | 82.8 | 67.1 | 62.0 | −5.1 |
| LCE → CI | 73.4 | 74.2 | 54.5 | 48.0 | −6.5 |
| NCE → NCC | 79.2 | 71.0 | 56.2 | 52.7 | −3.5 |
Numbers are GPT-5, zero-shot, on the SemanticQA test split. "Independent" assumes the two steps fail independently; "Δ" is observed minus that independence baseline.
05 · Closed leads, open closes — selectively
#05Closed frontier models (GPT-5, o3, Claude Sonnet 4.5, Gemini 3.1 Pro) hold the top of every column. Open-weight models close the gap on extraction-style tasks but lag on categorization and interpretation, where structured semantic knowledge matters most.
Span-level extraction
Top open models trail GPT-5 by less than 8 points on IEE / NCE / LCE — extraction rewards string-pattern competence which open models inherit from training mix.
Fine-grained categorization
On 16-category LCC the gap stretches to 18+ points. Lexical-functional roles seem to require alignment data that open ecosystems still lack.
Takeaways for benchmark builders
#takeaways- Single-metric leaderboards mislead. Phrasal semantics is multi-axis; a single number averages away precisely the structure you want to measure.
- Composition is where models actually live. Sequential tasks expose failure modes that no single-step benchmark can.
- Granularity is a stress test. Scaling category count is a cheap and discriminating probe — far more so than yet another harder MCQ.
- Detection alone is not understanding. If your eval is detection-shaped, you are measuring discrimination, not semantics.
Full per-model tables ship with the codebase under
semantic_qa/results/; mean & SD scripts are in
scripts/calc_mean_sd.py.