Revisiting a Pain in the Neck:
A Semantic Reasoning Benchmark for Language Models
1 University of Science and Technology Beijing · 2 Beijing Institute for General Artificial Intelligence
We introduce SemanticQA, an evaluation suite designed to encourage modern language models to engage in semantic reasoning rather than surface-form pattern matching. The benchmark covers twelve carefully-controlled tasks across four categories of semantically rich, syntactically constrained phrasal constructions, plus an additional six sequential compositions.
Existing benchmarks treat semantic phenomena one corpus at a time. SemanticQA puts idioms, lexical collocations, noun compounds, and verbal multiword expressions in the same evaluation frame, with a uniform task taxonomy (Detection · Extraction · Categorization · Interpretation) and a shared phrase-level lens. We further compose extraction with a downstream judgment or interpretation step to surface error cascades that single-step benchmarks miss.
Why another benchmark?
#whySemantic phrases — idioms like "a pain in the neck", collocations like "raise an alarm", noun compounds, and verbal MWEs — sit at the boundary where distributional cues stop being enough. They are the smallest units that demand composition, world knowledge, and context together. Yet existing studies typically isolate one phenomenon, one task, and one model family.
SemanticQA asks one question across the four phrase types:
How do language models behave when evaluated on phrasal constructs across distinct but structurally constrained task operations?
The answer turns out to be revealing. Models are stronger on detection than on interpretation; collapse rapidly when categorization granularity grows; and exhibit cascading error patterns when extraction is followed by interpretation — failures that are invisible to any single-step evaluation.
Three design principles
#contributions-
Operation-aligned semantic evaluation
One taxonomy of operations applied uniformly to every phrase type, so cross-phenomenon comparison is finally possible. Detection, extraction, categorization and interpretation are evaluated on the same axes.
-
Minimal & controlled design
Concise prompt templates avoid instruction-induced variance. Tasks vary in structural distinction alone, so what we measure is semantic competence — not prompt ingenuity.
-
Diagnostic readout & cascade sensitivity
Per-phrase, per-operation scorecards expose where models break. Sequential extract→interpret tasks let cascade-level failures surface that single-step evaluation cannot detect.
The phrase × operation matrix
#taxonomyTwelve standalone tasks fill a 4×4 grid of phrase type against semantic operation. Six sequential tasks compose extraction with a downstream judgment or interpretation step.
Pain in the neck
Non-compositional. Detection · Extraction · Interpretation.
Raise an alarm
Lexically-restricted. Categorization · Extraction · Interpretation · Retrieval · Identification.
Rocket science
Variable compositionality. Compositionality · Extraction · Interpretation.
Take advantage of
Headed by a verb. Extraction · Sequential judgment.
Tasks at a glance
all tasks →| Task | Abbr. | Phrase | Operation | Eval metrics |
|---|---|---|---|---|
| Idiomatic Expression Detection | IED | Idiom | Detection | MCQ Accuracy |
| Idiomatic Expression Extraction | IEE | Idiom | Extraction | Exact Match |
| Idiomatic Expression Interpretation | IEI | Idiom | Interpretation | ROUGE-L · BERTScore-F1 · METEOR · BLEU |
| Lexical Collocation Categorization | LCC | Collocation | Categorization | Accuracy · Macro / Micro / Weighted F1 |
| Lexical Collocation Extraction | LCE | Collocation | Extraction | Exact Match |
| Lexical Collocation Interpretation | LCI | Collocation | Interpretation | ROUGE-L · BERTScore-F1 · METEOR · BLEU |
| Collocate Retrieval | CR | Collocation | Retrieval | Exact Match |
| Collocation Identification | CI | Collocation | Identification | Accuracy |
| Noun Compound Compositionality | NCC | Noun compound | Compositionality | MCQ Accuracy |
| Noun Compound Extraction | NCE | Noun compound | Extraction | Exact Match |
| Noun Compound Interpretation | NCI | Noun compound | Interpretation | ROUGE-L · BERTScore-F1 · METEOR · BLEU |
| Verbal MWE Extraction | VMWE | Verbal MWE | Extraction | Exact Match |
Plus six sequential tasks: extraction + judgment / interpretation across idiom, collocation and noun-compound branches.
Five findings worth keeping
full results →Detection > Interpretation
Across phrase types, surface-level discrimination is much easier than producing a faithful gloss. The gap holds even when the same model handles both ends.
Categorization collapses with granularity
LCC accuracy drops monotonically as the taxonomy grows from 2 → 16 categories. Even the strongest closed models lose more than half their accuracy at 16 classes.
ICL helps interpretation, hurts judgment
Few-shot prompting consistently improves generative tasks, but for IED, NCC and CI the best performance is most often zero-shot.
Cascade failure is real
In sequential extract→interpret pipelines, errors compound non-additively — especially for collocations where the lexical preference itself is brittle.
Closed ≫ open, but the gap narrows
Closed frontier models lead overall; specialised open models close the gap on extraction but lag on categorization and interpretation.
Phrasal semantics is multi-dimensional
No single metric or task captures phrasal competence. SemanticQA argues for a multi-axis readout — and provides the harness for it.
Cite this work
#citeSemanticQA (this paper)
@article{liu2026revisiting,
title = {Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models},
author = {Liu, Yang and Li, Hongming and Qin, Melissa Xiaohui and Liu, Qiankun and Huang, Chao},
journal= {arXiv preprint arXiv:2604.16593},
year = {2026}
}
Earlier benchmark (LexBench, 2024)
@article{liu2024revisiting,
title = {Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models},
author = {Liu, Yang and Qin, Melissa Xiaohui and Li, Hongming and Huang, Chao},
journal= {arXiv preprint arXiv:2405.02861},
year = {2024}
}