Revisiting a Pain in the Neck:
A Semantic Reasoning Benchmark for Language Models

Yang Liu^1,2,*, Hongming Li^1,*, Melissa Xiaohui Qin¹, Qiankun Liu¹, Chao Huang^1,†

¹ University of Science and Technology Beijing · ² Beijing Institute for General Artificial Intelligence

Read the paper Evals Dataset Slides Cite

Too long; Don't read

We introduce SemanticQA, an evaluation suite designed to encourage modern language models to engage in semantic reasoning rather than surface-form pattern matching. The benchmark covers twelve carefully-controlled tasks across four categories of semantically rich, syntactically constrained phrasal constructions, plus an additional six sequential compositions.

Existing benchmarks treat semantic phenomena one corpus at a time. SemanticQA puts idioms, lexical collocations, noun compounds, and verbal multiword expressions in the same evaluation frame, with a uniform task taxonomy (Detection · Extraction · Categorization · Interpretation) and a shared phrase-level lens. We further compose extraction with a downstream judgment or interpretation step to surface error cascades that single-step benchmarks miss.

Standalone tasks

Sequential compositions

Phrase categories

Models evaluated

Why another benchmark?

#why

Semantic phrases — idioms like "a pain in the neck", collocations like "raise an alarm", noun compounds, and verbal MWEs — sit at the boundary where distributional cues stop being enough. They are the smallest units that demand composition, world knowledge, and context together. Yet existing studies typically isolate one phenomenon, one task, and one model family.

SemanticQA asks one question across the four phrase types:

How do language models behave when evaluated on phrasal constructs across distinct but structurally constrained task operations?

The answer turns out to be revealing. Models are stronger on detection than on interpretation; collapse rapidly when categorization granularity grows; and exhibit cascading error patterns when extraction is followed by interpretation — failures that are invisible to any single-step evaluation.

Three design principles

#contributions

Operation-aligned semantic evaluation

One taxonomy of operations applied uniformly to every phrase type, so cross-phenomenon comparison is finally possible. Detection, extraction, categorization and interpretation are evaluated on the same axes.
Minimal & controlled design

Concise prompt templates avoid instruction-induced variance. Tasks vary in structural distinction alone, so what we measure is semantic competence — not prompt ingenuity.
Diagnostic readout & cascade sensitivity

Per-phrase, per-operation scorecards expose where models break. Sequential extract→interpret tasks let cascade-level failures surface that single-step evaluation cannot detect.

The phrase × operation matrix

#taxonomy

Twelve standalone tasks fill a 4×4 grid of phrase type against semantic operation. Six sequential tasks compose extraction with a downstream judgment or interpretation step.

Two-stage sunburst: inner ring marks the four phrase categories; outer ring shows the twelve standalone tasks, colored by operation type. Sequential tasks (not pictured) chain extraction with a downstream operation.

Idioms

Pain in the neck

Non-compositional. Detection · Extraction · Interpretation.

Collocations

Raise an alarm

Lexically-restricted. Categorization · Extraction · Interpretation · Retrieval · Identification.

Noun compounds

Rocket science

Variable compositionality. Compositionality · Extraction · Interpretation.

Verbal MWEs

Take advantage of

Headed by a verb. Extraction · Sequential judgment.

Tasks at a glance

all tasks →

Task	Abbr.	Phrase	Operation	Eval metrics
Idiomatic Expression Detection	IED	Idiom	Detection	MCQ Accuracy
Idiomatic Expression Extraction	IEE	Idiom	Extraction	Exact Match
Idiomatic Expression Interpretation	IEI	Idiom	Interpretation	ROUGE-L · BERTScore-F1 · METEOR · BLEU
Lexical Collocation Categorization	LCC	Collocation	Categorization	Accuracy · Macro / Micro / Weighted F1
Lexical Collocation Extraction	LCE	Collocation	Extraction	Exact Match
Lexical Collocation Interpretation	LCI	Collocation	Interpretation	ROUGE-L · BERTScore-F1 · METEOR · BLEU
Collocate Retrieval	CR	Collocation	Retrieval	Exact Match
Collocation Identification	CI	Collocation	Identification	Accuracy
Noun Compound Compositionality	NCC	Noun compound	Compositionality	MCQ Accuracy
Noun Compound Extraction	NCE	Noun compound	Extraction	Exact Match
Noun Compound Interpretation	NCI	Noun compound	Interpretation	ROUGE-L · BERTScore-F1 · METEOR · BLEU
Verbal MWE Extraction	VMWE	Verbal MWE	Extraction	Exact Match

Plus six sequential tasks: extraction + judgment / interpretation across idiom, collocation and noun-compound branches.

Five findings worth keeping

full results →

Finding 01

Detection > Interpretation

Across phrase types, surface-level discrimination is much easier than producing a faithful gloss. The gap holds even when the same model handles both ends.

Finding 02

Categorization collapses with granularity

LCC accuracy drops monotonically as the taxonomy grows from 2 → 16 categories. Even the strongest closed models lose more than half their accuracy at 16 classes.

Finding 03

ICL helps interpretation, hurts judgment

Few-shot prompting consistently improves generative tasks, but for IED, NCC and CI the best performance is most often zero-shot.

Finding 04

Cascade failure is real

In sequential extract→interpret pipelines, errors compound non-additively — especially for collocations where the lexical preference itself is brittle.

Finding 05

Closed ≫ open, but the gap narrows

Closed frontier models lead overall; specialised open models close the gap on extraction but lag on categorization and interpretation.

Takeaway

Phrasal semantics is multi-dimensional

No single metric or task captures phrasal competence. SemanticQA argues for a multi-axis readout — and provides the harness for it.

Cite this work

#cite

SemanticQA (this paper)

@article{liu2026revisiting,
  title  = {Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models},
  author = {Liu, Yang and Li, Hongming and Qin, Melissa Xiaohui and Liu, Qiankun and Huang, Chao},
  journal= {arXiv preprint arXiv:2604.16593},
  year   = {2026}
}

Earlier benchmark (LexBench, 2024)

@article{liu2024revisiting,
  title  = {Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models},
  author = {Liu, Yang and Qin, Melissa Xiaohui and Li, Hongming and Huang, Chao},
  journal= {arXiv preprint arXiv:2405.02861},
  year   = {2024}
}

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models