Revisiting a Pain in the Neck:
A Semantic Reasoning Benchmark for Language Models

Yang Liu1,2,*, Hongming Li1,*, Melissa Xiaohui Qin1, Qiankun Liu1, Chao Huang1,†

1 University of Science and Technology Beijing   ·   2 Beijing Institute for General Artificial Intelligence

Too long; Don't read

We introduce SemanticQA, an evaluation suite designed to encourage modern language models to engage in semantic reasoning rather than surface-form pattern matching. The benchmark covers twelve carefully-controlled tasks across four categories of semantically rich, syntactically constrained phrasal constructions, plus an additional six sequential compositions.

Existing benchmarks treat semantic phenomena one corpus at a time. SemanticQA puts idioms, lexical collocations, noun compounds, and verbal multiword expressions in the same evaluation frame, with a uniform task taxonomy (Detection · Extraction · Categorization · Interpretation) and a shared phrase-level lens. We further compose extraction with a downstream judgment or interpretation step to surface error cascades that single-step benchmarks miss.

12
Standalone tasks
+6
Sequential compositions
4
Phrase categories
31
Models evaluated

Why another benchmark?

#why

Semantic phrases — idioms like "a pain in the neck", collocations like "raise an alarm", noun compounds, and verbal MWEs — sit at the boundary where distributional cues stop being enough. They are the smallest units that demand composition, world knowledge, and context together. Yet existing studies typically isolate one phenomenon, one task, and one model family.

SemanticQA asks one question across the four phrase types:

How do language models behave when evaluated on phrasal constructs across distinct but structurally constrained task operations?

The answer turns out to be revealing. Models are stronger on detection than on interpretation; collapse rapidly when categorization granularity grows; and exhibit cascading error patterns when extraction is followed by interpretation — failures that are invisible to any single-step evaluation.

Three design principles

#contributions
  1. Operation-aligned semantic evaluation

    One taxonomy of operations applied uniformly to every phrase type, so cross-phenomenon comparison is finally possible. Detection, extraction, categorization and interpretation are evaluated on the same axes.

  2. Minimal & controlled design

    Concise prompt templates avoid instruction-induced variance. Tasks vary in structural distinction alone, so what we measure is semantic competence — not prompt ingenuity.

  3. Diagnostic readout & cascade sensitivity

    Per-phrase, per-operation scorecards expose where models break. Sequential extract→interpret tasks let cascade-level failures surface that single-step evaluation cannot detect.

The phrase × operation matrix

#taxonomy

Twelve standalone tasks fill a 4×4 grid of phrase type against semantic operation. Six sequential tasks compose extraction with a downstream judgment or interpretation step.

SemanticQA 12 tasks Idioms Collocations Noun cmpds. Verbal MWEs IEDDetect IEEExtract IEIInterpret LCEExtract LCIInterpret LCCCategorize NCCCategorize NCEExtract NCIInterpret VMWEExtract CIInterpret CRCategorize Detection Extraction Interpretation Categorization
Two-stage sunburst: inner ring marks the four phrase categories; outer ring shows the twelve standalone tasks, colored by operation type. Sequential tasks (not pictured) chain extraction with a downstream operation.
Idioms

Pain in the neck

Non-compositional. Detection · Extraction · Interpretation.

Collocations

Raise an alarm

Lexically-restricted. Categorization · Extraction · Interpretation · Retrieval · Identification.

Noun compounds

Rocket science

Variable compositionality. Compositionality · Extraction · Interpretation.

Verbal MWEs

Take advantage of

Headed by a verb. Extraction · Sequential judgment.

Tasks at a glance

all tasks →
TaskAbbr.PhraseOperationEval metrics
Idiomatic Expression Detection IED Idiom Detection MCQ Accuracy
Idiomatic Expression Extraction IEE Idiom Extraction Exact Match
Idiomatic Expression Interpretation IEI Idiom Interpretation ROUGE-L · BERTScore-F1 · METEOR · BLEU
Lexical Collocation Categorization LCC Collocation Categorization Accuracy · Macro / Micro / Weighted F1
Lexical Collocation Extraction LCE Collocation Extraction Exact Match
Lexical Collocation Interpretation LCI Collocation Interpretation ROUGE-L · BERTScore-F1 · METEOR · BLEU
Collocate Retrieval CR Collocation Retrieval Exact Match
Collocation Identification CI Collocation Identification Accuracy
Noun Compound Compositionality NCC Noun compound CompositionalityMCQ Accuracy
Noun Compound Extraction NCE Noun compound Extraction Exact Match
Noun Compound Interpretation NCI Noun compound Interpretation ROUGE-L · BERTScore-F1 · METEOR · BLEU
Verbal MWE Extraction VMWE Verbal MWE Extraction Exact Match

Plus six sequential tasks: extraction + judgment / interpretation across idiom, collocation and noun-compound branches.

Five findings worth keeping

full results →
Finding 01

Detection > Interpretation

Across phrase types, surface-level discrimination is much easier than producing a faithful gloss. The gap holds even when the same model handles both ends.

Finding 02

Categorization collapses with granularity

LCC accuracy drops monotonically as the taxonomy grows from 2 → 16 categories. Even the strongest closed models lose more than half their accuracy at 16 classes.

Finding 03

ICL helps interpretation, hurts judgment

Few-shot prompting consistently improves generative tasks, but for IED, NCC and CI the best performance is most often zero-shot.

Finding 04

Cascade failure is real

In sequential extract→interpret pipelines, errors compound non-additively — especially for collocations where the lexical preference itself is brittle.

Finding 05

Closed ≫ open, but the gap narrows

Closed frontier models lead overall; specialised open models close the gap on extraction but lag on categorization and interpretation.

Takeaway

Phrasal semantics is multi-dimensional

No single metric or task captures phrasal competence. SemanticQA argues for a multi-axis readout — and provides the harness for it.

Cite this work

#cite

SemanticQA (this paper)

@article{liu2026revisiting,
  title  = {Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models},
  author = {Liu, Yang and Li, Hongming and Qin, Melissa Xiaohui and Liu, Qiankun and Huang, Chao},
  journal= {arXiv preprint arXiv:2604.16593},
  year   = {2026}
}

Earlier benchmark (LexBench, 2024)

@article{liu2024revisiting,
  title  = {Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models},
  author = {Liu, Yang and Qin, Melissa Xiaohui and Li, Hongming and Huang, Chao},
  journal= {arXiv preprint arXiv:2405.02861},
  year   = {2024}
}