GRASP: Deterministic Argument Ranking
in Interaction Graphs

Diganta Misra1,2,3,4 Antonio Orvieto1,2,3 Rediet Abebe1,2,3 Volkan Cevher5
1MPI-IS Tübingen ·  2Tüb. AI Center ·  3ELLIS Institute ·  4Uni. Tübingen ·  5LIONS, EPFL
TL;DR LLM judges disagree sharply when asked to rank arguments holistically. GRASP replaces opaque global verdicts with local pairwise attack/defense judgments aggregated through a convergent graph propagation operator — producing rankings that are more reproducible, auditable, and structurally grounded.
Figure 1. Pairwise Pearson correlation of attack-weight matrices $W$ across six judge models (lower triangle), averaged across all 50 pool debates. Models substantially agree on relative attack patterns, providing a reproducible substrate for structural aggregation.
Figure 2. Per-debate mean Kendall $\tau$ of GRASP rankings vs. mean Pearson similarity of $W$ matrices across pool debates. Higher $W$ agreement predicts higher GRASP ranking consistency, confirming structural reproducibility translates to ranking stability.

Abstract

Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging — a common LLM-as-a-Judge practice where a model provides a global verdict on a debate — suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score.

To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack–defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency — a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.

Method

Weighted Interaction Graphs

GRASP operates on a weighted interaction graph $\mathcal{G} = (A, W, D)$, where arguments $A = \{a_1, \ldots, a_n\}$ are nodes. The attack matrix $W \in [0,1]^{n \times n}$ encodes how strongly argument $a_i$ attacks $a_j$, and the defense matrix $D \in [0,\infty)^{n \times n}$ captures the extent to which $a_k$ defends $a_j$ against its attackers.

LLM judges score every ordered pair $(a_i, a_j)$ — yielding a dense attack matrix $W$. GRASP converts $W$ into a global ranking by propagating strengths through the interaction graph.

The GRASP Update Rule

Starting from a uniform strength vector $\mathbf{s}^{(0)} = \mathbf{1}$, GRASP iterates a damped nonlinear operator that balances the weakening effect of attacks against the reinforcing effect of defense:

$$s_j^{(t)} = (1-\gamma)\,s_j^{(t-1)} + \gamma \cdot \frac{1 + \beta\sum_k D_{kj}\,s_k^{(t-1)}}{1 + \alpha\sum_i W_{ij}\,s_i^{(t-1)}}$$

Hyperparameters $\alpha, \beta \geq 0$ control attack and defense influence; $\gamma \in (0,1]$ is the damping factor. The default instantiation sets $D = W^2$ (two-hop defense), $\alpha = 1.0$, $\beta = 0.6$, $\gamma = 0.9$.

Convergence Guarantee

Theorem 3.1. If $\alpha \leq \tfrac{1}{4\|W\|_1}$ and $\beta \leq \tfrac{1}{4\|D\|_1}$, then $G$ is a contraction on $\mathcal{S} = \{s \in \mathbb{R}^d \mid \|s-1\|_\infty \leq 1\}$ and the iteration converges to a unique fixed point $s^* \in \mathcal{S}$.

0.70
0.15
Interactive GRASP toy. Drag node→node to add attack edge · double-click node to delete · edge weight slider controls new edge strength · use Random DAG to explore new topologies. Red = attacks · blue dashed = $D=W^2$ defense.
Convergence trajectory for the toy example — rank position of each argument across GRASP iterations.

Structural Sufficiency

GRASP operationalizes structural sufficiency: argument $a$ is structurally sufficient in $\mathcal{G}$ if all attacks on $a$ are themselves counter-attacked within the graph. Formally, let $R^- \subseteq A \times A$ denote the attack relation. We write $\text{SS}(a;\mathcal{G})$ when $a$ is structurally sufficient. This notion is distinct from persuasiveness, factuality, or rhetorical appeal.

AxiomFormal statement
S1 · Attack Sensitivity $\forall\, b : (b,a)\in R^- \;\Rightarrow\; \exists\, c : (c,b)\in R^-$ is required for $\text{SS}(a;\mathcal{G})$. In the weighted setting: $\neg\text{SS}(a;\mathcal{G})$ when $\sum_i W_{ia} s_i$ is large relative to available defense.
S2 · Defense Reinstatement $(c,b)\in R^-$ and $(b,a)\in R^-$ $\Rightarrow$ $(b,a)$ is neutralized and does not invalidate $\text{SS}(a;\mathcal{G})$. GRASP encodes this via $D_{ca} = \sum_b W_{cb} W_{ba}$ (two-hop defense).
S3 · Structural Locality $\text{SS}(a;\mathcal{G})$ depends only on $a$'s connected component in $(A, R^-)$. Arguments outside the component have zero influence on $s_a^*$.
S4 · Baseline Sufficiency $\forall\, b : (b,a)\notin R^- \;\Rightarrow\; \text{SS}(a;\mathcal{G})$. In GRASP, attack-free arguments converge to $s_a^* = 1$ (the baseline), independent of $\alpha, \beta$.

Results

We evaluate on STRUCTDEBATE — 300 debates (7,000 arguments) across 50 real-world motions, balanced by stance and semantic angle. Six judge models provide pairwise attack scores; GRASP aggregates them into global rankings.

Inter-model agreement improvement over RAW judging
τ≈0.62 Kendall's τ agreement (vs. 0.31 for RAW)
300 Debates · 50 motions · 7,000 arguments
6 LLM judge models evaluated

Table 1. Inter-model agreement (Kendall's $\tau$) for Pool (P) and Multi-turn (MT) settings. $\Delta\tau$ improvements over RAW are shown in parentheses. Bold = best; shaded = default GRASP variant.

Method $\tau$ (P) ↑ $\tau$ (MT) ↑ $\rho$ (P) ↑ $\rho$ (MT) ↑ Top-3 (P) ↑ Top-3 (MT) ↑
RAW 0.3370.309 0.4250.380 0.3850.410
RAW+SS 0.3930.064 0.4050.038 0.2080.251
GRASP 0.623 (+.286) 0.626 (+.317) 0.780 0.779 0.509 0.487
GRASP-W 0.623 (+.286) 0.626 (+.317) 0.7810.779 0.5090.486
GRASP-W1 0.604 (+.267) 0.607 (+.298) 0.7800.780 0.5280.585
Figure 3. Per-debate mean Kendall $\tau$ vs. mean Pearson $r$ of $W$ matrices across pool debates. Higher $W$ similarity (x-axis) predicts higher GRASP ranking agreement — confirming structural reproducibility translates to ranking stability.
Figure 4. Iterations to convergence vs. W-matrix density across all 50 pool debates, per judge model. Denser matrices require fewer iterations — convergence is faster when the graph is more uniformly connected.

Case Study: GRASP vs. RAW on a Single Argument

The motion "This House would break up dominant technology monopolies" (debate mt_048): the argument "Dominant technology monopolies stifle innovation by controlling access to critical platforms and data, discouraging competition through predatory practices, and reducing incentives to improve products. Breaking them up would enable smaller firms to enter the market, accelerate technological advancement, and diversify the development of new tools and services." is ranked near-unanimously #1 by GRASP across all judges, yet receives ranks 1–20 (median ≈ 14) under direct RAW judging.

GRASP Rankings
claude-haiku-4.5#1
deepseek-v3.2#1
gemini-3-flash#1
llama-4-scout#1
gpt-5.2-chat#1
mimo-v2-flash#2
Highly consistent
RAW Rankings
claude-haiku-4.5#7
deepseek-v3.2#17
gemini-3-flash#11
llama-4-scout#5
gpt-5.2-chat#20
mimo-v2-flash#19
Wide disagreement

STRUCTDEBATE Explorer

Live visualization of all 300 debates — 50 pool + 250 multi-turn — pairwise attack matrices from 6 LLM judges, and GRASP rankings with rank dynamics. No backend required.

Setting
Loading…
Loading…
Judge
τ = 0.50