Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging — a common LLM-as-a-Judge practice where a model provides a global verdict on a debate — suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score.
To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack–defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency — a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.
GRASP operates on a weighted interaction graph $\mathcal{G} = (A, W, D)$, where arguments $A = \{a_1, \ldots, a_n\}$ are nodes. The attack matrix $W \in [0,1]^{n \times n}$ encodes how strongly argument $a_i$ attacks $a_j$, and the defense matrix $D \in [0,\infty)^{n \times n}$ captures the extent to which $a_k$ defends $a_j$ against its attackers.
LLM judges score every ordered pair $(a_i, a_j)$ — yielding a dense attack matrix $W$. GRASP converts $W$ into a global ranking by propagating strengths through the interaction graph.
Starting from a uniform strength vector $\mathbf{s}^{(0)} = \mathbf{1}$, GRASP iterates a damped nonlinear operator that balances the weakening effect of attacks against the reinforcing effect of defense:
Hyperparameters $\alpha, \beta \geq 0$ control attack and defense influence; $\gamma \in (0,1]$ is the damping factor. The default instantiation sets $D = W^2$ (two-hop defense), $\alpha = 1.0$, $\beta = 0.6$, $\gamma = 0.9$.
Theorem 3.1. If $\alpha \leq \tfrac{1}{4\|W\|_1}$ and $\beta \leq \tfrac{1}{4\|D\|_1}$, then $G$ is a contraction on $\mathcal{S} = \{s \in \mathbb{R}^d \mid \|s-1\|_\infty \leq 1\}$ and the iteration converges to a unique fixed point $s^* \in \mathcal{S}$.
GRASP operationalizes structural sufficiency: argument $a$ is structurally sufficient in $\mathcal{G}$ if all attacks on $a$ are themselves counter-attacked within the graph. Formally, let $R^- \subseteq A \times A$ denote the attack relation. We write $\text{SS}(a;\mathcal{G})$ when $a$ is structurally sufficient. This notion is distinct from persuasiveness, factuality, or rhetorical appeal.
| Axiom | Formal statement |
|---|---|
| S1 · Attack Sensitivity | $\forall\, b : (b,a)\in R^- \;\Rightarrow\; \exists\, c : (c,b)\in R^-$ is required for $\text{SS}(a;\mathcal{G})$. In the weighted setting: $\neg\text{SS}(a;\mathcal{G})$ when $\sum_i W_{ia} s_i$ is large relative to available defense. |
| S2 · Defense Reinstatement | $(c,b)\in R^-$ and $(b,a)\in R^-$ $\Rightarrow$ $(b,a)$ is neutralized and does not invalidate $\text{SS}(a;\mathcal{G})$. GRASP encodes this via $D_{ca} = \sum_b W_{cb} W_{ba}$ (two-hop defense). |
| S3 · Structural Locality | $\text{SS}(a;\mathcal{G})$ depends only on $a$'s connected component in $(A, R^-)$. Arguments outside the component have zero influence on $s_a^*$. |
| S4 · Baseline Sufficiency | $\forall\, b : (b,a)\notin R^- \;\Rightarrow\; \text{SS}(a;\mathcal{G})$. In GRASP, attack-free arguments converge to $s_a^* = 1$ (the baseline), independent of $\alpha, \beta$. |
We evaluate on STRUCTDEBATE — 300 debates (7,000 arguments) across 50 real-world motions, balanced by stance and semantic angle. Six judge models provide pairwise attack scores; GRASP aggregates them into global rankings.
Table 1. Inter-model agreement (Kendall's $\tau$) for Pool (P) and Multi-turn (MT) settings. $\Delta\tau$ improvements over RAW are shown in parentheses. Bold = best; shaded = default GRASP variant.
| Method | $\tau$ (P) ↑ | $\tau$ (MT) ↑ | $\rho$ (P) ↑ | $\rho$ (MT) ↑ | Top-3 (P) ↑ | Top-3 (MT) ↑ |
|---|---|---|---|---|---|---|
| RAW | 0.337 | 0.309 | 0.425 | 0.380 | 0.385 | 0.410 |
| RAW+SS | 0.393 | 0.064 | 0.405 | 0.038 | 0.208 | 0.251 |
| GRASP | 0.623 (+.286) | 0.626 (+.317) | 0.780 | 0.779 | 0.509 | 0.487 |
| GRASP-W∞ | 0.623 (+.286) | 0.626 (+.317) | 0.781 | 0.779 | 0.509 | 0.486 |
| GRASP-W1 | 0.604 (+.267) | 0.607 (+.298) | 0.780 | 0.780 | 0.528 | 0.585 |
The motion "This House would break up dominant technology monopolies" (debate mt_048): the argument "Dominant technology monopolies stifle innovation by controlling access to critical platforms and data, discouraging competition through predatory practices, and reducing incentives to improve products. Breaking them up would enable smaller firms to enter the market, accelerate technological advancement, and diversify the development of new tools and services." is ranked near-unanimously #1 by GRASP across all judges, yet receives ranks 1–20 (median ≈ 14) under direct RAW judging.
Live visualization of all 300 debates — 50 pool + 250 multi-turn — pairwise attack matrices from 6 LLM judges, and GRASP rankings with rank dynamics. No backend required.