JWC

1.1 Problem: Quantization Degrades AI Reasoning—Benchmarks Miss It

When a law firm subscribes to an AI-powered legal research platform, computing costs drive the monthly invoice. Behind that invoice lies an engineering decision: run the model at full precision, or compress weights into 4-bit integers that quadruple throughput on the same hardware. Users gain speed; vendors cut costs. For vendors operating at scale, quantization is standard deployment practice.

Prior work suggests this compression is benign. Jin et al. found that 4-bit models "maintain performance comparable to their non-quantized counterparts" across MMLU, summarization, and arithmetic reasoning. Red Hat reported that quantized Llama 3.1 models recover 96–99% of baseline scores with "no discernible differences" in typical use.

But these benchmarks omit legal reasoning tasks. They ignore hallucination rates under legal-specific taxonomies. They bypass the multi-step doctrinal analysis—tracking exceptions, applying multi-factor tests, distinguishing precedent—that characterizes competent legal work.

Other work raises concern. Li et al. found up to 32% accuracy degradation on mathematical reasoning benchmarks while general text performance stayed above 95%. Liu et al. confirmed that quantization below 8-bit creates "significant accuracy risks" in reasoning-intensive models. The mechanism is understood: complex reasoning relies on rare, high-magnitude weight directions that aggressive compression clips first.

This study tests whether quantization creates a silent defect zone in legal reasoning—a regime where accuracy collapses while fluency remains stable—and whether PTQ methods produce materially different outcomes at identical bit-depths.

The central premise: legal reasoning is fragile in ways general benchmarks miss. Quantization may selectively degrade reasoning while leaving surface fluency intact—a divergence where the AI sounds competent but reasons poorly.

2.1 Research Execution as Integrated Competency

Before testing whether quantization degrades legal AI, we must define what legal AI should do. This Section frames legal work not as independent skills but as Research Execution—the integrated competency of completing a legal research task from question to answer.

Three authoritative sources converge on this construct:

MacCrate Report (ABA, 1992): Identifies legal research as a fundamental lawyering skill, emphasizing a "coherent and effective research design" that integrates issue identification, source selection, and strategic execution.

AALL Principles and Standards for Legal Research Competency (2013): Principle IV states that "a successful legal researcher applies information effectively to resolve a specific issue or need." The standard elaborates: the competent researcher "synthesizes legal doctrine by examining cases similar, but not identical, to cases that are the current focus of research."

Shultz & Zedeck empirical study of lawyer effectiveness (2011): Their 26 "Lawyering Effectiveness Factors" are job performance measures derived from asking lawyers, judges, and clients: "If you were looking for a lawyer for an important matter, what qualities would cause you to choose that attorney?" Relevant factors include "Fact Finding," "Researching the Law," and "Integrity & Honesty."

These sources describe a workflow that succeeds or fails as a unit—not seven independent skills tested in isolation. A lawyer who finds the right case but extracts the wrong holding has failed. A lawyer who synthesizes brilliantly but cites fabricated authority has produced worthless work product.

2.2 The Seven Components of Research Execution

Research Execution comprises seven sequential operations that lawyers perform when completing a legal research task:

Component	Operation	What It Tests
Known Authority	Resolve a known citation to correct case metadata	Can the system retrieve specific authorities?
Unknown Authority	Retrieve relevant law from a fact pattern or legal issue	Can the system find applicable precedent?
Validate Authority	Determine if authority remains good law	Can the system detect overruling or adverse treatment?
Fact Extraction	Extract disposition, holding, and outcome from opinion text	Can the system identify legally relevant facts?
Distinguish Cases	Determine whether precedent applies or can be distinguished	Can the system reason about doctrinal relationships?
Synthesize Results	Integrate authorities into coherent IRAC analysis	Can the system produce competent legal work product?
Citation Integrity	Ensure all cited authorities exist and support propositions cited	Does the system meet professional responsibility standards?

These components are dependencies, not independent benchmarks. Each component's output feeds the next. The chain succeeds or fails as a unit; failure at any point propagates downstream.

2.3 Professional Responsibility as Hard Constraint

Citation Integrity occupies a special position. Under Model Rule 3.3(a)(1), attorneys may not "make a false statement of fact or law to a tribunal." A brief citing fabricated cases is worthless and potentially sanctionable.

This is a binary gate, not a soft metric: work product either meets professional standards or it fails. Shultz & Zedeck's Factor 21—"Integrity & Honesty"—confirms that professional responsibility is foundational to lawyer effectiveness.

2.4 What Research Execution Tests

The final synthesis validates Research Execution. AALL Principle IV's standard—applying gathered information "to resolve a specific issue or need"—is tested at synthesis. If synthesis produces a well-reasoned IRAC analysis using correct authorities, Research Execution succeeded. If it fails, error tracing reveals which upstream component broke.

Two implications follow:

Planning is implicit. Earlier frameworks treated "Research Planning" and "Strategic Stopping" as separate skills. We eliminate them as explicit test targets because synthesis validates planning. Grade the memo, not the plan.

Errors propagate. A model achieving 90% accuracy on each component will complete only 0.9^7 ≈ 48% of full research tasks successfully. This multiplicative penalty reflects legal practice.

3.1 PTQ Methods

PTQ compresses model weights after training. Multiple methods exist, each with different compression strategies and preservation priorities:

Method	Bit-Depth	Target	Core Strategy
AWQ	4-bit	Weights	Activation-aware: identifies critical 1% of weights and protects them
GPTQ	4-bit / 8-bit	Weights	Hessian-based: minimizes MSE of weight error layer-by-layer
GGUF (llama.cpp)	Var.	Weights + KV	Block-wise: flexible CPU/GPU offloading, no importance calibration
BNB-nf4	4-bit	Weights	NormalFloat: exploits normal distribution of weight values
SmoothQuant	8-bit	Weight + Activation	Migrates activation spikes into weights
KIVI	2-4 bit	KV Cache	Channel-wise compression of context memory

The critical distinction is between activation-aware methods (AWQ) that identify and protect the weights most critical to reasoning, and block-wise methods (GGUF) that compress uniformly without importance calibration.

3.2 The AWQ vs. GGUF Divergence

AWQ identifies weight channels with high activation magnitude—the parameters most critical to output fidelity—and protects them from aggressive compression. This makes AWQ more robust for reasoning tasks than uniform quantization.

GGUF (llama.cpp) compresses in block-level chunks without nuanced weight protection. It supports optional importance matrix calibration (llama-imatrix), but standard K-Quants (Q4_K_M, Q4_K_S) skip it.

For legal practice, the difference matters:

Feature	AWQ (vLLM)	GGUF (llama.cpp)
Primary Hardware	High-end NVIDIA GPUs	CPUs, Apple Silicon, consumer GPUs
Operational Cost	$1-4/hour per GPU	$0.10-0.50/hour or free locally
Weight Protection	Activation-aware	Block-wise (no importance calibration)
Reasoning Preservation	Higher	Lower
Production Target	High-concurrency SaaS	Edge devices, privacy-first local tools

A user downloading a "4-bit Llama" model cannot tell whether it was quantized with AWQ or GGUF. The label is identical. The legal reasoning quality may differ substantially.

3.3 Commercial Optimizations and Supply Chain Stacking

The PTQ method chosen at the model layer is only the beginning. Commercial infrastructure applies additional optimizations that compound degradation:

NVIDIA TensorRT-LLM (W4A8 / NVFP4)

NVIDIA's Model Optimizer uses W4A8 (4-bit weights, 8-bit activations) to double throughput on Blackwell GPUs. The move to NVFP4 (4-bit weights and activations) for "extreme efficiency" triggers reasoning collapse documented in the literature. This engine powers most high-scale AI cloud providers. They optimize for tokens-per-second—lowering the model's reasoning ceiling in the process.

Microsoft PrefixQuant (Weight-Activation Optimization)

PrefixQuant compresses both weights and activations to achieve high throughput. This optimization caused a 71% collapse in high-level knowledge verification (GPQA). It allows a company to serve 4× more users on the same hardware while effectively degrading the model's validation capabilities.

Snowflake SwiftKV (KV-Cache Compression)

SwiftKV compresses context memory (KV-cache) to process massive datasets (128k tokens) on cheaper GPUs. This causes the 59% drop in retrieval accuracy Mekala et al. (2025) documented. A legal-tech vendor can claim to "review 1,000 contracts for the price of 100"—but retrieval is structurally unreliable.

3.4 The Stacking Problem

A legal AI product may encounter compression at multiple layers:

Layer	What's Happening	Who Controls It
Model weights	PTQ compression (AWQ, GPTQ, GGUF)	Model provider, vendor
Activations	W4A8, NVFP4	NVIDIA, cloud infrastructure
KV-cache	SwiftKV, KIVI	Hosting provider
Routing	Peak-hour downgrades	Everyone

Each layer compounds. Users see none of it. A lawyer using "LegalEagle 2.0" cannot tell whether the underlying model runs INT8 or INT4, whether AWQ or GGUF was used, whether TensorRT applies additional compression, or whether KV-cache is squeezed to handle long documents.

This Section maps existing evidence to each skill, establishing that quantization plausibly degrades the underlying capabilities each skill requires. Some evidence comes from quantization studies; some from adjacent work on long-context degradation, reasoning under compression, or legal benchmark difficulty. The mapping establishes plausibility; Section VI provides direct empirical confirmation.

Mapping Human Skills to AI Procedural Workflows

Each skill translates into AI operations:

Skill	Human Activity	AI Workflow
S1: Research Execution	Scope, research, databases	Query decomposition, tool selection
S2: Strategic Stopping	Recognizing diminishing returns	Termination conditions, confidence thresholds
S3: Known Authority	Citation lookup	Exact retrieval from identifier
S4: Unknown Authority	Issue-based searching	Semantic retrieval, query expansion
S5: Validating Authority	Shepardizing	Treatment classification, status verification
S6: Fact Extraction	Document review	Long-context retrieval, entity extraction
S7: Distinguishing Cases	Analogical reasoning	Holding comparison, fact matching
S8: Synthesizing Results	Memo drafting	Multi-document generation
S9: Citation	Ethics compliance	Hallucination avoidance, attribution

This mapping enables precise questions: when quantization degrades "multi-hop reasoning," which lawyer skills suffer? When embedding compression causes "retrieval blindness," which workflows fail?

4.1 Summary: The Full Skill Surface Is At Risk

Skill	Mechanism Evidence	Risk Level
S3: Known Authority	Long-context degradation	High
S4: Unknown Authority	Reasoning + retrieval	High
S5: Validating Authority	Temporal reasoning	Medium
S6: Fact Extraction	Long-context retrieval	High
S7: Distinguishing Cases	Multi-step reasoning	High
S8: Synthesizing Results	Integration + accuracy	High
PR: Prof. Responsibility	Fabrication resistance	Very High

4.2 Empirical Studies on Quantization Effects on Reasoning Skills

4.2.1 Research Planning

Mechanism: Research planning requires decomposing complex queries into subtasks—the multi-hop reasoning Li et al. showed degrades up to 4× under quantization.

Study	Finding
ACBench (Dong et al., 2025)	reveals that 4-bit quantization creates a critical divergence between apparent competence and actual reliability with real-world application exhibiting accuracy drops by 10-15%.
Liu et al. (2025)	shows that lower bit-width quantization introduces task-difficulty-dependent accuracy risks, and they explicitly evaluate KV cache / activation quantization as well as weights.
IntactKV (2024)	mechanism support that KV cache quantization can be a failure point; good for “workflow state maintenance” language.

4.2.2 Strategic Stopping

Mechanism: Strategic stopping requires calibrated confidence—knowing when you have enough. ECE (expected calibration error) studies show quantized models become overconfident, unable to recognize their own uncertainty.

Study	Finding
Zhong et al., 2025	Quantized LLMs are worse-calibrated than full-precision counterparts in 85% of measurements (41 of 48 test conditions). Quantization systematically produces overconfidence.
Q-Misalign (Dong, Li & Guo, 2025)	Safety alignment degrades under quantization; dormant vulnerabilities emerge post-compression.

4.2.3 Finding Known Authority

Mechanism: Known authority retrieval requires precise matching across long contexts. Embedding quantization collapses semantic distances; generator quantization corrupts attention to specific passages.

Study	Finding
Mekala et al. (2025)	8-bit roughly preserved; 4-bit methods produce losses up to 59%, especially for long-context inputs. Effect varies by method/model/task.
LegalBench-RAG (2024)	Legal-domain benchmark isolating retrieval quality. Legal retrieval is hard even before quantization.

4.2.4 Finding Unknown Authority

Mechanism: Finding unknown authority requires decomposing fact patterns into legal issues. Liu et al. show quantization degrades multi-hop reasoning by up to 4× on complex tasks. Combined with Zheng's difficulty findings: quantization severely impairs this skill.

Study	Finding
Li et al.	Low-bit quantization degrades complex math reasoning by up to 32.39% (avg. 11.31%), specifically in numerical computation and planning.
Liu et al., 2025	Lower bit-widths introduce significant accuracy risks; impact depends on task difficulty. Affects DeepSeek-R1, LLaMA, and Qwen.
Yazan, Verberne & Situmeang (2024)	In RAG pipelines, quantization may not impair retrieval when base LLM performs well, but smaller models show high sensitivity to context length and setup.

4.2.5 Validating Authority

Mechanism: Validation requires temporal reasoning (when was this overruled?) and status classification (still good law?). Outlier weight clipping destroys these fine-grained distinctions.

Study	Finding
Liu et al. (2025)	W8A8/W4A16 can be lossless; lower bit-widths introduce significant accuracy risks. Task difficulty is critical--placing authority-validation in the high-risk regime.
MixKVQ (Zhang et al. 2025)	Low-bit KV-cache quantization exhibits severe degradation on complex reasoning. Fixed-precision at very low bit-widths produces large quantization errors and critical failures.
TimeBench	GPT-4 achieves only 66.4% on implicit temporal relationships. Accuracy varies from 40% to 92% depending on how temporal facts are organized. TRAVELER shows implicit temporal reasoning degrades 39% as context scales from 5 to 100 events.
arXiv .04823	Accuracy drops exceed 10% at 4-bit for reasoning; LexTime achieves only 80.8% on temporal event ordering with 4-bit models. Validating authority requires multi-hop temporal reasoning--weak at baseline, catastrophic under quantization.

4.2.6 Fact Extraction

Mechanism: Fact extraction from contracts is long-context retrieval. The 59% accuracy collapse on NIAH-none (correctly identifying absent information) implicates document review reliability.

Study	Finding
Mekala et al. (2025)	up to 59% degradation on long-context extraction tasks at 4-bit quantization. Extracting holdings requires identifying the specific legal rule announced by a court, distinguishing it from dicta, and accurately capturing its scope and limitations. 🡪 S6's failure mode is particularly insidious because extracted "holdings" may be linguistically plausible while being substantively fabricated—the quantized system generates authoritative-sounding rules that cited case never announced.

Study

Finding

Mekala et al. (2025)

up to 59% degradation on long-context extraction tasks at 4-bit quantization. Extracting holdings requires identifying the specific legal rule announced by a court, distinguishing it from dicta, and accurately capturing its scope and limitations. 🡪 S6's failure mode is particularly insidious because extracted "holdings" may be linguistically plausible while being substantively fabricated—the quantized system generates authoritative-sounding rules that cited case never announced.

4.7 Skill 7: Distinguishing Cases

Mechanism: Case distinction requires tracking multiple factors simultaneously and identifying material differences—multi-step reasoning, the capacity most vulnerable to quantization.

Study	Finding
Dahl et al. (Journal of Legal Analysis, 2024)	models "cannot reliably detect when they are hallucinating" and fail to correct users' incorrect legal assumptions. When combined with Li et al.'s 32.39% reasoning degradation under quantization, demonstrates high unreliability. Baseline 58-88% hallucination rate represents unquantized models; adding 4-bit compression amplifies an already critical reliability gap.
Liu et al., 2025	supports that low-bit regimes create accuracy risks on hard reasoning tasks (the cognitive substrate for distinction).

4.8 Skill 8: Synthesizing Results

Mechanism: Synthesis requires integrating multiple sources while maintaining coherence. CLERC reports that strong models produce highly rated analyses while hallucinating—good writing does not mean truthful authority.

Study	Finding
LegalEval-Q (Li & Wu, 2025)	measures clarity/coherence/terminology quality; also (importantly) reports quantization has *negligible impact on those writing-quality* metrics**, which supports your “fluency preserved while truth degrades” story.
Lewis et al. (2020)	Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. canonical RAG citation that establishes retrieval + generation as a distinct paradigm because parametric memory alone is insufficient, and provenance/updating are core motivations; credibility backbone for “unknown authority finding + synthesis” being a retrieval-conditioned reasoning task rather than generic generation. URL: arXiv .11401.

Study

Finding

LegalEval-Q (Li & Wu, 2025)

measures clarity/coherence/terminology quality; also (importantly) reports quantization has negligible impact on those writing-quality metrics, which supports your “fluency preserved while truth degrades” story.

Lewis et al. (2020)

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. canonical RAG citation that establishes retrieval + generation as a distinct paradigm because parametric memory alone is insufficient, and provenance/updating are core motivations; credibility backbone for “unknown authority finding + synthesis” being a retrieval-conditioned reasoning task rather than generic generation. URL: arXiv

.11401.

4.9 Professional Responsibility

Mechanism: Citation integrity and fabrication resistance depend on precise parametric memory. Quantization clips outlier weights encoding rare-but-accurate associations.

Study	Finding
Q-Misalign (Dong et al., 2025)	safety alignment is not preserved by quantization but is instead contingent upon precision—vulnerabilities can remain dormant; making pre-deployment safety audits unreliable for detecting post-quantization failure modes. Combined with Dahl et al.'s finding that even unquantized models hallucinate legal information at 58-88% rates while being unable to detect their own errors, quantized legal AI systems present a dual threat to professional responsibility
Li et al. (2024)	4-bit quantization significantly weakens fabrication resistance.
Dahl et al. (2024) Large Legal Fictions	LLMs hallucinate legal authority at alarming rates (69-88%) on verifiable legal queries.

4.1 Research Execution as Integrated Competency

Before asking whether quantization degrades legal AI, we must define what legal AI is supposed to do. This Section operationalizes legal work not as a collection of independent skills, but as Research Execution—the integrated professional competency of completing a legal research task from question to answer.

Three authoritative sources converge on this construct:

MacCrate Report (ABA, 1992).

Identifies legal research as a fundamental lawyering skill, emphasizing "devising and implementing a coherent and effective research design" that integrates issue identification, source selection, and strategic execution.

Shultz & Zedeck empirical study of lawyer effectiveness (2011): Their 26 "Lawyering Effectiveness Factors" are job performance measures—derived from asking lawyers, judges, and clients: "If you were looking for a lawyer for an important matter, what qualities would cause you to choose that attorney?" Relevant factors include "Fact Finding" (identifying relevant facts and issues), "Researching the Law" (utilizing appropriate sources and strategies), and "Integrity & Honesty" (acting with core professional values).

These sources do not describe seven independent skills tested in isolation. They describe a workflow that succeeds or fails as a unit. A lawyer who finds the right case but extracts the wrong holding has not executed research competently. A lawyer who synthesizes brilliantly but cites fabricated authority has produced worthless work product.

4.2 The Seven Components of Research Execution

Research Execution comprises seven sequential operations that lawyers perform when completing a legal research task:

Component	Operation	What It Tests
Known Authority	Resolve a known citation to correct case metadata	Can the system retrieve specific authorities?
Unknown Authority	Retrieve relevant law from a fact pattern or legal issue	Can the system find applicable precedent?
Validate Authority	Determine if authority remains good law	Can the system detect overruling or adverse treatment?
Fact Extraction	Extract disposition, holding, and outcome from opinion text	Can the system identify legally relevant facts?
Distinguish Cases	Determine whether precedent applies or can be distinguished	Can the system reason about doctrinal relationships?
Synthesize Results	Integrate authorities into coherent IRAC analysis	Can the system produce competent legal work product?
Citation Integrity	Ensure all cited authorities exist and support propositions cited	Does the system meet professional responsibility standards?

These components are not independent benchmarks—they are dependencies in a workflow. Each component's output becomes input for subsequent components. The chain succeeds or fails as a unit, and failure at any point propagates downstream.

4.3 Professional Responsibility as Hard Constraint

This is not a soft performance metric. It is a binary gate: the work product either meets professional standards or it does not. Shultz & Zedeck's Factor 21—"Integrity & Honesty: has core values and beliefs; acts with integrity and honesty"—confirms that professional responsibility is foundational to lawyer effectiveness, not an optional overlay.

4.4 What Research Execution Tests

Research Execution is validated by whether the final synthesis succeeds. AALL Principle IV's standard—applying gathered information "to resolve a specific issue or need"—is tested at the synthesis step. If the synthesis produces a well-reasoned IRAC analysis using correct authorities, Research Execution worked. If it fails, error tracing reveals which upstream component broke.

This framing has two implications:

Planning is implicit. Earlier frameworks (including earlier drafts of this study) treated "Research Planning" and "Strategic Stopping" as separate skills. We eliminate them as explicit test targets because planning is validated by whether synthesis succeeds. You don't grade the plan; you grade the memo.

Errors propagate. A model achieving 90% accuracy on each independent component will, under independence assumptions, complete only 0.9^7 ≈ 48% of full research tasks successfully. This multiplicative penalty reflects the reality of legal

5.1 Research Execution as Job Performance

Section II framed legal work as Research Execution—the integrated competency of completing a legal research task from question to answer. Shultz & Zedeck's empirical study confirms this framing: their 26 "Lawyering Effectiveness Factors" are job performance measures, observed in the execution of actual legal work rather than tested in isolation.

AALL Principle IV operationalizes this directly: "A successful legal researcher applies information effectively to resolve a specific issue or need." The competent researcher "synthesizes legal doctrine by examining cases similar, but not identical, to cases that are the current focus of research."

This describes a workflow that succeeds or fails as a unit.

5.2 The Legal-7 Chain

Legal-7 (L7) operationalizes Research Execution as a seven-step dependent chain:

Step	Name	Modality	Task	Ground Truth
S1	Known Authority	RAG	Resolve known citation to correct authority	SCDB citation lookup
S2	Unknown Authority	RAG	Retrieve relevant law from fact pattern	shepards_data.csv
S3	Validate Authority	RAG	Determine if authority remains good law	scotus_overruled_db.csv
S4	Fact Extraction	RAG	Extract disposition, holding, outcome from opinion	SCDB metadata + opinion text
S5	Distinguish Cases	RAG + CB	Decide if precedent applies or can be distinguished	shepards.agree field
S6	IRAC Synthesis	RAG	Write IRAC-structured legal analysis	MEE rubric + chain grounding
S7	Citation Integrity	CB	Verify no fabricated citations in S6 output	fake_cases.csv + SCDB

The chain maps to IRAC:

Rule Phase (S1–S3): Identify, retrieve, and validate legal authority
Application Phase (S4–S5): Extract facts and apply precedent through distinction
Conclusion Phase (S6–S7): Synthesize analysis and verify citation integrity

The Issue component is implicit in the query. S6 is the capstone: it tests whether Research Execution succeeded.

5.3 Why S6 Validates the Chain

S6 runs closed-book: the model cannot return to sources. It must synthesize an IRAC memo from what it gathered in S1–S5.

This design reflects AALL Principle IV: applying gathered information to resolve an issue. If S6 produces a well-reasoned IRAC analysis using correct authorities, Research Execution worked. If S6 fails, error tracing reveals which upstream step broke:

If Step Fails...	Cascade Effect
S1 (Known Authority)	Wrong case → all downstream analysis corrupted
S2 (Unknown Authority)	Missing precedent → incomplete rule statement
S3 (Validate Authority)	Citing bad law → S6 argument fails
S4 (Fact Extraction)	Wrong facts → S5 distinction invalid
S5 (Distinguish)	Wrong application → S6 conclusion unsupported
S6 (IRAC Synthesis)	Poor reasoning → chain fails at capstone
S7 (Citation Integrity)	Fabrication detected → S6 voided, chain fails

A model achieving 90% accuracy on each skill completes only 0.9^7 ≈ 48% of full chains. This multiplicative penalty reflects legal practice: one fabricated citation makes a brief worthless.

5.4 S7 as Professional Responsibility Gate

S7 operationalizes Shultz & Zedeck's Factor 21: "Integrity & Honesty."

Under Model Rule 3.3(a)(1), attorneys may not "make a false statement of fact or law to a tribunal." A brief citing fabricated cases is worthless and potentially sanctionable. L7 mirrors this: if S7 detects any fabricated citation in the S6 output, S6 scores zero regardless of reasoning quality.

The gate operates deterministically: citations from S6 are checked against SCDB (real cases) and fake_cases.csv (known fabrications). No LLM-as-judge evaluation required.

5.5 S5 Dual-Modality: The Reasoning Bridge

S5 (Distinguish Cases) occupies a unique position in the chain. It is the point where retrieval must transform into reasoning. The model must:

Understand the holding of a precedent case
Understand the facts of the current case
Determine whether the precedent applies or can be distinguished

This is not retrieval. This is reasoning about retrieved content—precisely what AALL Principle IV demands when it requires synthesizing doctrine across "cases similar, but not identical."

To isolate the reasoning component, L7 tests S5 in two modalities:

S5-RAG (Primary): Both case texts available. Tests whether the model can distinguish cases with full information. This matches real lawyer workflow—attorneys distinguish cases with the opinions open.

S5-CB (Diagnostic): Only the S4-extracted holding available; no citing case text. Tests whether the model can reason from the rule alone, without copying from source material.

The gap between S5-RAG and S5-CB is the Fluency-Reasoning Divergence measurement:

S5-RAG	S5-CB	Interpretation
High	High	Model reasons well
High	Low	Model copies, doesn't reason (FRD signature)
Low	Low	Model cannot perform the task

A model exhibiting FRD will show a large RAG-CB gap: it can "distinguish" cases when it has the full text to copy from, but cannot reason about the legal relationship from the holding alone. This is precisely the failure mode we hypothesize quantization induces.

5.6 Grading Architecture

L7 achieves 6/7 objective grading:

Step	Grading Method	Ground Truth Source
S1	Exact match	SCDB citation
S2	MRR / Hit@k	Shepard's precedent relationships
S3	Exact match	scotus_overruled_db
S4	Exact match (disposition, party)	SCDB metadata
S5	Exact match	shepards.agree field
S6	Hybrid (50% objective, 50% LLM-as-Judge)	Chain grounding + MEE rubric
S7	Deterministic	Citation existence check

Only S6 requires rubric-based evaluation. The 50% objective component ("chain grounding") verifies that S6 correctly incorporates outputs from S1–S5—did the model use the authorities it found? The 50% subjective component applies MEE (Multistate Essay Examination) bar exam standards to assess legal reasoning quality.

This architecture minimizes LLM-as-judge circularity: most of the benchmark is deterministic, and even the subjective portion is anchored to the chain's objective outputs.

5.7 Task Structure: From Case to Chain

Each L7 chain instance begins with a Supreme Court case pair drawn from the Shepard's citation network. The anchor case (cited_case) provides the legal authority to be researched; the citing case establishes the doctrinal relationship to be analyzed.

Initial Scenario

A chain instance contains:

Element	Source	Example
Cited Case	scdb_sample.csv	Brown v. Board of Education, 347 U.S. 483 (1954)
Citing Case	scotus_shepards_sample.csv	Cooper v. Aaron, 358 U.S. 1 (1958)
Shepard's Signal	shepards field	"followed"
Doctrinal Agreement	agree field	True (citing case follows precedent)
Overrule Status	scotus_overruled_db.csv	None (not overruled)
Opinion Text	majority opinion field	Full text of majority opinion

The model receives this case pair and must execute the seven-step chain, with each step's output feeding subsequent steps.

Task Types by Step

Step	Task Type	Input	Expected Output
S1	Known Authority	Case name or citation	`\{us_cite, case_name, term\}`
S2	Unknown Authority	Legal issue from anchor case	Ranked list of citing cases
S3	Validate Authority	Citation from S1	`\{is_overruled, overruling_case, year_overruled\}`
S4	Fact Extraction	Opinion text	`\{disposition, party_winning, holding_summary\}`
S5	Distinguish (Closed-Book)	S4 holding + citing case metadata	`\{agrees, reasoning\}`
S5	Distinguish (RAG)	S4 holding + full citing opinion	`\{agrees, reasoning\}`
S6	IRAC Synthesis	All prior outputs	`\{issue, rule, application, conclusion\}`
S7	Citation Integrity	S6 output	`\{citations_found, all_valid\}`

Scoring Summary

Step	Ground Truth	Scoring Method
S1	SCDB metadata	Exact match
S2	Shepard's citing_case_us_cite	MRR, hit@10
S3	scotus_overruled_db	Binary match on is_overruled
S4	SCDB caseDisposition, partyWinning	Closed enum exact match
S5	Shepard's agree field	Binary match
S6	MEE rubric + chain grounding	Hybrid (50% objective, 50% rubric)
S7	fake_cases.csv + SCDB	Deterministic lookup

5.7 What L7 Detects That Parallel Benchmarks Cannot

The Dahl et al. benchmark tests the same underlying data but in parallel: each task is independent, and a model can achieve high aggregate scores while being functionally incapable of completing a single end-to-end workflow. If the 15% failures are distributed randomly, some workflows succeed. But if failures cluster at early-chain positions—as our quantization hypothesis predicts—then independent task accuracy becomes a misleading proxy for Research Execution capability.

L7 detects three failure modes invisible to parallel evaluation:

Cascade failures. A model that hallucinates at S1 corrupts all downstream steps. Parallel scoring treats S1 as one task among many; L7 propagates the error through the chain.

FRD signature. The S5 RAG-CB gap directly measures whether the model is reasoning or copying. No parallel benchmark isolates this.

Professional responsibility failures. S7 voiding enforces the binary reality of citation integrity—a synthesized memo is either citable or it is not. Parallel benchmarks score fabrication as one error among many; L7 treats it as disqualifying.

For quantization testing, these properties are essential. We hypothesize that compression degrades reasoning while preserving fluency. L7's chained architecture, dual-modality S5, and hard-gate S7 are designed to make this degradation visible.

The following Section applies L7 to test Fluency-Reasoning Divergence across quantization regimes.

Few sentences write out here

6.1 Design Choice: Ecological Validity Over Experimental Purity

This study prioritizes ecological validity over experimental purity. The objective is not to isolate quantization as a laboratory variable under controlled conditions, but to measure what happens to legal reasoning under the deployment configurations users actually encounter upon use.

A secondary objective emerged from the market analysis in Section III: demonstrating that "4-bit" is not a specification. Two deployments at identical bit-depth using different PTQ methods may produce materially different legal reasoning—and users cannot distinguish them. This study tests that claim empirically.

The design involves deliberate tradeoffs. By testing multiple PTQ methods, we gain the ability to show method variance. By using officially released and community-standard quantization rather than a single controlled PTQ pipeline, we lose clean causal isolation but gain results that reflect what lawyers actually face.

We accept these tradeoffs because the policy-relevant questions are:

What are lawyers exposed to in practice?
Can users distinguish between "good 3-bit" and "bad 3-bit"?

The answer to the second question, we hypothesize, is no—and we aim to prove it.

The inability to make clean causal attributions is not a limitation of this study. It is a finding about the market.

6.2 Research Questions

RQ1. Existence. Does aggressive quantization reduce legal-reasoning accuracy more than it reduces surface fluency?

RQ2. Generalizability. Is FRD observable across multiple legal task types, or is it benchmark-specific?

RQ3. Regimes. Can we distinguish a "silent defect zone" (INT8→INT4: reasoning degraded, fluency intact) from an "obvious defect zone" (2-bit: both collapse)?

RQ4. PTQ Method Variance. At identical bit-depth, do different PTQ methods produce significantly different legal reasoning outcomes?

6.3 Hypotheses

H1: Reasoning Degradation (Silent Defect Zone)

Legal reasoning accuracy will degrade significantly between INT8 and INT4 across all three arms, with the largest drops in categories requiring multi-step inference, exception-tracking, or doctrinal distinction.

H2: Fluency Stability (FRD Signature)

Fluency metrics (perplexity, grammaticality, coherence) will remain stable across precision tiers even as reasoning accuracy degrades—confirming Fluency-Reasoning Divergence.

H3: Catastrophic Collapse (Obvious Defect Zone)

At 2-bit precision, both fluency and reasoning will degrade across all arms, producing visible failure modes distinguishable from the latent degradation at INT4.

H4: Category-Specific Vulnerability

Different Dahl task types will show differential sensitivity. Tasks 5-10 (reasoning-intensive) will degrade faster than Tasks 1-4 (factual).

H5: Architecture Generalization

Arms A and B (Qwen and Llama under high-quality PTQ) will both show FRD, demonstrating that degradation is not architecture-specific.

H6: PTQ Method Divergence

At identical bit-depth, AWQ (Arm B) will significantly outperform GGUF (Arm C), demonstrating that bit-depth alone is an insufficient specification. The B-C gap at 4-bit may exceed the 8→4 gap within a single arm.

6.4 Model and Precision Regimes

Design: Three arms testing two model families under two PTQ philosophies.

Tier	Model 1	Model 2	PTQ	Rationale
8-bit	Llama 3.1 8B	Qwen 2.5 7N	??	Ecological validity, higher performance quality
4-bit
3-bit			llama.cpp (GGUF)	Low-cost PTQ—block-wise compression, no importance calibration
2-bit

Rationale for Study Design

Arms A and B test FRD under "reasonable quality" PTQ—what a careful vendor or sophisticated deployer might ship. These represent the ceiling of what users could hope to encounter.

Arm C tests FRD under the PTQ method that economic pressures actually favor. Startups without GPU budgets, solo developers, and cost-constrained deployments default to llama.cpp because it runs on consumer hardware. This represents what many users actually encounter.

The B vs C comparison tests the same model across different PTQ techniques: same base weights, same nominal bit-depth, different compression philosophy. This directly tests whether PTQ method produces material differences at identical bit-depth (see Section III).

Why INT8 As Baseline

Full-precision models exist but are rarely what users encounter. The economic realities documented in Section III push vendors toward compressed deployments. INT8 represents the "reasonable production floor"—the precision tier a user might plausibly expect from a serious legal AI product. Testing degradation relative to INT8 answers the question users actually face: "How much worse does it get from here?"

Rationale For Selection

Verifiable ground truth. Unlike benchmarks requiring LLM-as-judge evaluation, Dahl uses structured case metadata that can be programmatically verified.
Multi-skill coverage. The 10 task types span multiple L-10 skills, testing factual recall, reasoning, temporal judgment, and fabrication resistance.
FRD operationalization. The task gradient—from simple factual tasks (1-4) to complex reasoning tasks (8-10)—allows observation of whether errors cluster in reasoning-intensive categories while fluency-dependent categories remain stable.
Cross-arm comparability. All three arms run identical queries, enabling direct comparison across architectures and PTQ methods.

H4 Prediction: Category-Specific Vulnerability

Tier	Tasks 1-4 (Factual)	Tasks 5-8 (Reasoning)	Tasks 9-10 (Hardest)
INT8	✓ Stable	✓ Stable	✓ Stable
INT4	✓ Stable	↓ Silent degradation	↓↓ Degradation
2-bit	↓ Degradation	↓↓ Collapse	↓↓↓ Collapse

The gradient is the FRD signature. Simple tasks hold. Reasoning tasks degrade silently. Hardest tasks fail first and worst.

6.5 Evaluation Architecture: HELM Integration

This study implements evaluation using Stanford CRFM's Holistic Evaluation of Language Models (HELM) framework. Seven considerations motivate this choice:

Audit trail. HELM produces SHA-256 hashed bundles containing prompts, outputs, and configuration state, enabling independent verification.
Standardized reporting. HELM's structure naturally accommodates our 9-configuration × 10-task matrix with consistent output formats.
Credibility signal. HELM is the evaluation infrastructure behind Stanford CRFM's published model assessments, signaling alignment with established best practices.
Reproducibility. The entire study can be re-executed with a single command; all configuration is declarative.
YAML-based configs. Model swapping requires only configuration file updates, not code changes.
Caching. HELM provides crash-safe resume for 50,000+ inference calls across potentially unstable quantized models.
Extensibility. Adding a future Arm D requires only a new YAML entry rather than architectural changes.

The Dahl benchmark is integrated as a custom HELM scenario (DahlHallucinationScenario) that loads task data from the RegLab repository, structures queries as HELM instances, and scores outputs using Dahl's correctness_checks.py logic with GPT-4 as judge for semantic evaluation.

6.6 Hypotheses

H1: Reasoning Degradation (Silent Defect Zone)

H2: Fluency Stability (FRD Signature)

Fluency metrics (perplexity, grammaticality, coherence) will remain stable across precision tiers even as reasoning accuracy degrades—confirming Fluency-Reasoning Divergence.

H3: Catastrophic Collapse (Obvious Defect Zone)

At 2-bit precision, both fluency and reasoning will degrade across all arms, producing visible failure modes distinguishable from the latent degradation at INT4.

H4: Category-Specific Vulnerability

Different Dahl task types will show differential sensitivity. Tasks 5-10 (reasoning-intensive) will degrade faster than Tasks 1-4 (factual).

H5: Architecture Generalization

Arms A and B (Qwen and Llama under high-quality PTQ) will both show FRD, demonstrating that degradation is not architecture-specific.

H6: PTQ Method Divergence

6.6 Metrics

	Metric	What It Measures	Applied To
Primary Outcome	Dahl (overall)	Correctness rate across all 10 task types	All arms, all tiers
	Dahl (by task)	Correctness rate per task type	All arms, all tiers
FRD Detection Metrics	Perplexity	Model confidence / fluency proxy	Detect fluency stability
	FRD Index	(Fluency stability) − (Reasoning degradation)	Composite divergence measure
Calibration Metrics	ECE	Confidence-accuracy alignment	Detect overconfidence
	Confidence by Correctness	Mean confidence on correct vs incorrect	Identify high-confidence errors
Forensic Metrics	EBPW	Effective bits per weight	Verify claimed precision
	dtype audit	Data type of served weights	Confirm no silent override

6.7 Procedure

6.7.1 Environment

Component	Specification
Hardware	RTX 5090 (32GB VRAM)
Framework	vLLM (Arms A, B), llama-cpp-python (Arm C)
Judge	GPT-4 via OpenAI API

6.7.2 Model Artifacts

Arm	Source	Artifacts
A	Alibaba official	Qwen2.5-7B-Instruct @ INT8/INT4/2-bit
B	HuggingFace	Meta-Llama-3.1-8B-Instruct-AWQ @ INT8/INT4/2-bit
C	HuggingFace	Meta-Llama-3.1-8B-Instruct-GGUF @ Q8_0/Q4_K_M/Q2_K

6.7.3 Decoding Parameters (Fixed Across All Runs)

Parameter	Value	Rationale
Temperature	0.0	Deterministic—eliminates sampling variance
Top-p	1.0	No nucleus sampling
Max tokens	512	Sufficient for Dahl responses
Seed	42	Reproducibility

6.7.4 Run Protocol

Load model at specified precision
Verify dtype/EBPW before run
Run full Dahl benchmark (all 10 task types)
Log all outputs with metadata
Score via GPT-4 judge + programmatic checks
Repeat for each arm × tier (9 configurations)

6.7.5 Reproducibility Package

Per Legal-10 protocol:

Full run bundles (prompts, raw outputs, scores, configuration)
SHA-256 hashes for all artifacts
Signed manifests
Public append-only submission log

6.8 Analysis Plan

6.8.1 Within-Arm Analysis (H1, H2, H3)

For each arm independently:

Plot accuracy by precision tier (8 → 4 → 2)
Plot perplexity by precision tier
Identify silent defect zone (accuracy drops, perplexity stable)
Identify obvious defect zone (both drop)

6.8.2 Task-Level Analysis (H4)

Heatmap: Task type × precision tier × arm
Test whether Tasks 5-10 degrade faster than Tasks 1-4
Identify which tasks show earliest/steepest decline

6.8.3 Architecture Comparison (H5)

Compare Arm A vs Arm B at matched precision
Report whether FRD appears in both architectures

6.8.4 PTQ Method Comparison (H6)

Direct comparison: Arm B vs Arm C at each precision tier
Statistical test: Is B-C gap at INT4 significant?
Compare B-C gap magnitude to within-arm 8→4 gap
If B-C gap ≥ 8→4 gap: "PTQ method matters as much as bit-depth"

6.8.5 FRD Quantification

Compute FRD Index per arm/tier
Identify precision threshold where FRD is maximized

[To be completed after experiments]

8.1 FRD Confirmed

[To be completed after results]

8.2 The Governance Vacuum

The results confirm what Section III documented: the legal AI market is operating without oversight.

No disclosure requirements mandate that vendors reveal precision tiers. No standards define what "4-bit" means. No liability exposure yet connects quantization choices to malpractice. The economic incentives—from big cloud providers optimizing margins to startups avoiding GPU costs—all push toward aggressive compression.

8.3 Implications

For practitioners: Verification is structurally defeated. A lawyer cannot determine, upon use, whether the legal AI tool is running at INT8 or INT4, AWQ or GGUF, with or without infrastructure-layer compression. The observability gap documented in Part I is empirically confirmed.

For vendors: The results create design-defect exposure. Quantization-induced FRD is foreseeable, configuration-linked, and undisclosed. Reasonable alternative designs (higher precision, transparent labeling) exist and are feasible.

For regulators: Precision tier should be a disclosed material specification. "4-bit" is not a specification—it is a label covering materially different products.

External Benchmark Injection Points. The L7 chain architecture enables integration with existing legal AI benchmarks at specific skill positions. CaseHOLD (Zheng et al., 2021) maps to S7 closed-book mode, testing holding identification without full case context. LegalBench overruling detection maps to S5 (Validating Authority). LegalBench definition extraction maps to S6 (Fact Extraction). This injection architecture allows L7 to provide both native chained evaluation and external benchmark comparability within a single framework.

This study demonstrates that Fluency-Reasoning Divergence is empirically observable in legal AI systems: quantization degrades legal reasoning while surface fluency remains stable.

The degradation is not uniform across methods. AWQ and GGUF at identical bit-depth produce materially different legal reasoning—confirming that "4-bit" is a label, not a specification. Users face a black box they cannot audit.

And the degradation is invisible to users. The opacity problem operates at every layer: model, PTQ method, infrastructure compression, routing. Users face a black box they cannot audit.

The industry is deploying compromised systems under nominal equivalence. The governance vacuum persists. The evidentiary gap is architectural.

This Article does not solve the problem. But it documents it—with enough precision that the documentation itself becomes actionable. When regulators ask "did anyone see this coming?", when courts ask "was this foreseeable?", when bar associations ask "what should lawyers have known?"—the answer is here.

Appendix A: Skill Framework Source Citations

Sources:

MacCrate Report: American Bar Association. (1992). Legal education and professional development: An educational continuum (Report of the Task Force on Law Schools and the Profession: Narrowing the Gap), 138–141.
AALL Principles: American Association of Law Libraries. (2013). Principles and standards for legal research competencies.
Shultz & Zedeck: Shultz, M. M., & Zedeck, S. (2011). Predicting lawyer effectiveness: Broadening the basis for law school admission decisions. Law & Social Inquiry, 36(3), 620-661.

External Benchmark Injection Points. The L10 chain architecture enables integration with existing legal AI benchmarks at specific skill positions. CaseHOLD (Zheng et al., 2021) maps to S7 closed-book mode, testing holding identification without full case context. LegalBench overruling detection maps to S5 (Validating Authority). LegalBench definition extraction maps to S6 (Fact Extraction). This injection architecture allows L10 to provide both native chained evaluation and external benchmark comparability within a single framework.

Skill	MacCrate	AALL	Shultz & Zedeck
S1: Research Planning	§ 3.3(a): "devising and implementing a coherent and effective research design"	Principle II, Standard 1	Factor 14: Strategic Planning
S2: Strategic Stopping	§ 3.3(a)(iii): "assessing feasibility... in terms of time and financial constraints"	Principle II, Standard 4: "Recognizes when sufficient research has been done"	Factor 13: Organizing and Managing One's Own Work
S3: Known Authority	§ 3.2: "Knowledge of the Fundamental Tools of Legal Research"	Principle II, Standard 2: "find the full text... given a legal citation"	Factor 6: Fact Finding
S4: Unknown Authority	§ 3.2(a): using secondary sources to find primary authority	Principle II, Standard 3	Factor 7: Researching the Law
S5: Validating Authority	§ 3.1: "Knowledge of the Nature of Legal Rules"	Principle III, Standard 2: "verifies that the authority is current and still good law"	Factor 5: Researching the Law
S6: Fact Extraction	§ 4.1: Factual Investigation	[Gap]	Factor 6: Fact Finding
S7: Distinguishing Cases	§ 3.3(c): "distinguishing cases on their facts and reasoning"	Principle IV, Standard 1	Factor 1: Analysis and Reasoning
S8: Synthesizing Results	§ 2.2: "synthesizing the holdings of multiple cases"	Principle IV, Standard 2	Factor 9: Writing
PR: Professional Responsibility	§ 10: "Recognizing and Resolving Ethical Dilemmas"	Principle V, Standard 1	Factor 21: Integrity & Honesty