1.1 Problem: Quantization Degrades AI Reasoning—Benchmarks Miss It
When a law firm subscribes to an AI-powered legal research platform, computing costs drive the monthly invoice. Behind that invoice lies an engineering decision: run the model at full precision, or compress weights into 4-bit integers that quadruple throughput on the same hardware. Users gain speed; vendors cut costs. For vendors operating at scale, quantization is standard deployment practice.
Prior work suggests this compression is benign. Jin et al. found that 4-bit models "maintain performance comparable to their non-quantized counterparts" across MMLU, summarization, and arithmetic reasoning. Red Hat reported that quantized Llama 3.1 models recover 96–99% of baseline scores with "no discernible differences" in typical use.
But these benchmarks omit legal reasoning tasks. They ignore hallucination rates under legal-specific taxonomies. They bypass the multi-step doctrinal analysis—tracking exceptions, applying multi-factor tests, distinguishing precedent—that characterizes competent legal work.
Other work raises concern. Li et al. found up to 32% accuracy degradation on mathematical reasoning benchmarks while general text performance stayed above 95%. Liu et al. confirmed that quantization below 8-bit creates "significant accuracy risks" in reasoning-intensive models. The mechanism is understood: complex reasoning relies on rare, high-magnitude weight directions that aggressive compression clips first.
This study tests whether quantization creates a silent defect zone in legal reasoning—a regime where accuracy collapses while fluency remains stable—and whether PTQ methods produce materially different outcomes at identical bit-depths.
The central premise: legal reasoning is fragile in ways general benchmarks miss. Quantization may selectively degrade reasoning while leaving surface fluency intact—a divergence where the AI sounds competent but reasons poorly.
2.1 Research Execution as Integrated Competency
Before testing whether quantization degrades legal AI, we must define what legal AI should do. This Section frames legal work not as independent skills but as Research Execution—the integrated competency of completing a legal research task from question to answer.
Three authoritative sources converge on this construct:
MacCrate Report (ABA, 1992): Identifies legal research as a fundamental lawyering skill, emphasizing a "coherent and effective research design" that integrates issue identification, source selection, and strategic execution.
AALL Principles and Standards for Legal Research Competency (2013): Principle IV states that "a successful legal researcher applies information effectively to resolve a specific issue or need." The standard elaborates: the competent researcher "synthesizes legal doctrine by examining cases similar, but not identical, to cases that are the current focus of research."
Shultz & Zedeck empirical study of lawyer effectiveness (2011): Their 26 "Lawyering Effectiveness Factors" are job performance measures derived from asking lawyers, judges, and clients: "If you were looking for a lawyer for an important matter, what qualities would cause you to choose that attorney?" Relevant factors include "Fact Finding," "Researching the Law," and "Integrity & Honesty."
These sources describe a workflow that succeeds or fails as a unit—not seven independent skills tested in isolation. A lawyer who finds the right case but extracts the wrong holding has failed. A lawyer who synthesizes brilliantly but cites fabricated authority has produced worthless work product.
2.2 The Seven Components of Research Execution
Research Execution comprises seven sequential operations that lawyers perform when completing a legal research task:
| Component | Operation | What It Tests |
|---|---|---|
| Known Authority | Resolve a known citation to correct case metadata | Can the system retrieve specific authorities? |
| Unknown Authority | Retrieve relevant law from a fact pattern or legal issue | Can the system find applicable precedent? |
| Validate Authority | Determine if authority remains good law | Can the system detect overruling or adverse treatment? |
| Fact Extraction | Extract disposition, holding, and outcome from opinion text | Can the system identify legally relevant facts? |
| Distinguish Cases | Determine whether precedent applies or can be distinguished | Can the system reason about doctrinal relationships? |
| Synthesize Results | Integrate authorities into coherent IRAC analysis | Can the system produce competent legal work product? |
| Citation Integrity | Ensure all cited authorities exist and support propositions cited | Does the system meet professional responsibility standards? |
These components are dependencies, not independent benchmarks. Each component's output feeds the next. The chain succeeds or fails as a unit; failure at any point propagates downstream.
2.3 Professional Responsibility as Hard Constraint
Citation Integrity occupies a special position. Under Model Rule 3.3(a)(1), attorneys may not "make a false statement of fact or law to a tribunal." A brief citing fabricated cases is worthless and potentially sanctionable.
This is a binary gate, not a soft metric: work product either meets professional standards or it fails. Shultz & Zedeck's Factor 21—"Integrity & Honesty"—confirms that professional responsibility is foundational to lawyer effectiveness.
2.4 What Research Execution Tests
The final synthesis validates Research Execution. AALL Principle IV's standard—applying gathered information "to resolve a specific issue or need"—is tested at synthesis. If synthesis produces a well-reasoned IRAC analysis using correct authorities, Research Execution succeeded. If it fails, error tracing reveals which upstream component broke.
Two implications follow:
Planning is implicit. Earlier frameworks treated "Research Planning" and "Strategic Stopping" as separate skills. We eliminate them as explicit test targets because synthesis validates planning. Grade the memo, not the plan.
Errors propagate. A model achieving 90% accuracy on each component will complete only 0.9^7 ≈ 48% of full research tasks successfully. This multiplicative penalty reflects legal practice.
3.1 PTQ Methods
PTQ compresses model weights after training. Multiple methods exist, each with different compression strategies and preservation priorities:
| Method | Bit-Depth | Target | Core Strategy |
|---|---|---|---|
| AWQ | 4-bit | Weights | Activation-aware: identifies critical 1% of weights and protects them |
| GPTQ | 4-bit / 8-bit | Weights | Hessian-based: minimizes MSE of weight error layer-by-layer |
| GGUF (llama.cpp) | Var. | Weights + KV | Block-wise: flexible CPU/GPU offloading, no importance calibration |
| BNB-nf4 | 4-bit | Weights | NormalFloat: exploits normal distribution of weight values |
| SmoothQuant | 8-bit | Weight + Activation | Migrates activation spikes into weights |
| KIVI | 2-4 bit | KV Cache | Channel-wise compression of context memory |
The critical distinction is between activation-aware methods (AWQ) that identify and protect the weights most critical to reasoning, and block-wise methods (GGUF) that compress uniformly without importance calibration.
3.2 The AWQ vs. GGUF Divergence
AWQ identifies weight channels with high activation magnitude—the parameters most critical to output fidelity—and protects them from aggressive compression. This makes AWQ more robust for reasoning tasks than uniform quantization.
GGUF (llama.cpp) compresses in block-level chunks without nuanced weight protection. It supports optional importance matrix calibration (llama-imatrix), but standard K-Quants (Q4_K_M, Q4_K_S) skip it.
For legal practice, the difference matters:
| Feature | AWQ (vLLM) | GGUF (llama.cpp) |
|---|---|---|
| Primary Hardware | High-end NVIDIA GPUs | CPUs, Apple Silicon, consumer GPUs |
| Operational Cost | $1-4/hour per GPU | $0.10-0.50/hour or free locally |
| Weight Protection | Activation-aware | Block-wise (no importance calibration) |
| Reasoning Preservation | Higher | Lower |
| Production Target | High-concurrency SaaS | Edge devices, privacy-first local tools |
A user downloading a "4-bit Llama" model cannot tell whether it was quantized with AWQ or GGUF. The label is identical. The legal reasoning quality may differ substantially.
3.3 Commercial Optimizations and Supply Chain Stacking
The PTQ method chosen at the model layer is only the beginning. Commercial infrastructure applies additional optimizations that compound degradation:
NVIDIA TensorRT-LLM (W4A8 / NVFP4)
NVIDIA's Model Optimizer uses W4A8 (4-bit weights, 8-bit activations) to double throughput on Blackwell GPUs. The move to NVFP4 (4-bit weights and activations) for "extreme efficiency" triggers reasoning collapse documented in the literature. This engine powers most high-scale AI cloud providers. They optimize for tokens-per-second—lowering the model's reasoning ceiling in the process.
Microsoft PrefixQuant (Weight-Activation Optimization)
PrefixQuant compresses both weights and activations to achieve high throughput. This optimization caused a 71% collapse in high-level knowledge verification (GPQA). It allows a company to serve 4× more users on the same hardware while effectively degrading the model's validation capabilities.
Snowflake SwiftKV (KV-Cache Compression)
SwiftKV compresses context memory (KV-cache) to process massive datasets (128k tokens) on cheaper GPUs. This causes the 59% drop in retrieval accuracy Mekala et al. (2025) documented. A legal-tech vendor can claim to "review 1,000 contracts for the price of 100"—but retrieval is structurally unreliable.
3.4 The Stacking Problem
A legal AI product may encounter compression at multiple layers:
| Layer | What's Happening | Who Controls It |
|---|---|---|
| Model weights | PTQ compression (AWQ, GPTQ, GGUF) | Model provider, vendor |
| Activations | W4A8, NVFP4 | NVIDIA, cloud infrastructure |
| KV-cache | SwiftKV, KIVI | Hosting provider |
| Routing | Peak-hour downgrades | Everyone |
Each layer compounds. Users see none of it. A lawyer using "LegalEagle 2.0" cannot tell whether the underlying model runs INT8 or INT4, whether AWQ or GGUF was used, whether TensorRT applies additional compression, or whether KV-cache is squeezed to handle long documents.
This Section maps existing evidence to each skill, establishing that quantization plausibly degrades the underlying capabilities each skill requires. Some evidence comes from quantization studies; some from adjacent work on long-context degradation, reasoning under compression, or legal benchmark difficulty. The mapping establishes plausibility; Section VI provides direct empirical confirmation.
Mapping Human Skills to AI Procedural Workflows
Each skill translates into AI operations:
| Skill | Human Activity | AI Workflow |
|---|---|---|
| S1: Research Execution | Scope, research, databases | Query decomposition, tool selection |
| S2: Strategic Stopping | Recognizing diminishing returns | Termination conditions, confidence thresholds |
| S3: Known Authority | Citation lookup | Exact retrieval from identifier |
| S4: Unknown Authority | Issue-based searching | Semantic retrieval, query expansion |
| S5: Validating Authority | Shepardizing | Treatment classification, status verification |
| S6: Fact Extraction | Document review | Long-context retrieval, entity extraction |
| S7: Distinguishing Cases | Analogical reasoning | Holding comparison, fact matching |
| S8: Synthesizing Results | Memo drafting | Multi-document generation |
| S9: Citation | Ethics compliance | Hallucination avoidance, attribution |
This mapping enables precise questions: when quantization degrades "multi-hop reasoning," which lawyer skills suffer? When embedding compression causes "retrieval blindness," which workflows fail?
4.1 Summary: The Full Skill Surface Is At Risk
| Skill | Mechanism Evidence | Risk Level |
|---|---|---|
| S3: Known Authority | Long-context degradation | High |
| S4: Unknown Authority | Reasoning + retrieval | High |
| S5: Validating Authority | Temporal reasoning | Medium |
| S6: Fact Extraction | Long-context retrieval | High |
| S7: Distinguishing Cases | Multi-step reasoning | High |
| S8: Synthesizing Results | Integration + accuracy | High |
| PR: Prof. Responsibility | Fabrication resistance | Very High |
4.2 Empirical Studies on Quantization Effects on Reasoning Skills
4.2.1 Research Planning
Mechanism: Research planning requires decomposing complex queries into subtasks—the multi-hop reasoning Li et al. showed degrades up to 4× under quantization.
| Study | Finding |
|---|---|
| ACBench (Dong et al., 2025) | reveals that 4-bit quantization creates a critical divergence between apparent competence and actual reliability with real-world application exhibiting accuracy drops by 10-15%. |
| Liu et al. (2025) | shows that lower bit-width quantization introduces task-difficulty-dependent accuracy risks, and they explicitly evaluate KV cache / activation quantization as well as weights. |
| IntactKV (2024) | mechanism support that KV cache quantization can be a failure point; good for “workflow state maintenance” language. |
4.2.2 Strategic Stopping
Mechanism: Strategic stopping requires calibrated confidence—knowing when you have enough. ECE (expected calibration error) studies show quantized models become overconfident, unable to recognize their own uncertainty.
| Study | Finding |
|---|---|
| Zhong et al., 2025 | Quantized LLMs are worse-calibrated than full-precision counterparts in 85% of measurements (41 of 48 test conditions). Quantization systematically produces overconfidence. |
| Q-Misalign (Dong, Li & Guo, 2025) | Safety alignment degrades under quantization; dormant vulnerabilities emerge post-compression. |
4.2.3 Finding Known Authority
Mechanism: Known authority retrieval requires precise matching across long contexts. Embedding quantization collapses semantic distances; generator quantization corrupts attention to specific passages.
| Study | Finding |
|---|---|
| Mekala et al. (2025) | 8-bit roughly preserved; 4-bit methods produce losses up to 59%, especially for long-context inputs. Effect varies by method/model/task. |
| LegalBench-RAG (2024) | Legal-domain benchmark isolating retrieval quality. Legal retrieval is hard even before quantization. |
4.2.4 Finding Unknown Authority
Mechanism: Finding unknown authority requires decomposing fact patterns into legal issues. Liu et al. show quantization degrades multi-hop reasoning by up to 4× on complex tasks. Combined with Zheng's difficulty findings: quantization severely impairs this skill.
| Study | Finding |
|---|---|
| Li et al. | Low-bit quantization degrades complex math reasoning by up to 32.39% (avg. 11.31%), specifically in numerical computation and planning. |
| Liu et al., 2025 | Lower bit-widths introduce significant accuracy risks; impact depends on task difficulty. Affects DeepSeek-R1, LLaMA, and Qwen. |
| Yazan, Verberne & Situmeang (2024) | In RAG pipelines, quantization may not impair retrieval when base LLM performs well, but smaller models show high sensitivity to context length and setup. |
4.2.5 Validating Authority
Mechanism: Validation requires temporal reasoning (when was this overruled?) and status classification (still good law?). Outlier weight clipping destroys these fine-grained distinctions.
| Study | Finding |
|---|---|
| Liu et al. (2025) | W8A8/W4A16 can be lossless; lower bit-widths introduce significant accuracy risks. Task difficulty is critical--placing authority-validation in the high-risk regime. |
| MixKVQ (Zhang et al. 2025) | Low-bit KV-cache quantization exhibits severe degradation on complex reasoning. Fixed-precision at very low bit-widths produces large quantization errors and critical failures. |
| TimeBench | GPT-4 achieves only 66.4% on implicit temporal relationships. Accuracy varies from 40% to 92% depending on how temporal facts are organized. TRAVELER shows implicit temporal reasoning degrades 39% as context scales from 5 to 100 events. |
| arXiv.04823 | Accuracy drops exceed 10% at 4-bit for reasoning; LexTime achieves only 80.8% on temporal event ordering with 4-bit models. Validating authority requires multi-hop temporal reasoning--weak at baseline, catastrophic under quantization. |
4.2.6 Fact Extraction
Mechanism: Fact extraction from contracts is long-context retrieval. The 59% accuracy collapse on NIAH-none (correctly identifying absent information) implicates document review reliability.
| Study | Finding |
|---|---|
| Mekala et al. (2025) | up to 59% degradation on long-context extraction tasks at 4-bit quantization. Extracting holdings requires identifying the specific legal rule announced by a court, distinguishing it from dicta, and accurately capturing its scope and limitations. 🡪 S6's failure mode is particularly insidious because extracted "holdings" may be linguistically plausible while being substantively fabricated—the quantized system generates authoritative-sounding rules that cited case never announced. |
4.7 Skill 7: Distinguishing Cases
Mechanism: Case distinction requires tracking multiple factors simultaneously and identifying material differences—multi-step reasoning, the capacity most vulnerable to quantization.
| Study | Finding |
|---|---|
| Dahl et al. (Journal of Legal Analysis, 2024) | models "cannot reliably detect when they are hallucinating" and fail to correct users' incorrect legal assumptions. When combined with Li et al.'s 32.39% reasoning degradation under quantization, demonstrates high unreliability. Baseline 58-88% hallucination rate represents unquantized models; adding 4-bit compression amplifies an already critical reliability gap. |
| Liu et al., 2025 | supports that low-bit regimes create accuracy risks on hard reasoning tasks (the cognitive substrate for distinction). |
4.8 Skill 8: Synthesizing Results
Mechanism: Synthesis requires integrating multiple sources while maintaining coherence. CLERC reports that strong models produce highly rated analyses while hallucinating—good writing does not mean truthful authority.
| Study | Finding |
|---|---|
| LegalEval-Q (Li & Wu, 2025) | measures clarity/coherence/terminology quality; also (importantly) reports quantization has negligible impact on those writing-quality metrics, which supports your “fluency preserved while truth degrades” story. |
| Lewis et al. (2020) | Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. canonical RAG citation that establishes retrieval + generation as a distinct paradigm because parametric memory alone is insufficient, and provenance/updating are core motivations; credibility backbone for “unknown authority finding + synthesis” being a retrieval-conditioned reasoning task rather than generic generation. URL: arXiv.11401. |
4.9 Professional Responsibility
Mechanism: Citation integrity and fabrication resistance depend on precise parametric memory. Quantization clips outlier weights encoding rare-but-accurate associations.
| Study | Finding |
|---|---|
| Q-Misalign (Dong et al., 2025) | safety alignment is not preserved by quantization but is instead contingent upon precision—vulnerabilities can remain dormant; making pre-deployment safety audits unreliable for detecting post-quantization failure modes. Combined with Dahl et al.'s finding that even unquantized models hallucinate legal information at 58-88% rates while being unable to detect their own errors, quantized legal AI systems present a dual threat to professional responsibility |
| Li et al. (2024) | 4-bit quantization significantly weakens fabrication resistance. |
| Dahl et al. (2024) Large Legal Fictions | LLMs hallucinate legal authority at alarming rates (69-88%) on verifiable legal queries. |
4.1 Research Execution as Integrated Competency
Before asking whether quantization degrades legal AI, we must define what legal AI is supposed to do. This Section operationalizes legal work not as a collection of independent skills, but as Research Execution—the integrated professional competency of completing a legal research task from question to answer.
Three authoritative sources converge on this construct:
MacCrate Report (ABA, 1992).
Identifies legal research as a fundamental lawyering skill, emphasizing "devising and implementing a coherent and effective research design" that integrates issue identification, source selection, and strategic execution.
AALL Principles and Standards for Legal Research Competency (2013): Principle IV states that "a successful legal researcher applies information effectively to resolve a specific issue or need." The standard elaborates: the competent researcher "synthesizes legal doctrine by examining cases similar, but not identical, to cases that are the current focus of research."
Shultz & Zedeck empirical study of lawyer effectiveness (2011): Their 26 "Lawyering Effectiveness Factors" are job performance measures—derived from asking lawyers, judges, and clients: "If you were looking for a lawyer for an important matter, what qualities would cause you to choose that attorney?" Relevant factors include "Fact Finding" (identifying relevant facts and issues), "Researching the Law" (utilizing appropriate sources and strategies), and "Integrity & Honesty" (acting with core professional values).
These sources do not describe seven independent skills tested in isolation. They describe a workflow that succeeds or fails as a unit. A lawyer who finds the right case but extracts the wrong holding has not executed research competently. A lawyer who synthesizes brilliantly but cites fabricated authority has produced worthless work product.
4.2 The Seven Components of Research Execution
Research Execution comprises seven sequential operations that lawyers perform when completing a legal research task:
| Component | Operation | What It Tests |
|---|---|---|
| Known Authority | Resolve a known citation to correct case metadata | Can the system retrieve specific authorities? |
| Unknown Authority | Retrieve relevant law from a fact pattern or legal issue | Can the system find applicable precedent? |
| Validate Authority | Determine if authority remains good law | Can the system detect overruling or adverse treatment? |
| Fact Extraction | Extract disposition, holding, and outcome from opinion text | Can the system identify legally relevant facts? |
| Distinguish Cases | Determine whether precedent applies or can be distinguished | Can the system reason about doctrinal relationships? |
| Synthesize Results | Integrate authorities into coherent IRAC analysis | Can the system produce competent legal work product? |
| Citation Integrity | Ensure all cited authorities exist and support propositions cited | Does the system meet professional responsibility standards? |
These components are not independent benchmarks—they are dependencies in a workflow. Each component's output becomes input for subsequent components. The chain succeeds or fails as a unit, and failure at any point propagates downstream.
4.3 Professional Responsibility as Hard Constraint
Citation Integrity occupies a special position. Under Model Rule 3.3(a)(1), attorneys may not "make a false statement of fact or law to a tribunal." A brief citing fabricated cases is not merely imperfect; it is professionally worthless and potentially sanctionable.
This is not a soft performance metric. It is a binary gate: the work product either meets professional standards or it does not. Shultz & Zedeck's Factor 21—"Integrity & Honesty: has core values and beliefs; acts with integrity and honesty"—confirms that professional responsibility is foundational to lawyer effectiveness, not an optional overlay.
4.4 What Research Execution Tests
Research Execution is validated by whether the final synthesis succeeds. AALL Principle IV's standard—applying gathered information "to resolve a specific issue or need"—is tested at the synthesis step. If the synthesis produces a well-reasoned IRAC analysis using correct authorities, Research Execution worked. If it fails, error tracing reveals which upstream component broke.
This framing has two implications:
Planning is implicit. Earlier frameworks (including earlier drafts of this study) treated "Research Planning" and "Strategic Stopping" as separate skills. We eliminate them as explicit test targets because planning is validated by whether synthesis succeeds. You don't grade the plan; you grade the memo.
Errors propagate. A model achieving 90% accuracy on each independent component will, under independence assumptions, complete only 0.9^7 ≈ 48% of full research tasks successfully. This multiplicative penalty reflects the reality of legal
5.1 Research Execution as Job Performance
Section II framed legal work as Research Execution—the integrated competency of completing a legal research task from question to answer. Shultz & Zedeck's empirical study confirms this framing: their 26 "Lawyering Effectiveness Factors" are job performance measures, observed in the execution of actual legal work rather than tested in isolation.
AALL Principle IV operationalizes this directly: "A successful legal researcher applies information effectively to resolve a specific issue or need." The competent researcher "synthesizes legal doctrine by examining cases similar, but not identical, to cases that are the current focus of research."
This describes a workflow that succeeds or fails as a unit.
5.2 The Legal-7 Chain
Legal-7 (L7) operationalizes Research Execution as a seven-step dependent chain:
| Step | Name | Modality | Task | Ground Truth |
|---|---|---|---|---|
| S1 | Known Authority | RAG | Resolve known citation to correct authority | SCDB citation lookup |
| S2 | Unknown Authority | RAG | Retrieve relevant law from fact pattern | shepards_data.csv |
| S3 | Validate Authority | RAG | Determine if authority remains good law | scotus_overruled_db.csv |
| S4 | Fact Extraction | RAG | Extract disposition, holding, outcome from opinion | SCDB metadata + opinion text |
| S5 | Distinguish Cases | RAG + CB | Decide if precedent applies or can be distinguished | shepards.agree field |
| S6 | IRAC Synthesis | RAG | Write IRAC-structured legal analysis | MEE rubric + chain grounding |
| S7 | Citation Integrity | CB | Verify no fabricated citations in S6 output | fake_cases.csv + SCDB |
The chain maps to IRAC:
- Rule Phase (S1–S3): Identify, retrieve, and validate legal authority
- Application Phase (S4–S5): Extract facts and apply precedent through distinction
- Conclusion Phase (S6–S7): Synthesize analysis and verify citation integrity
The Issue component is implicit in the query. S6 is the capstone: it tests whether Research Execution succeeded.
5.3 Why S6 Validates the Chain
S6 runs closed-book: the model cannot return to sources. It must synthesize an IRAC memo from what it gathered in S1–S5.
This design reflects AALL Principle IV: applying gathered information to resolve an issue. If S6 produces a well-reasoned IRAC analysis using correct authorities, Research Execution worked. If S6 fails, error tracing reveals which upstream step broke:
| If Step Fails... | Cascade Effect |
|---|---|
| S1 (Known Authority) | Wrong case → all downstream analysis corrupted |
| S2 (Unknown Authority) | Missing precedent → incomplete rule statement |
| S3 (Validate Authority) | Citing bad law → S6 argument fails |
| S4 (Fact Extraction) | Wrong facts → S5 distinction invalid |
| S5 (Distinguish) | Wrong application → S6 conclusion unsupported |
| S6 (IRAC Synthesis) | Poor reasoning → chain fails at capstone |
| S7 (Citation Integrity) | Fabrication detected → S6 voided, chain fails |
A model achieving 90% accuracy on each skill completes only 0.9^7 ≈ 48% of full chains. This multiplicative penalty reflects legal practice: one fabricated citation makes a brief worthless.
5.4 S7 as Professional Responsibility Gate
S7 operationalizes Shultz & Zedeck's Factor 21: "Integrity & Honesty."
Under Model Rule 3.3(a)(1), attorneys may not "make a false statement of fact or law to a tribunal." A brief citing fabricated cases is worthless and potentially sanctionable. L7 mirrors this: if S7 detects any fabricated citation in the S6 output, S6 scores zero regardless of reasoning quality.
The gate operates deterministically: citations from S6 are checked against SCDB (real cases) and fake_cases.csv (known fabrications). No LLM-as-judge evaluation required.
5.5 S5 Dual-Modality: The Reasoning Bridge
S5 (Distinguish Cases) occupies a unique position in the chain. It is the point where retrieval must transform into reasoning. The model must:
- Understand the holding of a precedent case
- Understand the facts of the current case
- Determine whether the precedent applies or can be distinguished
This is not retrieval. This is reasoning about retrieved content—precisely what AALL Principle IV demands when it requires synthesizing doctrine across "cases similar, but not identical."
To isolate the reasoning component, L7 tests S5 in two modalities:
S5-RAG (Primary): Both case texts available. Tests whether the model can distinguish cases with full information. This matches real lawyer workflow—attorneys distinguish cases with the opinions open.
S5-CB (Diagnostic): Only the S4-extracted holding available; no citing case text. Tests whether the model can reason from the rule alone, without copying from source material.
The gap between S5-RAG and S5-CB is the Fluency-Reasoning Divergence measurement:
| S5-RAG | S5-CB | Interpretation |
|---|---|---|
| High | High | Model reasons well |
| High | Low | Model copies, doesn't reason (FRD signature) |
| Low | Low | Model cannot perform the task |
A model exhibiting FRD will show a large RAG-CB gap: it can "distinguish" cases when it has the full text to copy from, but cannot reason about the legal relationship from the holding alone. This is precisely the failure mode we hypothesize quantization induces.
5.6 Grading Architecture
L7 achieves 6/7 objective grading:
| Step | Grading Method | Ground Truth Source |
|---|---|---|
| S1 | Exact match | SCDB citation |
| S2 | MRR / Hit@k | Shepard's precedent relationships |
| S3 | Exact match | scotus_overruled_db |
| S4 | Exact match (disposition, party) | SCDB metadata |
| S5 | Exact match | shepards.agree field |
| S6 | Hybrid (50% objective, 50% LLM-as-Judge) | Chain grounding + MEE rubric |
| S7 | Deterministic | Citation existence check |
Only S6 requires rubric-based evaluation. The 50% objective component ("chain grounding") verifies that S6 correctly incorporates outputs from S1–S5—did the model use the authorities it found? The 50% subjective component applies MEE (Multistate Essay Examination) bar exam standards to assess legal reasoning quality.
This architecture minimizes LLM-as-judge circularity: most of the benchmark is deterministic, and even the subjective portion is anchored to the chain's objective outputs.
5.7 Task Structure: From Case to Chain
Each L7 chain instance begins with a Supreme Court case pair drawn from the Shepard's citation network. The anchor case (cited_case) provides the legal authority to be researched; the citing case establishes the doctrinal relationship to be analyzed.
Initial Scenario
A chain instance contains:
| Element | Source | Example |
|---|---|---|
| Cited Case | scdb_sample.csv | Brown v. Board of Education, 347 U.S. 483 (1954) |
| Citing Case | scotus_shepards_sample.csv | Cooper v. Aaron, 358 U.S. 1 (1958) |
| Shepard's Signal | shepards field | "followed" |
| Doctrinal Agreement | agree field | True (citing case follows precedent) |
| Overrule Status | scotus_overruled_db.csv | None (not overruled) |
| Opinion Text | majority opinion field | Full text of majority opinion |
The model receives this case pair and must execute the seven-step chain, with each step's output feeding subsequent steps.
Task Types by Step
| Step | Task Type | Input | Expected Output |
|---|---|---|---|
| S1 | Known Authority | Case name or citation | \{us_cite, case_name, term\} |
| S2 | Unknown Authority | Legal issue from anchor case | Ranked list of citing cases |
| S3 | Validate Authority | Citation from S1 | \{is_overruled, overruling_case, year_overruled\} |
| S4 | Fact Extraction | Opinion text | \{disposition, party_winning, holding_summary\} |
| S5 | Distinguish (Closed-Book) | S4 holding + citing case metadata | \{agrees, reasoning\} |
| S5 | Distinguish (RAG) | S4 holding + full citing opinion | \{agrees, reasoning\} |
| S6 | IRAC Synthesis | All prior outputs | \{issue, rule, application, conclusion\} |
| S7 | Citation Integrity | S6 output | \{citations_found, all_valid\} |
Scoring Summary
| Step | Ground Truth | Scoring Method |
|---|---|---|
| S1 | SCDB metadata | Exact match |
| S2 | Shepard's citing_case_us_cite | MRR, hit@10 |
| S3 | scotus_overruled_db | Binary match on is_overruled |
| S4 | SCDB caseDisposition, partyWinning | Closed enum exact match |
| S5 | Shepard's agree field | Binary match |
| S6 | MEE rubric + chain grounding | Hybrid (50% objective, 50% rubric) |
| S7 | fake_cases.csv + SCDB | Deterministic lookup |
5.7 What L7 Detects That Parallel Benchmarks Cannot
The Dahl et al. benchmark tests the same underlying data but in parallel: each task is independent, and a model can achieve high aggregate scores while being functionally incapable of completing a single end-to-end workflow. If the 15% failures are distributed randomly, some workflows succeed. But if failures cluster at early-chain positions—as our quantization hypothesis predicts—then independent task accuracy becomes a misleading proxy for Research Execution capability.
L7 detects three failure modes invisible to parallel evaluation:
Cascade failures. A model that hallucinates at S1 corrupts all downstream steps. Parallel scoring treats S1 as one task among many; L7 propagates the error through the chain.
FRD signature. The S5 RAG-CB gap directly measures whether the model is reasoning or copying. No parallel benchmark isolates this.
Professional responsibility failures. S7 voiding enforces the binary reality of citation integrity—a synthesized memo is either citable or it is not. Parallel benchmarks score fabrication as one error among many; L7 treats it as disqualifying.
For quantization testing, these properties are essential. We hypothesize that compression degrades reasoning while preserving fluency. L7's chained architecture, dual-modality S5, and hard-gate S7 are designed to make this degradation visible.
The following Section applies L7 to test Fluency-Reasoning Divergence across quantization regimes.
Few sentences write out here
6.1 Design Choice: Ecological Validity Over Experimental Purity
This study prioritizes ecological validity over experimental purity. The objective is not to isolate quantization as a laboratory variable under controlled conditions, but to measure what happens to legal reasoning under the deployment configurations users actually encounter upon use.
A secondary objective emerged from the market analysis in Section III: demonstrating that "4-bit" is not a specification. Two deployments at identical bit-depth using different PTQ methods may produce materially different legal reasoning—and users cannot distinguish them. This study tests that claim empirically.
The design involves deliberate tradeoffs. By testing multiple PTQ methods, we gain the ability to show method variance. By using officially released and community-standard quantization rather than a single controlled PTQ pipeline, we lose clean causal isolation but gain results that reflect what lawyers actually face.
We accept these tradeoffs because the policy-relevant questions are:
-
What are lawyers exposed to in practice?
-
Can users distinguish between "good 3-bit" and "bad 3-bit"?
The answer to the second question, we hypothesize, is no—and we aim to prove it.
The inability to make clean causal attributions is not a limitation of this study. It is a finding about the market.
6.2 Research Questions
RQ1. Existence. Does aggressive quantization reduce legal-reasoning accuracy more than it reduces surface fluency?
RQ2. Generalizability. Is FRD observable across multiple legal task types, or is it benchmark-specific?
RQ3. Regimes. Can we distinguish a "silent defect zone" (INT8→INT4: reasoning degraded, fluency intact) from an "obvious defect zone" (2-bit: both collapse)?
RQ4. PTQ Method Variance. At identical bit-depth, do different PTQ methods produce significantly different legal reasoning outcomes?
6.3 Hypotheses
H1: Reasoning Degradation (Silent Defect Zone)
Legal reasoning accuracy will degrade significantly between INT8 and INT4 across all three arms, with the largest drops in categories requiring multi-step inference, exception-tracking, or doctrinal distinction.
H2: Fluency Stability (FRD Signature)
Fluency metrics (perplexity, grammaticality, coherence) will remain stable across precision tiers even as reasoning accuracy degrades—confirming Fluency-Reasoning Divergence.
H3: Catastrophic Collapse (Obvious Defect Zone)
At 2-bit precision, both fluency and reasoning will degrade across all arms, producing visible failure modes distinguishable from the latent degradation at INT4.
H4: Category-Specific Vulnerability
Different Dahl task types will show differential sensitivity. Tasks 5-10 (reasoning-intensive) will degrade faster than Tasks 1-4 (factual).
H5: Architecture Generalization
Arms A and B (Qwen and Llama under high-quality PTQ) will both show FRD, demonstrating that degradation is not architecture-specific.
H6: PTQ Method Divergence
At identical bit-depth, AWQ (Arm B) will significantly outperform GGUF (Arm C), demonstrating that bit-depth alone is an insufficient specification. The B-C gap at 4-bit may exceed the 8→4 gap within a single arm.
6.4 Model and Precision Regimes
Design: Three arms testing two model families under two PTQ philosophies.
| Tier | Model 1 | Model 2 | PTQ | Rationale |
|---|---|---|---|---|
| 8-bit | Llama 3.1 8B | Qwen 2.5 7N | ?? | Ecological validity, higher performance quality |
| 4-bit | ||||
| 3-bit | llama.cpp (GGUF) | Low-cost PTQ—block-wise compression, no importance calibration | ||
| 2-bit |
Rationale for Study Design
Arms A and B test FRD under "reasonable quality" PTQ—what a careful vendor or sophisticated deployer might ship. These represent the ceiling of what users could hope to encounter.
Arm C tests FRD under the PTQ method that economic pressures actually favor. Startups without GPU budgets, solo developers, and cost-constrained deployments default to llama.cpp because it runs on consumer hardware. This represents what many users actually encounter.
The B vs C comparison tests the same model across different PTQ techniques: same base weights, same nominal bit-depth, different compression philosophy. This directly tests whether PTQ method produces material differences at identical bit-depth (see Section III).
Why INT8 As Baseline
Full-precision models exist but are rarely what users encounter. The economic realities documented in Section III push vendors toward compressed deployments. INT8 represents the "reasonable production floor"—the precision tier a user might plausibly expect from a serious legal AI product. Testing degradation relative to INT8 answers the question users actually face: "How much worse does it get from here?"
Rationale For Selection
-
Verifiable ground truth. Unlike benchmarks requiring LLM-as-judge evaluation, Dahl uses structured case metadata that can be programmatically verified.
-
Multi-skill coverage. The 10 task types span multiple L-10 skills, testing factual recall, reasoning, temporal judgment, and fabrication resistance.
-
FRD operationalization. The task gradient—from simple factual tasks (1-4) to complex reasoning tasks (8-10)—allows observation of whether errors cluster in reasoning-intensive categories while fluency-dependent categories remain stable.
-
Cross-arm comparability. All three arms run identical queries, enabling direct comparison across architectures and PTQ methods.
H4 Prediction: Category-Specific Vulnerability
| Tier | Tasks 1-4 (Factual) | Tasks 5-8 (Reasoning) | Tasks 9-10 (Hardest) |
|---|---|---|---|
| INT8 | ✓ Stable | ✓ Stable | ✓ Stable |
| INT4 | ✓ Stable | ↓ Silent degradation | ↓↓ Degradation |
| 2-bit | ↓ Degradation | ↓↓ Collapse | ↓↓↓ Collapse |
The gradient is the FRD signature. Simple tasks hold. Reasoning tasks degrade silently. Hardest tasks fail first and worst.
6.5 Evaluation Architecture: HELM Integration
This study implements evaluation using Stanford CRFM's Holistic Evaluation of Language Models (HELM) framework. Seven considerations motivate this choice:
-
Audit trail. HELM produces SHA-256 hashed bundles containing prompts, outputs, and configuration state, enabling independent verification.
-
Standardized reporting. HELM's structure naturally accommodates our 9-configuration × 10-task matrix with consistent output formats.
-
Credibility signal. HELM is the evaluation infrastructure behind Stanford CRFM's published model assessments, signaling alignment with established best practices.
-
Reproducibility. The entire study can be re-executed with a single command; all configuration is declarative.
-
YAML-based configs. Model swapping requires only configuration file updates, not code changes.
-
Caching. HELM provides crash-safe resume for 50,000+ inference calls across potentially unstable quantized models.
-
Extensibility. Adding a future Arm D requires only a new YAML entry rather than architectural changes.
The Dahl benchmark is integrated as a custom HELM scenario (DahlHallucinationScenario) that loads task data from the RegLab repository, structures queries as HELM instances, and scores outputs using Dahl's correctness_checks.py logic with GPT-4 as judge for semantic evaluation.
6.6 Hypotheses
H1: Reasoning Degradation (Silent Defect Zone)
Legal reasoning accuracy will degrade significantly between INT8 and INT4 across all three arms, with the largest drops in categories requiring multi-step inference, exception-tracking, or doctrinal distinction.
H2: Fluency Stability (FRD Signature)
Fluency metrics (perplexity, grammaticality, coherence) will remain stable across precision tiers even as reasoning accuracy degrades—confirming Fluency-Reasoning Divergence.
H3: Catastrophic Collapse (Obvious Defect Zone)
At 2-bit precision, both fluency and reasoning will degrade across all arms, producing visible failure modes distinguishable from the latent degradation at INT4.
H4: Category-Specific Vulnerability
Different Dahl task types will show differential sensitivity. Tasks 5-10 (reasoning-intensive) will degrade faster than Tasks 1-4 (factual).
H5: Architecture Generalization
Arms A and B (Qwen and Llama under high-quality PTQ) will both show FRD, demonstrating that degradation is not architecture-specific.
H6: PTQ Method Divergence
At identical bit-depth, AWQ (Arm B) will significantly outperform GGUF (Arm C), demonstrating that bit-depth alone is an insufficient specification. The B-C gap at 4-bit may exceed the 8→4 gap within a single arm.
6.6 Metrics
| Metric | What It Measures | Applied To | |
|---|---|---|---|
| Primary Outcome | Dahl (overall) | Correctness rate across all 10 task types | All arms, all tiers |
| Dahl (by task) | Correctness rate per task type | All arms, all tiers | |
| FRD Detection Metrics | Perplexity | Model confidence / fluency proxy | Detect fluency stability |
| FRD Index | (Fluency stability) − (Reasoning degradation) | Composite divergence measure | |
| Calibration Metrics | ECE | Confidence-accuracy alignment | Detect overconfidence |
| Confidence by Correctness | Mean confidence on correct vs incorrect | Identify high-confidence errors | |
| Forensic Metrics | EBPW | Effective bits per weight | Verify claimed precision |
| dtype audit | Data type of served weights | Confirm no silent override |
6.7 Procedure
6.7.1 Environment
| Component | Specification |
|---|---|
| Hardware | RTX 5090 (32GB VRAM) |
| Framework | vLLM (Arms A, B), llama-cpp-python (Arm C) |
| Judge | GPT-4 via OpenAI API |
6.7.2 Model Artifacts
| Arm | Source | Artifacts |
|---|---|---|
| A | Alibaba official | Qwen2.5-7B-Instruct @ INT8/INT4/2-bit |
| B | HuggingFace | Meta-Llama-3.1-8B-Instruct-AWQ @ INT8/INT4/2-bit |
| C | HuggingFace | Meta-Llama-3.1-8B-Instruct-GGUF @ Q8_0/Q4_K_M/Q2_K |
6.7.3 Decoding Parameters (Fixed Across All Runs)
| Parameter | Value | Rationale |
|---|---|---|
| Temperature | 0.0 | Deterministic—eliminates sampling variance |
| Top-p | 1.0 | No nucleus sampling |
| Max tokens | 512 | Sufficient for Dahl responses |
| Seed | 42 | Reproducibility |
6.7.4 Run Protocol
-
Load model at specified precision
-
Verify dtype/EBPW before run
-
Run full Dahl benchmark (all 10 task types)
-
Log all outputs with metadata
-
Score via GPT-4 judge + programmatic checks
-
Repeat for each arm × tier (9 configurations)
6.7.5 Reproducibility Package
Per Legal-10 protocol:
-
Full run bundles (prompts, raw outputs, scores, configuration)
-
SHA-256 hashes for all artifacts
-
Signed manifests
-
Public append-only submission log
6.8 Analysis Plan
6.8.1 Within-Arm Analysis (H1, H2, H3)
For each arm independently:
-
Plot accuracy by precision tier (8 → 4 → 2)
-
Plot perplexity by precision tier
-
Identify silent defect zone (accuracy drops, perplexity stable)
-
Identify obvious defect zone (both drop)
6.8.2 Task-Level Analysis (H4)
-
Heatmap: Task type × precision tier × arm
-
Test whether Tasks 5-10 degrade faster than Tasks 1-4
-
Identify which tasks show earliest/steepest decline
6.8.3 Architecture Comparison (H5)
-
Compare Arm A vs Arm B at matched precision
-
Report whether FRD appears in both architectures
6.8.4 PTQ Method Comparison (H6)
-
Direct comparison: Arm B vs Arm C at each precision tier
-
Statistical test: Is B-C gap at INT4 significant?
-
Compare B-C gap magnitude to within-arm 8→4 gap
-
If B-C gap ≥ 8→4 gap: "PTQ method matters as much as bit-depth"
6.8.5 FRD Quantification
-
Compute FRD Index per arm/tier
-
Identify precision threshold where FRD is maximized
[To be completed after experiments]
8.1 FRD Confirmed
[To be completed after results]
8.2 The Governance Vacuum
The results confirm what Section III documented: the legal AI market is operating without oversight.
No disclosure requirements mandate that vendors reveal precision tiers. No standards define what "4-bit" means. No liability exposure yet connects quantization choices to malpractice. The economic incentives—from big cloud providers optimizing margins to startups avoiding GPU costs—all push toward aggressive compression.
8.3 Implications
For practitioners: Verification is structurally defeated. A lawyer cannot determine, upon use, whether the legal AI tool is running at INT8 or INT4, AWQ or GGUF, with or without infrastructure-layer compression. The observability gap documented in Part I is empirically confirmed.
For vendors: The results create design-defect exposure. Quantization-induced FRD is foreseeable, configuration-linked, and undisclosed. Reasonable alternative designs (higher precision, transparent labeling) exist and are feasible.
For regulators: Precision tier should be a disclosed material specification. "4-bit" is not a specification—it is a label covering materially different products.
External Benchmark Injection Points. The L7 chain architecture enables integration with existing legal AI benchmarks at specific skill positions. CaseHOLD (Zheng et al., 2021) maps to S7 closed-book mode, testing holding identification without full case context. LegalBench overruling detection maps to S5 (Validating Authority). LegalBench definition extraction maps to S6 (Fact Extraction). This injection architecture allows L7 to provide both native chained evaluation and external benchmark comparability within a single framework.
This study demonstrates that Fluency-Reasoning Divergence is empirically observable in legal AI systems: quantization degrades legal reasoning while surface fluency remains stable.
The degradation is not uniform across methods. AWQ and GGUF at identical bit-depth produce materially different legal reasoning—confirming that "4-bit" is a label, not a specification. Users face a black box they cannot audit.
And the degradation is invisible to users. The opacity problem operates at every layer: model, PTQ method, infrastructure compression, routing. Users face a black box they cannot audit.
The industry is deploying compromised systems under nominal equivalence. The governance vacuum persists. The evidentiary gap is architectural.
This Article does not solve the problem. But it documents it—with enough precision that the documentation itself becomes actionable. When regulators ask "did anyone see this coming?", when courts ask "was this foreseeable?", when bar associations ask "what should lawyers have known?"—the answer is here.
Appendix A: Skill Framework Source Citations
Sources:
-
MacCrate Report: American Bar Association. (1992). Legal education and professional development: An educational continuum (Report of the Task Force on Law Schools and the Profession: Narrowing the Gap), 138–141.
-
AALL Principles: American Association of Law Libraries. (2013). Principles and standards for legal research competencies.
-
Shultz & Zedeck: Shultz, M. M., & Zedeck, S. (2011). Predicting lawyer effectiveness: Broadening the basis for law school admission decisions. Law & Social Inquiry, 36(3), 620-661.
External Benchmark Injection Points. The L10 chain architecture enables integration with existing legal AI benchmarks at specific skill positions. CaseHOLD (Zheng et al., 2021) maps to S7 closed-book mode, testing holding identification without full case context. LegalBench overruling detection maps to S5 (Validating Authority). LegalBench definition extraction maps to S6 (Fact Extraction). This injection architecture allows L10 to provide both native chained evaluation and external benchmark comparability within a single framework.
| Skill | MacCrate | AALL | Shultz & Zedeck |
|---|---|---|---|
| S1: Research Planning | § 3.3(a): "devising and implementing a coherent and effective research design" | Principle II, Standard 1 | Factor 14: Strategic Planning |
| S2: Strategic Stopping | § 3.3(a)(iii): "assessing feasibility... in terms of time and financial constraints" | Principle II, Standard 4: "Recognizes when sufficient research has been done" | Factor 13: Organizing and Managing One's Own Work |
| S3: Known Authority | § 3.2: "Knowledge of the Fundamental Tools of Legal Research" | Principle II, Standard 2: "find the full text... given a legal citation" | Factor 6: Fact Finding |
| S4: Unknown Authority | § 3.2(a): using secondary sources to find primary authority | Principle II, Standard 3 | Factor 7: Researching the Law |
| S5: Validating Authority | § 3.1: "Knowledge of the Nature of Legal Rules" | Principle III, Standard 2: "verifies that the authority is current and still good law" | Factor 5: Researching the Law |
| S6: Fact Extraction | § 4.1: Factual Investigation | [Gap] | Factor 6: Fact Finding |
| S7: Distinguishing Cases | § 3.3(c): "distinguishing cases on their facts and reasoning" | Principle IV, Standard 1 | Factor 1: Analysis and Reasoning |
| S8: Synthesizing Results | § 2.2: "synthesizing the holdings of multiple cases" | Principle IV, Standard 2 | Factor 9: Writing |
| PR: Professional Responsibility | § 10: "Recognizing and Resolving Ethical Dilemmas" | Principle V, Standard 1 | Factor 21: Integrity & Honesty |