jwc-global

1.1 Problem: Quantization Degrades AI Reasoning—Benchmarks Miss It

When a law firm subscribes to an AI-powered legal research platform, computing costs drive the monthly invoice. Behind that invoice lies an engineering decision: run the model at full precision, or compress weights into 4-bit integers that quadruple throughput on the same hardware. Users gain speed; vendors cut costs. For vendors operating at scale, quantization is standard deployment practice.

Prior work suggests this compression is benign. Jin et al. found that 4-bit models "maintain performance comparable to their non-quantized counterparts" across MMLU, summarization, and arithmetic reasoning. Red Hat reported that quantized Llama 3.1 models recover 96–99% of baseline scores with "no discernible differences" in typical use.

But these benchmarks omit legal reasoning tasks. They ignore hallucination rates under legal-specific taxonomies. They bypass the multi-step doctrinal analysis—tracking exceptions, applying multi-factor tests, distinguishing precedent—that characterizes competent legal work.

Other work raises concern. Li et al. found up to 32% accuracy degradation on mathematical reasoning benchmarks while general text performance stayed above 95%. Liu et al. confirmed that quantization below 8-bit creates "significant accuracy risks" in reasoning-intensive models. The mechanism is understood: complex reasoning relies on rare, high-magnitude weight directions that aggressive compression clips first.

This study tests whether quantization creates a silent defect zone in legal reasoning—a regime where accuracy collapses while fluency remains stable—and whether PTQ methods produce materially different outcomes at identical bit-depths.

The central premise: legal reasoning is fragile in ways general benchmarks miss. Quantization may selectively degrade reasoning while leaving surface fluency intact—a divergence where the AI sounds competent but reasons poorly.

2.1 Research Execution as Integrated Competency

Before testing whether quantization degrades legal AI, we must define what legal AI should do. This Section frames legal work not as independent skills but as Research Execution—the integrated competency of completing a legal research task from question to answer.

Three authoritative sources converge on this construct:

MacCrate Report (ABA, 1992): Identifies legal research as a fundamental lawyering skill, emphasizing a "coherent and effective research design" that integrates issue identification, source selection, and strategic execution.

AALL Principles and Standards for Legal Research Competency (2013): Principle IV states that "a successful legal researcher applies information effectively to resolve a specific issue or need." The standard elaborates: the competent researcher "synthesizes legal doctrine by examining cases similar, but not identical, to cases that are the current focus of research."

Shultz & Zedeck empirical study of lawyer effectiveness (2011): Their 26 "Lawyering Effectiveness Factors" are job performance measures derived from asking lawyers, judges, and clients: "If you were looking for a lawyer for an important matter, what qualities would cause you to choose that attorney?" Relevant factors include "Fact Finding," "Researching the Law," and "Integrity & Honesty."

These sources describe a workflow that succeeds or fails as a unit—not seven independent skills tested in isolation. A lawyer who finds the right case but extracts the wrong holding has failed. A lawyer who synthesizes brilliantly but cites fabricated authority has produced worthless work product.

2.2 The Seven Components of Research Execution

Research Execution comprises seven sequential operations that lawyers perform when completing a legal research task:

ComponentOperationWhat It Tests
Known AuthorityResolve a known citation to correct case metadataCan the system retrieve specific authorities?
Unknown AuthorityRetrieve relevant law from a fact pattern or legal issueCan the system find applicable precedent?
Validate AuthorityDetermine if authority remains good lawCan the system detect overruling or adverse treatment?
Fact ExtractionExtract disposition, holding, and outcome from opinion textCan the system identify legally relevant facts?
Distinguish CasesDetermine whether precedent applies or can be distinguishedCan the system reason about doctrinal relationships?
Synthesize ResultsIntegrate authorities into coherent IRAC analysisCan the system produce competent legal work product?
Citation IntegrityEnsure all cited authorities exist and support propositions citedDoes the system meet professional responsibility standards?

These components are dependencies, not independent benchmarks. Each component's output feeds the next. The chain succeeds or fails as a unit; failure at any point propagates downstream.

2.3 Professional Responsibility as Hard Constraint

Citation Integrity occupies a special position. Under Model Rule 3.3(a)(1), attorneys may not "make a false statement of fact or law to a tribunal." A brief citing fabricated cases is worthless and potentially sanctionable.

This is a binary gate, not a soft metric: work product either meets professional standards or it fails. Shultz & Zedeck's Factor 21—"Integrity & Honesty"—confirms that professional responsibility is foundational to lawyer effectiveness.

2.4 What Research Execution Tests

The final synthesis validates Research Execution. AALL Principle IV's standard—applying gathered information "to resolve a specific issue or need"—is tested at synthesis. If synthesis produces a well-reasoned IRAC analysis using correct authorities, Research Execution succeeded. If it fails, error tracing reveals which upstream component broke.

Two implications follow:

Planning is implicit. Earlier frameworks treated "Research Planning" and "Strategic Stopping" as separate skills. We eliminate them as explicit test targets because synthesis validates planning. Grade the memo, not the plan.

Errors propagate. A model achieving 90% accuracy on each component will complete only 0.9^7 ≈ 48% of full research tasks successfully. This multiplicative penalty reflects legal practice.

3.1 PTQ Methods

PTQ compresses model weights after training. Multiple methods exist, each with different compression strategies and preservation priorities:

MethodBit-DepthTargetCore Strategy
AWQ4-bitWeightsActivation-aware: identifies critical 1% of weights and protects them
GPTQ4-bit / 8-bitWeightsHessian-based: minimizes MSE of weight error layer-by-layer
GGUF (llama.cpp)Var.Weights + KVBlock-wise: flexible CPU/GPU offloading, no importance calibration
BNB-nf44-bitWeightsNormalFloat: exploits normal distribution of weight values
SmoothQuant8-bitWeight + ActivationMigrates activation spikes into weights
KIVI2-4 bitKV CacheChannel-wise compression of context memory

The critical distinction is between activation-aware methods (AWQ) that identify and protect the weights most critical to reasoning, and block-wise methods (GGUF) that compress uniformly without importance calibration.

3.2 The AWQ vs. GGUF Divergence

AWQ identifies weight channels with high activation magnitude—the parameters most critical to output fidelity—and protects them from aggressive compression. This makes AWQ more robust for reasoning tasks than uniform quantization.

GGUF (llama.cpp) compresses in block-level chunks without nuanced weight protection. It supports optional importance matrix calibration (llama-imatrix), but standard K-Quants (Q4_K_M, Q4_K_S) skip it.

For legal practice, the difference matters:

FeatureAWQ (vLLM)GGUF (llama.cpp)
Primary HardwareHigh-end NVIDIA GPUsCPUs, Apple Silicon, consumer GPUs
Operational Cost$1-4/hour per GPU$0.10-0.50/hour or free locally
Weight ProtectionActivation-awareBlock-wise (no importance calibration)
Reasoning PreservationHigherLower
Production TargetHigh-concurrency SaaSEdge devices, privacy-first local tools

A user downloading a "4-bit Llama" model cannot tell whether it was quantized with AWQ or GGUF. The label is identical. The legal reasoning quality may differ substantially.

3.3 Commercial Optimizations and Supply Chain Stacking

The PTQ method chosen at the model layer is only the beginning. Commercial infrastructure applies additional optimizations that compound degradation:

NVIDIA TensorRT-LLM (W4A8 / NVFP4)

NVIDIA's Model Optimizer uses W4A8 (4-bit weights, 8-bit activations) to double throughput on Blackwell GPUs. The move to NVFP4 (4-bit weights and activations) for "extreme efficiency" triggers reasoning collapse documented in the literature. This engine powers most high-scale AI cloud providers. They optimize for tokens-per-second—lowering the model's reasoning ceiling in the process.

Microsoft PrefixQuant (Weight-Activation Optimization)

PrefixQuant compresses both weights and activations to achieve high throughput. This optimization caused a 71% collapse in high-level knowledge verification (GPQA). It allows a company to serve 4× more users on the same hardware while effectively degrading the model's validation capabilities.

Snowflake SwiftKV (KV-Cache Compression)

SwiftKV compresses context memory (KV-cache) to process massive datasets (128k tokens) on cheaper GPUs. This causes the 59% drop in retrieval accuracy Mekala et al. (2025) documented. A legal-tech vendor can claim to "review 1,000 contracts for the price of 100"—but retrieval is structurally unreliable.

3.4 The Stacking Problem

A legal AI product may encounter compression at multiple layers:

LayerWhat's HappeningWho Controls It
Model weightsPTQ compression (AWQ, GPTQ, GGUF)Model provider, vendor
ActivationsW4A8, NVFP4NVIDIA, cloud infrastructure
KV-cacheSwiftKV, KIVIHosting provider
RoutingPeak-hour downgradesEveryone

Each layer compounds. Users see none of it. A lawyer using "LegalEagle 2.0" cannot tell whether the underlying model runs INT8 or INT4, whether AWQ or GGUF was used, whether TensorRT applies additional compression, or whether KV-cache is squeezed to handle long documents.

This Section maps existing evidence to each skill, establishing that quantization plausibly degrades the underlying capabilities each skill requires. Some evidence comes from quantization studies; some from adjacent work on long-context degradation, reasoning under compression, or legal benchmark difficulty. The mapping establishes plausibility; Section VI provides direct empirical confirmation.

Mapping Human Skills to AI Procedural Workflows

Each skill translates into AI operations:

SkillHuman ActivityAI Workflow
S1: Research ExecutionScope, research, databasesQuery decomposition, tool selection
S2: Strategic StoppingRecognizing diminishing returnsTermination conditions, confidence thresholds
S3: Known AuthorityCitation lookupExact retrieval from identifier
S4: Unknown AuthorityIssue-based searchingSemantic retrieval, query expansion
S5: Validating AuthorityShepardizingTreatment classification, status verification
S6: Fact ExtractionDocument reviewLong-context retrieval, entity extraction
S7: Distinguishing CasesAnalogical reasoningHolding comparison, fact matching
S8: Synthesizing ResultsMemo draftingMulti-document generation
S9: CitationEthics complianceHallucination avoidance, attribution

This mapping enables precise questions: when quantization degrades "multi-hop reasoning," which lawyer skills suffer? When embedding compression causes "retrieval blindness," which workflows fail?

4.1 Summary: The Full Skill Surface Is At Risk

SkillMechanism EvidenceRisk Level
S3: Known AuthorityLong-context degradationHigh
S4: Unknown AuthorityReasoning + retrievalHigh
S5: Validating AuthorityTemporal reasoningMedium
S6: Fact ExtractionLong-context retrievalHigh
S7: Distinguishing CasesMulti-step reasoningHigh
S8: Synthesizing ResultsIntegration + accuracyHigh
PR: Prof. ResponsibilityFabrication resistanceVery High

4.2 Empirical Studies on Quantization Effects on Reasoning Skills

4.2.1 Research Planning

Mechanism: Research planning requires decomposing complex queries into subtasks—the multi-hop reasoning Li et al. showed degrades up to 4× under quantization.

StudyFinding
ACBench (Dong et al., 2025)reveals that 4-bit quantization creates a critical divergence between apparent competence and actual reliability with real-world application exhibiting accuracy drops by 10-15%.
Liu et al. (2025)shows that lower bit-width quantization introduces task-difficulty-dependent accuracy risks, and they explicitly evaluate KV cache / activation quantization as well as weights.
IntactKV (2024)mechanism support that KV cache quantization can be a failure point; good for “workflow state maintenance” language.

4.2.2 Strategic Stopping

Mechanism: Strategic stopping requires calibrated confidence—knowing when you have enough. ECE (expected calibration error) studies show quantized models become overconfident, unable to recognize their own uncertainty.

StudyFinding
Zhong et al., 2025Quantized LLMs are worse-calibrated than full-precision counterparts in 85% of measurements (41 of 48 test conditions). Quantization systematically produces overconfidence.
Q-Misalign (Dong, Li & Guo, 2025)Safety alignment degrades under quantization; dormant vulnerabilities emerge post-compression.

4.2.3 Finding Known Authority

Mechanism: Known authority retrieval requires precise matching across long contexts. Embedding quantization collapses semantic distances; generator quantization corrupts attention to specific passages.

StudyFinding
Mekala et al. (2025)8-bit roughly preserved; 4-bit methods produce losses up to 59%, especially for long-context inputs. Effect varies by method/model/task.
LegalBench-RAG (2024)Legal-domain benchmark isolating retrieval quality. Legal retrieval is hard even before quantization.

4.2.4 Finding Unknown Authority

Mechanism: Finding unknown authority requires decomposing fact patterns into legal issues. Liu et al. show quantization degrades multi-hop reasoning by up to 4× on complex tasks. Combined with Zheng's difficulty findings: quantization severely impairs this skill.

StudyFinding
Li et al.Low-bit quantization degrades complex math reasoning by up to 32.39% (avg. 11.31%), specifically in numerical computation and planning.
Liu et al., 2025Lower bit-widths introduce significant accuracy risks; impact depends on task difficulty. Affects DeepSeek-R1, LLaMA, and Qwen.
Yazan, Verberne & Situmeang (2024)In RAG pipelines, quantization may not impair retrieval when base LLM performs well, but smaller models show high sensitivity to context length and setup.

4.2.5 Validating Authority

Mechanism: Validation requires temporal reasoning (when was this overruled?) and status classification (still good law?). Outlier weight clipping destroys these fine-grained distinctions.

StudyFinding
Liu et al. (2025)W8A8/W4A16 can be lossless; lower bit-widths introduce significant accuracy risks. Task difficulty is critical--placing authority-validation in the high-risk regime.
MixKVQ (Zhang et al. 2025)Low-bit KV-cache quantization exhibits severe degradation on complex reasoning. Fixed-precision at very low bit-widths produces large quantization errors and critical failures.
TimeBenchGPT-4 achieves only 66.4% on implicit temporal relationships. Accuracy varies from 40% to 92% depending on how temporal facts are organized. TRAVELER shows implicit temporal reasoning degrades 39% as context scales from 5 to 100 events.
arXiv
.04823
Accuracy drops exceed 10% at 4-bit for reasoning; LexTime achieves only 80.8% on temporal event ordering with 4-bit models. Validating authority requires multi-hop temporal reasoning--weak at baseline, catastrophic under quantization.

4.2.6 Fact Extraction

Mechanism: Fact extraction from contracts is long-context retrieval. The 59% accuracy collapse on NIAH-none (correctly identifying absent information) implicates document review reliability.

StudyFinding
Mekala et al. (2025)up to 59% degradation on long-context extraction tasks at 4-bit quantization. Extracting holdings requires identifying the specific legal rule announced by a court, distinguishing it from dicta, and accurately capturing its scope and limitations. 🡪 S6's failure mode is particularly insidious because extracted "holdings" may be linguistically plausible while being substantively fabricated—the quantized system generates authoritative-sounding rules that cited case never announced.

4.7 Skill 7: Distinguishing Cases

Mechanism: Case distinction requires tracking multiple factors simultaneously and identifying material differences—multi-step reasoning, the capacity most vulnerable to quantization.

StudyFinding
Dahl et al. (Journal of Legal Analysis, 2024)models "cannot reliably detect when they are hallucinating" and fail to correct users' incorrect legal assumptions. When combined with Li et al.'s 32.39% reasoning degradation under quantization, demonstrates high unreliability. Baseline 58-88% hallucination rate represents unquantized models; adding 4-bit compression amplifies an already critical reliability gap.
Liu et al., 2025supports that low-bit regimes create accuracy risks on hard reasoning tasks (the cognitive substrate for distinction).

4.8 Skill 8: Synthesizing Results

Mechanism: Synthesis requires integrating multiple sources while maintaining coherence. CLERC reports that strong models produce highly rated analyses while hallucinating—good writing does not mean truthful authority.

StudyFinding
LegalEval-Q (Li & Wu, 2025)measures clarity/coherence/terminology quality; also (importantly) reports quantization has negligible impact on those writing-quality metrics, which supports your “fluency preserved while truth degrades” story.
Lewis et al. (2020)Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. canonical RAG citation that establishes retrieval + generation as a distinct paradigm because parametric memory alone is insufficient, and provenance/updating are core motivations; credibility backbone for “unknown authority finding + synthesis” being a retrieval-conditioned reasoning task rather than generic generation. URL: arXiv
.11401.

4.9 Professional Responsibility

Mechanism: Citation integrity and fabrication resistance depend on precise parametric memory. Quantization clips outlier weights encoding rare-but-accurate associations.

StudyFinding
Q-Misalign (Dong et al., 2025)safety alignment is not preserved by quantization but is instead contingent upon precision—vulnerabilities can remain dormant; making pre-deployment safety audits unreliable for detecting post-quantization failure modes. Combined with Dahl et al.'s finding that even unquantized models hallucinate legal information at 58-88% rates while being unable to detect their own errors, quantized legal AI systems present a dual threat to professional responsibility
Li et al. (2024)4-bit quantization significantly weakens fabrication resistance.
Dahl et al. (2024) Large Legal FictionsLLMs hallucinate legal authority at alarming rates (69-88%) on verifiable legal queries.

4.1 Research Execution as Integrated Competency

Before asking whether quantization degrades legal AI, we must define what legal AI is supposed to do. This Section operationalizes legal work not as a collection of independent skills, but as Research Execution—the integrated professional competency of completing a legal research task from question to answer.

Three authoritative sources converge on this construct:

MacCrate Report (ABA, 1992).

Identifies legal research as a fundamental lawyering skill, emphasizing "devising and implementing a coherent and effective research design" that integrates issue identification, source selection, and strategic execution.

AALL Principles and Standards for Legal Research Competency (2013): Principle IV states that "a successful legal researcher applies information effectively to resolve a specific issue or need." The standard elaborates: the competent researcher "synthesizes legal doctrine by examining cases similar, but not identical, to cases that are the current focus of research."

Shultz & Zedeck empirical study of lawyer effectiveness (2011): Their 26 "Lawyering Effectiveness Factors" are job performance measures—derived from asking lawyers, judges, and clients: "If you were looking for a lawyer for an important matter, what qualities would cause you to choose that attorney?" Relevant factors include "Fact Finding" (identifying relevant facts and issues), "Researching the Law" (utilizing appropriate sources and strategies), and "Integrity & Honesty" (acting with core professional values).

These sources do not describe seven independent skills tested in isolation. They describe a workflow that succeeds or fails as a unit. A lawyer who finds the right case but extracts the wrong holding has not executed research competently. A lawyer who synthesizes brilliantly but cites fabricated authority has produced worthless work product.

4.2 The Seven Components of Research Execution

Research Execution comprises seven sequential operations that lawyers perform when completing a legal research task:

ComponentOperationWhat It Tests
Known AuthorityResolve a known citation to correct case metadataCan the system retrieve specific authorities?
Unknown AuthorityRetrieve relevant law from a fact pattern or legal issueCan the system find applicable precedent?
Validate AuthorityDetermine if authority remains good lawCan the system detect overruling or adverse treatment?
Fact ExtractionExtract disposition, holding, and outcome from opinion textCan the system identify legally relevant facts?
Distinguish CasesDetermine whether precedent applies or can be distinguishedCan the system reason about doctrinal relationships?
Synthesize ResultsIntegrate authorities into coherent IRAC analysisCan the system produce competent legal work product?
Citation IntegrityEnsure all cited authorities exist and support propositions citedDoes the system meet professional responsibility standards?

These components are not independent benchmarks—they are dependencies in a workflow. Each component's output becomes input for subsequent components. The chain succeeds or fails as a unit, and failure at any point propagates downstream.

4.3 Professional Responsibility as Hard Constraint

Citation Integrity occupies a special position. Under Model Rule 3.3(a)(1), attorneys may not "make a false statement of fact or law to a tribunal." A brief citing fabricated cases is not merely imperfect; it is professionally worthless and potentially sanctionable.

This is not a soft performance metric. It is a binary gate: the work product either meets professional standards or it does not. Shultz & Zedeck's Factor 21—"Integrity & Honesty: has core values and beliefs; acts with integrity and honesty"—confirms that professional responsibility is foundational to lawyer effectiveness, not an optional overlay.

4.4 What Research Execution Tests

Research Execution is validated by whether the final synthesis succeeds. AALL Principle IV's standard—applying gathered information "to resolve a specific issue or need"—is tested at the synthesis step. If the synthesis produces a well-reasoned IRAC analysis using correct authorities, Research Execution worked. If it fails, error tracing reveals which upstream component broke.

This framing has two implications:

Planning is implicit. Earlier frameworks (including earlier drafts of this study) treated "Research Planning" and "Strategic Stopping" as separate skills. We eliminate them as explicit test targets because planning is validated by whether synthesis succeeds. You don't grade the plan; you grade the memo.

Errors propagate. A model achieving 90% accuracy on each independent component will, under independence assumptions, complete only 0.9^7 ≈ 48% of full research tasks successfully. This multiplicative penalty reflects the reality of legal

5.1 Research Execution as Job Performance

Section II framed legal work as Research Execution—the integrated competency of completing a legal research task from question to answer. Shultz & Zedeck's empirical study confirms this framing: their 26 "Lawyering Effectiveness Factors" are job performance measures, observed in the execution of actual legal work rather than tested in isolation.

AALL Principle IV operationalizes this directly: "A successful legal researcher applies information effectively to resolve a specific issue or need." The competent researcher "synthesizes legal doctrine by examining cases similar, but not identical, to cases that are the current focus of research."

This describes a workflow that succeeds or fails as a unit.

Legal-7 (L7) operationalizes Research Execution as a seven-step dependent chain:

StepNameModalityTaskGround Truth
S1Known AuthorityRAGResolve known citation to correct authoritySCDB citation lookup
S2Unknown AuthorityRAGRetrieve relevant law from fact patternshepards_data.csv
S3Validate AuthorityRAGDetermine if authority remains good lawscotus_overruled_db.csv
S4Fact ExtractionRAGExtract disposition, holding, outcome from opinionSCDB metadata + opinion text
S5Distinguish CasesRAG + CBDecide if precedent applies or can be distinguishedshepards.agree field
S6IRAC SynthesisRAGWrite IRAC-structured legal analysisMEE rubric + chain grounding
S7Citation IntegrityCBVerify no fabricated citations in S6 outputfake_cases.csv + SCDB

The chain maps to IRAC:

  • Rule Phase (S1–S3): Identify, retrieve, and validate legal authority
  • Application Phase (S4–S5): Extract facts and apply precedent through distinction
  • Conclusion Phase (S6–S7): Synthesize analysis and verify citation integrity

The Issue component is implicit in the query. S6 is the capstone: it tests whether Research Execution succeeded.

5.3 Why S6 Validates the Chain

S6 runs closed-book: the model cannot return to sources. It must synthesize an IRAC memo from what it gathered in S1–S5.

This design reflects AALL Principle IV: applying gathered information to resolve an issue. If S6 produces a well-reasoned IRAC analysis using correct authorities, Research Execution worked. If S6 fails, error tracing reveals which upstream step broke:

If Step Fails...Cascade Effect
S1 (Known Authority)Wrong case → all downstream analysis corrupted
S2 (Unknown Authority)Missing precedent → incomplete rule statement
S3 (Validate Authority)Citing bad law → S6 argument fails
S4 (Fact Extraction)Wrong facts → S5 distinction invalid
S5 (Distinguish)Wrong application → S6 conclusion unsupported
S6 (IRAC Synthesis)Poor reasoning → chain fails at capstone
S7 (Citation Integrity)Fabrication detected → S6 voided, chain fails

A model achieving 90% accuracy on each skill completes only 0.9^7 ≈ 48% of full chains. This multiplicative penalty reflects legal practice: one fabricated citation makes a brief worthless.

5.4 S7 as Professional Responsibility Gate

S7 operationalizes Shultz & Zedeck's Factor 21: "Integrity & Honesty."

Under Model Rule 3.3(a)(1), attorneys may not "make a false statement of fact or law to a tribunal." A brief citing fabricated cases is worthless and potentially sanctionable. L7 mirrors this: if S7 detects any fabricated citation in the S6 output, S6 scores zero regardless of reasoning quality.

The gate operates deterministically: citations from S6 are checked against SCDB (real cases) and fake_cases.csv (known fabrications). No LLM-as-judge evaluation required.

5.5 S5 Dual-Modality: The Reasoning Bridge

S5 (Distinguish Cases) occupies a unique position in the chain. It is the point where retrieval must transform into reasoning. The model must:

  • Understand the holding of a precedent case
  • Understand the facts of the current case
  • Determine whether the precedent applies or can be distinguished

This is not retrieval. This is reasoning about retrieved content—precisely what AALL Principle IV demands when it requires synthesizing doctrine across "cases similar, but not identical."

To isolate the reasoning component, L7 tests S5 in two modalities:

S5-RAG (Primary): Both case texts available. Tests whether the model can distinguish cases with full information. This matches real lawyer workflow—attorneys distinguish cases with the opinions open.

S5-CB (Diagnostic): Only the S4-extracted holding available; no citing case text. Tests whether the model can reason from the rule alone, without copying from source material.

The gap between S5-RAG and S5-CB is the Fluency-Reasoning Divergence measurement:

S5-RAGS5-CBInterpretation
HighHighModel reasons well
HighLowModel copies, doesn't reason (FRD signature)
LowLowModel cannot perform the task

A model exhibiting FRD will show a large RAG-CB gap: it can "distinguish" cases when it has the full text to copy from, but cannot reason about the legal relationship from the holding alone. This is precisely the failure mode we hypothesize quantization induces.

5.6 Grading Architecture

L7 achieves 6/7 objective grading:

StepGrading MethodGround Truth Source
S1Exact matchSCDB citation
S2MRR / Hit@kShepard's precedent relationships
S3Exact matchscotus_overruled_db
S4Exact match (disposition, party)SCDB metadata
S5Exact matchshepards.agree field
S6Hybrid (50% objective, 50% LLM-as-Judge)Chain grounding + MEE rubric
S7DeterministicCitation existence check

Only S6 requires rubric-based evaluation. The 50% objective component ("chain grounding") verifies that S6 correctly incorporates outputs from S1–S5—did the model use the authorities it found? The 50% subjective component applies MEE (Multistate Essay Examination) bar exam standards to assess legal reasoning quality.

This architecture minimizes LLM-as-judge circularity: most of the benchmark is deterministic, and even the subjective portion is anchored to the chain's objective outputs.

5.7 Task Structure: From Case to Chain

Each L7 chain instance begins with a Supreme Court case pair drawn from the Shepard's citation network. The anchor case (cited_case) provides the legal authority to be researched; the citing case establishes the doctrinal relationship to be analyzed.

Initial Scenario

A chain instance contains:

ElementSourceExample
Cited Casescdb_sample.csvBrown v. Board of Education, 347 U.S. 483 (1954)
Citing Casescotus_shepards_sample.csvCooper v. Aaron, 358 U.S. 1 (1958)
Shepard's Signalshepards field"followed"
Doctrinal Agreementagree fieldTrue (citing case follows precedent)
Overrule Statusscotus_overruled_db.csvNone (not overruled)
Opinion Textmajority opinion fieldFull text of majority opinion

The model receives this case pair and must execute the seven-step chain, with each step's output feeding subsequent steps.

Task Types by Step

StepTask TypeInputExpected Output
S1Known AuthorityCase name or citation\{us_cite, case_name, term\}
S2Unknown AuthorityLegal issue from anchor caseRanked list of citing cases
S3Validate AuthorityCitation from S1\{is_overruled, overruling_case, year_overruled\}
S4Fact ExtractionOpinion text\{disposition, party_winning, holding_summary\}
S5
Distinguish (Closed-Book)S4 holding + citing case metadata\{agrees, reasoning\}
S5
Distinguish (RAG)S4 holding + full citing opinion\{agrees, reasoning\}
S6IRAC SynthesisAll prior outputs\{issue, rule, application, conclusion\}
S7Citation IntegrityS6 output\{citations_found, all_valid\}

Scoring Summary

StepGround TruthScoring Method
S1SCDB metadataExact match
S2Shepard's citing_case_us_citeMRR, hit@10
S3scotus_overruled_dbBinary match on is_overruled
S4SCDB caseDisposition, partyWinningClosed enum exact match
S5Shepard's agree fieldBinary match
S6MEE rubric + chain groundingHybrid (50% objective, 50% rubric)
S7fake_cases.csv + SCDBDeterministic lookup

5.7 What L7 Detects That Parallel Benchmarks Cannot

The Dahl et al. benchmark tests the same underlying data but in parallel: each task is independent, and a model can achieve high aggregate scores while being functionally incapable of completing a single end-to-end workflow. If the 15% failures are distributed randomly, some workflows succeed. But if failures cluster at early-chain positions—as our quantization hypothesis predicts—then independent task accuracy becomes a misleading proxy for Research Execution capability.

L7 detects three failure modes invisible to parallel evaluation:

Cascade failures. A model that hallucinates at S1 corrupts all downstream steps. Parallel scoring treats S1 as one task among many; L7 propagates the error through the chain.

FRD signature. The S5 RAG-CB gap directly measures whether the model is reasoning or copying. No parallel benchmark isolates this.

Professional responsibility failures. S7 voiding enforces the binary reality of citation integrity—a synthesized memo is either citable or it is not. Parallel benchmarks score fabrication as one error among many; L7 treats it as disqualifying.

For quantization testing, these properties are essential. We hypothesize that compression degrades reasoning while preserving fluency. L7's chained architecture, dual-modality S5, and hard-gate S7 are designed to make this degradation visible.

The following Section applies L7 to test Fluency-Reasoning Divergence across quantization regimes.

Few sentences write out here

6.1 Design Choice: Ecological Validity Over Experimental Purity

This study prioritizes ecological validity over experimental purity. The objective is not to isolate quantization as a laboratory variable under controlled conditions, but to measure what happens to legal reasoning under the deployment configurations users actually encounter upon use.

A secondary objective emerged from the market analysis in Section III: demonstrating that "4-bit" is not a specification. Two deployments at identical bit-depth using different PTQ methods may produce materially different legal reasoning—and users cannot distinguish them. This study tests that claim empirically.

The design involves deliberate tradeoffs. By testing multiple PTQ methods, we gain the ability to show method variance. By using officially released and community-standard quantization rather than a single controlled PTQ pipeline, we lose clean causal isolation but gain results that reflect what lawyers actually face.

We accept these tradeoffs because the policy-relevant questions are:

  1. What are lawyers exposed to in practice?

  2. Can users distinguish between "good 3-bit" and "bad 3-bit"?

The answer to the second question, we hypothesize, is no—and we aim to prove it.

The inability to make clean causal attributions is not a limitation of this study. It is a finding about the market.

6.2 Research Questions

RQ1. Existence. Does aggressive quantization reduce legal-reasoning accuracy more than it reduces surface fluency?

RQ2. Generalizability. Is FRD observable across multiple legal task types, or is it benchmark-specific?

RQ3. Regimes. Can we distinguish a "silent defect zone" (INT8→INT4: reasoning degraded, fluency intact) from an "obvious defect zone" (2-bit: both collapse)?

RQ4. PTQ Method Variance. At identical bit-depth, do different PTQ methods produce significantly different legal reasoning outcomes?

6.3 Hypotheses

H1: Reasoning Degradation (Silent Defect Zone)

Legal reasoning accuracy will degrade significantly between INT8 and INT4 across all three arms, with the largest drops in categories requiring multi-step inference, exception-tracking, or doctrinal distinction.

H2: Fluency Stability (FRD Signature)

Fluency metrics (perplexity, grammaticality, coherence) will remain stable across precision tiers even as reasoning accuracy degrades—confirming Fluency-Reasoning Divergence.

H3: Catastrophic Collapse (Obvious Defect Zone)

At 2-bit precision, both fluency and reasoning will degrade across all arms, producing visible failure modes distinguishable from the latent degradation at INT4.

H4: Category-Specific Vulnerability

Different Dahl task types will show differential sensitivity. Tasks 5-10 (reasoning-intensive) will degrade faster than Tasks 1-4 (factual).

H5: Architecture Generalization

Arms A and B (Qwen and Llama under high-quality PTQ) will both show FRD, demonstrating that degradation is not architecture-specific.

H6: PTQ Method Divergence

At identical bit-depth, AWQ (Arm B) will significantly outperform GGUF (Arm C), demonstrating that bit-depth alone is an insufficient specification. The B-C gap at 4-bit may exceed the 8→4 gap within a single arm.

6.4 Model and Precision Regimes

Design: Three arms testing two model families under two PTQ philosophies.

TierModel 1Model 2PTQ Rationale
8-bitLlama 3.1 8BQwen 2.5 7N??Ecological validity, higher performance quality
4-bit
3-bitllama.cpp (GGUF)Low-cost PTQ—block-wise compression, no importance calibration
2-bit

Rationale for Study Design

Arms A and B test FRD under "reasonable quality" PTQ—what a careful vendor or sophisticated deployer might ship. These represent the ceiling of what users could hope to encounter.

Arm C tests FRD under the PTQ method that economic pressures actually favor. Startups without GPU budgets, solo developers, and cost-constrained deployments default to llama.cpp because it runs on consumer hardware. This represents what many users actually encounter.

The B vs C comparison tests the same model across different PTQ techniques: same base weights, same nominal bit-depth, different compression philosophy. This directly tests whether PTQ method produces material differences at identical bit-depth (see Section III).

Why INT8 As Baseline

Full-precision models exist but are rarely what users encounter. The economic realities documented in Section III push vendors toward compressed deployments. INT8 represents the "reasonable production floor"—the precision tier a user might plausibly expect from a serious legal AI product. Testing degradation relative to INT8 answers the question users actually face: "How much worse does it get from here?"

Rationale For Selection

  1. Verifiable ground truth. Unlike benchmarks requiring LLM-as-judge evaluation, Dahl uses structured case metadata that can be programmatically verified.

  2. Multi-skill coverage. The 10 task types span multiple L-10 skills, testing factual recall, reasoning, temporal judgment, and fabrication resistance.

  3. FRD operationalization. The task gradient—from simple factual tasks (1-4) to complex reasoning tasks (8-10)—allows observation of whether errors cluster in reasoning-intensive categories while fluency-dependent categories remain stable.

  4. Cross-arm comparability. All three arms run identical queries, enabling direct comparison across architectures and PTQ methods.

H4 Prediction: Category-Specific Vulnerability

TierTasks 1-4 (Factual)Tasks 5-8 (Reasoning)Tasks 9-10 (Hardest)
INT8✓ Stable✓ Stable✓ Stable
INT4✓ Stable↓ Silent degradation↓↓ Degradation
2-bit↓ Degradation↓↓ Collapse↓↓↓ Collapse

The gradient is the FRD signature. Simple tasks hold. Reasoning tasks degrade silently. Hardest tasks fail first and worst.

6.5 Evaluation Architecture: HELM Integration

This study implements evaluation using Stanford CRFM's Holistic Evaluation of Language Models (HELM) framework. Seven considerations motivate this choice:

  1. Audit trail. HELM produces SHA-256 hashed bundles containing prompts, outputs, and configuration state, enabling independent verification.

  2. Standardized reporting. HELM's structure naturally accommodates our 9-configuration × 10-task matrix with consistent output formats.

  3. Credibility signal. HELM is the evaluation infrastructure behind Stanford CRFM's published model assessments, signaling alignment with established best practices.

  4. Reproducibility. The entire study can be re-executed with a single command; all configuration is declarative.

  5. YAML-based configs. Model swapping requires only configuration file updates, not code changes.

  6. Caching. HELM provides crash-safe resume for 50,000+ inference calls across potentially unstable quantized models.

  7. Extensibility. Adding a future Arm D requires only a new YAML entry rather than architectural changes.

The Dahl benchmark is integrated as a custom HELM scenario (DahlHallucinationScenario) that loads task data from the RegLab repository, structures queries as HELM instances, and scores outputs using Dahl's correctness_checks.py logic with GPT-4 as judge for semantic evaluation.

6.6 Hypotheses

H1: Reasoning Degradation (Silent Defect Zone)

Legal reasoning accuracy will degrade significantly between INT8 and INT4 across all three arms, with the largest drops in categories requiring multi-step inference, exception-tracking, or doctrinal distinction.

H2: Fluency Stability (FRD Signature)

Fluency metrics (perplexity, grammaticality, coherence) will remain stable across precision tiers even as reasoning accuracy degrades—confirming Fluency-Reasoning Divergence.

H3: Catastrophic Collapse (Obvious Defect Zone)

At 2-bit precision, both fluency and reasoning will degrade across all arms, producing visible failure modes distinguishable from the latent degradation at INT4.

H4: Category-Specific Vulnerability

Different Dahl task types will show differential sensitivity. Tasks 5-10 (reasoning-intensive) will degrade faster than Tasks 1-4 (factual).

H5: Architecture Generalization

Arms A and B (Qwen and Llama under high-quality PTQ) will both show FRD, demonstrating that degradation is not architecture-specific.

H6: PTQ Method Divergence

At identical bit-depth, AWQ (Arm B) will significantly outperform GGUF (Arm C), demonstrating that bit-depth alone is an insufficient specification. The B-C gap at 4-bit may exceed the 8→4 gap within a single arm.

6.6 Metrics

MetricWhat It MeasuresApplied To
Primary OutcomeDahl (overall)Correctness rate across all 10 task typesAll arms, all tiers
Dahl (by task)Correctness rate per task typeAll arms, all tiers
FRD Detection MetricsPerplexityModel confidence / fluency proxyDetect fluency stability
FRD Index(Fluency stability) − (Reasoning degradation)Composite divergence measure
Calibration MetricsECEConfidence-accuracy alignmentDetect overconfidence
Confidence by CorrectnessMean confidence on correct vs incorrectIdentify high-confidence errors
Forensic MetricsEBPWEffective bits per weightVerify claimed precision
dtype auditData type of served weightsConfirm no silent override

6.7 Procedure

6.7.1 Environment

ComponentSpecification
HardwareRTX 5090 (32GB VRAM)
FrameworkvLLM (Arms A, B), llama-cpp-python (Arm C)
JudgeGPT-4 via OpenAI API

6.7.2 Model Artifacts

ArmSourceArtifacts
AAlibaba officialQwen2.5-7B-Instruct @ INT8/INT4/2-bit
BHuggingFaceMeta-Llama-3.1-8B-Instruct-AWQ @ INT8/INT4/2-bit
CHuggingFaceMeta-Llama-3.1-8B-Instruct-GGUF @ Q8_0/Q4_K_M/Q2_K

6.7.3 Decoding Parameters (Fixed Across All Runs)

ParameterValueRationale
Temperature0.0Deterministic—eliminates sampling variance
Top-p1.0No nucleus sampling
Max tokens512Sufficient for Dahl responses
Seed42Reproducibility

6.7.4 Run Protocol

  1. Load model at specified precision

  2. Verify dtype/EBPW before run

  3. Run full Dahl benchmark (all 10 task types)

  4. Log all outputs with metadata

  5. Score via GPT-4 judge + programmatic checks

  6. Repeat for each arm × tier (9 configurations)

6.7.5 Reproducibility Package

Per Legal-10 protocol:

  • Full run bundles (prompts, raw outputs, scores, configuration)

  • SHA-256 hashes for all artifacts

  • Signed manifests

  • Public append-only submission log

6.8 Analysis Plan

6.8.1 Within-Arm Analysis (H1, H2, H3)

For each arm independently:

  • Plot accuracy by precision tier (8 → 4 → 2)

  • Plot perplexity by precision tier

  • Identify silent defect zone (accuracy drops, perplexity stable)

  • Identify obvious defect zone (both drop)

6.8.2 Task-Level Analysis (H4)

  • Heatmap: Task type × precision tier × arm

  • Test whether Tasks 5-10 degrade faster than Tasks 1-4

  • Identify which tasks show earliest/steepest decline

6.8.3 Architecture Comparison (H5)

  • Compare Arm A vs Arm B at matched precision

  • Report whether FRD appears in both architectures

6.8.4 PTQ Method Comparison (H6)

  • Direct comparison: Arm B vs Arm C at each precision tier

  • Statistical test: Is B-C gap at INT4 significant?

  • Compare B-C gap magnitude to within-arm 8→4 gap

  • If B-C gap ≥ 8→4 gap: "PTQ method matters as much as bit-depth"

6.8.5 FRD Quantification

  • Compute FRD Index per arm/tier

  • Identify precision threshold where FRD is maximized

[To be completed after experiments]

8.1 FRD Confirmed

[To be completed after results]

8.2 The Governance Vacuum

The results confirm what Section III documented: the legal AI market is operating without oversight.

No disclosure requirements mandate that vendors reveal precision tiers. No standards define what "4-bit" means. No liability exposure yet connects quantization choices to malpractice. The economic incentives—from big cloud providers optimizing margins to startups avoiding GPU costs—all push toward aggressive compression.

8.3 Implications

For practitioners: Verification is structurally defeated. A lawyer cannot determine, upon use, whether the legal AI tool is running at INT8 or INT4, AWQ or GGUF, with or without infrastructure-layer compression. The observability gap documented in Part I is empirically confirmed.

For vendors: The results create design-defect exposure. Quantization-induced FRD is foreseeable, configuration-linked, and undisclosed. Reasonable alternative designs (higher precision, transparent labeling) exist and are feasible.

For regulators: Precision tier should be a disclosed material specification. "4-bit" is not a specification—it is a label covering materially different products.

External Benchmark Injection Points. The L7 chain architecture enables integration with existing legal AI benchmarks at specific skill positions. CaseHOLD (Zheng et al., 2021) maps to S7 closed-book mode, testing holding identification without full case context. LegalBench overruling detection maps to S5 (Validating Authority). LegalBench definition extraction maps to S6 (Fact Extraction). This injection architecture allows L7 to provide both native chained evaluation and external benchmark comparability within a single framework.

This study demonstrates that Fluency-Reasoning Divergence is empirically observable in legal AI systems: quantization degrades legal reasoning while surface fluency remains stable.

The degradation is not uniform across methods. AWQ and GGUF at identical bit-depth produce materially different legal reasoning—confirming that "4-bit" is a label, not a specification. Users face a black box they cannot audit.

And the degradation is invisible to users. The opacity problem operates at every layer: model, PTQ method, infrastructure compression, routing. Users face a black box they cannot audit.

The industry is deploying compromised systems under nominal equivalence. The governance vacuum persists. The evidentiary gap is architectural.

This Article does not solve the problem. But it documents it—with enough precision that the documentation itself becomes actionable. When regulators ask "did anyone see this coming?", when courts ask "was this foreseeable?", when bar associations ask "what should lawyers have known?"—the answer is here.

Appendix A: Skill Framework Source Citations

Sources:

  • MacCrate Report: American Bar Association. (1992). Legal education and professional development: An educational continuum (Report of the Task Force on Law Schools and the Profession: Narrowing the Gap), 138–141.

  • AALL Principles: American Association of Law Libraries. (2013). Principles and standards for legal research competencies.

  • Shultz & Zedeck: Shultz, M. M., & Zedeck, S. (2011). Predicting lawyer effectiveness: Broadening the basis for law school admission decisions. Law & Social Inquiry, 36(3), 620-661.

External Benchmark Injection Points. The L10 chain architecture enables integration with existing legal AI benchmarks at specific skill positions. CaseHOLD (Zheng et al., 2021) maps to S7 closed-book mode, testing holding identification without full case context. LegalBench overruling detection maps to S5 (Validating Authority). LegalBench definition extraction maps to S6 (Fact Extraction). This injection architecture allows L10 to provide both native chained evaluation and external benchmark comparability within a single framework.

SkillMacCrateAALLShultz & Zedeck
S1: Research Planning§ 3.3(a): "devising and implementing a coherent and effective research design"Principle II, Standard 1Factor 14: Strategic Planning
S2: Strategic Stopping§ 3.3(a)(iii): "assessing feasibility... in terms of time and financial constraints"Principle II, Standard 4: "Recognizes when sufficient research has been done"Factor 13: Organizing and Managing One's Own Work
S3: Known Authority§ 3.2: "Knowledge of the Fundamental Tools of Legal Research"Principle II, Standard 2: "find the full text... given a legal citation"Factor 6: Fact Finding
S4: Unknown Authority§ 3.2(a): using secondary sources to find primary authorityPrinciple II, Standard 3Factor 7: Researching the Law
S5: Validating Authority§ 3.1: "Knowledge of the Nature of Legal Rules"Principle III, Standard 2: "verifies that the authority is current and still good law"Factor 5: Researching the Law
S6: Fact Extraction§ 4.1: Factual Investigation[Gap]Factor 6: Fact Finding
S7: Distinguishing Cases§ 3.3(c): "distinguishing cases on their facts and reasoning"Principle IV, Standard 1Factor 1: Analysis and Reasoning
S8: Synthesizing Results§ 2.2: "synthesizing the holdings of multiple cases"Principle IV, Standard 2Factor 9: Writing
PR: Professional Responsibility§ 10: "Recognizing and Resolving Ethical Dilemmas"Principle V, Standard 1Factor 21: Integrity & Honesty