jwc-global

When a law firm subscribes to an AI-powered legal research platform, the invoice reflects compute costs. Behind that invoice lies an engineering decision: serve the model at full precision—16-bit floating-point weights preserving the numerical fidelity of training—or compress those weights into 8-bit or 4-bit integers that halve or quarter memory and quadruple throughput. The economics compel compression. A 4-bit model serves four times as many concurrent users on the same hardware. For vendors at scale, quantization is standard deployment practice.

Quantization instantiates Act I's T-Arm fidelity parameter (bit-precision) under nominal equivalence (same model label), creating an Observability Gap for downstream legal users who cannot verify which precision tier generated their output.

This study asks: what happens to legal reasoning when that compression knob is turned?

Prior work suggests quantization rarely degrades language models on standard benchmarks. Jin et al.'s comprehensive evaluation of quantized instruction-tuned models found that 4-bit variants "maintain performance comparable to their non-quantized counterparts" across tasks spanning MMLU, summarization, truthfulness, and arithmetic reasoning. A Red Hat study of quantized Llama 3.1 models reported that 8-bit and 4-bit deployments recover 96–99% of baseline leaderboard scores, with "no discernible differences" from full-precision outputs in typical use. Perplexity—the standard measure of linguistic fluency—remains stable through aggressive compression, spiking only at extreme 2-bit levels.

These findings reassure for general-purpose applications but remain incomplete. These benchmarks exclude legal reasoning tasks, omit legal-specific hallucination taxonomies, and never stress the multi-step doctrinal analysis—tracking exceptions, applying multi-factor tests, distinguishing precedent—that competent legal work demands. Legal reasoning may be fragile in ways general benchmarks miss—and quantization may selectively degrade this fragile capacity while leaving surface fluency intact.

The companion theoretical paper terms this phenomenon—surface-level linguistic competence preserved even as underlying reasoning deteriorates—Fluency–Reasoning Divergence (FRD). This study tests whether FRD appears in a controlled legal-domain setting, and at what precision thresholds.

1.2 Gap in Existing Evidence

The quantization-and-reasoning literature has advanced rapidly. Li et al. demonstrated that standard post-training quantization methods introduce up to 32.39% accuracy degradation on mathematical reasoning benchmarks while preserving greater than 95% of baseline performance on general text tasks. Liu et al. confirmed that quantization below 8-bit weights creates "significant accuracy risks" specifically in reasoning-intensive models. Dettmers et al.'s work on activation outliers showed that complex reasoning relies on rare, high-magnitude weight directions that are clipped first under aggressive compression.

These findings suggest a mechanism: quantization selectively damages neural circuits for multi-step reasoning while leaving the redundant patterns underlying fluent text generation intact. But none of these studies use legal tasks, legal hallucination taxonomies, or law-specific benchmarks. The gap matters because legal reasoning has distinctive properties:

Doctrinal precision matters. Confusing "gross negligence" with "recklessness" determines whether qualified immunity applies.

Exceptions are load-bearing. Legal rules bristle with carve-outs; tracking them is precisely what multi-step reasoning requires.

Confidence without calibration is dangerous. A model that sounds authoritative while misstating holdings creates greater professional risk than one that hedges or declines.

Whether quantization-induced reasoning decay manifests in legal tasks—masked by preserved fluency—remains an open empirical question. This study addresses it.

1.3 Research Questions

Four empirical research questions organize the study:

RQ1 (Existence). Does aggressive post-training quantization reduce legal-reasoning accuracy more than it reduces surface fluency?

RQ2 (Magnitude). Is the effect size of quantization on legal correctness substantially larger than its effect size on perplexity?

RQ3 (Regimes). Can we empirically distinguish a "silent defect zone" (INT8–INT4: reasoning degraded, fluency intact) from an "obvious defect zone" (2-bit: both collapse)?

RQ4 (Robustness). Do Llama 3.1 and Qwen 2.5 show similar degradation patterns—a general phenomenon, not an architecture-specific quirk?

2.1 Quantization and Reasoning Decay in Computer Science

2.1.1 The Standard Finding: Moderate Quantization Preserves General Performance

The dominant finding: moderate compression—down to 4-bit weights—preserves model performance on standard benchmarks with minimal degradation. This finding has shaped industry practice and vendor confidence.

Jin et al.'s comprehensive evaluation of quantized instruction-tuned Qwen models across MMLU, C-EVAL, CNN/DailyMail, XSum, TruthfulQA, BBQ, GSM8K, SNLI, and FollowBench found that 4-bit quantized models "maintain performance comparable to their non-quantized counterparts." Critically, perplexity "serves as a reliable performance indicator for quantized LLMs across the majority of the benchmarks"—if a model sounds fluent, it probably performs well. Only at extreme 2-bit settings did they observe orders-of-magnitude increases in perplexity and sharply degraded benchmark scores.

A Red Hat industry-scale evaluation of quantized Llama 3.1 models (8B/70B/405B) across OpenLLM Leaderboard, Arena-Hard-Auto, HumanEval, and text-similarity metrics reinforced this picture. All quantized schemes recovered 96–99% of baseline leaderboard scores. Text similarity metrics (ROUGE-1/L, BERTScore, semantic textual similarity) showed strong preservation of word choice, structure, and semantic content. The authors concluded that 8-bit and 4-bit quantized models show "no discernible differences" from their full-precision counterparts in typical use.

These findings explain aggressive vendor quantization: standard benchmarks say it is safe.

2.1.2 The Exception: Reasoning Tasks Show Disproportionate Degradation

A parallel research line tells a different story for multi-step reasoning tasks.

Li et al.'s study of mathematical reasoning under quantization found that standard post-training quantization methods like AWQ and GPTQ introduce up to 32.39% accuracy degradation on the MATH benchmark—while preserving greater than 95% of baseline performance on general text tasks. Degradation concentrated in numerical computation and reasoning planning—exactly the operations requiring precise, multi-step inference.

Liu et al. confirmed this pattern, demonstrating that quantization below 8-bit weights or 16-bit activations creates "significant accuracy risks" specifically in reasoning models. Their data identified INT8 as a danger zone for complex logic and INT4 as functionally unsafe for tasks requiring extended chains of inference.

The pattern: selective vulnerability. Quantization damages reasoning while sparing fluency. A model can lose 30% of its multi-step problem-solving capacity while retaining 95%+ of its capacity for grammatical, confident-sounding text—precisely what the companion paper terms Fluency–Reasoning Divergence.

2.1.3 Mechanistic Evidence: Outliers and Fragile Circuits

This selective damage reflects transformer architecture.

Dettmers et al.'s work on activation outliers revealed that complex reasoning relies on rare, high-magnitude weight directions: specific neural pathways with unusually large values that track long logical chains, handle edge cases, and maintain exceptions across multi-step derivations. These "outlier" weights encode higher-order patterns—nuanced distinctions separating competent legal analysis from superficial pattern matching.

Aggressive quantization clips these outliers first. When 65,536 continuous values (FP16) are forced into 16 discrete buckets (INT4), the extreme values—the ones encoding nuanced distinctions—get rounded into the same bucket as their neighbors. The model retains common patterns and frequent associations but loses rare, critical distinctions.

Kumar et al. extended this finding by demonstrating that quantization negatively impacts the learning process of reasoning models more than general language models. This suggests that reasoning is a higher-order function mechanistically distinct from text generation, requiring higher precision to preserve.

The damage is not uniform. Attention layers—performing information routing, deciding what connects to what—are precision-sensitive and degrade under compression. Feed-forward layers—applying learned patterns to whatever attention surfaces—prove far more robust. For legal practitioners: imagine a law firm where budget cuts eliminate senior partners while leaving associates fully staffed. The work product looks professional, but strategic judgment has been quietly hollowed out.

2.1.4 The Fluency-Preservation Paradox

These findings create a paradox for legal AI governance.

Standard industry evaluations—the kind vendors cite when claiming reliability—focus on perplexity and general benchmarks. By those measures, quantized models look safe. But reasoning-specific literature documents substantial degradation on exactly the tasks legal practice demands: multi-step inference, exception tracking, precise doctrinal application.

The metrics vendors use to validate quantization (perplexity, general accuracy) may miss failures that matter most in legal contexts. A quantized model can pass industry tests while failing the bar exam.

This study resolves that paradox empirically: testing whether the reasoning-degradation pattern documented in mathematical and logical tasks also appears in legal reasoning—while fluency metrics remain reassuring.

2.2 RAG Blindness and Generative Distortion

Quantization can corrupt legal AI systems at two stages: retrieval and generation. This study focuses on generation, but the distinction matters for interpreting results.

Retrieval Degradation (RAG Blindness). In retrieval-augmented generation systems, aggressive quantization of embedding models can cause relevant documents to fail to surface. Jeong demonstrated that 4-bit quantization in vector embeddings causes measurable degradation in retrieval accuracy, with particularly severe impact on semantic similarity matching for nuanced queries. For legal applications: the system fails to retrieve controlling precedent even when it exists in the corpus—not because the case is absent, but because the compressed embedding space cannot distinguish legally adjacent concepts.

Generative Distortion. Separately, quantization of the generator's weights can cause misapplication of correctly retrieved documents. Yazan et al. demonstrated that quantized generators fail specifically on long-context and complex synthesis tasks: the model might cite the right case but misstate the holding, omit a critical exception, or invert the burden of proof.

This study isolates generative distortion by design. In the Doctrinal Arm (Stanford RegLab), no retrieval pipeline exists—the model answers directly from parameters. In the Issue-Spotting Arm (LegalBench), retrieval components stay constant while generator precision varies. When degradation appears, it traces to reasoning capacity, not missing data or misconfigured search.

Future work should test embedding quantization and full-stack RAG degradation. Here, the contribution: establishing that the generative layer alone—the model's "legal brain"—exhibits Reasoning Decay under quantization.

The Stanford RegLab Legal Hallucinations Benchmark

The Doctrinal Arm uses the Stanford RegLab "legal hallucinations" dataset (Dahl, Magesh, Suzgun, and Ho), which provides:

  • A stratified sample of U.S. federal cases spanning the Supreme Court, Courts of Appeals, and District Courts.

  • A fourteen-category taxonomy distinguishing correct answers from specific hallucination types: answer hallucination, source-rule hallucination, case hallucination, citation hallucination, and finer-grained categories capturing misstatements of holdings, incorrect procedural postures, fabricated authorities, and other failures.

  • Verifiable ground-truth answers derived from structured case metadata, enabling automated scoring.

Tasks mimic realistic legal uses: case explanations, holding summaries, author and court identification, doctrinal applications to hypotheticals. Critically, the benchmark distinguishes error kinds—not just "right" versus "wrong," but fabricated versus mischaracterized versus incomplete.

Prior work found general-purpose LLMs hallucinate on legal queries 58–82% of the time. This study asks a different question: not "do LLMs hallucinate on legal tasks?" but "does hallucination rate and type change systematically as bit-precision decreases?"

LegalBench Issue-Spotting Tasks

The Issue-Spotting Arm draws on LegalBench, a collaboratively constructed benchmark of 162 legal reasoning tasks covering six reasoning types. Developed through an interdisciplinary process involving legal professionals, LegalBench has become a standard evaluation suite for legal LLM capabilities.

From LegalBench, I select issue-spotting and doctrinal reasoning tasks stressing multi-factor tests and exception recognition—operations most likely to require the fragile high-precision circuits quantization damages. These tasks require models to identify whether facts implicate particular doctrines, areas of law, or parties.

I supplement these with bar-exam-style fact patterns as issue-spotting stimuli (not multiple-choice evaluation). For each pattern, a rubric specifies core issues and doctrines a minimally competent lawyer should identify. Models receive the prompt: "identify the legal issues raised by the following facts and state the questions you would research." Responses score on an ordinal rubric:

  • Fully captures core issues: All major doctrines and claims identified.

  • Partially captures: Some issues identified, others missed.

  • Misses core issues: Fails to identify central legal questions.

  • Introduces irrelevant or nonsensical issues: Hallucinates doctrines not implicated by facts.

This design enables side-by-side comparison across precision levels. When a high-precision model correctly identifies First Amendment retaliation issues while a low-precision variant confidently calls it a breach-of-contract dispute, the divergence is immediately visible—and legally consequential.

Benchmark Selection Rationale

This study does not exhaust every legal benchmark. The design reflects a constrained optimization: to make credible claims about quantization, I must (1) observe models whose weights and precision I fully control, (2) evaluate tasks demanding genuine legal reasoning rather than shallow pattern matching, and (3) use labels capturing hallucinations and partial failures in ways mapping onto legal risk.

The intersection of these requirements is narrow. Open-weights models with well-understood quantization behavior exist; legal benchmarks with rich hallucination annotations exist; but the overlap is limited to essentially the configuration adopted here.

I therefore exclude retrieval-focused benchmarks (LegalBench-RAG) and proprietary RAG tools (Lexis+ AI, Westlaw AI-Assisted Research, CoCounsel). Those benchmarks evaluate retriever accuracy or commercial hallucination rates—valuable questions, but ones introducing moving parts that blur the causal link between bit-width and legal reasoning.

The contribution is not a comprehensive leaderboard but the cleanest possible case: a reviewer can see, step by step, how reducing precision in otherwise identical models degrades doctrinal answering and issue-spotting.

Other benchmarks—AusLaw (Australian citation accuracy), Zheng et al.'s retrieval-oriented legal QA, LawBench (Chinese legal reasoning)—are candidates for extension. A follow-on study varying both embedding and generator precision could test whether retrieval and generative degradation interact multiplicatively.

Justification for HELM Framework

This study uses Stanford CRFM's HELM framework rather than custom scripts. Seven considerations:

  1. Audit trails: Each run generates SHA-256 hashed bundles (prompts, outputs, configuration), enabling independent verification.
  2. Standardized reporting: HELM accommodates our 9-configuration × 10-task matrix with consistent output formats.
  3. Methodological credibility: HELM powers Stanford CRFM's published assessments, signaling alignment with established best practices.
  4. Reproducibility: The entire study re-executes with a single command; configuration is declarative, not embedded in code.
  5. Model swapping: YAML-based configuration allows transitioning from Arm C (GGUF) to Arm B (AWQ) by updating one file.
  6. Robustness: Caching and crash-safe resume handle 50,000+ inference calls across potentially unstable quantized models.
  7. Extensibility: Adding Arm D requires only a new YAML entry, not architectural changes.

These properties ensure auditable, reproducible, extensible results—requirements aligned with Legal-10's design principles.

Operationalization

Reasoning Decay: the empirical pattern in which reasoning metrics (hallucination rates, correctness scores, issue-spotting rubric scores) systematically deteriorate as precision decreases, while fluency metrics (perplexity, grammaticality ratings) remain stable. Evidence supporting H1–H2 and H4–H5 would directly observe Reasoning Decay. Evidence supporting H7 would identify the precision regime where this decay is most dangerous—masked by preserved fluency.

The parallel-arms design tests these hypotheses. Falsification occurs if: (a) legal correctness stays stable across precision levels (contradicting H1); (b) perplexity degrades proportionally with correctness (contradicting H2, H4); or (c) the two model families diverge sharply (weakening H6).

5.1 Design Choice: Ecological Validity Over Experimental Purity

This study prioritizes ecological validity over experimental purity. The objective: measure what happens to legal reasoning under deployment configurations users actually encounter.

A secondary objective from the market analysis: demonstrating that "4-bit" is not a specification. Two deployments at identical bit-depth using different PTQ methods may produce materially different legal reasoning—and users cannot distinguish them. This study tests that claim.

The design involves deliberate tradeoffs. Testing multiple PTQ methods lets us show method variance. Using officially released and community-standard quantizations rather than a single controlled PTQ pipeline sacrifices clean causal isolation but yields results reflecting what lawyers actually face.

We accept these tradeoffs because the policy-relevant questions are:

  1. What do lawyers encounter in practice?
  2. Can users distinguish "good 4-bit" from "bad 4-bit"?

We hypothesize the answer to the second is no—and aim to prove it.


5.2 Model and Precision Regimes

Design: Three arms testing two model families under two PTQ philosophies.

ArmModelPTQ MethodPrecision TiersRationale
AQwen 2.5 7B-InstructAlibaba officialINT8 / INT4 / 2-bitEcological validity — vendor-shipped artifacts
BLlama 3.1 8B-InstructAWQINT8 / INT4 / 2-bitHigh-quality PTQ — activation-aware, protects salient weights
CLlama 3.1 8B-Instructllama.cpp (GGUF)Q8_0 / Q4_K_M / Q2_KLow-cost PTQ — block-wise compression, no importance calibration

Why three arms:

Arms A and B test FRD under "reasonable quality" PTQ—what a careful vendor might ship. These represent the ceiling users could hope to encounter.

Arm C tests FRD under the PTQ method economic pressures favor. Startups without GPU budgets, solo developers, and cost-constrained deployments default to llama.cpp because it runs on consumer hardware. This represents what many users actually encounter.

The B vs C comparison is the key methodological contribution. Same architecture. Same base weights. Same nominal bit-depth. Different PTQ philosophy. If legal reasoning diverges significantly between B and C at INT4, then "4-bit Llama" is a label covering materially different products—and users cannot know which "4-bit" they receive.

Precision ladder: 8 → 4 → 2

TierAWQ / OfficialGGUF EquivalentRole
8-bitINT8Q8_0Baseline — reasonable production floor
4-bitINT4Q4_K_MTest — hypothesized silent defect zone
2-bit2-bitQ2_KTest — hypothesized obvious defect zone

Note on GGUF variants: llama.cpp offers multiple 4-bit options (Q4_0, Q4_1, Q4_K_S, Q4_K_M). We select Q4_K_M as representative of "reasonable llama.cpp deployment" — it uses k-quant with medium quality, balancing size and accuracy. Q2_K is the most aggressive 2-bit option available.


5.3 Task Battery

The empirical demonstration anchors the broader thesis without overclaiming coverage. Sections II and IV establish that quantization plausibly affects every lawyer skill circuit. This section shows FRD concretely manifesting in legally meaningful tasks across multiple model/PTQ configurations.

Primary Instrument: Dahl et al. (2024) Legal Hallucinations

The full Dahl benchmark evaluates legal hallucination across 14 categories of verifiable legal queries.

Rationale:

  1. Verifiable ground truth. Dahl uses structured case metadata that can be programmatically verified. The cited case either exists with the stated attributes or it does not. No LLM-as-judge required for core evaluation.

  2. Multi-skill coverage. The benchmark touches:

  • S7 (Distinguishing Cases): Doctrinal agreement, holding comparison

  • S8 (Synthesize Results): Case explanation, holding identification

  • Professional Responsibility: Citation existence, attribution accuracy

  1. FRD operationalization. The 14-category taxonomy allows observation of whether errors cluster in reasoning-intensive categories while fluency-dependent categories remain stable — the FRD signature.

  2. Cross-arm comparability. All three arms run identical Dahl queries, enabling direct comparison of:

  • Qwen vs Llama at matched bit-depth

  • AWQ-Llama vs GGUF-Llama at matched bit-depth

  • Degradation curves across the 8→4→2 ladder

Secondary Instrument: LegalBench Rule-Conclusion Subset

A narrow slice of LegalBench testing issue-spotting and rule application.

Rationale:

  1. Different skill surface. Tests S7 (Distinguishing) from a different angle — identifying legally significant factual differences rather than hallucination detection.

  2. Generalizability check. If FRD appears on both Dahl and LegalBench across all three arms, the pattern is not benchmark-specific or model-specific.

  3. Lightweight addition. The subset is small enough to run across all nine configurations (3 arms × 3 precision tiers) without ballooning compute.


5.4 Hypotheses

  • H1: Reasoning Degradation (Silent Defect Zone) Legal reasoning accuracy will degrade significantly between INT8 and INT4 across all three arms, with the largest drops in categories requiring multi-step inference, exception-tracking, or doctrinal distinction.

  • H2: Fluency Stability (FRD Signature) Fluency metrics (perplexity, grammaticality, coherence) will remain stable across precision tiers even as reasoning accuracy degrades — confirming Fluency-Reasoning Divergence. This pattern will appear in all three arms.

  • H3: Catastrophic Collapse (Obvious Defect Zone) At 2-bit precision, both fluency and reasoning will degrade across all arms, producing visible failure modes distinguishable from the latent degradation at INT4.

  • H4: Category-Specific Vulnerability Different Dahl hallucination categories will show differential sensitivity to quantization. Categories involving fine-grained legal distinctions (holding vs. dicta, majority vs. dissent, procedural posture) will degrade faster than categories involving coarse factual retrieval.

  • H5: PTQ Method Divergence At identical bit-depth (INT4 / Q4_K_M), AWQ (Arm B) will significantly outperform llama.cpp/GGUF (Arm C) on legal reasoning accuracy, demonstrating that bit-depth alone is an insufficient specification. The performance gap between B and C at 4-bit may exceed the gap between 8-bit and 4-bit within a single arm.

  • H6: Cross-Architecture Comparison (Exploratory) Qwen (Arm A) and Llama-AWQ (Arm B), both representing high-quality PTQ, may show different degradation profiles reflecting architectural differences. This comparison is exploratory because we cannot fully isolate architecture from PTQ implementation differences between Alibaba's process and community AWQ.


5.5 Metrics

Primary Outcome Metrics

MetricWhat It MeasuresApplied To
Dahl Accuracy (overall)Correctness rate across all 14 categoriesAll arms, all tiers
Dahl Accuracy (by category)Correctness rate per hallucination typeAll arms, all tiers
LegalBench AccuracyRule-conclusion task correctnessAll arms, all tiers

FRD Detection Metrics

MetricWhat It MeasuresPurpose
PerplexityModel confidence / fluency proxyDetect fluency stability while reasoning degrades
Response CoherenceGrammaticality, structure, readabilityConfirm surface quality preservation
FRD Index(Fluency stability) − (Reasoning degradation)Composite measure of divergence magnitude

Calibration Metrics

MetricWhat It MeasuresPurpose
Expected Calibration Error (ECE)Alignment between confidence and accuracyDetect overconfidence — does the model "know" it's wrong?
Confidence by CorrectnessMean confidence on correct vs incorrect outputsIdentify whether errors are high-confidence (dangerous) or low-confidence (detectable)

Forensic / Reproducibility Metrics

MetricWhat It MeasuresPurpose
EBPW (Effective Bits Per Weight)Actual compression achievedVerify claimed precision matches reality
dtype auditData type of served weightsConfirm no silent precision override
Response length distributionToken count by arm/tierControl for verbosity changes under compression

5.6 Procedure

5.6.1 Environment

ComponentSpecification
Hardware[GPU type, VRAM, count]
Cloud/Local[Provider or local spec]
Framework[vLLM / llama.cpp / HuggingFace — specify per arm]

5.6.2 Model Artifacts

ArmSourceArtifact Identifiers
A (Qwen Official)Hugging Face / Alibaba[exact model card names]
B (Llama AWQ)Hugging Face[exact model card names]
C (Llama GGUF)Hugging Face / llama.cpp[exact GGUF filenames]

5.6.3 Prompt Protocol

  • Identical prompts across all nine configurations (3 arms × 3 tiers)

  • Dahl benchmark published prompt format

  • LegalBench published prompt format

  • No arm-specific prompt tuning

5.6.4 Decoding Parameters

Fixed across all runs:

ParameterValueRationale
Temperature[X][why]
Top-p[X][why]
Max tokens[X][why]
Seed[X]Reproducibility

5.6.5 Run Protocol

StepDescription
1Load model at specified precision
2Verify dtype/EBPW before run
3Run full Dahl benchmark
4Run LegalBench subset
5Log all outputs with metadata
6Compute metrics
7Repeat for each arm × tier combination

5.6.6 Runs per Configuration

[Single run / N runs averaged — specify]

Rationale: [If single: deterministic decoding eliminates variance. If multiple: captures sampling variance under non-zero temperature.]

5.6.7 Evaluation Protocol

  • Dahl: Automated scoring against ground truth metadata

  • LegalBench: Automated scoring per published rubric

  • Borderline cases: Manual review with [N] raters, inter-rater reliability reported

5.6.8 Reproducibility Package

Per Legal-10 protocol:

  • Full run bundles (prompts, raw outputs, scores, configuration)

  • SHA-256 hashes for all artifacts

  • Signed manifests

  • Public append-only submission log


5.7 Analysis Plan

5.7.1 Within-Arm Analysis (H1, H2, H3)

For each arm independently:

  • Plot accuracy by precision tier (8 → 4 → 2)

  • Plot perplexity by precision tier

  • Identify "silent defect zone" (accuracy drops, perplexity stable)

  • Identify "obvious defect zone" (both drop)

5.7.2 Category-Level Analysis (H4)

  • Heatmap: Dahl category × precision tier × arm

  • Identify which categories degrade first/fastest

  • Test whether reasoning-intensive categories show steeper decline

5.7.3 PTQ Method Comparison (H5)

  • Direct comparison: Arm B vs Arm C at each precision tier

  • Statistical test: Is B-C gap at INT4 significant?

  • Compare B-C gap magnitude to within-arm 8→4 gap

  • If B-C gap ≥ 8→4 gap: "PTQ method matters as much as bit-depth"

5.7.4 Cross-Architecture Comparison (H6, Exploratory)

  • Compare Arm A vs Arm B at matched precision

  • Report with caveat: different PTQ implementations

  • Note architectural differences if degradation profiles diverge

5.7.5 FRD Quantification

  • Compute FRD Index per arm/tier

  • Report magnitude of divergence

  • Identify the precision threshold where FRD is maximized (reasoning collapsed, fluency preserved)


5.8 The Opacity Finding

Regardless of quantitative results, this design surfaces a structural finding:

Users cannot distinguish "good 4-bit" from "bad 4-bit."

If H5 is confirmed—if AWQ-Llama significantly outperforms GGUF-Llama at identical bit-depth—then "4-bit Llama" is meaningless for quality inference. A user downloading a "4-bit Llama" legal AI tool cannot know whether they receive Arm B or Arm C quality.

This is opacity at the artifact level, not just deployment. Even users who know they want "4-bit" cannot verify what "4-bit" they get.

The evidentiary gap is architectural—existing before the model is served, before the API is called, before the user types a query. The defect is baked into market structure.


What's still needed:

  • Hardware specs

  • Exact model card identifiers for all nine configurations

  • Decoding parameter values

  • Single vs multiple runs decision

  • Number of raters for borderline cases


[To be completed after experiments]


Results confirm the T-Arm signature failure: Fluency-Reasoning Divergence emerges in the INT8–INT4 silent defect regime (H7), where nominal equivalence masks fidelity degradation. The Observability Gap (Phase 2 of SDF diagnostic) explains why lawyers cannot verify precision tier at reliance.

Holding everything constant except quantization—same dataset, same prompts, same evaluator, same taxonomy—observed Reasoning Decay traces causally to bit-precision reduction. We ran Dahl's pipeline on quantized models. Nothing else changed.

[Additional discussion to be completed after results]