All docs/@repo/video

docs/architecture/av-quality-classification-thresholds.md

Last verified: 2026-03-06 Target: apps/video-processor (metrics), packages/video (classification — recommended location)

Audio & Video Quality Classification Thresholds

Date: 2026-02-26 (v2 — corrected after manual listening validation)

How to classify respondent video recordings into quality tiers using the EBU R128 audio metrics and signalstats video brightness. Thresholds are derived from analysis of 28 real VRT respondent clips (10 respondents, "De Zevende Dag" Feb 2025 production) and validated against human listening tests.

Related: video-processing-pipeline-performance.md — pipeline timing and bottleneck analysis.


How This Integrates with Content Intelligence

The content intelligence pipeline (ContentIntelligenceService) evaluates video content via LLM — it reads the transcript and scores engagement, authenticity, relevance, etc. It cannot hear or see the actual video.

Audio/video quality classification is separate and deterministic — computed directly from FFmpeg metrics, not from the LLM. This is by design:

  • Faster: no API call, just arithmetic on existing metrics
  • Cheaper: zero token cost
  • Reproducible: same input always gives same classification
  • No hallucination: thresholds are grounded in measured data

The two systems complement each other: content intelligence answers "is this a good testimonial?", AV quality answers "is this technically usable?". A video can have great content but poor audio, or perfect technical quality but irrelevant content.

Where Classification Happens

The processVideo pipeline already computes and returns audioQuality and videoQuality in its result. Classification should happen at read time (when displaying results) or at storage time (when persisting to a registry), not in the video-processor itself. The video-processor returns raw metrics; consumers apply thresholds.

Recommended approach: a pure function in @repo/video that takes AudioQualityResult, VideoQualityResult, and durationSeconds and returns classification labels. This keeps the thresholds in one place, testable, and reusable across admin UI, content intelligence page, and any future consumer.


Lesson Learned: Why speechPresenceRatio (>-40 LUFS) Alone Fails

Our first classification attempt used speechPresenceRatio (frames > -40 LUFS) as the primary voice clarity metric. This produced a false positive for Tauman — the respondent who is hardest to hear in the entire dataset.

Tauman Q1-Q2 scored: LUFS -21.6/-22.8 (normal), speech presence 93%/80% (good), stddev 5.68/5.27 (acceptable). Our v1 model classified these as "good". But Tauman is clearly the worst audio — very quiet voice, hard to understand.

Why the metrics lied

  1. Clips are only 5-6 seconds long — with 45-55 ebur128 frames, statistics are unreliable. A few frames of normal-level speech inflate the averages.
  2. -40 LUFS is too generous a threshold for "speech" — it catches barely-audible mumbling and background noise, not just clear voice. At -40 LUFS, Tauman Q2 has 80% "speech presence". At -30 LUFS ("clearly audible"), it drops to 67%.
  3. Integrated LUFS is a weighted average — it gives less weight to quiet parts, so a few loud frames can make a mostly-quiet clip look normal.

The fix: store and use clearlyAudibleRatio (>-30 LUFS)

The corrected model adds clearlyAudibleRatio (frames > -30 LUFS / total frames) directly to the video-processor output. This is the primary voice clarity metric — it measures the fraction of the clip where speech is actually intelligible, not just technically detectable.

Respondent>-40 (old)>-30 (new)Human verdict
Nele Q290%86%Best clip in dataset
Jonathan Q293%81%Good
Bart Q285%72%Good
Tauman Q193%76%Hard to hear
Tauman Q280%67%Very hard to hear
Lorenzo Q290%39%Quiet, fading
Zacharria Q254%31%Mostly pauses
Tauman Q391%31%Inaudible

The >-30 metric correctly ranks Tauman below the good clips. At 67-76%, Tauman Q1-Q2 now classify as adequate rather than clear, and Tauman Q3 at 31% correctly classifies as poor.


Audio Quality Classification

Metrics Used

MetricSourceRangeWhat it measures
integratedLufsebur128 summary-70 to 0Overall perceived loudness (EBU R128 standard)
loudnessRange (LRA)ebur128 summary0 to 30+ LUHow much loudness varies across the clip
speechPresenceRatioebur128 per-frame M0.0 to 1.0Fraction of clip where M > -40 LUFS (detectable audio)
clearlyAudibleRatioebur128 per-frame M0.0 to 1.0Fraction of clip where M > -30 LUFS (intelligible speech)
speechLoudnessStddevebur128 per-frame M0 to 10+Stability of voice level during speech

clearlyAudibleRatio is the primary clarity metric. The gap between speechPresenceRatio and clearlyAudibleRatio reveals clips where audio is technically present but not understandable (the -40 to -30 LUFS "mumble zone").

Classification Axes

Axis 1: Loudness Level (integratedLufs)

RangeLabelRationale
> -14 LUFStoo-loudClipping or mic too close. Maarten (-11.8 to -13.5).
-14 to -26 LUFSgoodEBU R128 target is -23 LUFS. Normal webcam recordings.
-26 to -35 LUFSquietAudible but needs volume boost. Tauman Q3 (-30.5), Lorenzo Q2 (-26.3).
< -35 LUFStoo-quietBarely audible, microphone issue.

Changed from v1: Narrowed good range from -28 to -26 LUFS. Lorenzo Q2 at -26.3 is noticeably quiet and should not classify as "good".

Axis 2: Voice Clarity (clearlyAudibleRatio)

This is the key axis that distinguishes Tauman from Nele. It measures the fraction of the clip where speech is at an intelligible level (M > -30 LUFS), not just technically detectable.

RangeLabelRationale
>= 0.75clearMajority of clip has strong, intelligible speech. Nele Q2 (86%), Jonathan Q2 (81%).
0.55 to 0.74adequateVoice present but frequently drops below intelligible level. Tauman Q1 (76%), Tauman Q2 (67%).
0.35 to 0.54faintLess than half the clip is clearly audible. Lorenzo Q2 (39%).
< 0.35poorMostly inaudible. Tauman Q3 (31%), Zacharria Q2 (31%).

This correctly classifies Tauman: Q1 at 76% is adequate (not clear), Q2 at 67% is adequate, and Q3 at 31% is poor. No duration hack needed — the metric itself catches the problem.

Axis 3: Voice Stability (speechLoudnessStddev)

RangeLabelRationale
< 5.0stableConsistent voice level. Nele Q2 (3.03), Bart Q3 (3.55).
5.0 to 7.0moderateSome variation, still usable. Most clips fall here.
> 7.0unstableVoice jumps around. Maarten Q2 (7.50), Zacharria Q3 (7.69).

Axis 4: Loudness Range (LRA)

RangeLabelRationale
< 10 LUconsistentNormal for speech. Most clips are 2-8 LU.
10 to 15 LUvariableNoticeable shifts. Jonathan Q3 (13.1), Lorenzo Q2 (12.3).
> 15 LUerraticExtreme variation. Zacharria Q1 (16.0), Lobke Q3 (20.0).

Overall Audio Grade

GradeCriteriaColor
goodLoudness good AND clarity clear AND stability stable or moderate AND LRA consistentgreen
acceptableLoudness good AND clarity clear or adequate AND no axis at worst levelamber
poorAny axis at worst level (too-loud, too-quiet, poor clarity, unstable, erratic)red
quietLoudness quiet, regardless of other axesred

Video Quality Classification

Metrics Available

MetricSourceRangeWhat it measures
avgBrightnesssignalstats YAVG0 to 255Average luma (brightness) across sampled frames
brightnessSamplessignalstatscountNumber of frames analyzed

Classification: Brightness Level

RangeLabelColorRationale
< 30darkredHard to see the respondent. Jos (15-17), Bart Q2-Q3 (20-24).
30 to 60dimamberVisible but not ideal. Bart Q1 (37.9), Maarten (35-41).
60 to 200goodgreenWell-lit recording. Most daytime/indoor recordings.
> 200overexposedredWashed out — backlit or direct light source.

Classification of All 28 VRT Clips (Corrected v2)

Audio Classification

RespondentClipLUFSLRAAudible%StdDevLoudnessClarityStabilityLRAGrade
bart_desmetQ1-23.34.30.724.25goodadequatestableconsistentacceptable
bart_desmetQ2-23.23.60.724.44goodadequatestableconsistentacceptable
bart_desmetQ3-24.44.60.663.55goodadequatestableconsistentacceptable
david_roegiersQ1-21.53.50.765.12goodclearmoderateconsistentgood
david_roegiersQ2-22.05.60.735.62goodadequatemoderateconsistentacceptable
jonathanQ1-24.43.30.784.14goodclearstableconsistentgood
jonathanQ2-24.04.20.813.64goodclearstableconsistentgood
jonathanQ3-25.313.10.464.77goodfaintstablevariablepoor
jos_verbistQ1-17.24.10.777.44goodclearunstableconsistentpoor
jos_verbistQ2-18.63.00.775.59goodclearmoderateconsistentgood
jos_verbistQ3-18.85.70.684.54goodadequatestableconsistentacceptable
lobkeQ1-21.23.90.824.39goodclearstableconsistentgood
lobkeQ2-22.34.60.653.39goodadequatestableconsistentacceptable
lobkeQ3-29.220.00.223.26quietpoorstableerraticpoor
lorenzoQ1-23.57.30.625.69goodadequatemoderateconsistentacceptable
lorenzoQ2-26.312.30.395.30quietfaintmoderatevariablepoor
lorenzoQ3-33.59.60.173.80quietpoorstableconsistentpoor
maartenQ1-11.82.70.736.20too-loudadequatemoderateconsistentpoor
maartenQ2-13.55.80.737.50too-loudadequateunstableconsistentpoor
maartenQ3-13.25.00.696.98too-loudadequatemoderateconsistentpoor
neleQ1-24.06.60.674.13goodadequatestableconsistentacceptable
neleQ2-24.12.00.863.03goodclearstableconsistentgood
taumanQ1-21.63.60.765.68goodadequatemoderateconsistentacceptable
taumanQ2-22.83.20.675.27goodadequatemoderateconsistentacceptable
taumanQ3-30.54.20.314.42quietpoorstableconsistentpoor
zacharriaQ1-24.516.00.316.58goodpoormoderateerraticpoor
zacharriaQ2-23.07.20.316.86goodpoormoderateconsistentpoor
zacharriaQ3-22.06.90.177.69goodpoorunstableconsistentpoor

Video Classification (Brightness)

RespondentQ1Q2Q3
bart_desmetdim (37.9)dark (19.9)dark (24.3)
david_roegiersgood (155.6)good (133.7)
jonathan_dierckensdark (24.2)dark (25.0)dim (67.1)
jos_verbistdark (17.3)dark (16.8)dark (15.5)
lobke_devoldergood (123.8)good (115.0)good (118.7)
lorenzo_bowngood (111.9)good (110.0)good (120.2)
maarten_lannoodim (40.8)dim (37.2)dim (35.4)
nele_allemeerschgood (128.1)good (122.7)
taumangood (128.9)good (123.7)good (123.4)
zacharriagood (152.9)good (161.3)good (142.9)

Per-Respondent Summary (Corrected v2)

#RespondentAudioVideoKey issue
1Nele AllemeerschgoodgoodGold standard. Nele Q2 is the reference clip.
2David RoegiersgoodgoodSolid on all axes.
3Bart DesmetgooddarkExcellent audio, recorded in a dark room.
4Jonathan DierckensacceptabledarkQ3 trails off. Dark on Q1-Q2.
5Jos VerbistacceptabledarkQ1 unstable voice. Darkest respondent (15-17).
6Lobke DevolderacceptablegoodQ3 collapses (3.3s, quiet, erratic).
7Lorenzo BownacceptablegoodProgressive fade-out across questions.
8TaumanpoorgoodClips too short (5-6s). Q3 very quiet (-30.5 LUFS). Hardest to hear.
9ZacharriapoorgoodLow speech presence (43-54%). Long pauses, hesitant.
10Maarten LannoopoordimToo loud on all clips (> -14 LUFS). Needs normalization.

Detailed M-Value Distribution (Tauman vs. Nele)

To understand why Tauman is the worst audio despite normal-looking summary metrics, compare the per-frame momentary loudness distributions:

Nele Q2 (gold standard) — 18.0s, 176 frames

  <-40  :  17 ░░░░░
-40 -35 :   3 ▒
-35 -30 :   5 ▓
-30 -25 :  63 █████████████████████
-25 -20 :  86 ██████████████████████████████    ← bulk of frames here
-20 -15 :   2
  >-15  :   0

>-30 LUFS (clearly audible): 86%
>-25 LUFS (strong voice):    50%

Tauman Q2 — 4.9s, 45 frames

  <-40  :   9 ░░░░░░░░░
-40 -35 :   5 ▒▒▒▒▒
-35 -30 :   1 ▓                               ← bimodal: loud bursts + silence
-30 -25 :   7 ███████
-25 -20 :  20 ████████████████████
-20 -15 :   3 ███
  >-15  :   0

>-30 LUFS (clearly audible): 67%
>-25 LUFS (strong voice):    51%

Tauman Q3 — 12.0s, 116 frames (most representative)

  <-40  :  10 ░░░░░░░░
-40 -35 :  35 ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
-35 -30 :  35 ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  ← majority of frames are -30 to -40
-30 -25 :  32 ███████████████████████████
-25 -20 :   4 ███
-20 -15 :   0
  >-15  :   0

>-30 LUFS (clearly audible): 31%     ← only 1 in 3 frames is audible
>-25 LUFS (strong voice):     3%     ← almost no strong speech

Nele's distribution is concentrated in -25 to -20 LUFS (strong, clear voice). Tauman Q3's distribution is concentrated in -40 to -30 LUFS (barely audible mumbling). The integrated LUFS (-30.5 vs -24.1) does reflect this, but the speechPresenceRatio (>-40) at 91% is misleading — almost all of Tauman Q3's "speech" frames are between -40 and -30, which is the "I can technically detect audio but can't understand words" zone.


Implementation Recommendation

// packages/video/src/server/av-quality-classification.ts

export type AudioGrade = "good" | "acceptable" | "quiet" | "poor";
export type VideoGrade = "good" | "dim" | "dark" | "overexposed";

export type LoudnessLevel = "too-loud" | "good" | "quiet" | "too-quiet";
export type VoiceClarity = "clear" | "adequate" | "faint" | "sparse";
export type VoiceStability = "stable" | "moderate" | "unstable";
export type LoudnessConsistency = "consistent" | "variable" | "erratic";

export interface AVQualityClassification {
  audio: {
    grade: AudioGrade;
    loudness: LoudnessLevel;
    clarity: VoiceClarity;
    stability: VoiceStability;
    lra: LoudnessConsistency;
    tooShort: boolean;
  };
  video: {
    grade: VideoGrade;
  };
}

export function classifyAVQuality(
  audio: AudioQualityResult,
  video: VideoQualityResult,
  durationSeconds: number
): AVQualityClassification;

The classification function should be a pure function with no dependencies — just takes AudioQualityResult + VideoQualityResult + durationSeconds and returns AVQualityClassification. This makes it trivially testable and usable in any context (server action, API route, batch script).


Future Refinements

These thresholds are calibrated against 28 clips from one VRT production (webcam recordings, self-recorded by respondents). They should be validated against:

  1. Professional studio recordings — may need a stricter "good" range
  2. Mobile recordings — phones have different microphone characteristics
  3. Multi-speaker scenarios — not currently handled
  4. More respondents — 10 is a small sample; some thresholds may shift with more data

As more production data flows through the pipeline, the thresholds can be refined. The classification function should be easy to update — it's just constants.

The most impactful next step would be to compute and store the >-30 LUFS "clearly audible" ratio directly in the video-processor, rather than approximating it from the existing metrics. This is already implemented. clearlyAudibleRatio is computed and returned by detectAudioQualityFromFile() in apps/video-processor/src/operations/audio-quality.ts (alongside speechPresenceRatio). The AudioQualityResult type includes this field.

The remaining next step is to implement the classifyAVQuality() function in @repo/video and connect it to admin UI display and content intelligence consumers.