Last verified: 2026-03-06 Target:
apps/video-processor(metrics),packages/video(classification — recommended location)
Audio & Video Quality Classification Thresholds
Date: 2026-02-26 (v2 — corrected after manual listening validation)
How to classify respondent video recordings into quality tiers using the EBU R128 audio metrics and signalstats video brightness. Thresholds are derived from analysis of 28 real VRT respondent clips (10 respondents, "De Zevende Dag" Feb 2025 production) and validated against human listening tests.
Related: video-processing-pipeline-performance.md — pipeline timing and bottleneck analysis.
How This Integrates with Content Intelligence
The content intelligence pipeline (ContentIntelligenceService) evaluates video content via LLM — it reads the transcript and scores engagement, authenticity, relevance, etc. It cannot hear or see the actual video.
Audio/video quality classification is separate and deterministic — computed directly from FFmpeg metrics, not from the LLM. This is by design:
- Faster: no API call, just arithmetic on existing metrics
- Cheaper: zero token cost
- Reproducible: same input always gives same classification
- No hallucination: thresholds are grounded in measured data
The two systems complement each other: content intelligence answers "is this a good testimonial?", AV quality answers "is this technically usable?". A video can have great content but poor audio, or perfect technical quality but irrelevant content.
Where Classification Happens
The processVideo pipeline already computes and returns audioQuality and videoQuality in its result. Classification should happen at read time (when displaying results) or at storage time (when persisting to a registry), not in the video-processor itself. The video-processor returns raw metrics; consumers apply thresholds.
Recommended approach: a pure function in @repo/video that takes AudioQualityResult, VideoQualityResult, and durationSeconds and returns classification labels. This keeps the thresholds in one place, testable, and reusable across admin UI, content intelligence page, and any future consumer.
Lesson Learned: Why speechPresenceRatio (>-40 LUFS) Alone Fails
Our first classification attempt used speechPresenceRatio (frames > -40 LUFS) as the primary voice clarity metric. This produced a false positive for Tauman — the respondent who is hardest to hear in the entire dataset.
Tauman Q1-Q2 scored: LUFS -21.6/-22.8 (normal), speech presence 93%/80% (good), stddev 5.68/5.27 (acceptable). Our v1 model classified these as "good". But Tauman is clearly the worst audio — very quiet voice, hard to understand.
Why the metrics lied
- Clips are only 5-6 seconds long — with 45-55 ebur128 frames, statistics are unreliable. A few frames of normal-level speech inflate the averages.
- -40 LUFS is too generous a threshold for "speech" — it catches barely-audible mumbling and background noise, not just clear voice. At -40 LUFS, Tauman Q2 has 80% "speech presence". At -30 LUFS ("clearly audible"), it drops to 67%.
- Integrated LUFS is a weighted average — it gives less weight to quiet parts, so a few loud frames can make a mostly-quiet clip look normal.
The fix: store and use clearlyAudibleRatio (>-30 LUFS)
The corrected model adds clearlyAudibleRatio (frames > -30 LUFS / total frames) directly to the video-processor output. This is the primary voice clarity metric — it measures the fraction of the clip where speech is actually intelligible, not just technically detectable.
| Respondent | >-40 (old) | >-30 (new) | Human verdict |
|---|---|---|---|
| Nele Q2 | 90% | 86% | Best clip in dataset |
| Jonathan Q2 | 93% | 81% | Good |
| Bart Q2 | 85% | 72% | Good |
| Tauman Q1 | 93% | 76% | Hard to hear |
| Tauman Q2 | 80% | 67% | Very hard to hear |
| Lorenzo Q2 | 90% | 39% | Quiet, fading |
| Zacharria Q2 | 54% | 31% | Mostly pauses |
| Tauman Q3 | 91% | 31% | Inaudible |
The >-30 metric correctly ranks Tauman below the good clips. At 67-76%, Tauman Q1-Q2 now classify as adequate rather than clear, and Tauman Q3 at 31% correctly classifies as poor.
Audio Quality Classification
Metrics Used
| Metric | Source | Range | What it measures |
|---|---|---|---|
integratedLufs | ebur128 summary | -70 to 0 | Overall perceived loudness (EBU R128 standard) |
loudnessRange (LRA) | ebur128 summary | 0 to 30+ LU | How much loudness varies across the clip |
speechPresenceRatio | ebur128 per-frame M | 0.0 to 1.0 | Fraction of clip where M > -40 LUFS (detectable audio) |
clearlyAudibleRatio | ebur128 per-frame M | 0.0 to 1.0 | Fraction of clip where M > -30 LUFS (intelligible speech) |
speechLoudnessStddev | ebur128 per-frame M | 0 to 10+ | Stability of voice level during speech |
clearlyAudibleRatio is the primary clarity metric. The gap between speechPresenceRatio and clearlyAudibleRatio reveals clips where audio is technically present but not understandable (the -40 to -30 LUFS "mumble zone").
Classification Axes
Axis 1: Loudness Level (integratedLufs)
| Range | Label | Rationale |
|---|---|---|
| > -14 LUFS | too-loud | Clipping or mic too close. Maarten (-11.8 to -13.5). |
| -14 to -26 LUFS | good | EBU R128 target is -23 LUFS. Normal webcam recordings. |
| -26 to -35 LUFS | quiet | Audible but needs volume boost. Tauman Q3 (-30.5), Lorenzo Q2 (-26.3). |
| < -35 LUFS | too-quiet | Barely audible, microphone issue. |
Changed from v1: Narrowed good range from -28 to -26 LUFS. Lorenzo Q2 at -26.3 is noticeably quiet and should not classify as "good".
Axis 2: Voice Clarity (clearlyAudibleRatio)
This is the key axis that distinguishes Tauman from Nele. It measures the fraction of the clip where speech is at an intelligible level (M > -30 LUFS), not just technically detectable.
| Range | Label | Rationale |
|---|---|---|
| >= 0.75 | clear | Majority of clip has strong, intelligible speech. Nele Q2 (86%), Jonathan Q2 (81%). |
| 0.55 to 0.74 | adequate | Voice present but frequently drops below intelligible level. Tauman Q1 (76%), Tauman Q2 (67%). |
| 0.35 to 0.54 | faint | Less than half the clip is clearly audible. Lorenzo Q2 (39%). |
| < 0.35 | poor | Mostly inaudible. Tauman Q3 (31%), Zacharria Q2 (31%). |
This correctly classifies Tauman: Q1 at 76% is adequate (not clear), Q2 at 67% is adequate, and Q3 at 31% is poor. No duration hack needed — the metric itself catches the problem.
Axis 3: Voice Stability (speechLoudnessStddev)
| Range | Label | Rationale |
|---|---|---|
| < 5.0 | stable | Consistent voice level. Nele Q2 (3.03), Bart Q3 (3.55). |
| 5.0 to 7.0 | moderate | Some variation, still usable. Most clips fall here. |
| > 7.0 | unstable | Voice jumps around. Maarten Q2 (7.50), Zacharria Q3 (7.69). |
Axis 4: Loudness Range (LRA)
| Range | Label | Rationale |
|---|---|---|
| < 10 LU | consistent | Normal for speech. Most clips are 2-8 LU. |
| 10 to 15 LU | variable | Noticeable shifts. Jonathan Q3 (13.1), Lorenzo Q2 (12.3). |
| > 15 LU | erratic | Extreme variation. Zacharria Q1 (16.0), Lobke Q3 (20.0). |
Overall Audio Grade
| Grade | Criteria | Color |
|---|---|---|
| good | Loudness good AND clarity clear AND stability stable or moderate AND LRA consistent | green |
| acceptable | Loudness good AND clarity clear or adequate AND no axis at worst level | amber |
| poor | Any axis at worst level (too-loud, too-quiet, poor clarity, unstable, erratic) | red |
| quiet | Loudness quiet, regardless of other axes | red |
Video Quality Classification
Metrics Available
| Metric | Source | Range | What it measures |
|---|---|---|---|
avgBrightness | signalstats YAVG | 0 to 255 | Average luma (brightness) across sampled frames |
brightnessSamples | signalstats | count | Number of frames analyzed |
Classification: Brightness Level
| Range | Label | Color | Rationale |
|---|---|---|---|
| < 30 | dark | red | Hard to see the respondent. Jos (15-17), Bart Q2-Q3 (20-24). |
| 30 to 60 | dim | amber | Visible but not ideal. Bart Q1 (37.9), Maarten (35-41). |
| 60 to 200 | good | green | Well-lit recording. Most daytime/indoor recordings. |
| > 200 | overexposed | red | Washed out — backlit or direct light source. |
Classification of All 28 VRT Clips (Corrected v2)
Audio Classification
| Respondent | Clip | LUFS | LRA | Audible% | StdDev | Loudness | Clarity | Stability | LRA | Grade |
|---|---|---|---|---|---|---|---|---|---|---|
| bart_desmet | Q1 | -23.3 | 4.3 | 0.72 | 4.25 | good | adequate | stable | consistent | acceptable |
| bart_desmet | Q2 | -23.2 | 3.6 | 0.72 | 4.44 | good | adequate | stable | consistent | acceptable |
| bart_desmet | Q3 | -24.4 | 4.6 | 0.66 | 3.55 | good | adequate | stable | consistent | acceptable |
| david_roegiers | Q1 | -21.5 | 3.5 | 0.76 | 5.12 | good | clear | moderate | consistent | good |
| david_roegiers | Q2 | -22.0 | 5.6 | 0.73 | 5.62 | good | adequate | moderate | consistent | acceptable |
| jonathan | Q1 | -24.4 | 3.3 | 0.78 | 4.14 | good | clear | stable | consistent | good |
| jonathan | Q2 | -24.0 | 4.2 | 0.81 | 3.64 | good | clear | stable | consistent | good |
| jonathan | Q3 | -25.3 | 13.1 | 0.46 | 4.77 | good | faint | stable | variable | poor |
| jos_verbist | Q1 | -17.2 | 4.1 | 0.77 | 7.44 | good | clear | unstable | consistent | poor |
| jos_verbist | Q2 | -18.6 | 3.0 | 0.77 | 5.59 | good | clear | moderate | consistent | good |
| jos_verbist | Q3 | -18.8 | 5.7 | 0.68 | 4.54 | good | adequate | stable | consistent | acceptable |
| lobke | Q1 | -21.2 | 3.9 | 0.82 | 4.39 | good | clear | stable | consistent | good |
| lobke | Q2 | -22.3 | 4.6 | 0.65 | 3.39 | good | adequate | stable | consistent | acceptable |
| lobke | Q3 | -29.2 | 20.0 | 0.22 | 3.26 | quiet | poor | stable | erratic | poor |
| lorenzo | Q1 | -23.5 | 7.3 | 0.62 | 5.69 | good | adequate | moderate | consistent | acceptable |
| lorenzo | Q2 | -26.3 | 12.3 | 0.39 | 5.30 | quiet | faint | moderate | variable | poor |
| lorenzo | Q3 | -33.5 | 9.6 | 0.17 | 3.80 | quiet | poor | stable | consistent | poor |
| maarten | Q1 | -11.8 | 2.7 | 0.73 | 6.20 | too-loud | adequate | moderate | consistent | poor |
| maarten | Q2 | -13.5 | 5.8 | 0.73 | 7.50 | too-loud | adequate | unstable | consistent | poor |
| maarten | Q3 | -13.2 | 5.0 | 0.69 | 6.98 | too-loud | adequate | moderate | consistent | poor |
| nele | Q1 | -24.0 | 6.6 | 0.67 | 4.13 | good | adequate | stable | consistent | acceptable |
| nele | Q2 | -24.1 | 2.0 | 0.86 | 3.03 | good | clear | stable | consistent | good |
| tauman | Q1 | -21.6 | 3.6 | 0.76 | 5.68 | good | adequate | moderate | consistent | acceptable |
| tauman | Q2 | -22.8 | 3.2 | 0.67 | 5.27 | good | adequate | moderate | consistent | acceptable |
| tauman | Q3 | -30.5 | 4.2 | 0.31 | 4.42 | quiet | poor | stable | consistent | poor |
| zacharria | Q1 | -24.5 | 16.0 | 0.31 | 6.58 | good | poor | moderate | erratic | poor |
| zacharria | Q2 | -23.0 | 7.2 | 0.31 | 6.86 | good | poor | moderate | consistent | poor |
| zacharria | Q3 | -22.0 | 6.9 | 0.17 | 7.69 | good | poor | unstable | consistent | poor |
Video Classification (Brightness)
| Respondent | Q1 | Q2 | Q3 |
|---|---|---|---|
| bart_desmet | dim (37.9) | dark (19.9) | dark (24.3) |
| david_roegiers | good (155.6) | good (133.7) | — |
| jonathan_dierckens | dark (24.2) | dark (25.0) | dim (67.1) |
| jos_verbist | dark (17.3) | dark (16.8) | dark (15.5) |
| lobke_devolder | good (123.8) | good (115.0) | good (118.7) |
| lorenzo_bown | good (111.9) | good (110.0) | good (120.2) |
| maarten_lannoo | dim (40.8) | dim (37.2) | dim (35.4) |
| nele_allemeersch | good (128.1) | good (122.7) | — |
| tauman | good (128.9) | good (123.7) | good (123.4) |
| zacharria | good (152.9) | good (161.3) | good (142.9) |
Per-Respondent Summary (Corrected v2)
| # | Respondent | Audio | Video | Key issue |
|---|---|---|---|---|
| 1 | Nele Allemeersch | good | good | Gold standard. Nele Q2 is the reference clip. |
| 2 | David Roegiers | good | good | Solid on all axes. |
| 3 | Bart Desmet | good | dark | Excellent audio, recorded in a dark room. |
| 4 | Jonathan Dierckens | acceptable | dark | Q3 trails off. Dark on Q1-Q2. |
| 5 | Jos Verbist | acceptable | dark | Q1 unstable voice. Darkest respondent (15-17). |
| 6 | Lobke Devolder | acceptable | good | Q3 collapses (3.3s, quiet, erratic). |
| 7 | Lorenzo Bown | acceptable | good | Progressive fade-out across questions. |
| 8 | Tauman | poor | good | Clips too short (5-6s). Q3 very quiet (-30.5 LUFS). Hardest to hear. |
| 9 | Zacharria | poor | good | Low speech presence (43-54%). Long pauses, hesitant. |
| 10 | Maarten Lannoo | poor | dim | Too loud on all clips (> -14 LUFS). Needs normalization. |
Detailed M-Value Distribution (Tauman vs. Nele)
To understand why Tauman is the worst audio despite normal-looking summary metrics, compare the per-frame momentary loudness distributions:
Nele Q2 (gold standard) — 18.0s, 176 frames
<-40 : 17 ░░░░░
-40 -35 : 3 ▒
-35 -30 : 5 ▓
-30 -25 : 63 █████████████████████
-25 -20 : 86 ██████████████████████████████ ← bulk of frames here
-20 -15 : 2
>-15 : 0
>-30 LUFS (clearly audible): 86%
>-25 LUFS (strong voice): 50%
Tauman Q2 — 4.9s, 45 frames
<-40 : 9 ░░░░░░░░░
-40 -35 : 5 ▒▒▒▒▒
-35 -30 : 1 ▓ ← bimodal: loud bursts + silence
-30 -25 : 7 ███████
-25 -20 : 20 ████████████████████
-20 -15 : 3 ███
>-15 : 0
>-30 LUFS (clearly audible): 67%
>-25 LUFS (strong voice): 51%
Tauman Q3 — 12.0s, 116 frames (most representative)
<-40 : 10 ░░░░░░░░
-40 -35 : 35 ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
-35 -30 : 35 ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ← majority of frames are -30 to -40
-30 -25 : 32 ███████████████████████████
-25 -20 : 4 ███
-20 -15 : 0
>-15 : 0
>-30 LUFS (clearly audible): 31% ← only 1 in 3 frames is audible
>-25 LUFS (strong voice): 3% ← almost no strong speech
Nele's distribution is concentrated in -25 to -20 LUFS (strong, clear voice). Tauman Q3's distribution is concentrated in -40 to -30 LUFS (barely audible mumbling). The integrated LUFS (-30.5 vs -24.1) does reflect this, but the speechPresenceRatio (>-40) at 91% is misleading — almost all of Tauman Q3's "speech" frames are between -40 and -30, which is the "I can technically detect audio but can't understand words" zone.
Implementation Recommendation
// packages/video/src/server/av-quality-classification.ts
export type AudioGrade = "good" | "acceptable" | "quiet" | "poor";
export type VideoGrade = "good" | "dim" | "dark" | "overexposed";
export type LoudnessLevel = "too-loud" | "good" | "quiet" | "too-quiet";
export type VoiceClarity = "clear" | "adequate" | "faint" | "sparse";
export type VoiceStability = "stable" | "moderate" | "unstable";
export type LoudnessConsistency = "consistent" | "variable" | "erratic";
export interface AVQualityClassification {
audio: {
grade: AudioGrade;
loudness: LoudnessLevel;
clarity: VoiceClarity;
stability: VoiceStability;
lra: LoudnessConsistency;
tooShort: boolean;
};
video: {
grade: VideoGrade;
};
}
export function classifyAVQuality(
audio: AudioQualityResult,
video: VideoQualityResult,
durationSeconds: number
): AVQualityClassification;
The classification function should be a pure function with no dependencies — just takes AudioQualityResult + VideoQualityResult + durationSeconds and returns AVQualityClassification. This makes it trivially testable and usable in any context (server action, API route, batch script).
Future Refinements
These thresholds are calibrated against 28 clips from one VRT production (webcam recordings, self-recorded by respondents). They should be validated against:
- Professional studio recordings — may need a stricter "good" range
- Mobile recordings — phones have different microphone characteristics
- Multi-speaker scenarios — not currently handled
- More respondents — 10 is a small sample; some thresholds may shift with more data
As more production data flows through the pipeline, the thresholds can be refined. The classification function should be easy to update — it's just constants.
The most impactful next step would be to compute and store the >-30 LUFS "clearly audible" ratio directly in the video-processor, rather than approximating it from the existing metrics. This is already implemented. clearlyAudibleRatio is computed and returned by detectAudioQualityFromFile() in apps/video-processor/src/operations/audio-quality.ts (alongside speechPresenceRatio). The AudioQualityResult type includes this field.
The remaining next step is to implement the classifyAVQuality() function in @repo/video and connect it to admin UI display and content intelligence consumers.