Last verified: 2026-03-06 Target: apps/video-processor (metrics), packages/video (classification — recommended location)

Audio & Video Quality Classification Thresholds

Date: 2026-02-26 (v2 — corrected after manual listening validation)

How to classify respondent video recordings into quality tiers using the EBU R128 audio metrics and signalstats video brightness. Thresholds are derived from analysis of 28 real VRT respondent clips (10 respondents, "De Zevende Dag" Feb 2025 production) and validated against human listening tests.

Related: video-processing-pipeline-performance.md — pipeline timing and bottleneck analysis.

How This Integrates with Content Intelligence

The content intelligence pipeline (ContentIntelligenceService) evaluates video content via LLM — it reads the transcript and scores engagement, authenticity, relevance, etc. It cannot hear or see the actual video.

Audio/video quality classification is separate and deterministic — computed directly from FFmpeg metrics, not from the LLM. This is by design:

Faster: no API call, just arithmetic on existing metrics
Cheaper: zero token cost
Reproducible: same input always gives same classification
No hallucination: thresholds are grounded in measured data

The two systems complement each other: content intelligence answers "is this a good testimonial?", AV quality answers "is this technically usable?". A video can have great content but poor audio, or perfect technical quality but irrelevant content.

Where Classification Happens

The processVideo pipeline already computes and returns audioQuality and videoQuality in its result. Classification should happen at read time (when displaying results) or at storage time (when persisting to a registry), not in the video-processor itself. The video-processor returns raw metrics; consumers apply thresholds.

Recommended approach: a pure function in @repo/video that takes AudioQualityResult, VideoQualityResult, and durationSeconds and returns classification labels. This keeps the thresholds in one place, testable, and reusable across admin UI, content intelligence page, and any future consumer.

Lesson Learned: Why speechPresenceRatio (>-40 LUFS) Alone Fails

Our first classification attempt used speechPresenceRatio (frames > -40 LUFS) as the primary voice clarity metric. This produced a false positive for Tauman — the respondent who is hardest to hear in the entire dataset.

Tauman Q1-Q2 scored: LUFS -21.6/-22.8 (normal), speech presence 93%/80% (good), stddev 5.68/5.27 (acceptable). Our v1 model classified these as "good". But Tauman is clearly the worst audio — very quiet voice, hard to understand.

Why the metrics lied

Clips are only 5-6 seconds long — with 45-55 ebur128 frames, statistics are unreliable. A few frames of normal-level speech inflate the averages.
-40 LUFS is too generous a threshold for "speech" — it catches barely-audible mumbling and background noise, not just clear voice. At -40 LUFS, Tauman Q2 has 80% "speech presence". At -30 LUFS ("clearly audible"), it drops to 67%.
Integrated LUFS is a weighted average — it gives less weight to quiet parts, so a few loud frames can make a mostly-quiet clip look normal.

The fix: store and use `clearlyAudibleRatio` (>-30 LUFS)

The corrected model adds clearlyAudibleRatio (frames > -30 LUFS / total frames) directly to the video-processor output. This is the primary voice clarity metric — it measures the fraction of the clip where speech is actually intelligible, not just technically detectable.

Respondent	>-40 (old)	>-30 (new)	Human verdict
Nele Q2	90%	86%	Best clip in dataset
Jonathan Q2	93%	81%	Good
Bart Q2	85%	72%	Good
Tauman Q1	93%	76%	Hard to hear
Tauman Q2	80%	67%	Very hard to hear
Lorenzo Q2	90%	39%	Quiet, fading
Zacharria Q2	54%	31%	Mostly pauses
Tauman Q3	91%	31%	Inaudible

The >-30 metric correctly ranks Tauman below the good clips. At 67-76%, Tauman Q1-Q2 now classify as adequate rather than clear, and Tauman Q3 at 31% correctly classifies as poor.

Audio Quality Classification

Metrics Used

Metric	Source	Range	What it measures
`integratedLufs`	ebur128 summary	-70 to 0	Overall perceived loudness (EBU R128 standard)
`loudnessRange` (LRA)	ebur128 summary	0 to 30+ LU	How much loudness varies across the clip
`speechPresenceRatio`	ebur128 per-frame M	0.0 to 1.0	Fraction of clip where M > -40 LUFS (detectable audio)
`clearlyAudibleRatio`	ebur128 per-frame M	0.0 to 1.0	Fraction of clip where M > -30 LUFS (intelligible speech)
`speechLoudnessStddev`	ebur128 per-frame M	0 to 10+	Stability of voice level during speech

clearlyAudibleRatio is the primary clarity metric. The gap between speechPresenceRatio and clearlyAudibleRatio reveals clips where audio is technically present but not understandable (the -40 to -30 LUFS "mumble zone").

Classification Axes

Axis 1: Loudness Level (integratedLufs)

Range	Label	Rationale
> -14 LUFS	`too-loud`	Clipping or mic too close. Maarten (-11.8 to -13.5).
-14 to -26 LUFS	`good`	EBU R128 target is -23 LUFS. Normal webcam recordings.
-26 to -35 LUFS	`quiet`	Audible but needs volume boost. Tauman Q3 (-30.5), Lorenzo Q2 (-26.3).
< -35 LUFS	`too-quiet`	Barely audible, microphone issue.

Changed from v1: Narrowed good range from -28 to -26 LUFS. Lorenzo Q2 at -26.3 is noticeably quiet and should not classify as "good".

Axis 2: Voice Clarity (clearlyAudibleRatio)

This is the key axis that distinguishes Tauman from Nele. It measures the fraction of the clip where speech is at an intelligible level (M > -30 LUFS), not just technically detectable.

Range	Label	Rationale
>= 0.75	`clear`	Majority of clip has strong, intelligible speech. Nele Q2 (86%), Jonathan Q2 (81%).
0.55 to 0.74	`adequate`	Voice present but frequently drops below intelligible level. Tauman Q1 (76%), Tauman Q2 (67%).
0.35 to 0.54	`faint`	Less than half the clip is clearly audible. Lorenzo Q2 (39%).
< 0.35	`poor`	Mostly inaudible. Tauman Q3 (31%), Zacharria Q2 (31%).

This correctly classifies Tauman: Q1 at 76% is adequate (not clear), Q2 at 67% is adequate, and Q3 at 31% is poor. No duration hack needed — the metric itself catches the problem.

Axis 3: Voice Stability (speechLoudnessStddev)

Range	Label	Rationale
< 5.0	`stable`	Consistent voice level. Nele Q2 (3.03), Bart Q3 (3.55).
5.0 to 7.0	`moderate`	Some variation, still usable. Most clips fall here.
> 7.0	`unstable`	Voice jumps around. Maarten Q2 (7.50), Zacharria Q3 (7.69).

Axis 4: Loudness Range (LRA)

Range	Label	Rationale
< 10 LU	`consistent`	Normal for speech. Most clips are 2-8 LU.
10 to 15 LU	`variable`	Noticeable shifts. Jonathan Q3 (13.1), Lorenzo Q2 (12.3).
> 15 LU	`erratic`	Extreme variation. Zacharria Q1 (16.0), Lobke Q3 (20.0).

Overall Audio Grade

Grade	Criteria	Color
good	Loudness `good` AND clarity `clear` AND stability `stable` or `moderate` AND LRA `consistent`	green
acceptable	Loudness `good` AND clarity `clear` or `adequate` AND no axis at worst level	amber
poor	Any axis at worst level (`too-loud`, `too-quiet`, `poor` clarity, `unstable`, `erratic`)	red
quiet	Loudness `quiet`, regardless of other axes	red

Video Quality Classification

Metrics Available

Metric	Source	Range	What it measures
`avgBrightness`	signalstats YAVG	0 to 255	Average luma (brightness) across sampled frames
`brightnessSamples`	signalstats	count	Number of frames analyzed

Classification: Brightness Level

Range	Label	Color	Rationale
< 30	`dark`	red	Hard to see the respondent. Jos (15-17), Bart Q2-Q3 (20-24).
30 to 60	`dim`	amber	Visible but not ideal. Bart Q1 (37.9), Maarten (35-41).
60 to 200	`good`	green	Well-lit recording. Most daytime/indoor recordings.
> 200	`overexposed`	red	Washed out — backlit or direct light source.

Classification of All 28 VRT Clips (Corrected v2)

Audio Classification

Respondent	Clip	LUFS	LRA	Audible%	StdDev	Loudness	Clarity	Stability	LRA	Grade
bart_desmet	Q1	-23.3	4.3	0.72	4.25	good	adequate	stable	consistent	acceptable
bart_desmet	Q2	-23.2	3.6	0.72	4.44	good	adequate	stable	consistent	acceptable
bart_desmet	Q3	-24.4	4.6	0.66	3.55	good	adequate	stable	consistent	acceptable
david_roegiers	Q1	-21.5	3.5	0.76	5.12	good	clear	moderate	consistent	good
david_roegiers	Q2	-22.0	5.6	0.73	5.62	good	adequate	moderate	consistent	acceptable
jonathan	Q1	-24.4	3.3	0.78	4.14	good	clear	stable	consistent	good
jonathan	Q2	-24.0	4.2	0.81	3.64	good	clear	stable	consistent	good
jonathan	Q3	-25.3	13.1	0.46	4.77	good	faint	stable	variable	poor
jos_verbist	Q1	-17.2	4.1	0.77	7.44	good	clear	unstable	consistent	poor
jos_verbist	Q2	-18.6	3.0	0.77	5.59	good	clear	moderate	consistent	good
jos_verbist	Q3	-18.8	5.7	0.68	4.54	good	adequate	stable	consistent	acceptable
lobke	Q1	-21.2	3.9	0.82	4.39	good	clear	stable	consistent	good
lobke	Q2	-22.3	4.6	0.65	3.39	good	adequate	stable	consistent	acceptable
lobke	Q3	-29.2	20.0	0.22	3.26	quiet	poor	stable	erratic	poor
lorenzo	Q1	-23.5	7.3	0.62	5.69	good	adequate	moderate	consistent	acceptable
lorenzo	Q2	-26.3	12.3	0.39	5.30	quiet	faint	moderate	variable	poor
lorenzo	Q3	-33.5	9.6	0.17	3.80	quiet	poor	stable	consistent	poor
maarten	Q1	-11.8	2.7	0.73	6.20	too-loud	adequate	moderate	consistent	poor
maarten	Q2	-13.5	5.8	0.73	7.50	too-loud	adequate	unstable	consistent	poor
maarten	Q3	-13.2	5.0	0.69	6.98	too-loud	adequate	moderate	consistent	poor
nele	Q1	-24.0	6.6	0.67	4.13	good	adequate	stable	consistent	acceptable
nele	Q2	-24.1	2.0	0.86	3.03	good	clear	stable	consistent	good
tauman	Q1	-21.6	3.6	0.76	5.68	good	adequate	moderate	consistent	acceptable
tauman	Q2	-22.8	3.2	0.67	5.27	good	adequate	moderate	consistent	acceptable
tauman	Q3	-30.5	4.2	0.31	4.42	quiet	poor	stable	consistent	poor
zacharria	Q1	-24.5	16.0	0.31	6.58	good	poor	moderate	erratic	poor
zacharria	Q2	-23.0	7.2	0.31	6.86	good	poor	moderate	consistent	poor
zacharria	Q3	-22.0	6.9	0.17	7.69	good	poor	unstable	consistent	poor

Video Classification (Brightness)

Respondent	Q1	Q2	Q3
bart_desmet	dim (37.9)	dark (19.9)	dark (24.3)
david_roegiers	good (155.6)	good (133.7)	—
jonathan_dierckens	dark (24.2)	dark (25.0)	dim (67.1)
jos_verbist	dark (17.3)	dark (16.8)	dark (15.5)
lobke_devolder	good (123.8)	good (115.0)	good (118.7)
lorenzo_bown	good (111.9)	good (110.0)	good (120.2)
maarten_lannoo	dim (40.8)	dim (37.2)	dim (35.4)
nele_allemeersch	good (128.1)	good (122.7)	—
tauman	good (128.9)	good (123.7)	good (123.4)
zacharria	good (152.9)	good (161.3)	good (142.9)

Per-Respondent Summary (Corrected v2)

#	Respondent	Audio	Video	Key issue
1	Nele Allemeersch	good	good	Gold standard. Nele Q2 is the reference clip.
2	David Roegiers	good	good	Solid on all axes.
3	Bart Desmet	good	dark	Excellent audio, recorded in a dark room.
4	Jonathan Dierckens	acceptable	dark	Q3 trails off. Dark on Q1-Q2.
5	Jos Verbist	acceptable	dark	Q1 unstable voice. Darkest respondent (15-17).
6	Lobke Devolder	acceptable	good	Q3 collapses (3.3s, quiet, erratic).
7	Lorenzo Bown	acceptable	good	Progressive fade-out across questions.
8	Tauman	poor	good	Clips too short (5-6s). Q3 very quiet (-30.5 LUFS). Hardest to hear.
9	Zacharria	poor	good	Low speech presence (43-54%). Long pauses, hesitant.
10	Maarten Lannoo	poor	dim	Too loud on all clips (> -14 LUFS). Needs normalization.

Detailed M-Value Distribution (Tauman vs. Nele)

To understand why Tauman is the worst audio despite normal-looking summary metrics, compare the per-frame momentary loudness distributions:

Nele Q2 (gold standard) — 18.0s, 176 frames

  <-40  :  17 ░░░░░
-40 -35 :   3 ▒
-35 -30 :   5 ▓
-30 -25 :  63 █████████████████████
-25 -20 :  86 ██████████████████████████████    ← bulk of frames here
-20 -15 :   2
  >-15  :   0

>-30 LUFS (clearly audible): 86%
>-25 LUFS (strong voice):    50%

Tauman Q2 — 4.9s, 45 frames

  <-40  :   9 ░░░░░░░░░
-40 -35 :   5 ▒▒▒▒▒
-35 -30 :   1 ▓                               ← bimodal: loud bursts + silence
-30 -25 :   7 ███████
-25 -20 :  20 ████████████████████
-20 -15 :   3 ███
  >-15  :   0

>-30 LUFS (clearly audible): 67%
>-25 LUFS (strong voice):    51%

Tauman Q3 — 12.0s, 116 frames (most representative)

  <-40  :  10 ░░░░░░░░
-40 -35 :  35 ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
-35 -30 :  35 ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  ← majority of frames are -30 to -40
-30 -25 :  32 ███████████████████████████
-25 -20 :   4 ███
-20 -15 :   0
  >-15  :   0

>-30 LUFS (clearly audible): 31%     ← only 1 in 3 frames is audible
>-25 LUFS (strong voice):     3%     ← almost no strong speech

Nele's distribution is concentrated in -25 to -20 LUFS (strong, clear voice). Tauman Q3's distribution is concentrated in -40 to -30 LUFS (barely audible mumbling). The integrated LUFS (-30.5 vs -24.1) does reflect this, but the speechPresenceRatio (>-40) at 91% is misleading — almost all of Tauman Q3's "speech" frames are between -40 and -30, which is the "I can technically detect audio but can't understand words" zone.

Implementation Recommendation

// packages/video/src/server/av-quality-classification.ts

export type AudioGrade = "good" | "acceptable" | "quiet" | "poor";
export type VideoGrade = "good" | "dim" | "dark" | "overexposed";

export type LoudnessLevel = "too-loud" | "good" | "quiet" | "too-quiet";
export type VoiceClarity = "clear" | "adequate" | "faint" | "sparse";
export type VoiceStability = "stable" | "moderate" | "unstable";
export type LoudnessConsistency = "consistent" | "variable" | "erratic";

export interface AVQualityClassification {
  audio: {
    grade: AudioGrade;
    loudness: LoudnessLevel;
    clarity: VoiceClarity;
    stability: VoiceStability;
    lra: LoudnessConsistency;
    tooShort: boolean;
  };
  video: {
    grade: VideoGrade;
  };
}

export function classifyAVQuality(
  audio: AudioQualityResult,
  video: VideoQualityResult,
  durationSeconds: number
): AVQualityClassification;

The classification function should be a pure function with no dependencies — just takes AudioQualityResult + VideoQualityResult + durationSeconds and returns AVQualityClassification. This makes it trivially testable and usable in any context (server action, API route, batch script).

Future Refinements

These thresholds are calibrated against 28 clips from one VRT production (webcam recordings, self-recorded by respondents). They should be validated against:

Professional studio recordings — may need a stricter "good" range
Mobile recordings — phones have different microphone characteristics
Multi-speaker scenarios — not currently handled
More respondents — 10 is a small sample; some thresholds may shift with more data

As more production data flows through the pipeline, the thresholds can be refined. The classification function should be easy to update — it's just constants.

~~The most impactful next step would be to compute and store the >-30 LUFS "clearly audible" ratio directly in the video-processor, rather than approximating it from the existing metrics.~~ This is already implemented. clearlyAudibleRatio is computed and returned by detectAudioQualityFromFile() in apps/video-processor/src/operations/audio-quality.ts (alongside speechPresenceRatio). The AudioQualityResult type includes this field.

The remaining next step is to implement the classifyAVQuality() function in @repo/video and connect it to admin UI display and content intelligence consumers.