Four panels rendered for every audio file. Waveform and spectrogram are industry standard; band energy and RMS derivative are our additions for LED mapping.
Raw audio samples (white) with optional RMS energy overlay (yellow, toggle with E). RMS is the root-mean-square of the waveform in each frame — smoothed loudness over time, scaled to match waveform amplitude.
Origin: Waveform display dates to oscilloscopes in the 1940s. RMS as a power measure dates to 19th-century electrical engineering; standard in audio since VU meters in the 1930s. Every DAW has both.
Waveform shows transient attacks, silence, macro structure. RMS reveals energy trends the raw waveform hides — our research found that derivatives of RMS matter more than absolute values (climax brightens 58x faster than build, despite identical static RMS).
Short-time Fourier Transform (STFT) converted to mel scale and displayed as a heatmap. Time on x-axis, frequency on y-axis (low=bottom, high=top), color=loudness.
Origin: The mel scale comes from Stevens, Volkmann & Newman (1937) — psychoacoustic research showing humans perceive pitch logarithmically (200Hz→400Hz sounds the same as 400Hz→800Hz). The spectrogram (STFT) dates to Gabor (1946). Mel spectrograms became standard input for audio ML in the 1980s.
You can see bass hits (bright blobs at bottom), vocals (middle bands), hi-hats (top). Harmonic content = horizontal lines. Percussive content = vertical lines — this is why HPSS works (median filtering by orientation).
The mel spectrogram collapsed into 5 bands — Sub-bass (20–80Hz), Bass (80–250Hz), Mids (250–2kHz), High-mids (2–6kHz), Treble (6–8kHz) — each plotted as a line over time.
Origin: Multi-band meters from mixing engineering. Band boundaries follow critical band theory (Fletcher, 1940s) and PA crossover points. “Bass energy over time” is the foundation of almost every audio-reactive LED system (WLED-SR’s entire beat detection = threshold on the bass bin).
Shows which frequency range dominates at each moment. A bass drop = Sub-bass/Bass spike. A cymbal crash = treble spike.
Rate-of-change of loudness (dRMS/dt). Red = getting louder, blue = getting quieter. Our most validated finding: a build and its climax can have identical RMS, but the climax brightens 58x faster.
The signal that distinguishes builds from drops. Derivatives > absolutes.
Two novelty curves showing where the “character of the music” changes, computed via Foote’s (2000) checkerboard kernel on a self-similarity matrix. Peaks = section boundaries.
MFCC novelty (coral) detects timbral changes — when the texture, instrument mix, or sonic character shifts. Chroma novelty (teal) detects harmonic changes — key changes, chord progressions, new melodic content.
Origin: Foote, “Automatic Audio Segmentation Using a Measure of Audio Novelty” (2000). Computes a self-similarity matrix from feature vectors, then slides a Gaussian-tapered checkerboard kernel along the diagonal. When past frames are similar to each other but dissimilar to future frames, the kernel produces a peak.
The most general section-boundary detector. Doesn’t label what changed — just that something did. Two feature sets let you see if the change is timbral, harmonic, or both.
Per-band z-scores showing when each frequency band’s energy deviates from its own 60-second running average. The white “Max” curve shows the most dramatic deviation in any band at each moment.
Applies our finding #2 (airiness = deviation from local context) at the section level. A bass drop, a high-frequency screech, or any sudden spectral shift will spike the relevant band’s z-score — without needing to know what kind of change it is.
Complements Foote’s novelty: Foote detects that a boundary occurred, band deviation shows which frequency range changed and by how much.
Same four analysis panels, plus a swim-lane overlay for your tap annotations.
Your own tap data overlaid on the analysis — beat taps, section changes, airy moments, flourishes. Whatever layers exist in the .annotations.yaml file. Press T to tap while playing.
Origin: Custom to this project. Our “test set” for evaluating audio features against human perception.
Note: tap annotations exhibit tactus ambiguity — listeners lock onto different metrical layers (kick, snare, off-beat) per song, so taps may be phase-shifted from the “metric beat” by 100–250ms (Martens 2011, London 2004). LEDs could exploit this: by flashing a specific layer, we may be able to entrain the audience’s tactus rather than follow it.
Four source separation algorithms, from deep learning to dictionary-based. Each decomposes audio into constituent parts. Use number keys to solo/mute stems where audio playback is available.
htdemucs (Meta, 2022): deep learning 4-stem separation into drums, bass, vocals, and other. Mel spectrograms per stem with individual audio playback. ~25 seconds CPU for a 50-second track.
Origin: Hybrid Transformer Demucs (Défossez, 2023). State-of-the-art offline source separation. Too slow for real-time on ESP32, but useful as ground truth for evaluating lighter methods.
Reference-quality separation. Use as ground truth, not for real-time.
Harmonic-Percussive Source Separation: median filtering on the spectrogram — horizontal streaks (harmonic) vs vertical streaks (percussive). No ML, frame-by-frame, trivially real-time on ESP32.
Origin: Fitzgerald (2010). Exploits the visual structure of spectrograms — harmonic content forms horizontal lines, percussive content forms vertical lines.
Two stems with audio playback. A coarse but fast decomposition: drums and transients land in percussive, sustained notes and chords in harmonic.
The simplest viable real-time decomposition. Already ESP32-feasible.
REPET (REPeating Pattern Extraction Technique) separates audio into repeating (background) and non-repeating (foreground) layers by detecting cyclic patterns in the spectrogram. No ML — just autocorrelation + median filtering + soft masking. ESP32-feasible.
Panels: beat spectrum (with detected period), soft mask, and spectrograms of each separated layer. Use 1/2 keys to solo/mute layers.
Based on Rafii & Pardo 2012. Tests whether pattern repetition alone can usefully decompose music for LED mapping.
Online Supervised NMF: pre-trained spectral dictionaries (10 components per source from 8 demucs-separated tracks) decompose each audio frame into drums/bass/vocals/other activations. 0.07ms/frame — ESP32-feasible.
Top panel: per-source activation curves (normalized). Lower panels: Wiener-masked spectrograms per source. No stem audio toggle (NMF produces energy estimates, not separated audio).
The most promising approach for real-time LED source attribution on ESP32. Dictionary: 64 mel bins × 40 components = 10KB.
Experimental features we’re evaluating for LED mapping potential. Not yet proven useful on their own, but may become inputs for derived features.
How noise-like vs tonal each frame is (0 = pure tone, 1 = white noise). Could indicate texture changes between sections.
Pitch class energy over time — which notes (C, C#, D, …) are present in each frame. Could detect key changes or harmonic shifts.
Peak-to-valley difference per frequency band. High contrast = clear tonal content. Low contrast = noise or dense mix.
How often the waveform crosses zero per frame. High ZCR = percussive or noisy. Low ZCR = smooth, tonal.
Spectral flux — how much the spectrum changes between adjacent frames. Peaks = “something new happened.”
Measures something real (spectral novelty) but raw values don’t map to perceived beats — F1=0.435 on Harmonix, only 48.5% of user taps align. Potential as a derived feature (e.g. deviation from local average could signal section changes).
Currently only in the local matplotlib viewer, not yet ported to the web.
Beat tracking via librosa.beat.beat_track — estimates tempo then snaps onset peaks to a grid.
Doubles tempo on syncopated rock (161.5 vs ~83 BPM on Tool’s Opiate). Built on top of onset strength, which is itself a weak beat discriminator. Best F1=0.500 on dense rock. Not reliable enough to drive LED effects.
Requires BlackHole 2ch for system audio capture. Recording will not work without it.
Tip: To hear audio while recording, open Audio MIDI Setup, click + → Create Multi-Output Device, check both your speakers and BlackHole, then set that as your system output.