These panels are always rendered. They represent the core audio properties that directly drive LED mapping and musical understanding.
Raw audio samples (white) with optional RMS energy overlay (yellow, toggle with E). RMS is the root-mean-square of the waveform in each frame — smoothed loudness over time, scaled to match waveform amplitude.
Origin: Waveform display dates to oscilloscopes in the 1940s. RMS as a power measure dates to 19th-century electrical engineering; standard in audio since VU meters in the 1930s. Every DAW has both.
Waveform shows transient attacks, silence, macro structure. RMS reveals energy trends the raw waveform hides — our research found that derivatives of RMS matter more than absolute values (climax brightens 58x faster than build, despite identical static RMS).
Non-negotiable. RMS overlay hidden by default to reduce visual clutter — enable when analyzing energy trajectories.
Short-time Fourier Transform (STFT) converted to mel scale and displayed as a heatmap. Time on x-axis, frequency on y-axis (low=bottom, high=top), color=loudness.
Origin: The mel scale comes from Stevens, Volkmann & Newman (1937) — psychoacoustic research showing humans perceive pitch logarithmically (200Hz→400Hz sounds the same as 400Hz→800Hz). The spectrogram (STFT) dates to Gabor (1946). Mel spectrograms became standard input for audio ML in the 1980s.
You can see bass hits (bright blobs at bottom), vocals (middle bands), hi-hats (top). Harmonic content = horizontal lines. Percussive content = vertical lines — this is why HPSS works (median filtering by orientation).
The single most informative audio visualization. Industry standard.
The mel spectrogram collapsed into 5 bands — Sub-bass (20–80Hz), Bass (80–250Hz), Mids (250–2kHz), High-mids (2–6kHz), Treble (6–8kHz) — each plotted as a line over time.
Origin: Multi-band meters from mixing engineering. Band boundaries follow critical band theory (Fletcher, 1940s) and PA crossover points. “Bass energy over time” is the foundation of almost every audio-reactive LED system (WLED-SR’s entire beat detection = threshold on the bass bin).
Shows which frequency range dominates at each moment. A bass drop = Sub-bass/Bass spike. A cymbal crash = treble spike.
Standard in audio-reactive systems. Useful reference for understanding frequency content.
Your own tap data overlaid on the analysis — beat taps, section changes, airy moments, flourishes. Whatever layers exist in the .annotations.yaml file.
Origin: Custom to this project. Our “test set” for evaluating audio features against human perception.
Note: tap annotations exhibit tactus ambiguity — listeners lock onto different metrical layers (kick, snare, off-beat) per song, so taps may be phase-shifted from the “metric beat” by 100–250ms (Martens 2011, London 2004). LEDs could exploit this: by flashing a specific layer, we may be able to entrain the audience’s tactus rather than follow it.
Essential for research. Only shown when annotation data exists.
Real audio properties, hidden by default. Not directly useful as raw indicators for LED mapping, but promising as inputs for derived features — running averages, deviation from context, rate-of-change, etc.
Spectral flux — how much the spectrum changes between adjacent frames. Peaks = “something new happened.” Toggle with O.
Measures something real (spectral novelty) but raw values don’t map to perceived beats — F1=0.435 on Harmonix, only 48.5% of user taps align. Potential as a derived feature (e.g. deviation from local average could signal section changes).
The “center of mass” of the spectrum — the frequency where half the energy is above and half below. Often called “brightness.” Toggle with C.
A standard timbral descriptor (Grey, 1977). Raw centroid isn’t directly useful for LED mapping, but derived features (running average, deviation = “airiness”) could detect timbral shifts between sections.
Beat tracking via librosa.beat.beat_track — estimates tempo then snaps onset peaks to a grid. Toggle with B.
Why second class: Doubles tempo on syncopated rock (161.5 vs ~83 BPM on Tool’s Opiate). Built on top of onset strength, which is itself a weak beat discriminator. Best F1=0.500 on dense rock.
Useful as a sanity check. Not reliable enough to drive LED effects directly.
Rate-of-change of loudness (dRMS/dt). Red = getting louder, blue = getting quieter. Our most validated finding: a build and its climax can have identical RMS, but the climax brightens 58x faster.
Now on the Analysis panel. The signal that distinguishes builds from drops.
Online Supervised NMF: pre-trained spectral dictionaries (10 components per source from 8 demucs-separated tracks) decompose each audio frame into drums/bass/vocals/other activations. 0.07ms/frame — ESP32-feasible.
Top panel: per-source activation curves (normalized). Lower panels: Wiener-masked spectrograms per source. No stem audio toggle (NMF produces energy estimates, not separated audio).
The most promising approach for real-time LED source attribution on ESP32. Dictionary: 64 mel bins × 40 components = 10KB.
REPET (REPeating Pattern Extraction Technique) separates audio into repeating (background) and non-repeating (foreground) layers by detecting cyclic patterns in the spectrogram. No ML — just autocorrelation + median filtering + soft masking. ESP32-feasible.
Panels: beat spectrum (with detected period), soft mask, and spectrograms of each separated layer. Use 1/2 keys to solo/mute layers.
Based on Rafii & Pardo 2012. Tests whether pattern repetition alone can usefully decompose music for LED mapping.
Four experimental features: spectral flatness, chromagram, spectral contrast, and zero crossing rate. Use this tab to evaluate whether these are useful indicators for LED mapping.