Audio Explorer

0:00.000 / 0:00.000 🔊
Sign In
Welcome
Record
Analysis
Band Analysis
Calculus
Annotate
Events
Decomposition ▾
Demucs
HPSS
REPET
NMF
Lab
Effects
Reference
Annotations
T to tap · 0 taps

Analysis

Four panels rendered for every audio file. Waveform and spectrogram are industry standard; band energy and RMS derivative are our additions for LED mapping.

Waveform + RMS overlay

Raw audio samples (white) with optional RMS energy overlay (yellow, toggle with E). RMS is the root-mean-square of the waveform in each frame — smoothed loudness over time, scaled to match waveform amplitude.

Origin: Waveform display dates to oscilloscopes in the 1940s. RMS as a power measure dates to 19th-century electrical engineering; standard in audio since VU meters in the 1930s. Every DAW has both.

Waveform shows transient attacks, silence, macro structure. RMS reveals energy trends the raw waveform hides — our research found that derivatives of RMS matter more than absolute values (climax brightens 58x faster than build, despite identical static RMS).

Mel Spectrogram

Short-time Fourier Transform (STFT) converted to mel scale and displayed as a heatmap. Time on x-axis, frequency on y-axis (low=bottom, high=top), color=loudness.

Origin: The mel scale comes from Stevens, Volkmann & Newman (1937) — psychoacoustic research showing humans perceive pitch logarithmically (200Hz→400Hz sounds the same as 400Hz→800Hz). The spectrogram (STFT) dates to Gabor (1946). Mel spectrograms became standard input for audio ML in the 1980s.

You can see bass hits (bright blobs at bottom), vocals (middle bands), hi-hats (top). Harmonic content = horizontal lines. Percussive content = vertical lines — this is why HPSS works (median filtering by orientation).

Band Energy

The mel spectrogram collapsed into 5 bands — Sub-bass (20–80Hz), Bass (80–250Hz), Mids (250–2kHz), High-mids (2–6kHz), Treble (6–8kHz) — each plotted as a line over time.

Origin: Multi-band meters from mixing engineering. Band boundaries follow critical band theory (Fletcher, 1940s) and PA crossover points. “Bass energy over time” is the foundation of almost every audio-reactive LED system (WLED-SR’s entire beat detection = threshold on the bass bin).

Shows which frequency range dominates at each moment. A bass drop = Sub-bass/Bass spike. A cymbal crash = treble spike.

RMS Derivative Custom

Rate-of-change of loudness (dRMS/dt). Red = getting louder, blue = getting quieter. Our most validated finding: a build and its climax can have identical RMS, but the climax brightens 58x faster.

The signal that distinguishes builds from drops. Derivatives > absolutes.

Foote’s Checkerboard Novelty Custom

Two novelty curves showing where the “character of the music” changes, computed via Foote’s (2000) checkerboard kernel on a self-similarity matrix. Peaks = section boundaries.

MFCC novelty (coral) detects timbral changes — when the texture, instrument mix, or sonic character shifts. Chroma novelty (teal) detects harmonic changes — key changes, chord progressions, new melodic content.

Origin: Foote, “Automatic Audio Segmentation Using a Measure of Audio Novelty” (2000). Computes a self-similarity matrix from feature vectors, then slides a Gaussian-tapered checkerboard kernel along the diagonal. When past frames are similar to each other but dissimilar to future frames, the kernel produces a peak.

The most general section-boundary detector. Doesn’t label what changed — just that something did. Two feature sets let you see if the change is timbral, harmonic, or both.

Band Deviation from Long-Term Context Custom

Per-band z-scores showing when each frequency band’s energy deviates from its own 60-second running average. The white “Max” curve shows the most dramatic deviation in any band at each moment.

Applies our finding #2 (airiness = deviation from local context) at the section level. A bass drop, a high-frequency screech, or any sudden spectral shift will spike the relevant band’s z-score — without needing to know what kind of change it is.

Complements Foote’s novelty: Foote detects that a boundary occurred, band deviation shows which frequency range changed and by how much.

Annotations

Same four analysis panels, plus a swim-lane overlay for your tap annotations.

Tap Annotations Custom

Your own tap data overlaid on the analysis — beat taps, section changes, airy moments, flourishes. Whatever layers exist in the .annotations.yaml file. Press T to tap while playing.

Origin: Custom to this project. Our “test set” for evaluating audio features against human perception.

Note: tap annotations exhibit tactus ambiguity — listeners lock onto different metrical layers (kick, snare, off-beat) per song, so taps may be phase-shifted from the “metric beat” by 100–250ms (Martens 2011, London 2004). LEDs could exploit this: by flashing a specific layer, we may be able to entrain the audience’s tactus rather than follow it.

Decomposition

Four source separation algorithms, from deep learning to dictionary-based. Each decomposes audio into constituent parts. Use number keys to solo/mute stems where audio playback is available.

Decomposition › Demucs Common

htdemucs (Meta, 2022): deep learning 4-stem separation into drums, bass, vocals, and other. Mel spectrograms per stem with individual audio playback. ~25 seconds CPU for a 50-second track.

Origin: Hybrid Transformer Demucs (Défossez, 2023). State-of-the-art offline source separation. Too slow for real-time on ESP32, but useful as ground truth for evaluating lighter methods.

Reference-quality separation. Use as ground truth, not for real-time.

Decomposition › HPSS

Harmonic-Percussive Source Separation: median filtering on the spectrogram — horizontal streaks (harmonic) vs vertical streaks (percussive). No ML, frame-by-frame, trivially real-time on ESP32.

Origin: Fitzgerald (2010). Exploits the visual structure of spectrograms — harmonic content forms horizontal lines, percussive content forms vertical lines.

Two stems with audio playback. A coarse but fast decomposition: drums and transients land in percussive, sustained notes and chords in harmonic.

The simplest viable real-time decomposition. Already ESP32-feasible.

Decomposition › REPET

REPET (REPeating Pattern Extraction Technique) separates audio into repeating (background) and non-repeating (foreground) layers by detecting cyclic patterns in the spectrogram. No ML — just autocorrelation + median filtering + soft masking. ESP32-feasible.

Panels: beat spectrum (with detected period), soft mask, and spectrograms of each separated layer. Use 1/2 keys to solo/mute layers.

Based on Rafii & Pardo 2012. Tests whether pattern repetition alone can usefully decompose music for LED mapping.

Decomposition › NMF

Online Supervised NMF: pre-trained spectral dictionaries (10 components per source from 8 demucs-separated tracks) decompose each audio frame into drums/bass/vocals/other activations. 0.07ms/frame — ESP32-feasible.

Top panel: per-source activation curves (normalized). Lower panels: Wiener-masked spectrograms per source. No stem audio toggle (NMF produces energy estimates, not separated audio).

The most promising approach for real-time LED source attribution on ESP32. Dictionary: 64 mel bins × 40 components = 10KB.

Lab

Experimental features we’re evaluating for LED mapping potential. Not yet proven useful on their own, but may become inputs for derived features.

Spectral Flatness

How noise-like vs tonal each frame is (0 = pure tone, 1 = white noise). Could indicate texture changes between sections.

Chromagram

Pitch class energy over time — which notes (C, C#, D, …) are present in each frame. Could detect key changes or harmonic shifts.

Spectral Contrast

Peak-to-valley difference per frequency band. High contrast = clear tonal content. Low contrast = noise or dense mix.

Zero Crossing Rate

How often the waveform crosses zero per frame. High ZCR = percussive or noisy. Low ZCR = smooth, tonal.

Onset Strength Experimental

Spectral flux — how much the spectrum changes between adjacent frames. Peaks = “something new happened.”

Measures something real (spectral novelty) but raw values don’t map to perceived beats — F1=0.435 on Harmonix, only 48.5% of user taps align. Potential as a derived feature (e.g. deviation from local average could signal section changes).

Currently only in the local matplotlib viewer, not yet ported to the web.

Librosa Beats Deprecated

Beat tracking via librosa.beat.beat_track — estimates tempo then snaps onset peaks to a grid.

Doubles tempo on syncopated rock (161.5 vs ~83 BPM on Tool’s Opiate). Built on top of onset strength, which is itself a weak beat discriminator. Best F1=0.500 on dense rock. Not reliable enough to drive LED effects.

Level
-∞ dB
0:00.0
Click to record from BlackHole

Requires BlackHole 2ch for system audio capture. Recording will not work without it.

Setup: BlackHole (macOS)
  1. Install BlackHole 2ch from existential.audio/blackhole
  2. In System Settings → Sound → Output, select BlackHole 2ch
  3. Restart this server — BlackHole will be auto-detected

Tip: To hear audio while recording, open Audio MIDI Setup, click +Create Multi-Output Device, check both your speakers and BlackHole, then set that as your system output.


Your Files

Space play/pause   ±5s   Click panel to seek
Drop audio file to upload
Uploading...