Computer Vision to Sound in the Browser

Web-based solutions are great when we need to run projects on many different devices without installing or configuring things. It is accessible and easy to distribute. This exercise demonstrates how computer vision data can be mapped directly to sound synthesis in the browser. The system runs entirely in JavaScript, using:

  • MediaPipe Hands for real-time hand tracking (via webcam)

  • WebAudio for browser-based audio synthesis

No installation is required. Everything runs locally in the browser once the camera is allowed:

Click for Browser Notes

Works best in Chrome / Chromium (desktop). Use HTTPS or https://localhost and click Start once to unlock audio.

  • Chrome / Edge (desktop) – full WebAudio + camera support; best iframe stability.
  • Firefox – reliable; may need an extra click in iframes.
  • Safari 16+ – HTTPS only; user gesture required; limited iframe camera access.
  • Mobile (Android / iOS) – slower tracking, rotation quirks, stricter permissions.
  • Older (< 2022) – missing WebAudio or MediaPipe WASM.
  • Privacy browsers (Brave / Vivaldi) – may silently block camera or WASM threads.
  • Incognito modes – often prevent persistent camera access.

Tip: Refresh after granting permission; use stable Wi-Fi; avoid nested iframes lacking allow="camera; microphone; autoplay".


The Details

We can download and run the code above on our local machine. This allows us to edit and adjust it as needed.

  • Save this HTML File by right-clicking and select save as.

  • Open the downloaded file with a browser to make sure it is running smoothly on your machine.
    • An internet connection is necessary for running the file, since it needs to access MediaPipe and WebAudio.

  • Open the HTML file in your editor of choice to understand and change:

Signal Flow

The program can be divided into three conceptual blocks:

  1. Vision tracking - The webcam feed is analyzed by the MediaPipe Hands model. - The position of the index fingertip is extracted in normalized coordinates (x and y between 0 and 1).

  2. Mapping - The X position (horizontal) controls pitch in MIDI note space. - The Y position (vertical) controls the filter cutoff frequency. - These values are scaled through the function mapXY().

  3. Sound synthesis - A single square oscillator feeds a low-pass filter. - The filter’s cutoff and the oscillator’s frequency are updated continuously according to the hand position.


Code Structure

The HTML file is organized into clear sections with comments:

  • Section A – Constants Adjustable numbers that define pitch and filter ranges. These are safe for students to modify.

  • Section B – Utilities Helper functions such as midiToHz() and clamp01().

  • Section C – AudioEngine A minimal subtractive synthesizer implemented with Tone.js.

  • Section D – VisionTracker A webcam interface and manual processing loop that stays stable inside iframes.

  • Section E – Application Glue Connects camera → mapping → sound, and updates the UI.

Within Section E, the function mapXY(iTip) is where students should experiment with the mapping between motion and sound.


Exercises

  1. Experiment with mapping Open the file and scroll to the section marked:

    === STUDENT SECTION: modify the mapping below ===

    Try these small edits:

    • Invert the horizontal control:

      const x = 1 - clamp01(iTip.x);
    • Quantize the pitch:

      const midi = Math.round(PITCH_MIN_MIDI + x * PITCH_RANGE);
    • Restrict the filter to a narrow band by lowering CUTOFF_MAX_HZ or raising CUTOFF_MIN_HZ.

  2. Add a new control parameter (optional) Use another landmark pair, such as the thumb tip (index 4), to compute a distance and map it to filter resonance Q.

  3. Reflect How does the choice of mapping affect the expressive quality of the interaction? What gestures feel most “instrumental”?