getDisplayMedia Let Me Down So I Built an App

TLDR: Capturing system audio directly in the browser is nearly impossible due to browser limitations, missing APIs, and OS-level restrictions, especially on macOS. I needed to transcribe meetings without joining them, but getDisplayMedia() and virtual audio devices proved unreliable or too complex for users.

To solve this, I built a companion Electron app that uses WebRTC to stream system audio to the website. A local WebSocket signaling server automates the connection setup between the app and the site. The result: a seamless, user-friendly way to capture device audio with minimal setup.

The Problem: Why Capturing Device Audio is So Hard on the Web

Capturing system audio — such as audio from Zoom, Spotify, or Teams — is tricky, and making it work consistently across platforms is even more challenging. There are good reasons for this:

Security & Privacy: Different browsers handle privacy differently. For example, Firefox won’t let you share any audio at all, regardless of the OS. On the other hand, Chrome allows you to share tab audio on macOS and system audio on Windows.
Lack of API Support: JavaScript does not offer a direct API to access the output audio stream. Instead, we rely on getDisplayMedia(), which comes with limited browser support and inconsistent behavior.
OS-Level Constraints: Operating systems like macOS restrict applications — including browsers — from accessing system audio by default.
Inconsistent Support Across Platforms: Even when workarounds exist (like tab capture on Chrome), they’re unreliable across different browsers and OSes, making a consistent cross-platform experience nearly impossible.

My Use Case: Transcribing Meetings with System Audio

I wanted a way to transcribe a user's meeting without injecting a bot into the call, as platforms like Read.ai do. My initial idea was to use the getDisplayMedia() API to record the user’s screen along with audio:

1const stream = await navigator.mediaDevices.getDisplayMedia({
2  audio: {
3    echoCancellation: false,
4    noiseSuppression: false,
5    autoGainControl: false,
6    sampleRate: 48000,
7  },
8  video: true, // Most browsers require video to be true to capture screen with audio
9});

I planned to transcribe the audio from this stream. But I quickly realized this wasn’t a viable solution due to browser and OS restrictions. For example:

Chrome allows access to tab or system audio on Windows, but only tab audio on macOS.
Firefox doesn’t allow access to any tab or system audio, regardless of OS.
Safari behaves similarly to Firefox — no access to tab or system audio at all.

Alternative Approach: Virtual device

Due to these limitations, I experimented with virtual audio devices. The idea was to guide users to install and configure a tool like BlackHole, allowing us to route system audio to an input device we could capture.

However, this approach introduced a lot of complexity for users:

They would have to manually route their audio through the virtual device.
They'd need to create a separate virtual device for each audio output they use.
They’d have to manually select the correct input/output devices on the website.

While technically feasible, the manual setup and high fragility made it an unscalable solution. That’s when I had the idea to build a companion app.

A New Angle: The Companion App

Since we already wanted to port the product to a desktop app (using Electron), I came up with the idea to create a lightweight companion app that streams system audio to our website.

I built an Electron app with a native Swift module to handle permissions and capture system audio on macOS. The next challenge was: how do we send this audio stream to the web app?

I decided to go with WebRTC as the transport layer. But setting up a peer-to-peer WebRTC connection isn’t trivial. Here’s a breakdown of the basic WebRTC connection flow:

Basic WebRTC connection

In this setup, the website and the companion app act as two peers. They need to exchange metadata (like SDP and ICE candidates) in order to establish a secure connection. But this exchange requires a signaling channel — a way for both peers to "meet" and exchange this information without user input.

The Signaling Server

To automate signaling, I introduced a WebSocket-based signaling server, running locally on the user’s machine. This server acts as the intermediary between the website and the companion app.

Here’s how the architecture works:

WebRTC connection with Signaling Server

In this setup

Both the companion app and the website connect to the local WebSocket server.
The website sends a request to establish a WebRTC connection.
Both peers exchange metadata (SDP, ICE candidates) through the WebSocket.
The WebRTC connection is established.
The website signals the app to start streaming.
The companion app captures system audio and streams it directly to the browser.

Benefits and Final Thoughts

By leveraging a custom companion app and WebRTC, we’ve sidestepped the browser’s strict limitations on system audio access. This solution also bypasses OS-level restrictions and eliminates the need for complex user configurations.

Just install the app — and everything works seamlessly under the hood.

This approach is a great example of how creative thinking combined with the right technologies can overcome gaps in the web platform. The web may have its walls, but with the right tools, we can still build doors.

getDisplayMedia Let Me Down — So I Built a Desktop App to Stream My System Audio