How IceCubes Handles Speaker Diarization Differently: Vendor UI vs Voice Fingerprinting
Speaker diarization is the process of figuring out "who said what" in a conversation. It sounds simple, but it is one of the hardest problems in speech technology. Most meeting transcription tools tackle it by analyzing audio characteristics: pitch, tone, cadence, and other vocal features that distinguish one speaker from another. This is called voice fingerprinting or speaker embedding.
IceCubes takes a completely different approach. Instead of analyzing audio to guess who is speaking, it reads the speaker's name directly from the meeting platform's UI. When Google Meet shows "Sarah Chen" as the active speaker, that is exactly what appears in the transcript.
The difference is not cosmetic. It affects every downstream use of the transcript.
How Voice Fingerprinting Works (and Where It Fails)
Voice fingerprinting works by creating a mathematical representation of each speaker's voice and clustering transcript segments by similarity. The system groups segments that sound alike and labels them "Speaker 1," "Speaker 2," and so on. Some tools then try to match these clusters to known participants using enrollment data or meeting metadata.
This approach has well-documented failure modes:
-
Similar voices get merged. Two people with similar vocal characteristics (same gender, similar age, similar accent) often get assigned to the same speaker cluster. In a meeting with three male colleagues of similar age, diarization errors spike.
-
One speaker gets split into two. If someone's tone changes significantly (they start presenting after a casual conversation, or they get frustrated), the model may create a new speaker cluster for the same person.
-
Phone and room audio breaks everything. When a participant dials in on a phone, or multiple people share a conference room microphone, the audio quality shift confuses the model. Conference room participants are particularly problematic because the microphone distance and room acoustics vary as different people speak.
-
Short utterances are unreliable. Brief responses like "Yes," "Agreed," or "Go ahead" don't contain enough audio information for reliable speaker assignment.
-
Labels are anonymous. Even when clustering works perfectly, the output is "Speaker 1," "Speaker 2," "Speaker 3." Mapping these labels to real names requires additional logic, enrollment data, or manual correction.
How IceCubes Reads Speaker Names from the Platform
IceCubes is a browser extension that runs inside the same browser tab as your meeting. Because it sits alongside Google Meet, Zoom, or Teams, it has access to the meeting platform's own UI elements, including the active speaker indicator.
Here is what happens during a call:
-
The meeting platform's own captioning service generates captions in real time. Google Meet, Zoom, and Teams all have built-in caption engines that run server-side.
-
The platform's UI displays these captions along with the speaker's display name. When "David Park" is talking, the platform's caption overlay shows his name.
-
IceCubes reads both the caption text and the associated speaker name from the DOM (the page structure of the meeting tab).
-
Each transcript segment is tagged with the speaker's real display name as shown in the meeting.
There is no voice analysis. No clustering algorithm. No speaker embedding model. The speaker attribution comes directly from the source that already knows who is talking: the meeting platform itself.
Why This Matters for Downstream Use
Speaker accuracy is not just about a clean-looking transcript. It determines whether you can trust every feature built on top of the transcript.
AI Summaries and Action Items
When AI generates a summary, it needs to know who said what. "The prospect expressed budget concerns" is less useful than "David Park, VP of Engineering at Acme Corp, said their budget for this quarter is locked." Misattributed speaker labels produce summaries that attribute statements to the wrong person, which can be worse than no summary at all.
MEDDIC and BANT Extraction
Sales qualification frameworks depend on understanding who holds authority, who expressed needs, and who raised concerns. If the transcript misattributes the CFO's budget statement to a junior team member, the extracted MEDDIC data is misleading.
CRM Sync
When meeting insights flow into HubSpot or Salesforce, speaker attribution determines which contact record gets associated with which statements. Wrong speaker labels create noise in your CRM.
Cross-Meeting Analysis
Searching across meetings for "what has this prospect said about pricing" requires accurate speaker attribution in every transcript. One misattributed segment in a series of calls can produce false results.
Platform-Specific Implementation
Each meeting platform displays speaker information differently, and IceCubes handles each one:
| Platform | How IceCubes reads speaker names |
|---|---|
| Google Meet | Reads speaker names from the native caption overlay, which shows the speaker's Google account display name |
| Zoom | Reads from Zoom's caption UI, matching display names as shown in the participant list |
| Microsoft Teams | Reads from Teams' caption display, using the speaker's Teams/Entra ID display name |
The speaker names match exactly what participants see during the meeting. If someone joins as "Conference Room 5B" in the meeting, that is what appears in the transcript. But this is also exactly what a human notetaker would write down, and it is trivially easy to correct one label after the fact compared to untangling a voice fingerprinting model's misattributions.
The Accuracy Gap in Practice
Consider a typical sales call with four participants: two from the selling team and two from the prospect's organization. A voice fingerprinting system needs to:
- Detect four distinct speakers from overlapping audio
- Correctly cluster every utterance to the right speaker
- Map anonymous labels to real names
Each step introduces error. Published benchmarks for speaker diarization systems show error rates between 5% and 20% on real-world meeting audio, with higher error rates when speakers have similar voices or when audio quality varies.
IceCubes skips all three steps. The platform already knows who is speaking and shows their name. IceCubes reads that name. The error rate for speaker attribution is essentially the error rate of the meeting platform's own speaker identification, which is extremely low because the platform is working with authenticated user sessions and direct audio streams.
When Does Voice Fingerprinting Make Sense?
Voice fingerprinting is necessary when you do not have access to the meeting platform's speaker indicators. Pre-recorded audio files, phone calls without caller ID, and in-person meetings recorded on a single device all require audio-based speaker identification because there is no platform UI to read from.
For live meetings on Google Meet, Zoom, or Teams, though, the meeting platform already solved the speaker identification problem. Running a separate voice fingerprinting model on the same audio is redundant work that introduces errors the platform itself does not make.
Try It Yourself
Install IceCubes on Chrome or Edge, join your next meeting, and look at the transcript. Every line will show the speaker's real name as displayed in the meeting. No enrollment. No training. No "Speaker 1."