All input is normalized to a consistent format before Whisper runs. This is where quality is locked in — Whisper gets clean, standardized input regardless of how the original was recorded.
mlx-community/whisper-large-v3-turbo · Local inference
Whisper runs entirely on your hardware via Apple's MLX framework. No cloud. No API calls. No per-minute costs. The audio never leaves your infrastructure. Privacy by architecture.
Console output
Detected language: English
[00:00.480 --> 00:05.640]
Hello, this is Travis Brady with AIM-T Pulse...
[00:06.460 --> 00:11.160]
More extemporaneous narrative through the slides today.
[00:12.980 --> 00:14.900]
Excited to tell you about our company.
[00:16.340 --> 00:16.880]
Slide two.
Segments: 4 · Chars: 186 · Time: 4.28s
↓ Structured output — 4 formats generated
4
Structured Output — Four Formats, One Pass
JSON · SRT · VTT · TXT
Every run produces all four output formats simultaneously. JSON for AI systems, SRT and VTT for subtitles and captions, TXT for simple archival. Timestamps on every word in every format.
JSON
SRT
VTT
TXT
Structured JSON output
{
"title": "93e2e7cb...test2",
"source_kind": "local_av",
"media_kind": "audio",
"language": "en",
"model": "mlx-community/whisper-large-v3-turbo",
"duration_seconds": 20.0,
"segments": [
{
"segment_id": "S1",
"start_seconds": 0.48,
"end_seconds": 5.64,
"start_time": "00:00:00",
"end_time": "00:00:06",
"text": "Hello, this is Travis Brady with AIM-T Pulse and AIM Elemental Health Solutions."
},
{
"segment_id": "S2",
"start_seconds": 6.46,
"end_seconds": 11.16,
"text": "More extemporaneous narrative through the slides today."
}
]
}
This is the difference that matters.
A transcript is for humans. Structured, timestamped JSON is for machines.
Every word has a timestamp. Every segment has a time boundary. Every output carries the metadata that makes it useful to AI systems — not just readable by people. That's Signal Loom.