05: Captions & Tracks - Audio & Video Tutorials

The `<track>` element

The <track> element is placed inside a <video> (or <audio>) element alongside the <source> entries. It links an external text track file — typically a WebVTT file — to the media. The browser reads that file and renders the timed text (captions, subtitles, chapter markers, etc.) overlaid on the video at the correct moments.

The demo below plays the same sample video from earlier tutorials, this time with English captions attached. Click the CC button in the player controls to toggle them on and off.

`<track>` attributes

The <track> element has five attributes worth knowing:

Attribute	Value(s)	What it does
`kind`	`captions`, `subtitles`, `descriptions`, `chapters`, `metadata`	Declares what the track is for. The browser uses this to present the right UI (e.g. a CC button for captions).
`src`	URL	Path to the WebVTT text file. Must be same-origin or served with a permissive CORS header.
`srclang`	BCP 47 language tag (e.g. `en`, `fr`, `es`)	The language of the track. Required when `kind="subtitles"`.
`label`	Plain text string	A human-readable name shown in the browser's track selection menu (e.g. "English", "Français").
`default`	Boolean	Marks this track as the one to enable automatically. At most one track per media element should carry `default`.

You can attach multiple <track> elements to the same video — one per language or per track kind. The browser will list them in its captions menu so the user can switch between them.

WebVTT basics

WebVTT (Web Video Text Tracks) is the standard text track format for the web. A .vtt file is plain text and has a very simple structure: a header line followed by one or more timed cues.

File structure

The file must begin with the string WEBVTT on the very first line (optionally followed by a space and a description).
A blank line separates the header from the first cue, and separates cues from each other.
Each cue has a timestamp line in the format start --> end using the pattern HH:MM:SS.mmm or MM:SS.mmm.
The cue text follows immediately on the line(s) after the timestamp.
Cues may optionally have an identifier (a label before the timestamp line).

Here is the actual captions.vtt file used in the demo above:

WEBVTT 00:00.000 --> 00:02.000 A single-track road winds through the Scottish Highlands. 00:02.000 --> 00:04.000 Mist drifts across the steep green cliffs. 00:04.000 --> 00:06.000 The view slowly zooms in on the valley.

The timestamps use millisecond precision (.000), which is required even if you have no sub-second timing. Overlapping timestamps are allowed — useful when a caption needs to remain on screen while the next one begins to fade in — but the start time of each cue must be greater than or equal to the end time of the previous cue in a well-formed file.

Styling cues with CSS

Browser default caption styling varies. You can target rendered cue text with the CSS pseudo-element ::cue:

Only a limited subset of CSS properties are allowed inside ::cue — primarily color, font, and text properties. Layout properties such as position and display are ignored.

Why captions matter

Who benefits

Captions are commonly associated with deaf and hard-of-hearing viewers, but the audience is far broader:

Deaf and hard-of-hearing users — captions are their primary access to spoken content in video.
Sound-off contexts — research consistently shows that the majority of video on social platforms is watched without sound, whether because the viewer is in a public space, a quiet office, or simply has notifications muted. Captions are the only way your message reaches that audience.
Non-native speakers — reading captions while listening improves comprehension when the spoken language is not the viewer's first.
Noisy environments — a gym, a coffee shop, a crowded commute: situations where audio is hard to hear even with normal hearing.
Search engines and indexing — search engine crawlers cannot hear audio, but they can index VTT transcript text, making captioned video significantly more discoverable.

Captions vs. subtitles

The terms are often used interchangeably, but they have distinct meanings in the context of <track kind="...">:

Captions (kind="captions") — designed for viewers who cannot hear the audio at all. In addition to dialogue, captions include non-speech audio information: [music playing], [door slams], [phone rings]. These contextual cues are part of the content.
Subtitles (kind="subtitles") — designed for viewers who can hear but need the dialogue in a different language (or as a reading aid). Subtitles typically do not include non-speech audio descriptions.

For English-language content aimed at English-speaking users who may not be able to hear, use kind="captions". For translated text aimed at users who speak a different language, use kind="subtitles".

Audio descriptions

A third track kind, kind="descriptions", provides narration of on-screen visual content for blind or low-vision users. These tracks are read by screen readers. While less common than captions, they should be considered for videos where important information is conveyed purely visually (a diagram being drawn, a product being shown, a chart appearing on screen).

For the broader accessibility picture — how alt text, captions, and descriptive content work together across media types — see Images Tutorial 12: Accessibility and Alt Text.