The <track> element
The <track> element is placed inside a <video>
(or <audio>) element alongside the <source>
entries. It links an external text track file — typically a WebVTT
file — to the media. The browser reads that file and renders the timed text
(captions, subtitles, chapter markers, etc.) overlaid on the video at the correct
moments.
The demo below plays the same sample video from earlier tutorials, this time with English captions attached. Click the CC button in the player controls to toggle them on and off.
<track> attributes
The <track> element has five attributes worth knowing:
| Attribute | Value(s) | What it does |
|---|---|---|
kind |
captions, subtitles, descriptions, chapters, metadata |
Declares what the track is for. The browser uses this to present the right UI (e.g. a CC button for captions). |
src |
URL | Path to the WebVTT text file. Must be same-origin or served with a permissive CORS header. |
srclang |
BCP 47 language tag (e.g. en, fr, es) |
The language of the track. Required when kind="subtitles". |
label |
Plain text string | A human-readable name shown in the browser's track selection menu (e.g. "English", "Français"). |
default |
Boolean | Marks this track as the one to enable automatically. At most one track per media element should carry default. |
You can attach multiple <track> elements to the same video — one
per language or per track kind. The browser will list them in its captions menu so
the user can switch between them.
WebVTT basics
WebVTT (Web Video Text Tracks) is the standard text track format
for the web. A .vtt file is plain text and has a very simple structure:
a header line followed by one or more timed cues.
File structure
- The file must begin with the string
WEBVTTon the very first line (optionally followed by a space and a description). - A blank line separates the header from the first cue, and separates cues from each other.
- Each cue has a timestamp line in the format
start --> endusing the patternHH:MM:SS.mmmorMM:SS.mmm. - The cue text follows immediately on the line(s) after the timestamp.
- Cues may optionally have an identifier (a label before the timestamp line).
Here is the actual captions.vtt file used in the demo above:
The timestamps use millisecond precision (.000), which is required even
if you have no sub-second timing. Overlapping timestamps are allowed — useful when
a caption needs to remain on screen while the next one begins to fade in — but
the start time of each cue must be greater than or equal to the end time of the
previous cue in a well-formed file.
Styling cues with CSS
Browser default caption styling varies. You can target rendered cue text with the
CSS pseudo-element ::cue:
Only a limited subset of CSS properties are allowed inside ::cue —
primarily color, font, and text properties. Layout properties such as
position and display are ignored.
Why captions matter
Accessibility rule: Any video that carries meaningful audio content should have captions. This is not a nice-to-have — it is required under WCAG 2.1 Success Criterion 1.2.2 (Level AA) for pre-recorded, synchronised media.
Who benefits
Captions are commonly associated with deaf and hard-of-hearing viewers, but the audience is far broader:
- Deaf and hard-of-hearing users — captions are their primary access to spoken content in video.
- Sound-off contexts — research consistently shows that the majority of video on social platforms is watched without sound, whether because the viewer is in a public space, a quiet office, or simply has notifications muted. Captions are the only way your message reaches that audience.
- Non-native speakers — reading captions while listening improves comprehension when the spoken language is not the viewer's first.
- Noisy environments — a gym, a coffee shop, a crowded commute: situations where audio is hard to hear even with normal hearing.
- Search engines and indexing — search engine crawlers cannot hear audio, but they can index VTT transcript text, making captioned video significantly more discoverable.
Captions vs. subtitles
The terms are often used interchangeably, but they have distinct meanings in the
context of <track kind="...">:
- Captions (
kind="captions") — designed for viewers who cannot hear the audio at all. In addition to dialogue, captions include non-speech audio information:[music playing],[door slams],[phone rings]. These contextual cues are part of the content. - Subtitles (
kind="subtitles") — designed for viewers who can hear but need the dialogue in a different language (or as a reading aid). Subtitles typically do not include non-speech audio descriptions.
For English-language content aimed at English-speaking users who may not be able to
hear, use kind="captions". For translated text aimed at users who
speak a different language, use kind="subtitles".
Audio descriptions
A third track kind, kind="descriptions", provides narration of
on-screen visual content for blind or low-vision users. These tracks are read by
screen readers. While less common than captions, they should be considered for
videos where important information is conveyed purely visually (a diagram being
drawn, a product being shown, a chart appearing on screen).
For the broader accessibility picture — how alt text, captions, and descriptive content work together across media types — see Images Tutorial 12: Accessibility and Alt Text.