rfc9628v4.txt | rfc9628.txt | |||
---|---|---|---|---|
Internet Engineering Task Force (IETF) J. Uberti | Internet Engineering Task Force (IETF) J. Uberti | |||
Request for Comments: 9628 S. Holmer | Request for Comments: 9628 OpenAI | |||
Category: Standards Track M. Flodman | Category: Standards Track S. Holmer | |||
ISSN: 2070-1721 D. Hong | ISSN: 2070-1721 M. Flodman | |||
D. Hong | ||||
J. Lennox | J. Lennox | |||
8x8 / Jitsi | 8x8 / Jitsi | |||
February 2025 | February 2025 | |||
RTP Payload Format for VP9 Video | RTP Payload Format for VP9 Video | |||
Abstract | Abstract | |||
This specification describes an RTP payload format for the VP9 video | This specification describes an RTP payload format for the VP9 video | |||
skipping to change at line 61 ¶ | skipping to change at line 62 ¶ | |||
1. Introduction | 1. Introduction | |||
2. Conventions | 2. Conventions | |||
3. Media Format Description | 3. Media Format Description | |||
4. Payload Format | 4. Payload Format | |||
4.1. RTP Header Usage | 4.1. RTP Header Usage | |||
4.2. VP9 Payload Descriptor | 4.2. VP9 Payload Descriptor | |||
4.2.1. Scalability Structure (SS) | 4.2.1. Scalability Structure (SS) | |||
4.3. Frame Fragmentation | 4.3. Frame Fragmentation | |||
4.4. Scalable Encoding Considerations | 4.4. Scalable Encoding Considerations | |||
4.5. Examples of VP9 RTP Stream | 4.5. Example of a VP9 RTP Stream | |||
4.5.1. Reference Picture Use for Scalable Structure | 4.5.1. Reference Picture Use for Scalable Structure | |||
5. Feedback Messages and Header Extensions | 5. Feedback Messages and Header Extensions | |||
5.1. Reference Picture Selection Indication (RPSI) | 5.1. Reference Picture Selection Indication (RPSI) | |||
5.2. Full Intra Request (FIR) | 5.2. Full Intra Request (FIR) | |||
5.3. Layer Refresh Request (LRR) | 5.3. Layer Refresh Request (LRR) | |||
6. Payload Format Parameters | 6. Payload Format Parameters | |||
6.1. SDP Parameters | 6.1. SDP Parameters | |||
6.1.1. Mapping of Media Subtype Parameters to SDP | 6.1.1. Mapping of Media Subtype Parameters to SDP | |||
6.1.2. Offer/Answer Considerations | 6.1.2. Offer/Answer Considerations | |||
7. Media Type Definition | 7. Media Type Definition | |||
skipping to change at line 132 ¶ | skipping to change at line 133 ¶ | |||
allow a frame to be encoded at the same resolution but at different | allow a frame to be encoded at the same resolution but at different | |||
qualities (and, thus, with different amounts of coding error). VP9 | qualities (and, thus, with different amounts of coding error). VP9 | |||
supports quality layers as spatial layers without any resolution | supports quality layers as spatial layers without any resolution | |||
changes; hereinafter, the term "spatial layer" is used to represent | changes; hereinafter, the term "spatial layer" is used to represent | |||
both spatial and quality layers. | both spatial and quality layers. | |||
This payload format specification defines how such temporal and | This payload format specification defines how such temporal and | |||
spatial scalability layers can be described and communicated. | spatial scalability layers can be described and communicated. | |||
Temporal and spatial scalability layers are associated with non- | Temporal and spatial scalability layers are associated with non- | |||
negative integer IDs. The lowest layer of either type has an ID of 0 | negative integer IDs. The lowest layer of either type has an ID of | |||
and is sometimes referred to as the "base" temporal or spatial layer. | zero and is sometimes referred to as the "base" temporal or spatial | |||
layer. | ||||
Layers are designed, and MUST be encoded, such that if any layer, and | Layers are designed, and MUST be encoded, such that if any layer, and | |||
all higher layers, are removed from the bitstream along either the | all higher layers, are removed from the bitstream along either the | |||
spatial or temporal dimension, the remaining bitstream is still | spatial or temporal dimension, the remaining bitstream is still | |||
correctly decodable. | correctly decodable. | |||
For terminology, this document uses the term "frame" to refer to a | For terminology, this document uses the term "frame" to refer to a | |||
single encoded VP9 frame for a particular resolution and/or quality, | single encoded VP9 frame for a particular resolution and/or quality, | |||
and "picture" to refer to all the representations (frames) at a | and "picture" to refer to all the representations (frames) at a | |||
single instant in time. Thus, a picture consists of one or more | single instant in time. Thus, a picture consists of one or more | |||
skipping to change at line 167 ¶ | skipping to change at line 169 ¶ | |||
Given the above simplifications for inter-layer and inter-picture | Given the above simplifications for inter-layer and inter-picture | |||
dependencies, a flag (the D bit described below) is used to indicate | dependencies, a flag (the D bit described below) is used to indicate | |||
whether a spatial-layer SID frame depends on the spatial-layer SID-1 | whether a spatial-layer SID frame depends on the spatial-layer SID-1 | |||
frame. Given the D bit, a receiver only needs to additionally know | frame. Given the D bit, a receiver only needs to additionally know | |||
the inter-picture dependency structure for a given spatial-layer | the inter-picture dependency structure for a given spatial-layer | |||
frame in order to determine its decodability. Two modes of | frame in order to determine its decodability. Two modes of | |||
describing the inter-picture dependency structure are possible: | describing the inter-picture dependency structure are possible: | |||
"flexible mode" and "non-flexible mode". An encoder can only switch | "flexible mode" and "non-flexible mode". An encoder can only switch | |||
between the two on the first packet of a keyframe with a temporal- | between the two on the first packet of a keyframe with a temporal- | |||
layer ID equal to 0. | layer ID equal to zero. | |||
In flexible mode, each packet can contain up to three reference | In flexible mode, each packet can contain up to three reference | |||
indices, which identify all frames referenced by the frame | indices, which identify all frames referenced by the frame | |||
transmitted in the current packet for inter-picture prediction. This | transmitted in the current packet for inter-picture prediction. This | |||
(along with the D bit) enables a receiver to identify if a frame is | (along with the D bit) enables a receiver to identify if a frame is | |||
decodable or not and helps it understand the temporal-layer | decodable or not and helps it understand the temporal-layer | |||
structure. Since this is signaled in each packet, it makes it | structure. Since this is signaled in each packet, it makes it | |||
possible to have very flexible temporal-layer hierarchies and | possible to have very flexible temporal-layer hierarchies and | |||
scalability structures, which are changing dynamically. | scalability structures, which are changing dynamically. | |||
skipping to change at line 191 ¶ | skipping to change at line 193 ¶ | |||
inter-picture dependencies (the reference indices) of the PG MUST be | inter-picture dependencies (the reference indices) of the PG MUST be | |||
pre-specified as part of the Scalability Structure (SS) data. Each | pre-specified as part of the Scalability Structure (SS) data. Each | |||
packet has an index to refer to one of the described pictures in the | packet has an index to refer to one of the described pictures in the | |||
PG from which the pictures referenced by the picture transmitted in | PG from which the pictures referenced by the picture transmitted in | |||
the current packet for inter-picture prediction can be identified. | the current packet for inter-picture prediction can be identified. | |||
| Note: A "Picture Group" or "PG", as used in this document, is | | Note: A "Picture Group" or "PG", as used in this document, is | |||
| not the same thing as the term "Group of Pictures" as it is | | not the same thing as the term "Group of Pictures" as it is | |||
| commonly used in video coding, i.e., to mean an independently | | commonly used in video coding, i.e., to mean an independently | |||
| decodable run of pictures beginning with a keyframe. | | decodable run of pictures beginning with a keyframe. | |||
| | ||||
| The SS data can also be used to specify the resolution of each | The SS data can also be used to specify the resolution of each | |||
| spatial layer present in the VP9 stream for both flexible and | spatial layer present in the VP9 stream for both flexible and non- | |||
| non-flexible modes. | flexible modes. | |||
4. Payload Format | 4. Payload Format | |||
This section describes how the encoded VP9 bitstream is encapsulated | This section describes how the encoded VP9 bitstream is encapsulated | |||
in RTP. To handle network losses, usage of RTP/AVPF [RFC4585] is | in RTP. To handle network losses, usage of RTP/AVPF [RFC4585] is | |||
RECOMMENDED. All integer fields in this specification are encoded as | RECOMMENDED. All integer fields in this specification are encoded as | |||
unsigned integers in network octet order. | unsigned integers in network octet order. | |||
4.1. RTP Header Usage | 4.1. RTP Header Usage | |||
skipping to change at line 240 ¶ | skipping to change at line 242 ¶ | |||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
Figure 1: General RTP Payload Format for VP9 | Figure 1: General RTP Payload Format for VP9 | |||
See Section 4.2 for more information on the VP9 payload descriptor; | See Section 4.2 for more information on the VP9 payload descriptor; | |||
the VP9 payload is described in [VP9-BITSTREAM]. OPTIONAL RTP | the VP9 payload is described in [VP9-BITSTREAM]. OPTIONAL RTP | |||
padding MUST NOT be included unless the P bit is set. | padding MUST NOT be included unless the P bit is set. | |||
Marker bit (M): This bit MUST be set to one for the final packet of | Marker bit (M): This bit MUST be set to one for the final packet of | |||
the highest spatial-layer frame (the final packet of the picture); | the highest spatial-layer frame (the final packet of the picture); | |||
otherwise, it is 0. Unless spatial scalability is in use for this | otherwise, it is zero. Unless spatial scalability is in use for | |||
picture, this bit will have the same value as the E bit described | this picture, this bit will have the same value as the E bit | |||
in Section 4.2. Note this bit MUST be set to one for the target | described in Section 4.2. Note this bit MUST be set to one for | |||
spatial-layer frame if a stream is being rewritten to remove | the target spatial-layer frame if a stream is being rewritten to | |||
higher spatial layers. | remove higher spatial layers. | |||
Payload Type (PT): In line with the policy in Section 3 of | Payload Type (PT): In line with the policy in Section 3 of | |||
[RFC3551], applications using the VP9 RTP payload profile MUST | [RFC3551], applications using the VP9 RTP payload profile MUST | |||
assign a dynamic payload type number to be used in each RTP | assign a dynamic payload type number to be used in each RTP | |||
session and provide a mechanism to indicate the mapping. See | session and provide a mechanism to indicate the mapping. See | |||
Section 6.1 for the mechanism to be used with the Session | Section 6.1 for the mechanism to be used with the Session | |||
Description Protocol (SDP) [RFC8866]. | Description Protocol (SDP) [RFC8866]. | |||
Timestamp: The RTP timestamp [RFC3550] indicates the time when the | Timestamp: The RTP timestamp [RFC3550] indicates the time when the | |||
input frame was sampled, at a clock rate of 90 kHz. If the input | input frame was sampled, at a clock rate of 90 kHz. If the input | |||
picture is encoded with multiple-layer frames, all of the frames | picture is encoded with multiple frames, all of the frames of the | |||
of the picture MUST have the same timestamp. | picture MUST have the same timestamp. | |||
If a frame has the VP9 show_frame field set to zero (i.e., it is | If a frame has the VP9 show_frame field set to zero (i.e., it is | |||
meant only to populate a reference buffer without being output), | meant only to populate a reference buffer without being output), | |||
its timestamp MAY alternatively be set to be the same as the | its timestamp MAY alternatively be set to be the same as the | |||
subsequent frame with show_frame equal to 1. (This will be | subsequent frame with show_frame equal to one. (This will be | |||
convenient for playing out pre-encoded content packaged with VP9 | convenient for playing out pre-encoded content packaged with VP9 | |||
"superframes", which typically bundle show_frame==0 frames with a | "superframes", which typically bundle show_frame==0 frames with a | |||
subsequent show_frame==1 frame.) Every frame with show_frame==1, | subsequent show_frame==1 frame.) Every picture containing a frame | |||
however, MUST have a unique timestamp modulo the 2^32 wrap of the | with show_frame==1, however, MUST have a unique timestamp modulo | |||
field. | the 2^32 wrap of the field. | |||
The remaining RTP Fixed Header Fields (V, P, X, CC, sequence number, | The remaining RTP Fixed Header Fields (V, P, X, CC, sequence number, | |||
SSRC, and CSRC identifiers) are used as specified in Section 5.1 of | SSRC, and CSRC identifiers) are used as specified in Section 5.1 of | |||
[RFC3550]. | [RFC3550]. | |||
4.2. VP9 Payload Descriptor | 4.2. VP9 Payload Descriptor | |||
In flexible mode (with the F bit below set to one), the first octets | In flexible mode (with the F bit below set to one), the first octets | |||
after the RTP header are the VP9 payload descriptor, with the | after the RTP header are the VP9 payload descriptor, with the | |||
following structure. | following structure. | |||
skipping to change at line 316 ¶ | skipping to change at line 318 ¶ | |||
M: | EXTENDED PID | (RECOMMENDED) | M: | EXTENDED PID | (RECOMMENDED) | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
L: | TID |U| SID |D| (Conditionally RECOMMENDED) | L: | TID |U| SID |D| (Conditionally RECOMMENDED) | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
| TL0PICIDX | (Conditionally REQUIRED) | | TL0PICIDX | (Conditionally REQUIRED) | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
V: | SS | | V: | SS | | |||
| .. | | | .. | | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
Figure 3: Non-flexible Mode Format for VP9 Payload Descriptor | Figure 3: Non-Flexible Mode Format for VP9 Payload Descriptor | |||
Except as noted, the following field descriptions apply to the | ||||
payload descriptor formats in both Figures 2 and 3. | ||||
I: Picture ID (PID) present. When set to one, the OPTIONAL PID MUST | I: Picture ID (PID) present. When set to one, the OPTIONAL PID MUST | |||
be present after the mandatory first octet and specified as below. | be present after the mandatory first octet and specified as below. | |||
Otherwise, PID MUST NOT be present. If the V bit was set in the | Otherwise, PID MUST NOT be present. If the V bit was set in the | |||
stream's most recent start of a keyframe (i.e., the SS field was | stream's most recent start of a keyframe (i.e., the SS field was | |||
present) and the F bit is set to zero (i.e., non-flexible | present) and the F bit is set to zero (i.e., non-flexible | |||
scalability mode is in use), then this bit MUST be set on every | scalability mode is in use), then this bit MUST be set on every | |||
packet. | packet. | |||
P: Inter-picture predicted frame. When set to zero, the frame does | P: Inter-picture predicted frame. When set to zero, the frame does | |||
skipping to change at line 357 ¶ | skipping to change at line 362 ¶ | |||
mandatory first octet, the PID, and layer indices (if present) are | mandatory first octet, the PID, and layer indices (if present) are | |||
as described by "reference indices" below. This bit MUST only be | as described by "reference indices" below. This bit MUST only be | |||
set to one if the I bit is also set to one; if the I bit is set to | set to one if the I bit is also set to one; if the I bit is set to | |||
zero, then this bit MUST also be set to zero and ignored by | zero, then this bit MUST also be set to zero and ignored by | |||
receivers. (Flexible mode's reference indices are defined as | receivers. (Flexible mode's reference indices are defined as | |||
offsets from the Picture ID field, so they would have no meaning | offsets from the Picture ID field, so they would have no meaning | |||
if I were not set.) The value of the F bit MUST only change on | if I were not set.) The value of the F bit MUST only change on | |||
the first packet of a key picture. A "key picture" is a picture | the first packet of a key picture. A "key picture" is a picture | |||
whose base spatial-layer frame is a keyframe, and thus one which | whose base spatial-layer frame is a keyframe, and thus one which | |||
completely resets the encoder state. This packet will have its P | completely resets the encoder state. This packet will have its P | |||
bit equal to 0, SID or L bit (described below) equal to 0, and B | bit equal to zero, SID or L bit (described below) equal to zero, | |||
bit (described below) equal to 1. | and B bit (described below) equal to one. | |||
B: Start of Frame. This bit MUST be set to one if the first payload | B: Start of Frame. This bit MUST be set to one if the first payload | |||
octet of the RTP packet is the beginning of a new VP9 frame; | octet of the RTP packet is the beginning of a new VP9 frame; | |||
otherwise, it MUST NOT be 1. Note that this frame might not be | otherwise, it MUST NOT be one. Note that this frame might not be | |||
the first frame of a picture. | the first frame of a picture. | |||
E: End of Frame. This bit MUST be set to one for the final RTP | E: End of Frame. This bit MUST be set to one for the final RTP | |||
packet of a VP9 frame; otherwise, it is 0. This enables a decoder | packet of a VP9 frame; otherwise, it is zero. This enables a | |||
to finish decoding the frame, where it otherwise may need to wait | decoder to finish decoding the frame, where it otherwise may need | |||
for the next packet to explicitly know that the frame is complete. | to wait for the next packet to explicitly know that the frame is | |||
Note that, if spatial scalability is in use, more frames from the | complete. Note that, if spatial scalability is in use, more | |||
same picture may follow; see the description of the B bit above. | frames from the same picture may follow; see the description of | |||
the B bit above. | ||||
V: Scalability Structure (SS) data present. When set to one, the | V: Scalability Structure (SS) data present. When set to one, the | |||
OPTIONAL SS data MUST be present in the payload descriptor. | OPTIONAL SS data MUST be present in the payload descriptor. | |||
Otherwise, the SS data MUST NOT be present. | Otherwise, the SS data MUST NOT be present. | |||
Z: Not a reference frame for upper spatial layers. If set to one, | Z: Not a reference frame for upper spatial layers. If set to one, | |||
indicates that frames with higher spatial layers SID+1 and greater | indicates that frames with higher spatial layers SID+1 and greater | |||
of the current and following pictures do not depend on the current | of the current and following pictures do not depend on the current | |||
spatial-layer SID frame. This enables a decoder that is targeting | spatial-layer SID frame. This enables a decoder that is targeting | |||
a higher spatial layer to know that it can safely discard this | a higher spatial layer to know that it can safely discard this | |||
skipping to change at line 394 ¶ | skipping to change at line 400 ¶ | |||
The mandatory first octet is followed by the extension data fields | The mandatory first octet is followed by the extension data fields | |||
that are enabled: | that are enabled: | |||
M: The most significant bit of the first octet is an extension flag. | M: The most significant bit of the first octet is an extension flag. | |||
The field MUST be present if the I bit is equal to one. If M is | The field MUST be present if the I bit is equal to one. If M is | |||
set, the PID field MUST contain 15 bits; otherwise, it MUST | set, the PID field MUST contain 15 bits; otherwise, it MUST | |||
contain 7 bits. See PID below. | contain 7 bits. See PID below. | |||
Picture ID (PID): Picture ID represented in 7 or 15 bits, depending | Picture ID (PID): Picture ID represented in 7 or 15 bits, depending | |||
on the M bit. This is a running index of the pictures, where the | on the M bit. This is a running index of the pictures, where the | |||
sender increments the value by 1 for each picture it sends. | sender increments the value by one for each picture it sends. | |||
(Note, however, that because a middlebox can discard pictures | (Note, however, that because a middlebox can discard pictures | |||
where permitted by the SS, Picture IDs as received by a receiver | where permitted by the SS, Picture IDs as received by a receiver | |||
might not be contiguous.) This field MUST be present if the I bit | might not be contiguous.) This field MUST be present if the I bit | |||
is equal to one. If M is set to zero, 7 bits carry the PID; else, | is equal to one. If M is set to zero, 7 bits carry the PID; else, | |||
if M is set to one, 15 bits carry the PID in network byte order. | if M is set to one, 15 bits carry the PID in network byte order. | |||
The sender may choose between a 7- or 15-bit index. The PID | The sender may choose between a 7- or 15-bit index. The PID | |||
SHOULD start on a random number and MUST wrap after reaching the | SHOULD start on a random number and MUST wrap after reaching the | |||
maximum ID (0x7f or 0x7fff depending on the index size chosen). | maximum ID (0x7f or 0x7fff depending on the index size chosen). | |||
The receiver MUST NOT assume that the number of bits in the PID | The receiver MUST NOT assume that the number of bits in the PID | |||
stays the same through the session. If this field transitions | stays the same through the session. If this field transitions | |||
from 7 bits to 15 bits, the value is zero-extended (i.e., the | from 7 bits to 15 bits, the value is zero-extended (i.e., the | |||
value after 0x6e is 0x006f); if the field transitions from 15 bits | value after 0x6e is 0x006f); if the field transitions from 15 bits | |||
to 7 bits, it is truncated (i.e., the value after 0x1bbe is 0xbf). | to 7 bits, it is truncated (i.e., the value after 0x1bbe is 0x3f). | |||
In the non-flexible mode (when the F bit is set to zero), this PID | In the non-flexible mode (when the F bit is set to zero), this PID | |||
is used as an index to the PG specified in the SS data below. In | is used as an index to the PG specified in the SS data below. In | |||
this mode, the PID of the keyframe corresponds to the first | this mode, the PID of the keyframe corresponds to the first | |||
specified frame in the PG. Then subsequent PIDs are mapped to | specified frame in the PG. Then subsequent PIDs are mapped to | |||
subsequently specified frames in the PG (modulo N_G, specified in | subsequently specified frames in the PG (modulo N_G, specified in | |||
the SS data below), respectively. | the SS data below), respectively. | |||
All frames of the same picture MUST have the same PID value. | All frames of the same picture MUST have the same PID value. | |||
Frames (and their corresponding pictures) with the VP9 show_frame | Frames (and their corresponding pictures) with the VP9 show_frame | |||
field equal to 0 MUST have distinct PID values from subsequent | field equal to zero MUST have distinct PID values from subsequent | |||
pictures with show_frame equal to 1. Thus, a picture (as defined | pictures with show_frame equal to one. Thus, a picture (as | |||
in this specification) is different than a VP9 superframe. | defined in this specification) is different than a VP9 superframe. | |||
All frames of the same picture MUST have the same value for | All frames of the same picture MUST have the same value for | |||
show_frame. | show_frame. | |||
Layer indices: This field is optional but RECOMMENDED whenever | Layer indices: This field is optional but RECOMMENDED whenever | |||
encoding with layers. For both flexible and non-flexible modes, | encoding with layers. For both flexible and non-flexible modes, | |||
one octet is used to specify a layer frame's Temporal-layer ID | one octet is used to specify a layer frame's Temporal-layer ID | |||
(TID) and Spatial-layer ID (SID) as shown both in Figure 2 and | (TID) and Spatial-layer ID (SID) as shown both in Figures 2 and 3. | |||
Figure 3. Additionally, a bit (U) is used to indicate that the | Additionally, a bit (U) is used to indicate that the current frame | |||
current frame is a "switching up point" frame. Another bit (D) is | is a "switching up point" frame. Another bit (D) is used to | |||
used to indicate whether inter-layer prediction is used for the | indicate whether inter-layer prediction is used for the current | |||
current frame. | frame. | |||
In the non-flexible mode (when the F bit is set to zero), another | In the non-flexible mode (when the F bit is set to zero), another | |||
octet is used to represent Temporal Layer 0 Picture Index (8 bits) | octet is used to represent the Temporal Layer 0 Picture Index (8 | |||
(TL0PICIDX), as depicted in Figure 3. The TL0PICIDX is present so | bits) (TL0PICIDX), as depicted in Figure 3. The TL0PICIDX is | |||
that all minimally required frames (the base temporal-layer | present so that all minimally required frames (the base temporal- | |||
frames) can be tracked. | layer frames) can be tracked. | |||
The TID and SID fields indicate the temporal and spatial layers | The TID and SID fields indicate the temporal and spatial layers | |||
and can help middleboxes and endpoints quickly identify which | and can help middleboxes and endpoints quickly identify which | |||
layer a packet belongs to. | layer a packet belongs to. | |||
TID: The temporal-layer ID of the current frame. In the case of | TID: The temporal-layer ID of the current frame. In the case of | |||
non-flexible mode, if a PID is mapped to a picture in a | non-flexible mode, if a PID is mapped to a picture in a | |||
specified PG, then the value of the TID MUST match the | specified PG, then the value of the TID MUST match the | |||
corresponding TID value of the mapped picture in the PG. | corresponding TID value of the mapped picture in the PG. | |||
U: Switching up point. When this bit is set to one, if the | U: Switching up point. When this bit is set to one, if the | |||
current picture has a temporal-layer ID equal to value T, then | current picture has a temporal-layer ID equal to value T, then | |||
subsequent pictures with temporal-layer ID values higher than T | subsequent pictures with temporal-layer ID values higher than T | |||
will not depend on any picture before the current picture (in | will not depend on any picture before the current picture (in | |||
coding order) with a temporal-layer ID value greater than T. | decode order) with a temporal-layer ID value greater than T. | |||
SID: The spatial-layer ID of the current frame. Note that frames | SID: The spatial-layer ID of the current frame. Note that frames | |||
with spatial-layer SID > 0 may be dependent on decoded spatial- | with spatial-layer SID > 0 may be dependent on decoded spatial- | |||
layer SID-1 frame within the same picture. Different frames of | layer SID-1 frame within the same picture. Different frames of | |||
the same picture MUST have distinct spatial-layer IDs, and | the same picture MUST have distinct spatial-layer IDs, and | |||
frames' spatial layers MUST appear in increasing order within | frames' spatial layers MUST appear in increasing order within | |||
the frame. | the frame. | |||
D: Inter-layer dependency is used. D MUST be set to one if and | D: Inter-layer dependency is used. D MUST be set to one if and | |||
only if the current spatial-layer SID frame depends on spatial- | only if the current spatial-layer SID frame depends on spatial- | |||
layer SID-1 frame of the same picture; otherwise, it MUST be | layer SID-1 frame of the same picture; otherwise, it MUST be | |||
set to zero. For the base-layer frame (with SID equal to 0), | set to zero. For the base-layer frame (with SID equal to | |||
the D bit MUST be set to zero. | zero), the D bit MUST be set to zero. | |||
TL0PICIDX: Temporal Layer 0 Picture Index (8 bits). TL0PICIDX is | TL0PICIDX: Temporal Layer 0 Picture Index (8 bits). TL0PICIDX is | |||
only present in the non-flexible mode (F = 0). This is a | only present in the non-flexible mode (F = 0). This is a | |||
running index for the temporal base-layer pictures, i.e., the | running index for the temporal base-layer pictures, i.e., the | |||
pictures with a TID set to zero. If the TID is larger than 0, | pictures with a TID set to zero. If the TID is larger than | |||
TL0PICIDX indicates which temporal base-layer picture the | zero, TL0PICIDX indicates which temporal base-layer picture the | |||
current picture depends on. TL0PICIDX MUST be incremented by 1 | current picture depends on. TL0PICIDX MUST be incremented by | |||
when the TID is equal to 0. The index SHOULD start on a random | one when the TID is equal to zero. The index SHOULD start on a | |||
number and MUST restart at 0 after reaching the maximum number | random number and MUST restart at zero after reaching the | |||
255. | maximum number 255. | |||
Reference indices: When P and F are both set to one, indicating a | Reference indices: When P and F are both set to one, indicating a | |||
non-keyframe in flexible mode, then at least one reference index | non-keyframe in flexible mode, then at least one reference index | |||
MUST be specified as below. Additional reference indices (a total | MUST be specified as below. Additional reference indices (a total | |||
of up to three reference indices are allowed) may be specified | of up to three reference indices are allowed) may be specified | |||
using the N bit below. When either P or F is set to zero, then no | using the N bit below. When either P or F is set to zero, then no | |||
reference index is specified. | reference index is specified. | |||
P_DIFF: The reference index (in 7 bits) specified as the relative | P_DIFF: The reference index (in 7 bits) specified as the relative | |||
PID from the current picture. For example, when P_DIFF=3 on a | PID from the current picture. For example, when P_DIFF=3 on a | |||
packet containing the picture with PID 112 means that the | packet containing the picture with PID 112 means that the | |||
picture refers back to the picture with PID 109. This | picture refers back to the picture with PID 109. This | |||
calculation is done modulo the size of the PID field, i.e., | calculation is done modulo the size of the PID field, i.e., | |||
either 7 or 15 bits. A P_DIFF value of 0 is invalid. | either 7 or 15 bits. A P_DIFF value of zero is invalid. | |||
N: 1 if there is additional P_DIFF following the current P_DIFF. | N: 1 if there is additional P_DIFF following the current P_DIFF. | |||
4.2.1. Scalability Structure (SS) | 4.2.1. Scalability Structure (SS) | |||
The SS data describes the resolution of each frame within a picture | The SS data describes the resolution of each frame within a picture | |||
as well as the inter-picture dependencies for a PG. If the VP9 | as well as the inter-picture dependencies for a PG. If the VP9 | |||
payload descriptor's V bit is set, the SS data is present in the | payload descriptor's V bit is set, the SS data is present in the | |||
position indicated in Figures 2 and 3. | position indicated in Figures 2 and 3. | |||
skipping to change at line 536 ¶ | skipping to change at line 542 ¶ | |||
one, the OPTIONAL WIDTH (2 octets) and HEIGHT (2 octets) MUST be | one, the OPTIONAL WIDTH (2 octets) and HEIGHT (2 octets) MUST be | |||
present for each layer frame. Otherwise, the resolution MUST NOT | present for each layer frame. Otherwise, the resolution MUST NOT | |||
be present. | be present. | |||
G: The PG description present flag. | G: The PG description present flag. | |||
-: A bit reserved for future use. It MUST be set to zero and MUST | -: A bit reserved for future use. It MUST be set to zero and MUST | |||
be ignored by the receiver. | be ignored by the receiver. | |||
N_G: N_G indicates the number of pictures in a PG. If N_G is | N_G: N_G indicates the number of pictures in a PG. If N_G is | |||
greater than 0, then the SS data allows the inter-picture | greater than zero, then the SS data allows the inter-picture | |||
dependency structure of the VP9 stream to be pre-declared, rather | dependency structure of the VP9 stream to be pre-declared, rather | |||
than indicating it on the fly with every packet. If N_G is | than indicating it on the fly with every packet. If N_G is | |||
greater than 0, then for N_G pictures in the PG, each picture's | greater than zero, then for N_G pictures in the PG, each picture's | |||
Temporal-layer ID (TID), switch up point (U), and reference | Temporal-layer ID (TID), switch up point (U), and reference | |||
indices (P_DIFFs) are specified. | indices (P_DIFFs) are specified. | |||
The first picture specified in the PG MUST have a TID set to zero. | The first picture specified in the PG MUST have a TID set to zero. | |||
G set to zero or N_G set to zero indicates that either there is | G set to zero or N_G set to zero indicates that either there is | |||
only one temporal layer (for non-flexible mode) or no fixed inter- | only one temporal layer (for non-flexible mode) or no fixed inter- | |||
picture dependency information is present (for flexible mode) | picture dependency information is present (for flexible mode) | |||
going forward in the bitstream. | going forward in the bitstream. | |||
skipping to change at line 561 ¶ | skipping to change at line 567 ¶ | |||
picture dependency structure. However, the frame rate of each | picture dependency structure. However, the frame rate of each | |||
spatial layer can be different from each other; this can be | spatial layer can be different from each other; this can be | |||
described with the use of the D bit described above. The | described with the use of the D bit described above. The | |||
specified dependency structure in the SS data MUST be for the | specified dependency structure in the SS data MUST be for the | |||
highest frame rate layer. | highest frame rate layer. | |||
R: The number of P_DIFF fields that are present. | R: The number of P_DIFF fields that are present. | |||
In a scalable stream sent with a fixed pattern, the SS data SHOULD be | In a scalable stream sent with a fixed pattern, the SS data SHOULD be | |||
included in the first packet of every key frame. This is a packet | included in the first packet of every key frame. This is a packet | |||
with the P bit equal to 0, SID or L bit equal to 0, and B bit equal | with the P bit equal to zero, SID or L bit equal to zero, and B bit | |||
to 1. The SS data MUST only be changed on the picture that | equal to one. The SS data MUST only be changed on the picture that | |||
corresponds to the first picture specified in the previous SS data's | corresponds to the first picture specified in the previous SS data's | |||
PG (if the previous SS data's N_G was greater than 0). | PG (if the previous SS data's N_G was greater than zero). | |||
4.3. Frame Fragmentation | 4.3. Frame Fragmentation | |||
VP9 frames are fragmented into packets in RTP sequence number order: | VP9 frames are fragmented into packets in RTP sequence number order: | |||
beginning with a packet with the B bit set and ending with a packet | beginning with a packet with the B bit set and ending with a packet | |||
with the E bit set. There is no mechanism for finer-grained access | with the E bit set. There is no mechanism for finer-grained access | |||
to parts of a VP9 frame. | to parts of a VP9 frame. | |||
4.4. Scalable Encoding Considerations | 4.4. Scalable Encoding Considerations | |||
skipping to change at line 598 ¶ | skipping to change at line 604 ¶ | |||
For spatially scalable streams, this means that | For spatially scalable streams, this means that | |||
"error_resilient_mode" needs to be turned on for the base spatial | "error_resilient_mode" needs to be turned on for the base spatial | |||
layer; however, it can be turned off for higher spatial layers, | layer; however, it can be turned off for higher spatial layers, | |||
assuming they are sent with inter-layer dependency (i.e., with the D | assuming they are sent with inter-layer dependency (i.e., with the D | |||
bit set). For streams that are only temporally scalable without | bit set). For streams that are only temporally scalable without | |||
spatial scalability, "error_resilient_mode" can additionally be | spatial scalability, "error_resilient_mode" can additionally be | |||
turned off for any picture that immediately follows a temporal-layer | turned off for any picture that immediately follows a temporal-layer | |||
0 frame. | 0 frame. | |||
4.5. Examples of VP9 RTP Stream | 4.5. Example of a VP9 RTP Stream | |||
4.5.1. Reference Picture Use for Scalable Structure | 4.5.1. Reference Picture Use for Scalable Structure | |||
As discussed in Section 3, the VP9 codec can maintain up to eight | As discussed in Section 3, the VP9 codec can maintain up to eight | |||
reference frames, of which up to three can be referenced or updated | reference frames, of which up to three can be referenced or updated | |||
by any new frame. This section illustrates one way that a scalable | by any new frame. This section illustrates one way that a scalable | |||
structure (with three spatial layers and three temporal layers) can | structure (with three spatial layers and three temporal layers) can | |||
be constructed using these reference frames. | be constructed using these reference frames. | |||
+==========+=========+============+=========+ | +==========+=========+============+=========+ | |||
skipping to change at line 913 ¶ | skipping to change at line 919 ¶ | |||
are subject to the security considerations discussed in the RTP | are subject to the security considerations discussed in the RTP | |||
specification [RFC3550], and in any applicable RTP profile such as | specification [RFC3550], and in any applicable RTP profile such as | |||
RTP/AVP [RFC3551], RTP/AVPF [RFC4585], RTP/SAVP [RFC3711], or RTP/ | RTP/AVP [RFC3551], RTP/AVPF [RFC4585], RTP/SAVP [RFC3711], or RTP/ | |||
SAVPF [RFC5124]. However, as "Securing the RTP Framework: Why RTP | SAVPF [RFC5124]. However, as "Securing the RTP Framework: Why RTP | |||
Does Not Mandate a Single Media Security Solution" [RFC7202] | Does Not Mandate a Single Media Security Solution" [RFC7202] | |||
discusses, it is not an RTP payload format's responsibility to | discusses, it is not an RTP payload format's responsibility to | |||
discuss or mandate what solutions are used to meet the basic security | discuss or mandate what solutions are used to meet the basic security | |||
goals like confidentiality, integrity, and source authenticity for | goals like confidentiality, integrity, and source authenticity for | |||
RTP in general. This responsibility lies with anyone using RTP in an | RTP in general. This responsibility lies with anyone using RTP in an | |||
application. They can find guidance on available security mechanisms | application. They can find guidance on available security mechanisms | |||
in "Options for Securing RTP Sessions [RFC7201]. Applications SHOULD | in "Options for Securing RTP Sessions" [RFC7201]. Applications | |||
use one or more appropriate strong security mechanisms. | SHOULD use one or more appropriate strong security mechanisms. | |||
Implementations of this RTP payload format need to take appropriate | Implementations of this RTP payload format need to take appropriate | |||
security considerations into account. It is extremely important for | security considerations into account. It is extremely important for | |||
the decoder to be robust against malicious or malformed payloads and | the decoder to be robust against malicious or malformed payloads and | |||
ensure that they do not cause the decoder to overrun its allocated | ensure that they do not cause the decoder to overrun its allocated | |||
memory or otherwise misbehave. An overrun in allocated memory could | memory or otherwise misbehave. An overrun in allocated memory could | |||
lead to arbitrary code execution by an attacker. The same applies to | lead to arbitrary code execution by an attacker. The same applies to | |||
the encoder, even though problems in encoders are (typically) rarer. | the encoder, even though problems in encoders are (typically) rarer. | |||
This RTP payload format and its media decoder do not exhibit any | This RTP payload format and its media decoder do not exhibit any | |||
skipping to change at line 948 ¶ | skipping to change at line 954 ¶ | |||
non-reference frames and discard them in order to reduce network | non-reference frames and discard them in order to reduce network | |||
congestion. Note that discarding of non-reference frames cannot be | congestion. Note that discarding of non-reference frames cannot be | |||
done if the stream is encrypted (because the non-reference marker is | done if the stream is encrypted (because the non-reference marker is | |||
encrypted). | encrypted). | |||
10. IANA Considerations | 10. IANA Considerations | |||
IANA has registered the media type registration "video/vp9" as | IANA has registered the media type registration "video/vp9" as | |||
specified in Section 7. The media type has also been added to the | specified in Section 7. The media type has also been added to the | |||
"RTP Payload Format Media Types" <https://www.iana.org/assignments/ | "RTP Payload Format Media Types" <https://www.iana.org/assignments/ | |||
rtp-parameters> subregistry of the "Real-Time Transport Protocol | rtp-parameters> registry of the "Real-Time Transport Protocol (RTP) | |||
(RTP) Paramaeters" registry as follows. | Paramaeters" registry group as follows. | |||
Media Type: video | Media Type: video | |||
Subtype: VP9 | Subtype: VP9 | |||
Clock Rate (Hz): 90000 | Clock Rate (Hz): 90000 | |||
Reference: RFC 9628 | Reference: RFC 9628 | |||
11. References | 11. References | |||
11.1. Normative References | 11.1. Normative References | |||
skipping to change at line 1062 ¶ | skipping to change at line 1068 ¶ | |||
Acknowledgments | Acknowledgments | |||
Alex Eleftheriadis, Yuki Ito, Won Kap Jang, Sergio Garcia Murillo, | Alex Eleftheriadis, Yuki Ito, Won Kap Jang, Sergio Garcia Murillo, | |||
Roi Sasson, Timothy Terriberry, Emircan Uysaler, and Thomas Volkert | Roi Sasson, Timothy Terriberry, Emircan Uysaler, and Thomas Volkert | |||
commented on the development of this document and provided helpful | commented on the development of this document and provided helpful | |||
feedback. | feedback. | |||
Authors' Addresses | Authors' Addresses | |||
Justin Uberti | Justin Uberti | |||
Google, Inc. | OpenAI | |||
747 6th Street South | 747 6th Street South | |||
Kirkland, WA 98033 | Kirkland, WA 98033 | |||
United States of America | United States of America | |||
Email: justin@uberti.name | Email: justin@uberti.name | |||
Stefan Holmer | Stefan Holmer | |||
Google, Inc. | Google, Inc. | |||
Kungsbron 2 | Kungsbron 2 | |||
SE-111 22 Stockholm | SE-111 22 Stockholm | |||
Sweden | Sweden | |||
skipping to change at line 1084 ¶ | skipping to change at line 1090 ¶ | |||
Magnus Flodman | Magnus Flodman | |||
Google, Inc. | Google, Inc. | |||
Kungsbron 2 | Kungsbron 2 | |||
SE-111 22 Stockholm | SE-111 22 Stockholm | |||
Sweden | Sweden | |||
Email: mflodman@google.com | Email: mflodman@google.com | |||
Danny Hong | Danny Hong | |||
Google, Inc. | Google, Inc. | |||
1585 Charleston Road | 315 Hudson St. | |||
Mountain View, CA 94043 | New York, NY 10013 | |||
United States of America | United States of America | |||
Email: dannyhong@google.com | Email: dannyhong@google.com | |||
Jonathan Lennox | Jonathan Lennox | |||
8x8, Inc. / Jitsi | 8x8, Inc. / Jitsi | |||
Jersey City, NJ 07302 | Jersey City, NJ 07302 | |||
United States of America | United States of America | |||
Email: jonathan.lennox@8x8.com | Email: jonathan.lennox@8x8.com | |||
End of changes. 31 change blocks. | ||||
68 lines changed or deleted | 74 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. |