Diff: rfc9628v4.txt - rfc9628.txt

	rfc9628v4.txt		rfc9628.txt

	Internet Engineering Task Force (IETF) J. Uberti		Internet Engineering Task Force (IETF) J. Uberti

	Request for Comments: 9628 S. Holmer		Request for Comments: 9628 OpenAI
	Category: Standards Track M. Flodman		Category: Standards Track S. Holmer
	ISSN: 2070-1721 D. Hong		ISSN: 2070-1721 M. Flodman
			D. Hong
	Google		Google
	J. Lennox		J. Lennox
	8x8 / Jitsi		8x8 / Jitsi
	February 2025		February 2025

	RTP Payload Format for VP9 Video		RTP Payload Format for VP9 Video

	Abstract		Abstract

	This specification describes an RTP payload format for the VP9 video		This specification describes an RTP payload format for the VP9 video

	skipping to change at line 61 ¶		skipping to change at line 62 ¶

	1. Introduction		1. Introduction
	2. Conventions		2. Conventions
	3. Media Format Description		3. Media Format Description
	4. Payload Format		4. Payload Format
	4.1. RTP Header Usage		4.1. RTP Header Usage
	4.2. VP9 Payload Descriptor		4.2. VP9 Payload Descriptor
	4.2.1. Scalability Structure (SS)		4.2.1. Scalability Structure (SS)
	4.3. Frame Fragmentation		4.3. Frame Fragmentation
	4.4. Scalable Encoding Considerations		4.4. Scalable Encoding Considerations

	4.5. Examples of VP9 RTP Stream		4.5. Example of a VP9 RTP Stream
	4.5.1. Reference Picture Use for Scalable Structure		4.5.1. Reference Picture Use for Scalable Structure
	5. Feedback Messages and Header Extensions		5. Feedback Messages and Header Extensions
	5.1. Reference Picture Selection Indication (RPSI)		5.1. Reference Picture Selection Indication (RPSI)
	5.2. Full Intra Request (FIR)		5.2. Full Intra Request (FIR)
	5.3. Layer Refresh Request (LRR)		5.3. Layer Refresh Request (LRR)
	6. Payload Format Parameters		6. Payload Format Parameters
	6.1. SDP Parameters		6.1. SDP Parameters
	6.1.1. Mapping of Media Subtype Parameters to SDP		6.1.1. Mapping of Media Subtype Parameters to SDP
	6.1.2. Offer/Answer Considerations		6.1.2. Offer/Answer Considerations
	7. Media Type Definition		7. Media Type Definition

	skipping to change at line 132 ¶		skipping to change at line 133 ¶
	allow a frame to be encoded at the same resolution but at different		allow a frame to be encoded at the same resolution but at different
	qualities (and, thus, with different amounts of coding error). VP9		qualities (and, thus, with different amounts of coding error). VP9
	supports quality layers as spatial layers without any resolution		supports quality layers as spatial layers without any resolution
	changes; hereinafter, the term "spatial layer" is used to represent		changes; hereinafter, the term "spatial layer" is used to represent
	both spatial and quality layers.		both spatial and quality layers.

	This payload format specification defines how such temporal and		This payload format specification defines how such temporal and
	spatial scalability layers can be described and communicated.		spatial scalability layers can be described and communicated.

	Temporal and spatial scalability layers are associated with non-		Temporal and spatial scalability layers are associated with non-

	negative integer IDs. The lowest layer of either type has an ID of 0		negative integer IDs. The lowest layer of either type has an ID of
	and is sometimes referred to as the "base" temporal or spatial layer.		zero and is sometimes referred to as the "base" temporal or spatial
			layer.

	Layers are designed, and MUST be encoded, such that if any layer, and		Layers are designed, and MUST be encoded, such that if any layer, and
	all higher layers, are removed from the bitstream along either the		all higher layers, are removed from the bitstream along either the
	spatial or temporal dimension, the remaining bitstream is still		spatial or temporal dimension, the remaining bitstream is still
	correctly decodable.		correctly decodable.

	For terminology, this document uses the term "frame" to refer to a		For terminology, this document uses the term "frame" to refer to a
	single encoded VP9 frame for a particular resolution and/or quality,		single encoded VP9 frame for a particular resolution and/or quality,
	and "picture" to refer to all the representations (frames) at a		and "picture" to refer to all the representations (frames) at a
	single instant in time. Thus, a picture consists of one or more		single instant in time. Thus, a picture consists of one or more

	skipping to change at line 167 ¶		skipping to change at line 169 ¶

	Given the above simplifications for inter-layer and inter-picture		Given the above simplifications for inter-layer and inter-picture
	dependencies, a flag (the D bit described below) is used to indicate		dependencies, a flag (the D bit described below) is used to indicate
	whether a spatial-layer SID frame depends on the spatial-layer SID-1		whether a spatial-layer SID frame depends on the spatial-layer SID-1
	frame. Given the D bit, a receiver only needs to additionally know		frame. Given the D bit, a receiver only needs to additionally know
	the inter-picture dependency structure for a given spatial-layer		the inter-picture dependency structure for a given spatial-layer
	frame in order to determine its decodability. Two modes of		frame in order to determine its decodability. Two modes of
	describing the inter-picture dependency structure are possible:		describing the inter-picture dependency structure are possible:
	"flexible mode" and "non-flexible mode". An encoder can only switch		"flexible mode" and "non-flexible mode". An encoder can only switch
	between the two on the first packet of a keyframe with a temporal-		between the two on the first packet of a keyframe with a temporal-

	layer ID equal to 0.		layer ID equal to zero.

	In flexible mode, each packet can contain up to three reference		In flexible mode, each packet can contain up to three reference
	indices, which identify all frames referenced by the frame		indices, which identify all frames referenced by the frame
	transmitted in the current packet for inter-picture prediction. This		transmitted in the current packet for inter-picture prediction. This
	(along with the D bit) enables a receiver to identify if a frame is		(along with the D bit) enables a receiver to identify if a frame is
	decodable or not and helps it understand the temporal-layer		decodable or not and helps it understand the temporal-layer
	structure. Since this is signaled in each packet, it makes it		structure. Since this is signaled in each packet, it makes it
	possible to have very flexible temporal-layer hierarchies and		possible to have very flexible temporal-layer hierarchies and
	scalability structures, which are changing dynamically.		scalability structures, which are changing dynamically.


	skipping to change at line 191 ¶		skipping to change at line 193 ¶
	inter-picture dependencies (the reference indices) of the PG MUST be		inter-picture dependencies (the reference indices) of the PG MUST be
	pre-specified as part of the Scalability Structure (SS) data. Each		pre-specified as part of the Scalability Structure (SS) data. Each
	packet has an index to refer to one of the described pictures in the		packet has an index to refer to one of the described pictures in the
	PG from which the pictures referenced by the picture transmitted in		PG from which the pictures referenced by the picture transmitted in
	the current packet for inter-picture prediction can be identified.		the current packet for inter-picture prediction can be identified.

	\| Note: A "Picture Group" or "PG", as used in this document, is		\| Note: A "Picture Group" or "PG", as used in this document, is
	\| not the same thing as the term "Group of Pictures" as it is		\| not the same thing as the term "Group of Pictures" as it is
	\| commonly used in video coding, i.e., to mean an independently		\| commonly used in video coding, i.e., to mean an independently
	\| decodable run of pictures beginning with a keyframe.		\| decodable run of pictures beginning with a keyframe.

	\|
	\| The SS data can also be used to specify the resolution of each		The SS data can also be used to specify the resolution of each
	\| spatial layer present in the VP9 stream for both flexible and		spatial layer present in the VP9 stream for both flexible and non-
	\| non-flexible modes.		flexible modes.

	4. Payload Format		4. Payload Format

	This section describes how the encoded VP9 bitstream is encapsulated		This section describes how the encoded VP9 bitstream is encapsulated
	in RTP. To handle network losses, usage of RTP/AVPF [RFC4585] is		in RTP. To handle network losses, usage of RTP/AVPF [RFC4585] is
	RECOMMENDED. All integer fields in this specification are encoded as		RECOMMENDED. All integer fields in this specification are encoded as
	unsigned integers in network octet order.		unsigned integers in network octet order.

	4.1. RTP Header Usage		4.1. RTP Header Usage


	skipping to change at line 240 ¶		skipping to change at line 242 ¶
	+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+		+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

	Figure 1: General RTP Payload Format for VP9		Figure 1: General RTP Payload Format for VP9

	See Section 4.2 for more information on the VP9 payload descriptor;		See Section 4.2 for more information on the VP9 payload descriptor;
	the VP9 payload is described in [VP9-BITSTREAM]. OPTIONAL RTP		the VP9 payload is described in [VP9-BITSTREAM]. OPTIONAL RTP
	padding MUST NOT be included unless the P bit is set.		padding MUST NOT be included unless the P bit is set.

	Marker bit (M): This bit MUST be set to one for the final packet of		Marker bit (M): This bit MUST be set to one for the final packet of
	the highest spatial-layer frame (the final packet of the picture);		the highest spatial-layer frame (the final packet of the picture);

	otherwise, it is 0. Unless spatial scalability is in use for this		otherwise, it is zero. Unless spatial scalability is in use for
	picture, this bit will have the same value as the E bit described		this picture, this bit will have the same value as the E bit
	in Section 4.2. Note this bit MUST be set to one for the target		described in Section 4.2. Note this bit MUST be set to one for
	spatial-layer frame if a stream is being rewritten to remove		the target spatial-layer frame if a stream is being rewritten to
	higher spatial layers.		remove higher spatial layers.

	Payload Type (PT): In line with the policy in Section 3 of		Payload Type (PT): In line with the policy in Section 3 of
	[RFC3551], applications using the VP9 RTP payload profile MUST		[RFC3551], applications using the VP9 RTP payload profile MUST
	assign a dynamic payload type number to be used in each RTP		assign a dynamic payload type number to be used in each RTP
	session and provide a mechanism to indicate the mapping. See		session and provide a mechanism to indicate the mapping. See
	Section 6.1 for the mechanism to be used with the Session		Section 6.1 for the mechanism to be used with the Session
	Description Protocol (SDP) [RFC8866].		Description Protocol (SDP) [RFC8866].

	Timestamp: The RTP timestamp [RFC3550] indicates the time when the		Timestamp: The RTP timestamp [RFC3550] indicates the time when the
	input frame was sampled, at a clock rate of 90 kHz. If the input		input frame was sampled, at a clock rate of 90 kHz. If the input

	picture is encoded with multiple-layer frames, all of the frames		picture is encoded with multiple frames, all of the frames of the
	of the picture MUST have the same timestamp.		picture MUST have the same timestamp.

	If a frame has the VP9 show_frame field set to zero (i.e., it is		If a frame has the VP9 show_frame field set to zero (i.e., it is
	meant only to populate a reference buffer without being output),		meant only to populate a reference buffer without being output),
	its timestamp MAY alternatively be set to be the same as the		its timestamp MAY alternatively be set to be the same as the

	subsequent frame with show_frame equal to 1. (This will be		subsequent frame with show_frame equal to one. (This will be
	convenient for playing out pre-encoded content packaged with VP9		convenient for playing out pre-encoded content packaged with VP9
	"superframes", which typically bundle show_frame==0 frames with a		"superframes", which typically bundle show_frame==0 frames with a

	subsequent show_frame==1 frame.) Every frame with show_frame==1,		subsequent show_frame==1 frame.) Every picture containing a frame
	however, MUST have a unique timestamp modulo the 2^32 wrap of the		with show_frame==1, however, MUST have a unique timestamp modulo
	field.		the 2^32 wrap of the field.

	The remaining RTP Fixed Header Fields (V, P, X, CC, sequence number,		The remaining RTP Fixed Header Fields (V, P, X, CC, sequence number,
	SSRC, and CSRC identifiers) are used as specified in Section 5.1 of		SSRC, and CSRC identifiers) are used as specified in Section 5.1 of
	[RFC3550].		[RFC3550].

	4.2. VP9 Payload Descriptor		4.2. VP9 Payload Descriptor

	In flexible mode (with the F bit below set to one), the first octets		In flexible mode (with the F bit below set to one), the first octets
	after the RTP header are the VP9 payload descriptor, with the		after the RTP header are the VP9 payload descriptor, with the
	following structure.		following structure.

	skipping to change at line 316 ¶		skipping to change at line 318 ¶
	M: \| EXTENDED PID \| (RECOMMENDED)		M: \| EXTENDED PID \| (RECOMMENDED)
	+-+-+-+-+-+-+-+-+		+-+-+-+-+-+-+-+-+
	L: \| TID \|U\| SID \|D\| (Conditionally RECOMMENDED)		L: \| TID \|U\| SID \|D\| (Conditionally RECOMMENDED)
	+-+-+-+-+-+-+-+-+		+-+-+-+-+-+-+-+-+
	\| TL0PICIDX \| (Conditionally REQUIRED)		\| TL0PICIDX \| (Conditionally REQUIRED)
	+-+-+-+-+-+-+-+-+		+-+-+-+-+-+-+-+-+
	V: \| SS \|		V: \| SS \|
	\| .. \|		\| .. \|
	+-+-+-+-+-+-+-+-+		+-+-+-+-+-+-+-+-+


	Figure 3: Non-flexible Mode Format for VP9 Payload Descriptor		Figure 3: Non-Flexible Mode Format for VP9 Payload Descriptor

			Except as noted, the following field descriptions apply to the
			payload descriptor formats in both Figures 2 and 3.

	I: Picture ID (PID) present. When set to one, the OPTIONAL PID MUST		I: Picture ID (PID) present. When set to one, the OPTIONAL PID MUST
	be present after the mandatory first octet and specified as below.		be present after the mandatory first octet and specified as below.
	Otherwise, PID MUST NOT be present. If the V bit was set in the		Otherwise, PID MUST NOT be present. If the V bit was set in the
	stream's most recent start of a keyframe (i.e., the SS field was		stream's most recent start of a keyframe (i.e., the SS field was
	present) and the F bit is set to zero (i.e., non-flexible		present) and the F bit is set to zero (i.e., non-flexible
	scalability mode is in use), then this bit MUST be set on every		scalability mode is in use), then this bit MUST be set on every
	packet.		packet.

	P: Inter-picture predicted frame. When set to zero, the frame does		P: Inter-picture predicted frame. When set to zero, the frame does

	skipping to change at line 357 ¶		skipping to change at line 362 ¶
	mandatory first octet, the PID, and layer indices (if present) are		mandatory first octet, the PID, and layer indices (if present) are
	as described by "reference indices" below. This bit MUST only be		as described by "reference indices" below. This bit MUST only be
	set to one if the I bit is also set to one; if the I bit is set to		set to one if the I bit is also set to one; if the I bit is set to
	zero, then this bit MUST also be set to zero and ignored by		zero, then this bit MUST also be set to zero and ignored by
	receivers. (Flexible mode's reference indices are defined as		receivers. (Flexible mode's reference indices are defined as
	offsets from the Picture ID field, so they would have no meaning		offsets from the Picture ID field, so they would have no meaning
	if I were not set.) The value of the F bit MUST only change on		if I were not set.) The value of the F bit MUST only change on
	the first packet of a key picture. A "key picture" is a picture		the first packet of a key picture. A "key picture" is a picture
	whose base spatial-layer frame is a keyframe, and thus one which		whose base spatial-layer frame is a keyframe, and thus one which
	completely resets the encoder state. This packet will have its P		completely resets the encoder state. This packet will have its P

	bit equal to 0, SID or L bit (described below) equal to 0, and B		bit equal to zero, SID or L bit (described below) equal to zero,
	bit (described below) equal to 1.		and B bit (described below) equal to one.

	B: Start of Frame. This bit MUST be set to one if the first payload		B: Start of Frame. This bit MUST be set to one if the first payload
	octet of the RTP packet is the beginning of a new VP9 frame;		octet of the RTP packet is the beginning of a new VP9 frame;

	otherwise, it MUST NOT be 1. Note that this frame might not be		otherwise, it MUST NOT be one. Note that this frame might not be
	the first frame of a picture.		the first frame of a picture.

	E: End of Frame. This bit MUST be set to one for the final RTP		E: End of Frame. This bit MUST be set to one for the final RTP

	packet of a VP9 frame; otherwise, it is 0. This enables a decoder		packet of a VP9 frame; otherwise, it is zero. This enables a
	to finish decoding the frame, where it otherwise may need to wait		decoder to finish decoding the frame, where it otherwise may need
	for the next packet to explicitly know that the frame is complete.		to wait for the next packet to explicitly know that the frame is
	Note that, if spatial scalability is in use, more frames from the		complete. Note that, if spatial scalability is in use, more
	same picture may follow; see the description of the B bit above.		frames from the same picture may follow; see the description of
			the B bit above.

	V: Scalability Structure (SS) data present. When set to one, the		V: Scalability Structure (SS) data present. When set to one, the
	OPTIONAL SS data MUST be present in the payload descriptor.		OPTIONAL SS data MUST be present in the payload descriptor.
	Otherwise, the SS data MUST NOT be present.		Otherwise, the SS data MUST NOT be present.

	Z: Not a reference frame for upper spatial layers. If set to one,		Z: Not a reference frame for upper spatial layers. If set to one,
	indicates that frames with higher spatial layers SID+1 and greater		indicates that frames with higher spatial layers SID+1 and greater
	of the current and following pictures do not depend on the current		of the current and following pictures do not depend on the current
	spatial-layer SID frame. This enables a decoder that is targeting		spatial-layer SID frame. This enables a decoder that is targeting
	a higher spatial layer to know that it can safely discard this		a higher spatial layer to know that it can safely discard this

	skipping to change at line 394 ¶		skipping to change at line 400 ¶
	The mandatory first octet is followed by the extension data fields		The mandatory first octet is followed by the extension data fields
	that are enabled:		that are enabled:

	M: The most significant bit of the first octet is an extension flag.		M: The most significant bit of the first octet is an extension flag.
	The field MUST be present if the I bit is equal to one. If M is		The field MUST be present if the I bit is equal to one. If M is
	set, the PID field MUST contain 15 bits; otherwise, it MUST		set, the PID field MUST contain 15 bits; otherwise, it MUST
	contain 7 bits. See PID below.		contain 7 bits. See PID below.

	Picture ID (PID): Picture ID represented in 7 or 15 bits, depending		Picture ID (PID): Picture ID represented in 7 or 15 bits, depending
	on the M bit. This is a running index of the pictures, where the		on the M bit. This is a running index of the pictures, where the

	sender increments the value by 1 for each picture it sends.		sender increments the value by one for each picture it sends.
	(Note, however, that because a middlebox can discard pictures		(Note, however, that because a middlebox can discard pictures
	where permitted by the SS, Picture IDs as received by a receiver		where permitted by the SS, Picture IDs as received by a receiver
	might not be contiguous.) This field MUST be present if the I bit		might not be contiguous.) This field MUST be present if the I bit
	is equal to one. If M is set to zero, 7 bits carry the PID; else,		is equal to one. If M is set to zero, 7 bits carry the PID; else,
	if M is set to one, 15 bits carry the PID in network byte order.		if M is set to one, 15 bits carry the PID in network byte order.
	The sender may choose between a 7- or 15-bit index. The PID		The sender may choose between a 7- or 15-bit index. The PID
	SHOULD start on a random number and MUST wrap after reaching the		SHOULD start on a random number and MUST wrap after reaching the
	maximum ID (0x7f or 0x7fff depending on the index size chosen).		maximum ID (0x7f or 0x7fff depending on the index size chosen).
	The receiver MUST NOT assume that the number of bits in the PID		The receiver MUST NOT assume that the number of bits in the PID
	stays the same through the session. If this field transitions		stays the same through the session. If this field transitions
	from 7 bits to 15 bits, the value is zero-extended (i.e., the		from 7 bits to 15 bits, the value is zero-extended (i.e., the
	value after 0x6e is 0x006f); if the field transitions from 15 bits		value after 0x6e is 0x006f); if the field transitions from 15 bits

	to 7 bits, it is truncated (i.e., the value after 0x1bbe is 0xbf).		to 7 bits, it is truncated (i.e., the value after 0x1bbe is 0x3f).

	In the non-flexible mode (when the F bit is set to zero), this PID		In the non-flexible mode (when the F bit is set to zero), this PID
	is used as an index to the PG specified in the SS data below. In		is used as an index to the PG specified in the SS data below. In
	this mode, the PID of the keyframe corresponds to the first		this mode, the PID of the keyframe corresponds to the first
	specified frame in the PG. Then subsequent PIDs are mapped to		specified frame in the PG. Then subsequent PIDs are mapped to
	subsequently specified frames in the PG (modulo N_G, specified in		subsequently specified frames in the PG (modulo N_G, specified in
	the SS data below), respectively.		the SS data below), respectively.

	All frames of the same picture MUST have the same PID value.		All frames of the same picture MUST have the same PID value.

	Frames (and their corresponding pictures) with the VP9 show_frame		Frames (and their corresponding pictures) with the VP9 show_frame

	field equal to 0 MUST have distinct PID values from subsequent		field equal to zero MUST have distinct PID values from subsequent
	pictures with show_frame equal to 1. Thus, a picture (as defined		pictures with show_frame equal to one. Thus, a picture (as
	in this specification) is different than a VP9 superframe.		defined in this specification) is different than a VP9 superframe.

	All frames of the same picture MUST have the same value for		All frames of the same picture MUST have the same value for
	show_frame.		show_frame.

	Layer indices: This field is optional but RECOMMENDED whenever		Layer indices: This field is optional but RECOMMENDED whenever
	encoding with layers. For both flexible and non-flexible modes,		encoding with layers. For both flexible and non-flexible modes,
	one octet is used to specify a layer frame's Temporal-layer ID		one octet is used to specify a layer frame's Temporal-layer ID

	(TID) and Spatial-layer ID (SID) as shown both in Figure 2 and		(TID) and Spatial-layer ID (SID) as shown both in Figures 2 and 3.
	Figure 3. Additionally, a bit (U) is used to indicate that the		Additionally, a bit (U) is used to indicate that the current frame
	current frame is a "switching up point" frame. Another bit (D) is		is a "switching up point" frame. Another bit (D) is used to
	used to indicate whether inter-layer prediction is used for the		indicate whether inter-layer prediction is used for the current
	current frame.		frame.

	In the non-flexible mode (when the F bit is set to zero), another		In the non-flexible mode (when the F bit is set to zero), another

	octet is used to represent Temporal Layer 0 Picture Index (8 bits)		octet is used to represent the Temporal Layer 0 Picture Index (8
	(TL0PICIDX), as depicted in Figure 3. The TL0PICIDX is present so		bits) (TL0PICIDX), as depicted in Figure 3. The TL0PICIDX is
	that all minimally required frames (the base temporal-layer		present so that all minimally required frames (the base temporal-
	frames) can be tracked.		layer frames) can be tracked.

	The TID and SID fields indicate the temporal and spatial layers		The TID and SID fields indicate the temporal and spatial layers
	and can help middleboxes and endpoints quickly identify which		and can help middleboxes and endpoints quickly identify which
	layer a packet belongs to.		layer a packet belongs to.

	TID: The temporal-layer ID of the current frame. In the case of		TID: The temporal-layer ID of the current frame. In the case of
	non-flexible mode, if a PID is mapped to a picture in a		non-flexible mode, if a PID is mapped to a picture in a
	specified PG, then the value of the TID MUST match the		specified PG, then the value of the TID MUST match the
	corresponding TID value of the mapped picture in the PG.		corresponding TID value of the mapped picture in the PG.

	U: Switching up point. When this bit is set to one, if the		U: Switching up point. When this bit is set to one, if the
	current picture has a temporal-layer ID equal to value T, then		current picture has a temporal-layer ID equal to value T, then
	subsequent pictures with temporal-layer ID values higher than T		subsequent pictures with temporal-layer ID values higher than T
	will not depend on any picture before the current picture (in		will not depend on any picture before the current picture (in

	coding order) with a temporal-layer ID value greater than T.		decode order) with a temporal-layer ID value greater than T.

	SID: The spatial-layer ID of the current frame. Note that frames		SID: The spatial-layer ID of the current frame. Note that frames
	with spatial-layer SID > 0 may be dependent on decoded spatial-		with spatial-layer SID > 0 may be dependent on decoded spatial-
	layer SID-1 frame within the same picture. Different frames of		layer SID-1 frame within the same picture. Different frames of
	the same picture MUST have distinct spatial-layer IDs, and		the same picture MUST have distinct spatial-layer IDs, and
	frames' spatial layers MUST appear in increasing order within		frames' spatial layers MUST appear in increasing order within
	the frame.		the frame.

	D: Inter-layer dependency is used. D MUST be set to one if and		D: Inter-layer dependency is used. D MUST be set to one if and
	only if the current spatial-layer SID frame depends on spatial-		only if the current spatial-layer SID frame depends on spatial-
	layer SID-1 frame of the same picture; otherwise, it MUST be		layer SID-1 frame of the same picture; otherwise, it MUST be

	set to zero. For the base-layer frame (with SID equal to 0),		set to zero. For the base-layer frame (with SID equal to
	the D bit MUST be set to zero.		zero), the D bit MUST be set to zero.

	TL0PICIDX: Temporal Layer 0 Picture Index (8 bits). TL0PICIDX is		TL0PICIDX: Temporal Layer 0 Picture Index (8 bits). TL0PICIDX is
	only present in the non-flexible mode (F = 0). This is a		only present in the non-flexible mode (F = 0). This is a
	running index for the temporal base-layer pictures, i.e., the		running index for the temporal base-layer pictures, i.e., the

	pictures with a TID set to zero. If the TID is larger than 0,		pictures with a TID set to zero. If the TID is larger than
	TL0PICIDX indicates which temporal base-layer picture the		zero, TL0PICIDX indicates which temporal base-layer picture the
	current picture depends on. TL0PICIDX MUST be incremented by 1		current picture depends on. TL0PICIDX MUST be incremented by
	when the TID is equal to 0. The index SHOULD start on a random		one when the TID is equal to zero. The index SHOULD start on a
	number and MUST restart at 0 after reaching the maximum number		random number and MUST restart at zero after reaching the
	255.		maximum number 255.

	Reference indices: When P and F are both set to one, indicating a		Reference indices: When P and F are both set to one, indicating a
	non-keyframe in flexible mode, then at least one reference index		non-keyframe in flexible mode, then at least one reference index
	MUST be specified as below. Additional reference indices (a total		MUST be specified as below. Additional reference indices (a total
	of up to three reference indices are allowed) may be specified		of up to three reference indices are allowed) may be specified
	using the N bit below. When either P or F is set to zero, then no		using the N bit below. When either P or F is set to zero, then no
	reference index is specified.		reference index is specified.

	P_DIFF: The reference index (in 7 bits) specified as the relative		P_DIFF: The reference index (in 7 bits) specified as the relative
	PID from the current picture. For example, when P_DIFF=3 on a		PID from the current picture. For example, when P_DIFF=3 on a
	packet containing the picture with PID 112 means that the		packet containing the picture with PID 112 means that the
	picture refers back to the picture with PID 109. This		picture refers back to the picture with PID 109. This
	calculation is done modulo the size of the PID field, i.e.,		calculation is done modulo the size of the PID field, i.e.,

	either 7 or 15 bits. A P_DIFF value of 0 is invalid.		either 7 or 15 bits. A P_DIFF value of zero is invalid.

	N: 1 if there is additional P_DIFF following the current P_DIFF.		N: 1 if there is additional P_DIFF following the current P_DIFF.

	4.2.1. Scalability Structure (SS)		4.2.1. Scalability Structure (SS)

	The SS data describes the resolution of each frame within a picture		The SS data describes the resolution of each frame within a picture
	as well as the inter-picture dependencies for a PG. If the VP9		as well as the inter-picture dependencies for a PG. If the VP9
	payload descriptor's V bit is set, the SS data is present in the		payload descriptor's V bit is set, the SS data is present in the
	position indicated in Figures 2 and 3.		position indicated in Figures 2 and 3.


	skipping to change at line 536 ¶		skipping to change at line 542 ¶
	one, the OPTIONAL WIDTH (2 octets) and HEIGHT (2 octets) MUST be		one, the OPTIONAL WIDTH (2 octets) and HEIGHT (2 octets) MUST be
	present for each layer frame. Otherwise, the resolution MUST NOT		present for each layer frame. Otherwise, the resolution MUST NOT
	be present.		be present.

	G: The PG description present flag.		G: The PG description present flag.

	-: A bit reserved for future use. It MUST be set to zero and MUST		-: A bit reserved for future use. It MUST be set to zero and MUST
	be ignored by the receiver.		be ignored by the receiver.

	N_G: N_G indicates the number of pictures in a PG. If N_G is		N_G: N_G indicates the number of pictures in a PG. If N_G is

	greater than 0, then the SS data allows the inter-picture		greater than zero, then the SS data allows the inter-picture
	dependency structure of the VP9 stream to be pre-declared, rather		dependency structure of the VP9 stream to be pre-declared, rather
	than indicating it on the fly with every packet. If N_G is		than indicating it on the fly with every packet. If N_G is

	greater than 0, then for N_G pictures in the PG, each picture's		greater than zero, then for N_G pictures in the PG, each picture's
	Temporal-layer ID (TID), switch up point (U), and reference		Temporal-layer ID (TID), switch up point (U), and reference
	indices (P_DIFFs) are specified.		indices (P_DIFFs) are specified.

	The first picture specified in the PG MUST have a TID set to zero.		The first picture specified in the PG MUST have a TID set to zero.

	G set to zero or N_G set to zero indicates that either there is		G set to zero or N_G set to zero indicates that either there is
	only one temporal layer (for non-flexible mode) or no fixed inter-		only one temporal layer (for non-flexible mode) or no fixed inter-
	picture dependency information is present (for flexible mode)		picture dependency information is present (for flexible mode)
	going forward in the bitstream.		going forward in the bitstream.


	skipping to change at line 561 ¶		skipping to change at line 567 ¶
	picture dependency structure. However, the frame rate of each		picture dependency structure. However, the frame rate of each
	spatial layer can be different from each other; this can be		spatial layer can be different from each other; this can be
	described with the use of the D bit described above. The		described with the use of the D bit described above. The
	specified dependency structure in the SS data MUST be for the		specified dependency structure in the SS data MUST be for the
	highest frame rate layer.		highest frame rate layer.

	R: The number of P_DIFF fields that are present.		R: The number of P_DIFF fields that are present.

	In a scalable stream sent with a fixed pattern, the SS data SHOULD be		In a scalable stream sent with a fixed pattern, the SS data SHOULD be
	included in the first packet of every key frame. This is a packet		included in the first packet of every key frame. This is a packet

	with the P bit equal to 0, SID or L bit equal to 0, and B bit equal		with the P bit equal to zero, SID or L bit equal to zero, and B bit
	to 1. The SS data MUST only be changed on the picture that		equal to one. The SS data MUST only be changed on the picture that
	corresponds to the first picture specified in the previous SS data's		corresponds to the first picture specified in the previous SS data's

	PG (if the previous SS data's N_G was greater than 0).		PG (if the previous SS data's N_G was greater than zero).

	4.3. Frame Fragmentation		4.3. Frame Fragmentation

	VP9 frames are fragmented into packets in RTP sequence number order:		VP9 frames are fragmented into packets in RTP sequence number order:
	beginning with a packet with the B bit set and ending with a packet		beginning with a packet with the B bit set and ending with a packet
	with the E bit set. There is no mechanism for finer-grained access		with the E bit set. There is no mechanism for finer-grained access
	to parts of a VP9 frame.		to parts of a VP9 frame.

	4.4. Scalable Encoding Considerations		4.4. Scalable Encoding Considerations


	skipping to change at line 598 ¶		skipping to change at line 604 ¶

	For spatially scalable streams, this means that		For spatially scalable streams, this means that
	"error_resilient_mode" needs to be turned on for the base spatial		"error_resilient_mode" needs to be turned on for the base spatial
	layer; however, it can be turned off for higher spatial layers,		layer; however, it can be turned off for higher spatial layers,
	assuming they are sent with inter-layer dependency (i.e., with the D		assuming they are sent with inter-layer dependency (i.e., with the D
	bit set). For streams that are only temporally scalable without		bit set). For streams that are only temporally scalable without
	spatial scalability, "error_resilient_mode" can additionally be		spatial scalability, "error_resilient_mode" can additionally be
	turned off for any picture that immediately follows a temporal-layer		turned off for any picture that immediately follows a temporal-layer
	0 frame.		0 frame.


	4.5. Examples of VP9 RTP Stream		4.5. Example of a VP9 RTP Stream

	4.5.1. Reference Picture Use for Scalable Structure		4.5.1. Reference Picture Use for Scalable Structure

	As discussed in Section 3, the VP9 codec can maintain up to eight		As discussed in Section 3, the VP9 codec can maintain up to eight
	reference frames, of which up to three can be referenced or updated		reference frames, of which up to three can be referenced or updated
	by any new frame. This section illustrates one way that a scalable		by any new frame. This section illustrates one way that a scalable
	structure (with three spatial layers and three temporal layers) can		structure (with three spatial layers and three temporal layers) can
	be constructed using these reference frames.		be constructed using these reference frames.

	+==========+=========+============+=========+		+==========+=========+============+=========+

	skipping to change at line 913 ¶		skipping to change at line 919 ¶
	are subject to the security considerations discussed in the RTP		are subject to the security considerations discussed in the RTP
	specification [RFC3550], and in any applicable RTP profile such as		specification [RFC3550], and in any applicable RTP profile such as
	RTP/AVP [RFC3551], RTP/AVPF [RFC4585], RTP/SAVP [RFC3711], or RTP/		RTP/AVP [RFC3551], RTP/AVPF [RFC4585], RTP/SAVP [RFC3711], or RTP/
	SAVPF [RFC5124]. However, as "Securing the RTP Framework: Why RTP		SAVPF [RFC5124]. However, as "Securing the RTP Framework: Why RTP
	Does Not Mandate a Single Media Security Solution" [RFC7202]		Does Not Mandate a Single Media Security Solution" [RFC7202]
	discusses, it is not an RTP payload format's responsibility to		discusses, it is not an RTP payload format's responsibility to
	discuss or mandate what solutions are used to meet the basic security		discuss or mandate what solutions are used to meet the basic security
	goals like confidentiality, integrity, and source authenticity for		goals like confidentiality, integrity, and source authenticity for
	RTP in general. This responsibility lies with anyone using RTP in an		RTP in general. This responsibility lies with anyone using RTP in an
	application. They can find guidance on available security mechanisms		application. They can find guidance on available security mechanisms

	in "Options for Securing RTP Sessions [RFC7201]. Applications SHOULD		in "Options for Securing RTP Sessions" [RFC7201]. Applications
	use one or more appropriate strong security mechanisms.		SHOULD use one or more appropriate strong security mechanisms.

	Implementations of this RTP payload format need to take appropriate		Implementations of this RTP payload format need to take appropriate
	security considerations into account. It is extremely important for		security considerations into account. It is extremely important for
	the decoder to be robust against malicious or malformed payloads and		the decoder to be robust against malicious or malformed payloads and
	ensure that they do not cause the decoder to overrun its allocated		ensure that they do not cause the decoder to overrun its allocated
	memory or otherwise misbehave. An overrun in allocated memory could		memory or otherwise misbehave. An overrun in allocated memory could
	lead to arbitrary code execution by an attacker. The same applies to		lead to arbitrary code execution by an attacker. The same applies to
	the encoder, even though problems in encoders are (typically) rarer.		the encoder, even though problems in encoders are (typically) rarer.

	This RTP payload format and its media decoder do not exhibit any		This RTP payload format and its media decoder do not exhibit any

	skipping to change at line 948 ¶		skipping to change at line 954 ¶
	non-reference frames and discard them in order to reduce network		non-reference frames and discard them in order to reduce network
	congestion. Note that discarding of non-reference frames cannot be		congestion. Note that discarding of non-reference frames cannot be
	done if the stream is encrypted (because the non-reference marker is		done if the stream is encrypted (because the non-reference marker is
	encrypted).		encrypted).

	10. IANA Considerations		10. IANA Considerations

	IANA has registered the media type registration "video/vp9" as		IANA has registered the media type registration "video/vp9" as
	specified in Section 7. The media type has also been added to the		specified in Section 7. The media type has also been added to the
	"RTP Payload Format Media Types" <https://www.iana.org/assignments/		"RTP Payload Format Media Types" <https://www.iana.org/assignments/

	rtp-parameters> subregistry of the "Real-Time Transport Protocol		rtp-parameters> registry of the "Real-Time Transport Protocol (RTP)
	(RTP) Paramaeters" registry as follows.		Paramaeters" registry group as follows.

	Media Type: video		Media Type: video
	Subtype: VP9		Subtype: VP9
	Clock Rate (Hz): 90000		Clock Rate (Hz): 90000
	Reference: RFC 9628		Reference: RFC 9628

	11. References		11. References

	11.1. Normative References		11.1. Normative References


	skipping to change at line 1062 ¶		skipping to change at line 1068 ¶
	Acknowledgments		Acknowledgments

	Alex Eleftheriadis, Yuki Ito, Won Kap Jang, Sergio Garcia Murillo,		Alex Eleftheriadis, Yuki Ito, Won Kap Jang, Sergio Garcia Murillo,
	Roi Sasson, Timothy Terriberry, Emircan Uysaler, and Thomas Volkert		Roi Sasson, Timothy Terriberry, Emircan Uysaler, and Thomas Volkert
	commented on the development of this document and provided helpful		commented on the development of this document and provided helpful
	feedback.		feedback.

	Authors' Addresses		Authors' Addresses

	Justin Uberti		Justin Uberti

	Google, Inc.		OpenAI
	747 6th Street South		747 6th Street South
	Kirkland, WA 98033		Kirkland, WA 98033
	United States of America		United States of America
	Email: justin@uberti.name		Email: justin@uberti.name

	Stefan Holmer		Stefan Holmer
	Google, Inc.		Google, Inc.
	Kungsbron 2		Kungsbron 2
	SE-111 22 Stockholm		SE-111 22 Stockholm
	Sweden		Sweden

	skipping to change at line 1084 ¶		skipping to change at line 1090 ¶

	Magnus Flodman		Magnus Flodman
	Google, Inc.		Google, Inc.
	Kungsbron 2		Kungsbron 2
	SE-111 22 Stockholm		SE-111 22 Stockholm
	Sweden		Sweden
	Email: mflodman@google.com		Email: mflodman@google.com

	Danny Hong		Danny Hong
	Google, Inc.		Google, Inc.

	1585 Charleston Road		315 Hudson St.
	Mountain View, CA 94043		New York, NY 10013
	United States of America		United States of America
	Email: dannyhong@google.com		Email: dannyhong@google.com

	Jonathan Lennox		Jonathan Lennox
	8x8, Inc. / Jitsi		8x8, Inc. / Jitsi
	Jersey City, NJ 07302		Jersey City, NJ 07302
	United States of America		United States of America
	Email: jonathan.lennox@8x8.com		Email: jonathan.lennox@8x8.com

End of changes. 31 change blocks.
	68 lines changed or deleted		74 lines changed or added
This html diff was produced by rfcdiff 1.48.