MPEG-4 Primer

Version: $Id: mpeg4.html,v 1.5 2001/03/08 15:22:20 chrisp Exp $

On Milia, several people have gotten a severe hype attack from a company called PacketVideo. Appearantly they use MPEG-4 to transport audio, video and 3D content across networks, even wireless ones. They were very interested in working with Blender and think it can become the de facto 3D component in MPEG-4. This is important because MPEG-4 is hard on its way to get into silicon and is making its way onto mobile devices (winCE, EPOC, see an article in InternetNews).
This document wants to eliminate the hype and tell you what it's about, what it can and cannot do and how NaN and Blender could benefit from it (or not).

This primer is based on MPEG-4: A Multimedia Standard for the Third Millennium, part 1 and part 2
IEEE has an excellent MPEG-4 introduction

Index

1. Introduction to MPEG-4 1.1. The Basics 1.2. The language 1.3. The Encoding
2. Impact for Not A Number 2.1. General 2.2. Impact 1 - Video stream to texture 2.3. Impact 2 - Audio stream into scene or object 2.4. Impact 3 - MPEG-4 read and write with Blender 2.5. Impact 4 - Blender as 2D/3D codec for MPEG-4 2.6. Impact 5 - Server-side rendering with Blender 2.7. Risks
3. MPEG-4 and 3D Games
4. Technical Section 4.1. Scenes and Hierarchies 4.2. Networking 4.2.1. Synchronization layer (timing and synchronization). 4.2.2. Flexible multiplex layer (content). 4.2.3. Transport multiplex layer (service). 4.3. Media Integration of Text and Graphics 4.3.1. 3D graphics 4.3.2. Face animation 4.3.3. 2D mesh encoding 4.3.4. Texture coding 4.4. Audio
5. Links

[Back to Chris] [Back to Developers] [Back to Intranet]

1. Introduction to MPEG-4

MPEG-4 (Moving Pictures Expert Group) defines a multimedia system for interoperable communication of complex scenes containing audio, video, synthetic audio and graphics material. Unlike MPEG-1 and MPEG-2, which focused on better compression efficiency, the emphasis in MPEG-4 is on new functionalities. Mobile as well as stationary user terminals, database access, communications and new types of interactive services will be major applications for MPEG-4.

1.1. The Basics

The MPEG-4 standard provides ways to encode audio and video into so-called media objects for transmission. It describes ways to combine several of these objects into compound media objects and form audiovisual scenes. (See an example.) The standard lists how the data associated with these scenes can be multiplexed and synchronized on network channels providing a Quality of Service (QoS) appropriate for the nature of the specific media objects. Finally, it lets users interact with the audiovisual scene. A scene has a number of video objects, of possibly different shapes, plus a number of audio objects, possibly associated with video objects. Objects can be natural or synthetic, i.e. computer generated.

Visual objects in a scene are described mathematically and given a position in a two- or three-dimensional space. Similarly, audio objects are placed in a sound space. When placed in 3-D space, the video or audio object need only be defined once; the viewer can change his vantage point, and the calculations to update the screen and sound are done locally, at the user's terminal. This is a critical feature if the response is to be fast and the available bit-rate is limited, or when no return channel is available, as in broadcast situations.

1.2. The language

MPEG-4's language for describing and dynamically changing the scene is called the Binary Format for Scenes (BIFS). BIFS commands are available not only to add objects to or delete them from a scene, but also to change visual or acoustic properties of an object without changing the object itself; thus only the color of a 3D sphere might be varied.
BIFS can be used to animate objects just by sending the proper BIFS command and to define the behavior of those objects in response to user input at the decoder. It can also be used to put an application screen (such as a Web browser's) as a "texture" in the scene or map a video stream as a texture onto a virtual TV screen.
BIFS borrows many concepts from the Virtual Reality Modeling Language (VRML), which is the method used most widely on the Internet to describe 3D objects and users' interaction with them. BIFS and VRML can be seen as different representations of the same data. In VRML, the objects and their actions are described in text, as in any other high-level language. But BIFS code is binary and is thus shorter for the same content - typically 10 to 15 times.

More importantly, unlike VRML, MPEG-4 uses BIFS for real-time streaming. That is, a scene does not need to be downloaded in full before it can be played, but can be built up on the fly. Lastly, BIFS allows defining 2D objects such as lines and rectangles. For more information, may want to read the BIFS FAQ.

1.3. The Encoding

sample MPEG-4 codec stream Content, before it is transmitted, in encoded. Full MPEG-4 (a complex compound scene with text, 3D objects, streaming audio and video) is more than just a little DivX encoding that you see frequently these days. Full encoding is a bit more complex. The media objects will be designated to so-called Elementary Streams (ESs). An object may get one or more streams, depending on its characteristics. Also, the basic positioning information gets its own stream. If the MPEG-4 scene is interactive, an interaction stream (aka upload stream) is added as well.

In the figure on the left, complex content is encoded into several streams. An audio stream(A), two video streams (V) for low and high res video, a 2D/3D stream (G, for geometry), a basic positioning stream (P) and an upload stream to capture any interactivity from the user. All this is packaged into an MPEG-4 channel that is transmitted across the network to a user. The user's player receives the channel and decodes the channel into sepearate streams. These streams are then fed to the MPEG-4player's "renderer" that composes the original scene again and puts it onto the user's screen. (See below for very detailed technical info on this.)

Another important addition in MPEG-4 is a file format known as mp4, which can be used for exchange of content and which is easily converted. It will be the only reliable way for users to exchange complete files of MPEG-4 content. Content which is present in mp4 format can be decoded by any player, much like audio mp3 files can be played by any player. This is unlike MPEG-2 files, which can only be viewed by a player that supports the same codec as the one used to create the file.

2. Impact for Not A Number

MPEG-4 can have an impact in a few areas:

Usage of MPEG-4 content inside Blender, ie. video streaming to texture
Usage of MPEG-4 content inside Blender, ie. audio streaming to object or scene
Creation of MPEG-4 content from (game)Blender, read and write
Use of Blender technology as a codec for 2D/3D content in MPEG-4
Server-side rendering of Blender and streaming of MPEG-4 content

I'll go into each of these below.

2.1. General

For NaN, it is interesting to know how that MPEG-4 combines objects into compound objects or scenes. They use a hierarchical structure of a scene and a graph to describe the relationships between objects in the scene. Even more interesting is that they adopted much of the VRML way of composing objects into a scene. VRML (Virtual Reality Modelling Lanuage) specifies the composition of complex scenes containing 3D material plus audio and video streams.
This results in an MPEG-4 standard that specifies how objects representing natural audio and video with streams attached (microphones and cameras), can be combined with objects representing synthetic audio and video (2D and 3D material) with streams attached (3D rooms, faces, figures, etc.). In additon, the standard conforms to the VRML syntax with extensions, however, MPEG-4 scenes are binary encoded for sake of efficiency (the format is called BIFS). But, just like VRML, the scene descriptions can be generate in textual format, without or without help from an authoring tool.

2.2. Impact 1 - Video stream to texture

Using a video stream in Blender and display it as a texture can be very useful. It enabled all kinds of interactive content. For example, jukeboxes for a kiosk, virtual TV rooms or to preview movies, webcams, etc.

2.3. Impact 2 - Audio stream into scene or object

Streaming audio into the scene can also greatly enhance what can be done with Blender. All kinds of interesting audio, songs, taunts can be fed into a game in this way. Either for background music, sound fx, clan taunts or just the rushing of the ocean in Myst.

2.4. Impact 3 - MPEG-4 read and write with Blender

Blender would, nativelly, create mp4 files. These could then be played anywhere or distributed over the Net. An MPEG-4 encoder would need to be added to Blender. This is a very tricky subject and one that has the most impact on Blender. It requires detailed understanding of MPEG-4, specifically how it deals with non-AV content and how it is encoded in the stream. However, it is still a large research topic how MPEG-4 encodes 3D content (2D is just treated as flat 3D). It seems the initial scene is constructed internally, perhaps only surfaces. This scene is then converted to VRML and all subsequent scenes are encoded as translations and rotations of the original one.
If this is true, then Blender would compose the scene and convert from its internal format to VRML. The VRML would then be converted to MPEG-4 in the same way as outlined above. Instead of using the rasterizer function to visualize a scene, it would directly be fed into an encoder of some kind.

Reading mp4 as just another type of content is much easier. It merely requires the addition of an MPEG-4 import filter for audio of texture objects. (Please correct me if I'm wrong here!)

2.5. Impact 4 - Blender as 2D/3D codec for MPEG-4

This is a spin-off from the one above. If MPEG-4 normally uses VRML to encode 2D and 3D, and Blender converts internally to VRML which is then encoded to mp4... why not encode Blender directly to mp4? In other words, can we derive some kind of new codec for 3D content that is better (smaller and platform neutral) than what is currently used in MPEG-4? Blender uses a very powerful and platform independent fiel format that is backwards and forwards compatible to old and new versions. Can we do something really innovative here?

2.6. Impact 5 - Server-side rendering with Blender

PacketVideo and nGame both have questions about Blender. They wonder if it can be used for server-side rendering? In server-side rendering, a large and powerful server runs the render engine. Some device presents it with a scene or game, the server renders it and sends it back to the client. This way, a device that itself is not capable of rendering high-quality images, but can display them, can still utilize the content.

How does this work? Take a look at the following example for PV.

Figure: Blender server-side rendering and MPEG-4 streams to a mobile device.

Here, the game engine runs on a powerful server. The engine architecture has several components, called Logic Bricks at NaN. There already is an Input Device logic brick, for instance for a PS2 DualShock controller, a keyboard, a mouse, etc. The engine also has a brick, the Rasterizer, that creates video frames, ready to be displayed on a monitor. When a user presses the fire button (fire rocket at tank), the Input Device brick sends the event to the game engine, who decides what to do next. The game engine generates the next image (rocket hits tank) and the Rasterizer creates a frame that can be displayed by the video card the user has. However, several Rasterizers have been tested. The same logic brick can very easily be used to create openGL, DirectX8 or PS2 frames.
Instead of feeding the output of the Rasterizer to a video card, the frame can also be fed into a MPEG-4 encoding stream. This stream can then be transmitted as any mp4 stream to an MPEG-4 player, desktop or mobile! Voila, cheap Blender content on your PDA of phone. To make it interactive, the user's input can be transmitted over a basic, standard stream (called ES) in MPEG-4. The MPEG-4 server then relays that to the game engine's Input Device brick. It is even possible to see any remote device with input capabilities as a Remote Input Device brick, simply a special kind of device.

Almost the same story can be told for Blender and WAP, a question nGame asked.

Figure: Blender server-side rendering and output to a mobile device over WAP.

In this case, output from the Rasterizer is not fed into an MPEG-4 encoder but into a special downsizing routine. This scales the image down to 320x200 or 160x160, increases contrast and convert it to a black & white bitmap: WAP bitmap standards. The image can be stored on any web server and a WML stored on the server can then create WAP content for a user to see on his mobile device.
Consider, for instance, a golf game. The user only has to select clubs, swing, angle and hit return. The result of the swing can be calculated by the game engine. It creates a new view for the user of where the ball is and the user makes another swing. Codeonline is already doing exactly this with their WAP Golf game.

2.7. Risks

(From: EET, MPEG-4's role unclear in streaming-media era) Exactly how soon it comes and how big the MPEG-4 market will be are questions still under intense discussion in the industry. Aside from the cell phones and PDAs, it remains unclear if the standard will find its way into enough other markets to justify semiconductor companies' modifying their chips for embedded applications like set-top boxes. The industry is further split on which MPEG-4 profiles, levels and feature sets need to be supported in given applications by servers, client systems and chips.

3. MPEG-4 and 3D Games

"The games market is another area where the application of MPEG-4 video, still-texture, interactivity and SNHC shows much promise, with 3-D texture mapping of still images, live video, or extended pre-recorded video sequences enhancing the player experience. Adding live video of users adds to the user experience multi-player 3-D games, as does use of arbitrary-shaped video, where transparency could be combined artistically with 3-D video texture mapping."

The real area of interest to NaN is the SNHC part of MPEG-4. "SNHC deals with the representation and coding of synthetic (2D and 3D graphics) and natural (still images and natural video) audiovisual information. In other words, SNHC represents the most important aspect of MPEG-4 for NaN, because it combines mixed media types including streaming and downloaded audiovisual objects. Application areas include 2D and 3D graphics, human face and body description and animation, integration of text and graphics, scalable textures encoding, 2D/3D mesh coding, hybrid text-to-speech coding, and synthetic audio coding (structured audio)."

The media integration of text and graphics layer provides ways to encode, synchronize and describe the layout of 2D scenes. They can be composed of text, audio, video, synthetic graphic shapes, pointers and annotations. A Layout node specifies placement, spacing, alignment, scrolling and wrapping of objects oin the MPEG-4 scene. See figure 2 below.

An extension of the 3D nodes defined in VRML, BIFS 3D nodes, allow the creation of virtual worlds. Behavior can be added to a objects through scripts, just like in VRML. These Script nodes contains JavaScript code that defines the behavior. The script can perform object animations, change values of nodes' fields, modify the scene tree, etc. MPEG-4 worlds can be more complex than in VRML because the world contents are not downloaded but streamed and can be continuously modified by users. (See also below for more detail.)

4. Technical Section

See also the actual MPEG-4 standard definition.

4.1. Scenes and Hierarchies

The scene hierarchy is a graph where each leaf is a media object. The structure of the graph is not necessarily static. As relationships change over time, nodes or subgraphs can be added or deleted. All parameters describing these relationships are part of the scene description that is sent to the decoder. The initial snapshot of a scene is sent or retrieved on a dedicated stream. An update of the scene structure may be sent at any time. These updates can access any field of any updatable node in the scene. Updatable nodes have received a unique identifier in the structure and can be accessed using this identifier. Composition information (information about the initial composition and scene updates during the sequence evolution) is delivered in one elementary stream. The composition stream is treated differently from other streams because it provides the information required by the terminal (that renders the scene) to set-up the scene structure and map all other streams to the respective media objects.

***Note that these data objects could, in theory, be transmitted using Terraplay's API. However, if Terraplay supports the real-time nature of MPEG-4 streams in unknown. I believe it is unlikely that a fat MPEG-4 video stream can be properly supported by Terraplay. The infrastructure does too many things in the GAS, with subscriptions, etc. to ensure that information is delivered in a timely fashion.***

4.2. Networking

Because MPEG-4 is intended for use on a wide variety of networks with widely varying performance characteristics, it includes a three-layer multiplex standardized by the Digital Media Integration Framework (DMIF)4 working group. The three layers separate the functionality of

adding MPEG-4specific information for timing and synchronization of the coded media (synchronization layer);
multiplexing streams with very different characteristics, such as average bit rate and size of access units (flexible multiplex layer);and
adapting the multiplexed stream to the particular network characteristics in order to facilitate the interface to different network environments (transport multiplex layer).

The goal is to exploit the characteristics of each network, while adding functionality that these environments lack and preserving a homogeneous interface toward the MPEG-4 system. Elementary streams are packetized, adding headers with timing information (clock references) and synchronization data (time stamps). They make up the synchronization layer (SL) of the multiplex. Streams with similar QoS requirements are then multiplexed on a content multiplex layer, termed the flexible multiplex layer (FML). It efficiently interleaves data from a variable number of variable bit-rate streams. A service multiplex layer, known as the transport multiplex layer (TML), can add a variety of levels of QoS and provide framing of its content and error detection. Since this layer is specific to the characteristics of the transport network, the specification of how data from SL or FML streams is packetized into TML streams refers to the definition of the network protocols. MPEG-4 doesn't specify it. Figure 1 shows these three layers and the relationship among them.

Figure 1. General structure of the MPEG-4 multiplex. Different cases have multiple SL streams multiplexed
in one FML stream and multiple FML streams multiplexed in one TML stream.

4.2.1. Synchronization layer (timing and synchronization).

Elementary streams consist of access units, which correspond to portions of the stream with a specific decoding time and composition time. As an example, an elementary stream for a natural video object consists of the coded video object instances at the refresh rate specific to the video sequence (for example, the video of a person captured at 25 pictures per second). Or, an elementary stream for a face model consists of the coded animation parameters instances at the refresh rate specific to the face model animation (for example, a model animated to refresh the facial animation parameters 30 times per second). Access units like a video object instance or a facial animation parameters instance are the self-contained semantic units in the respective streams, which have to be decoded and used for composition synchronously with a common system time base.
Elementary streams are first framed in SL packets, not necessarily matching the size of the access units in the streams. The header attached by this first layer contains fields specifying

Sequence numberï¿½ a continuous number for the packets, to perform packet loss checks
Instantaneous bit rateï¿½ the bit rate at which the elementary stream is coded
OCR (object clock reference)ï¿½ a time stamp used to reconstruct the time base for the single object
DTS (decoding time stamp)ï¿½ a time stamp to identify the correct time to decode an access unit
CTS (composition time stamp)ï¿½ a time stamp to identify the correct time to render a decoded access unit

The information contained in the SL headers maintains the correct time base for the elementary decoders and for the receiver terminal, plus the correct synchronization in the presentation of the elementary media objects in the scene. The clock references mechanism supports timing of the system, and the mechanism of time stamps supports synchronization of the different media.

4.2.2. Flexible multiplex layer (content).

Given the wide range of possible bit rates associated to the elementary streamsï¿½ ranging, for example, from 1 Kbps for facial animation parameters to 1 Mbps for good-quality video objectsï¿½ an intermediate multiplex layer provides more flexibility. The SL serves as a tool to associate timing and synchronization data to the coded material. The transport multiplex layer adapts the multiplexed stream to the specific transport or storage media. The intermediate (optional) flexible multiplex layer provides a way to group together several low-bit-rate streams for which the overhead associated to a further level of packetization is not necessary or introduces too much redundancy. With conventional scenes, like the usual audio plus video of a motion picture, this optional multiplex layer can be skipped; the single audio stream and the single video stream can be mapped each to a single transport multiplex stream.

4.2.3. Transport multiplex layer (service).

The multiplex layer closest to the transport level depends on the specific transmission or storage system on which the coded information is delivered. The Systems part of MPEG-4 doesn't specify the way SL packets (when no FML is used) or FML packets are mapped on TML packets. The specification simply references several different transport packetization schemes. The "content" packets (the coded media data wrapped by SL headers and FML headers) may be transported directly using an Asynchronous Transfer Mode (ATM) Adaptation Layer 2 (AAL2) scheme for applications over ATM, MPEG-2 transport stream packetization over networks providing that support, or transport control protocol/Internet protocol (TCP/IP) for applications over the Internet.

4.3. Media Integration of Text and Graphics

MITG provides a way to encode, synchronize, and describe the layout of 2D scenes composed of animated text, audio, video, synthetic graphic shapes, pointers, and annotations. The 2D BIFS graphics objects derive from and are a restriction of the corresponding VRML 2.0 3D nodes. Many different types of textures can be mapped on plane objects: still images, moving pictures, complete MPEG-4 scenes, or even user-defined patterns. Alternatively, many material characteristics (color, transparency, border type) can be applied on 2D objects.
Other VRML-derived nodes are the interpolators and the sensors. Interpolators allow predefined object animations like rotations, translations, and morphing. Sensors generate events that can be redirected to other scene nodes to trigger actions and animations. The user can generate events, or events can be associated to particular time instants.

MITG provides a Layout node to specify the placement, spacing, alignment, scrolling, and wrapping of objects in the MPEG-4 scene. Still images or video objects can be placed in a scene graph in many ways, and they can be texture-mapped on any 2D object. The most common way, though, is to use the Bitmap node to insert a rectangular area in the scene in which pixels coming from a video or still image can be copied.
The 2D scene graphs can contain audio sources by means of the Sound2D nodes. Like visual objects, they must be positioned in space and time. They are subject to the same spatial transformations of their parents. nodes hierarchically above them in the scene tree.
Text can be inserted in a scene graph through the Text node. Text characteristics (font, size, style, spacing, and so on) can be customized by means of the FontStyle node.

Figure 2 shows a rather complicated MPEG-4 scene from "Le tour de France" with many different object types like video, icons, text, still images for the map of France and the trail map, and a semitransparent pop-up menu with clickable items. These items, if selected, provide information about the race, the cyclists, the general placing, and so on.

Figure 2. An MPEG-4 application called "Le tour de France" featuring many different A/V objects.

4.3.1. 3D graphics

The advent of 3D graphics triggered the extension of MPEG-4 to the third dimension. BIFS 3D nodes. an extension of the ones defined in VRML specifications. allow the creation of virtual worlds. Like in VRML, it's possible to add behavior to objects through Script nodes. Script nodes contain functions and procedures (the terminal must support the Javascript programming language) that can define arbitrary complex behaviors like performing object animations, changing the values of nodes' fields, modifying the scene tree, and so on. MPEG-4 allows the creation of much more complex scenes than VRML, of 2D/3D hybrid worlds where contents are not downloaded once but can be streamed to update the scene continuously.

4.3.2. Face animation

Face animation focuses on delineating parameters for face animation and definition. It has a very tight relationship with hybrid scalable text-to-speech synthesis for creating interesting applications based on speech-driven avatars. Despite previous research on avatars, the face animation work is the first attempt to define in a standard way the sets of parameters for synthetic anthropomorphic models. Face animation is based on the development of two sets of parameters: facial animation parameters (FAPs) and facial definition parameters (FDPs). FAPs allow having a single set of parameters regardless of the face model used by the terminal or application. Most FAPs describe atomic movements of the facial features; others (expressions and visemes) define much more complex deformations. Visemes are the visual counterparts of phonemes and hence define the position of the mouth (lips, jaw, tongue) associated with phonemes. In the context of MPEG-4, the expressions mimic the facial expressions associated with human primary emotions like joy, anger, fear, surprise, sadness, and disgust. Animated avatars. animation streams fit very low bit-rate channels (about 4 Kbps). FAPs can be encoded either with arithmetic encoding or with discrete cosine transform (DCT). FDPs are used to calibrate (that is, modify or adapt the shape of) the receiver terminal default face models or to transmit completely new face model geometry and texture.

4.3.3. 2D mesh encoding

A 2D mesh object in MPEG-4 represents the geometry and motion of a 2D triangular mesh, that is, tessellation of a 2D visual object plane into triangular patches. A dynamic 2D mesh is a temporal sequence of 2D triangular meshes. The initial mesh can be either uniform (described by a small set of parameters) or Delaunay (described by listing the coordinates of the vertices or nodes and the edges connecting the nodes). Either way, it must be simpleï¿½ it cannot contain holes.
Once the mesh has been defined, it can be animated by moving its vertices and warping its triangles. To achieve smooth animations, motion vectors are represented and coded with half-pixel accuracy. When the mesh deforms, its topology remains unchanged. Updating the mesh shape requires only the motion vectors that express how to move the vertices in the new mesh. An example of a rectangular mesh object borrowed from the MPEG-4 specification appears in Figure 3.

Figure 3. Mesh object with uniform triangular geometry.

Dynamic 2D meshes inserted in an MPEG-4 scene create 2D animations. This results from mapping textures (video object planes, still images, 2D scenes) onto 2D meshes.

4.3.4. Texture coding

MPEG-4 supports an ad-hoc tool for encoding textures and still images based on a wavelet algorithm that provides spatial and quality scalability, content-based (arbitrarily shaped) object coding, and very efficient data compression over a large range of bit rates. Texture scalability comes through many (up to 11) different levels of spatial resolutions, allowing progressive texture transmission and many alternative resolutions (the analog of mipmapping in 3D graphics). In other words, the wavelet technique provides for scalable bit-stream coding in the form of an image-resolution pyramid for progressive transmission and temporal enhancement of still images. For animation, arbitrarily shaped textures mapped onto 2D dynamic meshes yield animated video objects with a very limited data transmission.
Texture scalability can adapt texture resolution to the receiving terminal. s graphics capabilities and the transmission rate to the channel bandwidth. For instance, the encoder may first transmit a coarse texture and then refine it with more texture data (levels of the resolution pyramid).

4.4. Audio

MPEG-4 audio encompasses 6 types of coding techniques:

parametric coding modules
linear predictive coding (LPC) modules
time/frequency (T/F) coding modules
synthetic/natural hybrid coding (SNHC) integration modules
text-to-speech (TTS) integration modules
main integration modules (combine first 3 modules to a scaleable encoder)

5. Links

Here is a list of some relevant links:

MPEG-4 home page, MPEG-4 FAQs
Nice MPEG-4 introduction from IEEE
web3d, the VRML consortium, with VRML97 (aka VRML2.0) and
X3D, an XML-compliant version of VRML 97
PacketVideo
Emblaze, PV's competitor
DivX, an MPEG-4 codec, the best one (so far)
Everything you need to know about DivX
Article on MPEG-4 in media