On Milia, several
people have gotten a severe hype attack from a company called PacketVideo.
Appearantly they use MPEG-4 to transport audio, video and 3D content across
networks, even wireless ones. They were very interested in working with Blender
and think it can become the de facto 3D component in MPEG-4. This is important
because MPEG-4 is hard on its way to get into silicon and is making its way
onto mobile devices (winCE, EPOC, see an article
in InternetNews).
This document wants to eliminate the hype and tell you what it's about, what
it can and cannot do and how NaN and Blender could benefit from it (or not).
This primer is based on MPEG-4: A Multimedia
Standard for the Third Millennium, part
1 and part
2
IEEE has an excellent MPEG-4
introduction
[Back to Chris] [Back to Developers] [Back to Intranet]
MPEG-4 (Moving Pictures Expert Group) defines a multimedia system for interoperable communication of complex scenes containing audio, video, synthetic audio and graphics material. Unlike MPEG-1 and MPEG-2, which focused on better compression efficiency, the emphasis in MPEG-4 is on new functionalities. Mobile as well as stationary user terminals, database access, communications and new types of interactive services will be major applications for MPEG-4.
The MPEG-4 standard provides ways to encode audio and video into so-called media objects for transmission. It describes ways to combine several of these objects into compound media objects and form audiovisual scenes. (See an example.) The standard lists how the data associated with these scenes can be multiplexed and synchronized on network channels providing a Quality of Service (QoS) appropriate for the nature of the specific media objects. Finally, it lets users interact with the audiovisual scene. A scene has a number of video objects, of possibly different shapes, plus a number of audio objects, possibly associated with video objects. Objects can be natural or synthetic, i.e. computer generated.
Visual objects in a scene are described mathematically and given a position in a two- or three-dimensional space. Similarly, audio objects are placed in a sound space. When placed in 3-D space, the video or audio object need only be defined once; the viewer can change his vantage point, and the calculations to update the screen and sound are done locally, at the user's terminal. This is a critical feature if the response is to be fast and the available bit-rate is limited, or when no return channel is available, as in broadcast situations.
MPEG-4's language for describing and dynamically changing the scene is called
the Binary Format for Scenes (BIFS). BIFS commands are available not
only to add objects to or delete them from a scene, but also to change visual
or acoustic properties of an object without changing the object itself; thus
only the color of a 3D sphere might be varied.
BIFS can be used to animate objects just by sending the proper BIFS command
and to define the behavior of those objects in response to user input at the
decoder. It can also be used to put an application screen (such as a Web browser's)
as a "texture" in the scene or map a video stream as a texture onto a virtual
TV screen.
BIFS borrows many concepts from the Virtual Reality Modeling Language (VRML),
which is the method used most widely on the Internet to describe 3D objects
and users' interaction with them. BIFS and VRML can be seen as different representations
of the same data. In VRML, the objects and their actions are described in text,
as in any other high-level language. But BIFS code is binary and is thus shorter
for the same content - typically 10 to 15 times.
More importantly, unlike VRML, MPEG-4 uses BIFS for real-time streaming. That is, a scene does not need to be downloaded in full before it can be played, but can be built up on the fly. Lastly, BIFS allows defining 2D objects such as lines and rectangles. For more information, may want to read the BIFS FAQ.
Content, before it is transmitted, in encoded. Full MPEG-4 (a complex compound scene with text, 3D objects, streaming audio and video) is more than just a little DivX encoding that you see frequently these days. Full encoding is a bit more complex. The media objects will be designated to so-called Elementary Streams (ESs). An object may get one or more streams, depending on its characteristics. Also, the basic positioning information gets its own stream. If the MPEG-4 scene is interactive, an interaction stream (aka upload stream) is added as well.
In the figure on the left, complex content is encoded into several streams. An audio stream(A), two video streams (V) for low and high res video, a 2D/3D stream (G, for geometry), a basic positioning stream (P) and an upload stream to capture any interactivity from the user. All this is packaged into an MPEG-4 channel that is transmitted across the network to a user. The user's player receives the channel and decodes the channel into sepearate streams. These streams are then fed to the MPEG-4player's "renderer" that composes the original scene again and puts it onto the user's screen. (See below for very detailed technical info on this.)
Another important addition in MPEG-4 is a file format known as mp4, which can be used for exchange of content and which is easily converted. It will be the only reliable way for users to exchange complete files of MPEG-4 content. Content which is present in mp4 format can be decoded by any player, much like audio mp3 files can be played by any player. This is unlike MPEG-2 files, which can only be viewed by a player that supports the same codec as the one used to create the file.
MPEG-4 can have an impact in a few areas:
I'll go into each of these below.
2.1. General
For NaN, it is interesting to know how that MPEG-4 combines objects into compound
objects or scenes. They use a hierarchical structure of a scene and a graph
to describe the relationships between objects in the scene. Even more interesting
is that they adopted much of the VRML way of composing objects into a scene.
VRML (Virtual Reality Modelling
Lanuage) specifies the composition of complex scenes containing 3D material
plus audio and video streams.
This results in an MPEG-4 standard that specifies how objects representing natural
audio and video with streams attached (microphones and cameras), can be combined
with objects representing synthetic audio and video (2D and 3D material) with
streams attached (3D rooms, faces, figures, etc.). In additon, the standard
conforms to the VRML syntax with extensions, however, MPEG-4 scenes are binary
encoded for sake of efficiency (the format is called BIFS). But, just like VRML,
the scene descriptions can be generate in textual format, without or without
help from an authoring tool.
Blender would, nativelly, create mp4 files. These could then be played anywhere
or distributed over the Net. An MPEG-4 encoder would need to be added to Blender.
This is a very tricky subject and one that has the most impact on Blender. It
requires detailed understanding of MPEG-4, specifically how it deals with non-AV
content and how it is encoded in the stream. However, it is still a large research
topic how MPEG-4 encodes 3D content (2D is just treated as flat 3D). It seems
the initial scene is constructed internally, perhaps only surfaces. This scene
is then converted to VRML and all subsequent scenes are encoded as translations
and rotations of the original one.
If this is true, then Blender would compose the scene and convert from its internal
format to VRML. The VRML would then be converted to MPEG-4 in the same way as
outlined above. Instead of using the rasterizer function to visualize a scene,
it would directly be fed into an encoder of some kind.
Reading mp4 as just another type of content is much easier. It merely requires the addition of an MPEG-4 import filter for audio of texture objects. (Please correct me if I'm wrong here!)
PacketVideo and nGame both have questions about Blender. They wonder if it can be used for server-side rendering? In server-side rendering, a large and powerful server runs the render engine. Some device presents it with a scene or game, the server renders it and sends it back to the client. This way, a device that itself is not capable of rendering high-quality images, but can display them, can still utilize the content.
How does this work? Take a look at the following example for PV.
Figure: Blender server-side rendering and MPEG-4 streams to a mobile device.
Here, the game engine runs on a powerful server. The engine architecture
has several components, called Logic Bricks at NaN. There already is an Input
Device logic brick, for instance for a PS2 DualShock controller, a keyboard,
a mouse, etc. The engine also has a brick, the Rasterizer, that creates video
frames, ready to be displayed on a monitor. When a user presses the fire button
(fire rocket at tank), the Input Device brick sends the event to the game engine,
who decides what to do next. The game engine generates the next image (rocket
hits tank) and the Rasterizer creates a frame that can be displayed by the video
card the user has. However, several Rasterizers have been tested. The same logic
brick can very easily be used to create openGL, DirectX8 or PS2 frames.
Instead of feeding the output of the Rasterizer to a video card, the frame can
also be fed into a MPEG-4 encoding stream. This stream can then be transmitted
as any mp4 stream to an MPEG-4 player, desktop or mobile! Voila, cheap Blender
content on your PDA of phone. To make it interactive, the user's input can be
transmitted over a basic, standard stream (called ES) in MPEG-4. The MPEG-4
server then relays that to the game engine's Input Device brick. It is even
possible to see any remote device with input capabilities as a Remote Input
Device brick, simply a special kind of device.
Almost the same story can be told for Blender and WAP, a question nGame asked.
Figure: Blender server-side rendering and output to a mobile device over WAP.
In this case, output from the Rasterizer is not fed into an MPEG-4
encoder but into a special downsizing routine. This scales the image down to
320x200 or 160x160, increases contrast and convert it to a black & white
bitmap: WAP bitmap standards. The image can be stored on any web server and
a WML stored on the server can then create WAP content for a user to see on
his mobile device.
Consider, for instance, a golf game. The user only has to select clubs, swing,
angle and hit return. The result of the swing can be calculated by the game
engine. It creates a new view for the user of where the ball is and the user
makes another swing. Codeonline
is already doing exactly this with their WAP Golf game.
(From: EET, MPEG-4's role unclear in streaming-media era) Exactly how soon it comes and how big the MPEG-4 market will be are questions still under intense discussion in the industry. Aside from the cell phones and PDAs, it remains unclear if the standard will find its way into enough other markets to justify semiconductor companies' modifying their chips for embedded applications like set-top boxes. The industry is further split on which MPEG-4 profiles, levels and feature sets need to be supported in given applications by servers, client systems and chips.
"The games market is another area where the application of MPEG-4 video, still-texture, interactivity and SNHC shows much promise, with 3-D texture mapping of still images, live video, or extended pre-recorded video sequences enhancing the player experience. Adding live video of users adds to the user experience multi-player 3-D games, as does use of arbitrary-shaped video, where transparency could be combined artistically with 3-D video texture mapping."
The real area of interest to NaN is the SNHC part of MPEG-4. "SNHC deals with the representation and coding of synthetic (2D and 3D graphics) and natural (still images and natural video) audiovisual information. In other words, SNHC represents the most important aspect of MPEG-4 for NaN, because it combines mixed media types including streaming and downloaded audiovisual objects. Application areas include 2D and 3D graphics, human face and body description and animation, integration of text and graphics, scalable textures encoding, 2D/3D mesh coding, hybrid text-to-speech coding, and synthetic audio coding (structured audio)."
The media integration of text and graphics layer provides ways to encode, synchronize and describe the layout of 2D scenes. They can be composed of text, audio, video, synthetic graphic shapes, pointers and annotations. A Layout node specifies placement, spacing, alignment, scrolling and wrapping of objects oin the MPEG-4 scene. See figure 2 below.
An extension of the 3D nodes defined in VRML, BIFS 3D nodes, allow the creation of virtual worlds. Behavior can be added to a objects through scripts, just like in VRML. These Script nodes contains JavaScript code that defines the behavior. The script can perform object animations, change values of nodes' fields, modify the scene tree, etc. MPEG-4 worlds can be more complex than in VRML because the world contents are not downloaded but streamed and can be continuously modified by users. (See also below for more detail.)
The scene hierarchy is a graph where each leaf is a media object. The structure of the graph is not necessarily static. As relationships change over time, nodes or subgraphs can be added or deleted. All parameters describing these relationships are part of the scene description that is sent to the decoder. The initial snapshot of a scene is sent or retrieved on a dedicated stream. An update of the scene structure may be sent at any time. These updates can access any field of any updatable node in the scene. Updatable nodes have received a unique identifier in the structure and can be accessed using this identifier. Composition information (information about the initial composition and scene updates during the sequence evolution) is delivered in one elementary stream. The composition stream is treated differently from other streams because it provides the information required by the terminal (that renders the scene) to set-up the scene structure and map all other streams to the respective media objects.
***Note that these data objects could, in theory, be transmitted using Terraplay's API. However, if Terraplay supports the real-time nature of MPEG-4 streams in unknown. I believe it is unlikely that a fat MPEG-4 video stream can be properly supported by Terraplay. The infrastructure does too many things in the GAS, with subscriptions, etc. to ensure that information is delivered in a timely fashion.***
Because MPEG-4 is intended for use on a wide variety of networks with widely varying performance characteristics, it includes a three-layer multiplex standardized by the Digital Media Integration Framework (DMIF)4 working group. The three layers separate the functionality of
The goal is to exploit the characteristics of each network, while adding functionality that these environments lack and preserving a homogeneous interface toward the MPEG-4 system. Elementary streams are packetized, adding headers with timing information (clock references) and synchronization data (time stamps). They make up the synchronization layer (SL) of the multiplex. Streams with similar QoS requirements are then multiplexed on a content multiplex layer, termed the flexible multiplex layer (FML). It efficiently interleaves data from a variable number of variable bit-rate streams. A service multiplex layer, known as the transport multiplex layer (TML), can add a variety of levels of QoS and provide framing of its content and error detection. Since this layer is specific to the characteristics of the transport network, the specification of how data from SL or FML streams is packetized into TML streams refers to the definition of the network protocols. MPEG-4 doesn't specify it. Figure 1 shows these three layers and the relationship among them.
Figure
1. General structure of the MPEG-4 multiplex. Different cases have multiple
SL streams multiplexed
in one FML stream and multiple FML streams multiplexed in one TML stream.
Elementary streams consist of access units, which correspond to portions of
the stream with a specific decoding time and composition time. As an example,
an elementary stream for a natural video object consists of the coded video
object instances at the refresh rate specific to the video sequence (for example,
the video of a person captured at 25 pictures per second). Or, an elementary
stream for a face model consists of the coded animation parameters instances
at the refresh rate specific to the face model animation (for example, a model
animated to refresh the facial animation parameters 30 times per second). Access
units like a video object instance or a facial animation parameters instance
are the self-contained semantic units in the respective streams, which have
to be decoded and used for composition synchronously with a common system time
base.
Elementary streams are first framed in SL packets, not necessarily matching
the size of the access units in the streams. The header attached by this first
layer contains fields specifying
The information contained in the SL headers maintains the correct time base for the elementary decoders and for the receiver terminal, plus the correct synchronization in the presentation of the elementary media objects in the scene. The clock references mechanism supports timing of the system, and the mechanism of time stamps supports synchronization of the different media.
Given the wide range of possible bit rates associated to the elementary streams� ranging, for example, from 1 Kbps for facial animation parameters to 1 Mbps for good-quality video objects� an intermediate multiplex layer provides more flexibility. The SL serves as a tool to associate timing and synchronization data to the coded material. The transport multiplex layer adapts the multiplexed stream to the specific transport or storage media. The intermediate (optional) flexible multiplex layer provides a way to group together several low-bit-rate streams for which the overhead associated to a further level of packetization is not necessary or introduces too much redundancy. With conventional scenes, like the usual audio plus video of a motion picture, this optional multiplex layer can be skipped; the single audio stream and the single video stream can be mapped each to a single transport multiplex stream.
The multiplex layer closest to the transport level depends on the specific transmission or storage system on which the coded information is delivered. The Systems part of MPEG-4 doesn't specify the way SL packets (when no FML is used) or FML packets are mapped on TML packets. The specification simply references several different transport packetization schemes. The "content" packets (the coded media data wrapped by SL headers and FML headers) may be transported directly using an Asynchronous Transfer Mode (ATM) Adaptation Layer 2 (AAL2) scheme for applications over ATM, MPEG-2 transport stream packetization over networks providing that support, or transport control protocol/Internet protocol (TCP/IP) for applications over the Internet.
MITG provides a way to encode, synchronize, and describe the layout of 2D scenes
composed of animated text, audio, video, synthetic graphic shapes, pointers,
and annotations. The 2D BIFS graphics objects derive from and are a restriction
of the corresponding VRML 2.0 3D nodes. Many different types of textures can
be mapped on plane objects: still images, moving pictures, complete MPEG-4 scenes,
or even user-defined patterns. Alternatively, many material characteristics
(color, transparency, border type) can be applied on 2D objects.
Other VRML-derived nodes are the interpolators and the sensors. Interpolators
allow predefined object animations like rotations, translations, and morphing.
Sensors generate events that can be redirected to other scene nodes to trigger
actions and animations. The user can generate events, or events can be associated
to particular time instants.
MITG provides a Layout node to specify the placement, spacing, alignment, scrolling,
and wrapping of objects in the MPEG-4 scene. Still images or video objects can
be placed in a scene graph in many ways, and they can be texture-mapped on any
2D object. The most common way, though, is to use the Bitmap node to insert
a rectangular area in the scene in which pixels coming from a video or still
image can be copied.
The 2D scene graphs can contain audio sources by means of the Sound2D nodes.
Like visual objects, they must be positioned in space and time. They are subject
to the same spatial transformations of their parents. nodes hierarchically above
them in the scene tree.
Text can be inserted in a scene graph through the Text node. Text characteristics
(font, size, style, spacing, and so on) can be customized by means of the FontStyle
node.
Figure 2 shows a rather complicated MPEG-4 scene from "Le tour de France" with many different object types like video, icons, text, still images for the map of France and the trail map, and a semitransparent pop-up menu with clickable items. These items, if selected, provide information about the race, the cyclists, the general placing, and so on.
Figure 2. An MPEG-4 application called "Le tour de France" featuring many different A/V objects.
The advent of 3D graphics triggered the extension of MPEG-4 to the third dimension. BIFS 3D nodes. an extension of the ones defined in VRML specifications. allow the creation of virtual worlds. Like in VRML, it's possible to add behavior to objects through Script nodes. Script nodes contain functions and procedures (the terminal must support the Javascript programming language) that can define arbitrary complex behaviors like performing object animations, changing the values of nodes' fields, modifying the scene tree, and so on. MPEG-4 allows the creation of much more complex scenes than VRML, of 2D/3D hybrid worlds where contents are not downloaded once but can be streamed to update the scene continuously.
Face animation focuses on delineating parameters for face animation and definition. It has a very tight relationship with hybrid scalable text-to-speech synthesis for creating interesting applications based on speech-driven avatars. Despite previous research on avatars, the face animation work is the first attempt to define in a standard way the sets of parameters for synthetic anthropomorphic models. Face animation is based on the development of two sets of parameters: facial animation parameters (FAPs) and facial definition parameters (FDPs). FAPs allow having a single set of parameters regardless of the face model used by the terminal or application. Most FAPs describe atomic movements of the facial features; others (expressions and visemes) define much more complex deformations. Visemes are the visual counterparts of phonemes and hence define the position of the mouth (lips, jaw, tongue) associated with phonemes. In the context of MPEG-4, the expressions mimic the facial expressions associated with human primary emotions like joy, anger, fear, surprise, sadness, and disgust. Animated avatars. animation streams fit very low bit-rate channels (about 4 Kbps). FAPs can be encoded either with arithmetic encoding or with discrete cosine transform (DCT). FDPs are used to calibrate (that is, modify or adapt the shape of) the receiver terminal default face models or to transmit completely new face model geometry and texture.
A 2D mesh object in MPEG-4 represents the geometry and motion of a 2D triangular
mesh, that is, tessellation of a 2D visual object plane into triangular patches.
A dynamic 2D mesh is a temporal sequence of 2D triangular meshes. The initial
mesh can be either uniform (described by a small set of parameters) or Delaunay
(described by listing the coordinates of the vertices or nodes and the edges
connecting the nodes). Either way, it must be simple� it cannot contain holes.
Once the mesh has been defined, it can be animated by moving its vertices and
warping its triangles. To achieve smooth animations, motion vectors are represented
and coded with half-pixel accuracy. When the mesh deforms, its topology remains
unchanged. Updating the mesh shape requires only the motion vectors that express
how to move the vertices in the new mesh. An example of a rectangular mesh object
borrowed from the MPEG-4 specification appears in Figure 3.
Figure 3. Mesh object with uniform triangular geometry.
Dynamic 2D meshes inserted in an MPEG-4 scene create 2D animations. This results from mapping textures (video object planes, still images, 2D scenes) onto 2D meshes.
MPEG-4 supports an ad-hoc tool for encoding textures and still images based
on a wavelet algorithm that provides spatial and quality scalability, content-based
(arbitrarily shaped) object coding, and very efficient data compression over
a large range of bit rates. Texture scalability comes through many (up to 11)
different levels of spatial resolutions, allowing progressive texture transmission
and many alternative resolutions (the analog of mipmapping in 3D graphics).
In other words, the wavelet technique provides for scalable bit-stream coding
in the form of an image-resolution pyramid for progressive transmission and
temporal enhancement of still images. For animation, arbitrarily shaped textures
mapped onto 2D dynamic meshes yield animated video objects with a very limited
data transmission.
Texture scalability can adapt texture resolution to the receiving terminal.
s graphics capabilities and the transmission rate to the channel bandwidth.
For instance, the encoder may first transmit a coarse texture and then refine
it with more texture data (levels of the resolution pyramid).
MPEG-4 audio encompasses 6 types of coding techniques:
Here is a list of some relevant links: