Beyond SMIL 2.0: Fragmentation and Integration of Media and Syntax

Lloyd Rutledge
Multimedia and Human-Computer Interaction Theme
CWI (Centrum voor Wiskunde en Informatica)
P.O. Box 94079, NL-1090 GB Amsterdam, The Netherlands
Lloyd.Rutledge@cwi.nl

Abstract

SMIL, the W3C recommendation for multimedia on the Web, is quickly approaching its version 2.0 release ^SMIL. The new version adds many new features over SMIL 1.0, including event-based timing, animation and transitions. This paper explores potential next steps for SMIL: what SMIL 2.0 enables, how other standards can be used with SMIL 2.0, and what could appear in a hypothetical "SMIL 3.0". These suggested features focus on more fine-grained fragmentation and the integration of the resulting smaller-scale components. Here, this fragmentation and integration applies to both the integrated media content and to encoding language of the media and the integration itself. Other developing W3C formats are also incorporating these features, so their further exploration in SMIL is timely. This fragmentation and integration will in many ways make the Web more seamless for author and user: the boundaries between units of media and between language browser types will become less apparent.

Keywords: SMIL, Multimedia, Integration, XPointer, XHTML, SVG, CSS

1. Introduction

Synchronized Multimedia Integration Language (SMIL) is the W3C language for multimedia on the Web. Version 1.0 of SMIL was released in 1998 with the basic foundation for distributed multimedia href="#SMIL1">[SMIL1]. SMIL 2.0 is expected to be released soon^SMIL as a recommendation [SMIL2]. It adds many features over version 1.0, defining state-of-the-art Web-based multimedia. With the constructs, function and architecture of SMIL 2.0 established, consideration of the possibilities for SMIL 3.0 can begin.

The most important aspect of SMIL is its integration of other media. SMIL 1.0 and 2.0 files contain no information-bearing media of their own -- all media perceived in a SMIL presentation comes from files or streams external to the SMIL code itself. However, not all possibilities for integration have been achieved by version 2.0. This paper discusses how integration can be further incorporated into future versions of SMIL. First this paper discusses how the integration of media that exists in SMIL can be extended. The issues discussed in this paper and their relation to the issues of fragmentation, integration, media and syntax are layed out in Table 1.

	... Media	... Syntax
Fragmentation of ...	Implementing processing of XPointer references into XML-defined media	Adding per-element CSS style to SMIL
	Adding removal of fragment context to SMIL
	SMIL 2.0 media clipping attribute values
	SMIL 2.0 fill="hidden" attribute assignment
	SMIL 2.0 XPointer bare name values of "href" attribute
Integration of ...	Adding <filter> elements to SMIL	Adding direct content of media to SMIL
Integration of ...	SMIL 2.0 media object elements	Adding direct content of media to SMIL

Table 1: Issues and their Relation to Fragmentation, Integration, Media and Syntax

4. Direct Inclusion of Media

One possibility for increased integration is the direct inclusion of media in SMIL. This would be an integration of syntax: the format used to define media types could be directly included in SMIL syntax. The benefit would be the reduction of distribution system communication and, quite often, authoring effort. The <ref> elements could directly contain the media they would otherwise locate. Particularly applicable to this is the inclusion of XML-defined media, such as SVG and XHTML. Having <ref> elements in a SMIL presentation each directly contain small SVG or XHTML subtrees will sometimes be easier to manage than a SMIL file with many separate small SVG and XHTML files.

This paper suggests the addition to SMIL of a <mediaContent> element that can be the child of any media object element. The XML-defined media content, such as the full subtree rooted in the root element of an SVG, XHTML or other XML document, would be contained directly in the <mediaContent> element. The <mediaContent> element would have a "type" attribute that functions just like the "type" attribute in SMIL media object elements: it states the mime type of the media content. A SMIL processor would use this attribute to determine what process to pass the XML-defined direct media content to for presentation, just as it would use the same attribute in a media object element to determine what process received the media object returned by a URI reference.

Media content not presentable in a given browser can be ignored if it is directly contained just as it can be ignored if is referenced: the browser can determine from the mimetype, or from other means, that the media type is not supported and choose not to present it. The only difference here with direct media content is that it would still have to be parsed. This would be only a minor disadvantage since direct media content would mostly be used for very small XML-defined media objects.

What this also enables is the reuse of directly included media. A possible extension to SMIL with that of directly included media is allowing the "src" attribute of the media object element to refer to a media object element elsewhere in the same document with direct media content. Thus, this direct media content can be defined once in the SMIL file and used in multiple places. The syntax of such "src" attribute values would start with a '#' character and be followed by the unique identifier of the element with direct media content. If XPointer is supported, the attribute value after the '#' character could be an XPointer value locating the reused media object element. This is similar to the function performed by the <def> element of SVG.

To demonstrate the implementability of this feature, this paper contains a small XSLT transform that takes any SMIL 2.0 document with extension <mediaContent> elements and turns it into the same SMIL 2.0 document with those elements replaced with external media references to new media files generated from the <mediaContent> element content. This transform also uses the <mediaContent> element "type" attribute to put the right XML declaration in the generated external media files. The transform code is XSLT with extension constructs that enable the creation of multiple files. This paper suggests that this type of use of XSLT transform from extension SMIL to current SMIL be used in research and proposals when possible to demonstrate implementation feasibility. Of course, there will be many cases when equivalent current SMIL code cannot be generated.

5. Structure-defined Fragmentation of Media Content

SMIL 2.0 presentations integrate all the media content they present from external sources. Media objects are most frequently included in their entirety, though SMIL 1.0 and 2.0 have constructs for stating that only part of a media object is to be integrated into a presentation. This paper calls this fragmentation. Fragmentation is similar to the term anchoring from the Amsterdam Hypermedia Model (AHM) [Hardman] extension of the Dexter Hypertext model [Halasz]. It states that the media units being integrated are portions of the external media objects referenced. Fragmentation contributes to media adaptation and reuse by not required that existing stored media objects be modified into different stored objects that are portions of the original just so that they can be used as desired. This section discusses how fragmentation could be further incorporated in SMIL, particularly with the use of XPointer.

5.1 Non-XML-based Fragmentation in Current SMIL

Since version 1.0, SMIL has had the "clipBegin" and "clipEnd" attributes which take a temporal media object such as and audio or video file or stream and present only part of it, starting it and ending it at particular times. SMIL 1.0 also provides some primitive spatial cropping with the "hidden" value of the "fill" attribute, which crops the bottom and right borders of visual objects that don't fit in the display regions assigned to them. SMIL 2.0 extends this spatial cropping with the "regPoint" construct, which can position an image anywhere within its region and thus have the "hidden" construct crop any amount of any side of the image going beyond the region boundaries.

Thus, as it stands in SMIL 2.0, fragmentation can occur on integrated media objects as long as it is defined in terms of spatial and temporal measurements. What remains is fragmentation based on the structure internal to the integrated media object. On the Web, this internal structure is typically defined by XML. The primary current XML-defined media content formats are Scalable Vector Graphics (SVG) [SVG] and, of course, XHTML [XHTML]. Other XML-defined media-content formats may be developed as well, such as MPEG-7, which may provide an XML structure interface to segmented audio and video tracks [MPEG7].

5.2 Hyperlinked XML-located Fragments in Current SMIL with XPointer

XPointer is a W3C candidate recommendation that is specifically designed for the task of locating XML documents portions defined in terms of XML structure [XPointer]. The SMIL 2.0 specification allows XPointer in the SMIL URI attributes such as "src" and "href" to integrate structure-defined XML media fragments. What is not specified is how to process the XML document fragment returned. The XPointer specification leaves it as completely "implementation dependent" what an application should do with the fragment returned by an XPointer reference. The SMIL 2.0 definition makes no further specification of what behavior should result from XPointer references.

One precedence that could be used for handling XPointers is the processing of named anchor references in HTML. This defines hyperlinks to portions of HTML documents rather than to HTML documents as a whole. These are values of the "href" attribute that contain the '#' character. What appears before the '#' character is the URI of a particular HTML document. What appears after is the name of a named anchor in the HTML document. This feature has been widely used since the early days of the Web.

Originally, names were assigned to <a> elements with the "name" attribute, but HTML linking definition has progressed to be consistent with XML unique identifiers and, later on, XPointer. HTML 4.0 [HTML4] introduced the unique identifier, or "id", attribute as a means of assigning a name to an <a> element. XHTML deprecates the "name" attribute in favor of "id", and will eventually remove it altogether. When XPointer was released, it specified the bare name XPointer value as a shortcut for identifier referencing that is consistent with the use of fragment identifiers in "href" attribute values in XHTML. That is, if a URI attribute value has a '#' character followed by a name, then XPointer accepts that as locating the element in the reference document that is assigned that unique identifier. As such, XHTML "href" attribute named anchor references use valid XPointer values.

These constructs for hyperlinking to XML-located fragments has been adopted by SMIL, with the same semantics as they apply to timed multimedia. The "href" attribute of SMIL <a> element hyperlinks can refer to a portion of the same or an external SMIL presentation by having as its value the unique identifier of an element in that presentation. And as with HTML, such references are valid XPointer.

The behavior that XHTML and SMIL apply to these types of XPointers is that the referenced document be loaded for presentation in its entirety, but the presentation is forwarded to where the referenced fragment is displayed. In HTML browsers, this typically means displaying the whole document, but scrolling it so that the referenced section shows in the window. After this initial display, the user can scroll the text up and down in order to see any part of the rest of the document.

The corresponding forwarding that happens in SMIL is of a different nature. The syntactic progression of SMIL is time-based, not text-flow-based like HTML. Thus, forwarding in SMIL doesn't means scrolling spatially through the flow of text but by "seeking" forward along the timeline. When a SMIL hyperlink is activated that has a XPointer bare name value to within a SMIL presentation, then that presentation is loaded and played starting at the time that the referenced element would start playing. The default behavior is that the linked-to presentation continues playing until the end of the presentation as a whole, which is not necessarily the ending time of the reference object.

5.3 Integrating Fragmented Media without its Context

While this forwarding behavior may typically be desired for hyperlink traversal, it is not always desired when the media fragment is integrated directly into the display. For example, if a passage from a Shakespeare play is used as a caption in a multimedia display, having scrollbars appear with the passage to allow access to the rest of the play is often not desired, and will often clutter the visual appearance of the display as a whole. The behavior of processing XPointer references by loading the whole document in the play space but forwarding the display state to the referenced element is defined for hyperlinking in XHTML and SMIL. But neither this behavior nor any other is specified for the use of bare name XPointer values, or other XPointer values, in the SMIL "src" attribute of media object elements. This is the construct that defines the non-hyperlinked integration of media fragments into SMIL presentations. This section suggests extensions to SMIL that allow media fragments to be integrated independently of the rest of the documents from which they come.

The suggested new construct for defining the behavior is a "context" attribute for SMIL media object elements. This element can have a value of either "remove", which is the default, or "keep". A value of "keep" maintains the context of the integrated media fragment, placing the entire referenced media file in the assigned display space, making sure that the reference fragment is displayed initially, and providing the means to the user for access the rest of the media file. A value of "remove", on the other hand, displays only the reference fragment, displays non of the rest of the media file, and provides no interface in the presentation for accessing the rest of the media file. The DTD code for this attribute is given below:

  ext:context  (remove|keep)  "remove"

But while none of the content in the rest of the media file should be displayed, some code in the file outside of the referenced object will have an impact on its appearance. For example, if the external document is SVG and a non-rectangular visual fragment is selected with XPointer, the rectangular display of that fragment upon integration should not include what appears around that fragment in the same rectangle in the original document's presentation. However, this rule does not apply if visual objects in the context of the fragment comes from style instead of content. If the external document is in XHTML and has a CSS style sheet specifying a background image, then that image is part of the fragment's presentation style. The style from the original document and fragment should be maintained in the integrated presentation unless overridden by style code of the integration.

6. Integration and Style

6.1 Applying Style to Integrated Media

While SMIL 2.0 can state what media gets integrated, and where and when it gets displayed, it says nothing about changes in the media object's style or appearance that should take place for this particular integration. Such changes could mean applying an image-processing filter to integrated visual media to, for example, make it black-and-white, blurry, or faded. This processing would not be part of the original image's appearance, but only applied when integrated in this SMIL presentation. Similarly, such changes could also mean applying a particular CSS style-sheet to an XML-defined media object for integration. This would enable, for example, having all XHMTL text uses as image captions have the same font type, color and size.

This paper suggests extending SMIL with a <filter> element and attribute of SVG, with some additions. The "filter" attribute would be assigned to the <region> element, which specifies how a media object is to be integrated into the layout. For example, the <region> element states in what rectangular area of the screen a visual image will appear, and how loud an audio object is to be played. The "filter" attribute would reference a file that specifies the style of the media objects presentation. This file could be, for example, a CSS file or an image processing script. The attribute value could also be an XPointer reference to a <filter> element described below.

The <filter> element would appear as a child of a <region> element. Instead of referencing style code, this element would directly contain it. The element would also have a "type" attribute to state the format the style code is in. With the "filter" attribute and element, the <region> element can state many other aspects of how an integrated media object is to be presented. The suggested use of the <filter> element to contain CSS code is discussed in the next section. The <filter> element would, of course, also have the attributes and allowed content that the lt;filter> element of SVG has, with the same resulting behavior.

6.2 Syntactic Integration of CSS into SMIL Layout

SMIL specifies its own means of defining the layout of a presentation. However, SMIL-defined layout has several relations with CSS [CSS]. One is that is is highly-isomorphic with CSS: its constructs have the same or similar names as their equivalents in CSS. Furthermore, ever since version 1.0, a SMIL presentation can have a CSS-defined alternative layout for its SMIL-defined layout. But despite this isomorphism, there are functional differences between what SMIL layout and CSS provide, and is currently difficult if not impossible to use a combination of both to have access to all the features both provide. This section suggests in this section the further incorporation of CSS into the layout of SMIL to provide access to the functions of both formats.

The suggested extension for encoding this behavior is the use of the extension <filter> element within the SMIL layout element hierarchy. These <filter> elements would contain CSS code to define the style applied to their parent layout element and all its other descendant. This use of CSS would be denoted by assign the "type" attribute of these elements to "text/css". Region attributes whose semantics are representable by CSS would be deprecated in favor of only using CSS code in <filter> element to specify these semantics. This section also presents suggested extensions to CSS for constructs in SMIL layout with no CSS equivalent. This would enable CSS to enable it to encode all layout and style features needed for multimedia on the Web as well as text. The relavant CSS and SMIL constructs and their impact on the suggested extensions are layed out in Table 2.

CSS Property or Concept	SMIL Construct	Impact on Suggested Extension
top, left, bottom, right, width and height properties	top, left, bottom, right, width and height attributes
background-color property	backgroundColor attribute
background-attachment, background-image, background-position and background-repeat properties	no equivalent
overflow:hidden	fit="hidden"	default in SMIL, not in CSS
overflow:scroll	fit="scroll"	default in CSS, not in SMIL
overflow:visible	no equivalent
border properties	no equivalent
text-related properties	no equivalent	Would apply to SMIL media content, not layout
volume property (applies to synthesized speech)	soundLevel (applies to general audio)	similar semantics but little isomorphism in syntax

Table 2: The relavant CSS and SMIL constructs and their impact on the suggested extensions

The extensions suggested here provide the additional multimedia layout features CSS has but SMIL 2.0 lacks. They also provide text-related constructs which can be applied to integrated text and graphics content, as are described ahead. And finally, they would unify the definition of CSS-encodable layout and style on the Web back within CSS. This section demonstrates the feasibility of this with a CSS-to-SMIL layout converter, which includes the generation of equivalent composite SMIL structures for CSS constructs with no single equivalent SMIL construct.

6.2.1 Putting the Overlap Constructs Exclusively in CSS

6.2.2 Use of Text-based CSS Features on Integrated Media

6.2.3 Cascading Style in SMIL

Region Hierarchy

Body Hierarchy

6.2.4 Necessary Extensions to CSS for Multimedia

Hierarchical Regions

Viewports

regPoints

7. Integration of Spatial Positions

Another layout integration issue is the ability to refer to the locations of objects within integrated visual objects. For example, this enables placing a pointer on the relevant part of an image. More simply, the implicit dimensions of integrated visual media objects could impact the layout structure in the same way implicit durations can impact timing. This is part of the fragmentation and integration of document properties or structure.

A simple type of property integration is using the "implicit width/height" to set up position of other regions or images. That is, you could say that one region or image is to be placed just to the left of another, no matter how wide it is. But this raises complications, both in general and with the SMIL layout model. The conflict with SMIL layout is that it is set up as distinct and fixed, in the head separate from the body where the images are located. The general complication is what happens when an image is replaced and other images depend on it, either on the image itself or on the region it is placed in. Should an image be relocated in the middle of its display if an image use implicit dimensions is used for its placement is removed or replaced with an image with different implicit dimensions?

What is needed for this is "dynamic layout positioning". Perhaps an easier alternative is to say that each relative positioning action happens once, when it is triggered, and does not change. Perhaps then all relative positioning constructs should be in the temporal hierarchy. Once an image made active, the relative positioning it uses remains until the image media object becomes inactive, even if images it depends on change.

This supports the argument for layout constructs in the SMIL body. There still should be distinct layout, and it's use should be general preferred over inline positioning. But if one image is going to be in one place, and nothing else will ever be in that place, it would be easier to attach region information specifically to that image. One way to model this is to allow the CSS styleFilter elements/attributes this paper proposes on media object elements as well. Here, CSS-defined information could be hierarchically inherited in the SMIL body as it is in the XHTML body and, as proposed, in the SMIL layout tree.

Conclusion

Acknowledgements

The research for this paper was funded in part by the Multimedia Information Analysis (MIA) project [MIA] and by the RTIPA project [RTIPA]. Valuable insights were provided by Jacco van Ossenbruggen, Steven Pemberton and Lynda Hardman of the CWI.

References

[CSS] Bos, B., Lie, H.W., Lilley, C. and Jacobs, I. Cascading Style Sheets, level 2 - CSS2 Specification, World Wide Web Consortium Recommendation, May 1998.
[SMIL2] Cohen, A. et.al. (eds.) Synchronized Multimedia Integration Language (SMIL 2.0) Specification, World Wide Web Consortium Last Call Working Draft, September 2000. (Work in Progress)
[XPointer] Daniel, R. Jr., DeRose, S. and Maler, E.XML Pointer Language (XPointer) Version 1.0, World Wide Web Consortium Candidate Recommendation, June 2000.
[SVG] Ferraiolo, J. Scalable Vector Graphics (SVG) 1.0 Specification, World Wide Web Consortium Candidate Recommendation, August 2000.
[Halasz] Halasz, F. and Schwartz, M., "The Dexter hypertext reference model", Communications of the ACM, vol. 37, no. 2, February 1994, pp 30-39.
[Hardman] Hardman, L ., Bulterman, D.C.A. and van Rossum, G. "The Amsterdam hypermedia model: adding time and context to the Dexter model", Communications of the ACM, vol. 37, no. 2, February 1994, pp 50-62.
[SMIL1] Hoschka, P. et. al. (eds.), Synchronized Multimedia Integration Language (SMIL), World Wide Web Consortium Recommendation, June 1998.
[MPEG7] International Organisation for Standardisation (ISO), Overview of the MPEG-7 Standard, ISO/IEC JTC1/SC29/WG11 N3445, Geneva, May/June 2000.
[MIA] MIA Project, MIA - Multimedia Information Analysis Webpage.
[XHTML] Pemberton, S. et.al., XHTML 1.0: The Extensible HyperText Markup Language - A Reformulation of HTML 4 in XML 1.0, World Wide Web Consortium Recommendation, January 2000.
[HTML4] Raggett, D., Le Hors, A. and Jacobs, I. HTML 4.01 Specification, World Wide Web Consortium Recommendation, December 1999.
[RTIPA] RTIPA project, RTIPA project - Real Time Internet Platform Architectures.
[Rutledge99] Rutledge, L., van Ossenbruggen, J., Hardman, L. and Bulterman, D.C.A. "Anticipating SMIL 2.0: The Developing Cooperative Infrastructure for Multimedia on the Web", Proceedings of The Eighth International World Wide Web Conference (WWW8), May 1999

Vitae

Lloyd Rutledge is a researcher at the CWI. His research involves adaptable hypermedia, and standards for it such as SMIL. He received his Sc.D. from the University of Massachusetts Lowell, where he worked with the Distributed Multimedia Systems Laboratory (DMSL) on developing the HyOctane HyTime-based hypermedia environment. Lloyd Rutledge is a member of the W3C SYMM Working Group, which developed SMIL.

Notes for Reviewers^Notes

[SMIL] SMIL 2.0 is expected to have been released as a recommendation by the due date for final versions of WWW10 papers. All references in this paper to SMIL 2.0 will be updated to cite it as a recommendation. Furthermore, any changes to SMIL 2.0 between the current draft and the recommendation version that affect the content of this paper will be applied to the paper's final draft.
[Notes] These notes are only for the review process and will not appear in the final publication.