Journal of Knowledge Management Practice,

Journal of Knowledge Management Practice, December 2001

Content Management Makes Sense - Part 2

Technical Impact Of Semantics On Content Management

Michaël Auffret, Profium

ABSTRACT:

Argues that organizations across the world are facing a new challenge; to efficiently manage the growing volume of static and/or dynamic information accumulated within the enterprise. Contends that by using Semantic Content Management abd related technology, organizations working with large volumes of data will soon realize dramatic cost reductions, revenue improvements and opportunities for gaining competitive advantage.

In Part 2 the author clarifies the impact of semantics on the content management industry as well as exploring the characteristics and semantic nature of the next generation of content management solution. Explains how Based on the World Wide Web Consortium’s (W3C) view of the Semantic Web (the next incarnation of today’s Internet), a number of open standards have emerged. Asserts that two of these standards, XML (Extensible Markup Language) and RDF (Resource Description Framework), have become absolute requirements for all modern, content management solutions. Contends that the application of these standards fundamentally re-architects the technological structure of content management systems, removing some of the limitations that plague many current industry applications.

A Glossary of terms is provided as well as online links to other resources.

Introduction

With an explosion in the amount of information made available to us as individuals, our world is often characterized by increasing complexity. Most of the time this wealth of information is considered key to the welfare of both individuals and enterprises. However, handling massive information streams is not a trivial task; on the contrary, it requires a sophisticated IT environment that employs the correct tools and well-chosen standards to offer users the freedom and ability to face the content management challenges of tomorrow.

The area of handling massive streams of information is referred to as ‘content management’. One immediate hurdle is that the term is not clearly defined; this means that anybody producing content and/or managing it, is claiming to be in this particular software hotspot.

· What do we mean by content? Content can be as diverse as film, audio, SMS, email and news streams.

· What is ‘management’? Is it storage, editing, web site structure and/or workflow authorization procedures? Or is it something else entirely?

So, the current situation is unclear. This is understandable when you look at some of the players:

· Website Management vendors – want to develop their existing tools into something larger and more functionally rich.

· Document Management vendors – obviously interested in content management, but carrying a heavy legacy of document databases.

· Database vendors – see content management as a natural add-on to a database solution.

· E-Business Application vendors – see content management as a natural extension of a commercial web site.

· New software companies focused on Web technology products.

· Spin-offs of either consultancies or advanced end-users, who have developed applications they believe may be of use.

Until now there has been no clear leading technology or architecture and no company has a dominating market position.

Raising the Bar: Matching Patterns of Information

The current generation of content management systems can characterized as follows:

· Management is achieved by way of named content objects (documents, Web pages etc.) and tags.

· Documents are often managed within a proprietary document database.

· Free-text searching is the prevailing means of looking for specific content.

· There are very limited possibilities for the exchange of content objects.

· Incoming objects cannot be classified or categorized in automated ways using a standard vocabulary.

· Proprietary, monolithic product architectures prevail.

Many content management products originate from Web site administration or document management background. Such products offer basic ‘search-engine’ style keyword-based searches. Although many sophisticated techniques are used, such as word frequency counts and lexical analysis, this method is not sophisticated enough to go beyond basic information searches. For example; a search for Woody Allen does not tell us whether the results relate to Woody Allen as a director, actor or both. Such methods are one-dimensional. The software is simply searching against a list of keywords, unknown and unrelated to one another.

A relational model may offer a higher level of sophistication, yet the concept of tables and relationships remains two-dimensional and the database itself carries no knowledge of the data it contains. All the system may assume is that there is a functional dependency between a database column and the primary key of the table.

Meaning can be formally expressed in a semantic data model based on triples that follow the ‘Subject-Predicate-Object’ construct. For example, “Take the Money and Run”, “has as Actor” and “Woody Allen”. Collections of such triples together form a directed graph (a network, where the subjects and objects represent the nodes). This graph represents a multi-dimensional, multi-hierarchical model. In consequence, the content objects must be self-describing and the process of searching and retrieving is a matter of matching patterns of information. In an environment such as this it should be perfectly possible to set up automated “Query Agents”, which – in both push and pull modes – satisfy users’ information needs on an ongoing basis. This can be achieved by way of matching the patterns of users’ information profiles with the semantic graph.

The Semantic Web

Whereas Web pages today are described by keywords (created by the author of the page), the Semantic Web introduces structure in such a way that content becomes meaningful. Content is described by metadata; which is structured in a schema that is defined according to the fundamental ‘subject-predicate-object’ construct of human languages. The inclusion of semantics also involves the construction of common vocabularies, rules for the processing of vocabularies and for the interchange of metadata between different vocabularies as well as digital signatures for a higher level of trust across information providers.

The W3C has been central to the development of the Semantic Web and has helped in the development of a number of critical open-standards (see later). In the words of Tim s-Lee, the Internet inventor and director of the World Wide Web Consortium (W3C): “The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation”. (Berners-Lee et al, 2001). The Semantic Web has been described in several articles, for example in Scientific American (2001)

The Next Generation: Semantic Content Management

The next generation of content management is best referred to as semantic content management and can be characterized by the following description:

A semantic content management solution is a system where content carries meaning. The meaning is expressed by metadata according to a semantic structure and is actively used in the automated management of the content. Content objects can be of any kind. The syntax and semantics used to describe meaning are based on open standards.

In other words, semantic content management is about managing content objects based on their properties. The objects can be of any type and the meaning of their properties can be recorded into metadata descriptions. These metadata descriptions are like library index cards meant for machine readers, not human readers. The metadata expresses the semantics according to the business environment. It could be customer codes for tenders, peoples’ identities for digital images, and artists for MP3 files.

Semantic content management is particularly well suited for environments with one or more of the following characteristics:

· Business value of content is high

· A need for complex searches

· Rich content objects with many properties

· Dynamic content

· A need for real-time publication

· A need for multi-channel delivery (and/or capture)

· Many content sources

· High volume

· External content feed and/or delivery

There are several major advantages offered by this next generation of content management solution:

· Only metadata enables computers to deal with information in a meaningful manner, in turn enabling a new, higher level of automation and the development of new services.

· Quality results from queries. Only meaningful results are returned (no ‘spamming’)

· By its very nature, a large collection of structured metadata only gets better with the more that is added. (This, in contrast, to the confusion created by a large collection of objects with little or no meaningful semantic structure - such as today’s Web).

· Complex queries are much easier to express in a rich semantic context, compared to the level of expressiveness offered by SQL or simple list searches.

· Content can be exchanged between different parties because it has meaning and because the meaning is expressed using technology based on open standards.

· Since the meaning of the content is completely described using metadata, there is no need to move content objects into specialized storage with specialized search facilities.

· For the same reason, there is no limit to the kind of content, which can be managed – any content object can be described by metadata.

· Content can be easily reused in many different publishing contexts.

· Peer-to-Peer content networks can be established.

· Since all the system needs to manage the content is the metadata, and since metadata is compact, a semantic content management system can handle very large numbers of content objects without scalability problems

· Open standards ease implementation efforts and makes it easier to employ developers and other specialist staff. At the same time, lock-in to proprietary architectures is avoided, which eliminates arbitrary limitations arising from vendor-specific design problems.

Such benefits represent a new order of expressiveness and ease of dealing with content complexity. Rich semantics enable you to manage and retrieve content based on patterns of information, not just simple ‘one or two-dimensional’ queries. This is the level of functionality required to efficiently deal with the vast information resources of today

An Overview of Metadata

Semantics is a highly complex discipline th can be characterized as ‘the study of meaning’. Although semantics is a term derived from the world of linguistics it can be compared to the IT term ‘metadata’, a word to signify data that describes data. Metadata can, for example, be used to describe the meaning of content stored within a data warehouse or an online catalog. Such technology can, therefore, be used as the mechanism for building large, structured descriptions of content within enterprise systems. Also several Enterprise Application Integration solutions rely heavily on metadata in order to facilitate data exchange between different applications

As the volume of information at our disposal continues to increase, the need for efficient and targeted automation of information capture and delivery is becoming absolutely necessary. If nothing is done, people will spend more time trying to find information than actually consuming it. There is only one way to enable computers to deal with information in a meaningful way, and that is to describe it in a precise, machine-readable format. This is one of the key reasons for the widespread interest in metadata-based technologies today.

A key approach to managing metadata is XML. It is extremely well designed for systems dependant on metadata and for that reason it is finding its way into many areas of IT, including: Application integration, electronic document interchange, data warehousing and now content management.

However, XML is just a standard for syntax. To describe meaning, it is not enough just to have ‘tags’ (or keywords in document databases). You need to have well-defined semantics, based on structured metadata.

Open Standards

There are many standards for content objects: HTML, the de facto Microsoft Office standard, Adobe Acrobat PDF, JPEG, MP3 etc. There are also a variety of transport protocols available for the exchange of content objects: HTTP, ICE, WAP, SMS etc. However, until recently there has been no standard for organizations looking to manage content objects or wanting to automatically recognize the nature of content produced by a third party.

This means that users of most content management solutions are locked into the proprietary technology of their current content management vendor. This has a profound impact on:

· The pricing level of the software solutions.

· Conversion of existing content databases to a new, open standards-based solution is extremely costly (if possible at all).

· Existing content management vendors face serious reengineering challenges as new standardized technology becomes available.

Recent developments mean that standards are now being developed for the content management industry, enabling a new generation of solutions. Although the Semantic Web has a much wider scope than just content management, two of its fundamental standards have had a major impact on the industry.

· XML: because it enables a common syntax between content providers, content managers (computer software) and content users.

· RDF: because it enables common semantics between the involved parties (in particular computer programs) in a content management system.

The current content management landscape is similar to that of the database market of the mid-eighties. The industry was dominated by proprietary, monolithic, closed, but functionally rich products. Yet, in just a few short years relational database technology gained a strong foothold because of the SQL-standard. Many database customers were forced to undertake major rewrites and database conversions simply in order to transform their data to the new, standardized environment.

The roadmap to the Semantic Web contains several standardization considerations:

1. Unicode (standard character sets) and URI’s (standard identifiers, such as http://www.profium.com)

2. XML, Namespaces and XML Schema

3. RDF and RDF Schema

4. Ontology vocabulary (interchange of semantics between different schema)

5. Logic (standard language for rules)

6. Proof and Trust (incl. digital signatures)

The first three steps have been standardized by the W3C and the three most important standards for enabling the Semantic Web are already in place and in use:

· XML, a standard syntax, which is particularly well suited for describing metadata.

· XSLT (Extensible Stylesheet Language Transformations), a popular standard for transformation and formatting of XML documents).

· RDF (Resource Description Framework), one of the latest standards from the W3C, dealing with the structure of semantics, expressed in XML.

In the context of building a content management system, the current XML standards (XML, XSLT and RDF), constitute a powerful platform for the developing market challenges.

Semantics Defined: Introducing RDF

RDF is the mechanism within which semantics are used to describe a set of (content) objects. RDF specifies how to describe ‘Resources’ (objects) with ‘Properties’ using ‘Statements’. This follows the subject-predicate-object triple as it occurs in natural languages.

Note: The RDF standard has two elements: The RDF Model and Syntax, which is a W3C recommendation (standard) and the RDF Schema, which is a W3C candidate recommendation (i.e. not finally approved).

In order to leverage semantics, one needs to follow these guidelines:

1. Construct an RDF Schema, which describes the vocabulary to be used in the system. A good option is to base your schema on some existing, industry standard RDF schema (e.g. Dublin Core, see below). Using RDF Schemas involves heavy use of several XML Namespaces. RDF itself has a syntax namespace and a schema namespace. The industry standard schema could be a third namespace, and you could add your own vocabulary to that as a fourth namespace and so forth. Your RDF schemas will be used for the validation of your content objects and consequently represent a significant contribution to the overall success of a system.

2. Describe your content objects with metadata in XML syntax. The vocabulary is restricted to the namespace(s) and RDF Schema(s) behind them.

3. Your content objects (or the metadata documents describing them) may still be XML documents and may be processed by any XML processor.

4. Should you want to take advantage of some of the more advanced features of RDF Schemas, you should employ a RDF validator, which understands constructs such as dependencies and subtypes.

5. Your content management system should take advantage of XSLT to format your content objects (or maybe just the metadata) for different delivery channels and media.

<rdf:RDF

    xmlns="http://www.w3.org/TR/1999/PR-rdf-schema-19990303#"

    xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”>

<rdf:Property rdf:ID="title">

    <label xml:lang="en">Title</label>

    <comment xml:lang="en">The main title of the movie</comment>

</rdf:Property>

<rdf:Property rdf:ID="director">

    <label xml:lang="en">Director</label>

    <comment xml:lang="en">The person who directed the movie</comment>

</rdf:Property>

<rdf:Property rdf:ID="actor">

    <label xml:lang="en">Actress/Actor</label>

    <comment xml:lang="en">The principal Actress/Actor</comment>

</rdf:Property>

</rdf:RDF>

The example below is an exert of an RDF Schema for a movie context and is a small part of a very basic model. The schema refers to the two XML namespaces for RDF and then simply lists the acceptable properties.

An example of an RDF metadata description of a movie, based on the schema above, could be:

<?xml version="1.0" encoding="ISO-8859-1" ?>

<rdf:RDF

    xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”

    xmlns:movie=”http://www.profium.com/moviedata#”>

<movie:title>Take the Money and Run</title>

<movie:director>John Huston</director>

<movie:genre>Comedy</genre>

<movie:actor>Woody Allen</actor>

  <movie:plotsummary>In an early spy spoof, aging Sir James Bond (David Niven) comes out of retirement to take on SMERSH.</plotsummary>

<movie:publicationyear>1967</publicationyear>

<movie:runtime>131</runtime>

<movie:country>USA</country>

<movie:language>English</language>

<movie:color>Color</color>

</rdf:RDF>

As can be seen, the metadata description refers to the RDF syntax namespace and to the RDF Schema described above. Admittedly, it is a simple example but it demonstrates the basic mechanics of using RDF to control semantic searching.

One of the big differences between the relational world and the semantic / RDF world is that although an RDF Schema operates on a level very similar to an ER-model, the RDF Schema is changeable and extensible on the fly. In fact, your content management system need not store the content objects themselves; all you need to store in order to manage content is the metadata and RDF Schema(s) necessary to validate and manage them.

Certain industries have defined metadata standards within their scope of interest and described them using RDF. For example two such standards, Dublin Core and PRISM, are widely used in the media industry. Other examples are: RSS (Rich Site Summary) – a scheme for Web site categorization, CC/PP – a forthcoming standard for describing the capabilities of devices such as next generation mobile handsets and Musicbrainz MM - a proposed set of metadata for audio and video on the Internet.

Internet browsers Mozilla and Netscape 6 use RDF as does Adobe Acrobat 5, which is using RDF as its metadata language. This is part of Adobe’s XMP (Extensible Metadata Platform) framework, the backbone of Adobe’s approach to Network Publishing. In Adobe’s own words: “XMP provides Adobe applications and partners with a common metadata framework that standardizes the creation, processing and interchange of document metadata across publishing workflows. XMP will be incorporated into all Adobe products eventually …” (Adobe, 2001).

More information about RDF is available (Bray, 2001; Hielm, 2001; and at http://www.w3.org/RDF/).

Conceptual Architecture Requirements

Although semantic content management systems ultimately follow the ‘Input – Process – Output’ procedure, there are significant architectural differences and challenges in this, the next generation of information delivery.

Since Semantic Web implementations are based on open standards and since semantic content managers will be integral parts of this environment, there are some basic requirements to be fulfilled:

1. Full Unicode support.

2. Content objects should be identified by their URI’s.

3. XML and XML Namespaces should underlay the design of the system as opposed to mapping XML documents to a proprietary internal architecture.

4. Management of content by way of (XML-based) metadata should be part of the basic design of the system.

5. RDF and RDF Schemas must be fully supported.

6. Since XSLT is the standard for format transformations of XML documents, the system should be based on (not only supporting) XSLT transformations.

7. The system should offer open APIs (based on the standard languages) allowing implementers to easily build custom applications and enabling a Best-of-Breed approach to software tool selection (authoring, editing etc.). A good example might be the capability of including a Topics Map-based tool for navigation of the semantic model.

Consequently, there are new components in a semantic architecture:

· XML-processing as the basic input module for metadata.

· There may be automated ways of identifying metadata (with reference to one or more target RDF Schema(s)) on the input side.

· Validation of metadata against a (collection of) RDF Schema(s).

· Management by way of metadata.

· Automated Query Agents capable of navigating the multi-dimensional, multi-hierarchical. structures of the semantic network as described in the RDF Schema(s).

On the other hand, there is no real need for storage of content objects within the new technology. Objects are described by metadata and may be accessed based on their URI, on an as-needed basis. Many of the interesting content objects today are not textual or structured by nature, and the number of formats and media continues to increase. Ultimately, content providers must be able to cope with a number of different content object storage devices.

From a software development perspective, the most challenging element of any semantic content manager is the Automated Query Agent. Not only do Agents require complex implementation but it is evident that a successful implementation is only possible if the whole system is built in support of semantics from the ground up.

Summary

To date, the development of information networks has been directed towards human users who actively click hyperlinks to find the relevant information. The additional effort required to encode information with semantic meaning is truly worthwhile and can remove the requirement to click through several hyperlinks to find the correct data.

The lifetime of semantic content objects is greatly improved by not being dependent on any bespoke programming language, but rather promoting a declarative syntax such as Extensible Markup Language (XML) to encode information.

Given the common syntax for encoding semantics, the data model that underlies object descriptions on the information networks can be based on Resource Description Framework (RDF); one can think of RDF as library index cards. These cards describe the semantics of the content objects for machines or applications with common syntax and data models. The ability to exploit these library index cards using different computers and operating systems opens up exciting opportunities. Information-intensive industries like the media and finance sector can greatly automate their workflows by providing customers with services that exploit semantic-based content. Semantic content management, being a subset of the Semantic Web initiative, is truly a shift in the way we view, manage and use electronic content.

References and Links To Other Sources

1. World Wide Web Consortium

· Home page: http://www.w3.org/

2. XML etc.

· Home page: http://www.w3.org/XML/

· XML specification: http://www.w3.org/TR/REC-xml

· Namespaces specification: http://www.w3.org/TR/REC-xml-names

· XSL and XSLT: http://www.w3.org/Style/XSL/

3. RDF

· Home page: http://www.w3.org/RDF/.

· Bray, T., “What is RDF?” http://www.xml.com/pub/a/2001/01/24/rdf.html .

· Resource Description Framework (RDF) Model and Syntax Specification. http://www.w3.org/TR/REC-rdf-syntax/

· Resource Description Framework (RDF) Schema Specification 1.0. http://www.w3.org/TR/2000/CR-rdf-schema-20000327/

· Beckett, D., Resource Description Framework (RDF) Resource Guide http://www.ilrt.bris.ac.uk/discovery/rdf/resources/

· Hjelm, J., Creating the Semantic Web with RDF, Wiley Professional Developer’s Guide Series, New York, 2001, ISBN 0-471-40259-1

· Medeiros, N., “XML and the Resource Description Framework: The Great Web Hope” http://www.onlineinc.com/onlinemag/OL2000/medeiros9.html

4. Adobe Inc., XMP

· Adobe Inc.’s Press Release re: XMP (Extensible Metadata Platform): http://www.adobe.com/aboutadobe/pressroom/pressreleases/200109/20010924xmp.html

5. The Semantic Web

· Home page. http://www.w3.org/2001/sw/

· Berners-Lee, T., Hendler, J., Lassila, O., The Semantic Web, Scientific American, May 2001.http://www.scientificamerican.com/2001/0501issue/0501berners-lee.html.

· Community portal. http://www.semanticweb.org/

· Palmer, S., Article http://infomesh.net/swintro

6. Metadata Vocabularies

· W3C Platform for Internet Content Selection (PICS) http://www.w3.org/PICS/

· Dublin Core home page http://purl.org/DC/

· PRISM Standard home page http://www.prismstandard.org/index.asp

· RDF Site Summary (RSS) 1.0. http://groups.yahoo.com/group/rss-dev/files/specification.html

· VCard in RDF. http://www.dstc.edu.au/Research/Projects/rdf/draft-iannella-vcard-rdf-00.txt

· MusicBrainz Metadata initiative http://musicbrainz.org/MM/

7. Introduction to Network Publishing

· http://www.atkearney.com/pdf/eng/Network_Publishing_Study_S.pdf

· http://www.adobe.com/aboutadobe/pressroom/pressreleases/200109/20010924xmp.html

8. Profium and SIR

· Profium home page is http://www.profium.com/

· Information re: SIR http://www.profium.com/gb/products/sir.shtml

· Download a RDF Parser written in Java. (An online demo is also available on the same URL) http://www.profium.com/gb/products/developers

Glossary

· ICE (Information and Content Exchange). An industry standard protocol that enables content syndication by standardizing content exchange. The ICE protocol describes catalogues of content packages as subscriptions, and enables delivery of those packages to be scheduled using push/pull methods. It also establishes the parameters within which content should be updated, sets business rules, specifies intellectual property rights, and governs other aspects of automated digital asset exchange. ICE works by defining a standard set of messages and its subscribers. These messages are encoded using XML. The ICE standard consists of definitions of these messages and descriptions of their meaning.

· Metadata. A generic term for machine-understandable information that describes content objects.

· RDF (Resource Description Framework). A declarative language that provides a standard way of using XML to represent metadata in the form of statements about properties and relationships of items. Such items, known as resources, can be almost any type of object.

· RDF Schemas. A declarative representation language that describes metadata vocabulary sets. A schema defines the meaning, characteristics and relationships of a set of properties, and this may include constraints on potential values and the inheritance of properties from other schemas. Within a schema, the meanings of terms are spelled out in detail, enabling independent communities to share vocabularies.

· XML (Extensible Mark-up Language). An open market, non-proprietary standard for defining, validating and storing structured data objects by expressing these objects as tagged text. XML is a subset of an earlier mark-up language, SGML.

· XML Namespaces. These allow RDF statements to reference a particular RDF vocabulary or schema. In a group of applications, materials may be ordered using the same headings and categories. However, properties may have different meanings. Potential conflicts are resolved because, through various programming mechanisms, a tag for a property name can use a short code which signals to which specific application vocabulary that tag belongs.

· XSL (Extensible Stylesheet Language). A stylesheet language for XML. XSL includes an XML vocabulary for specifying formatting. XSL specifies the styling of an XML document by using XSLT to describe how the document is transformed into another document that uses the formatting vocabulary.

· XSLT (XSL Transformations). A language for transforming XML documents into other documents, designed to be used as part of XSL.

· URI (Uniform Resource Identifier). A short string that uniquely identifies a resource on the Web.

· W3C (World Wide Web Consortium). A non-profit consortium dedicated to promoting the evolution and interoperability of the Web by developing common protocols. The consortium was founded in 1994 by Tim Berners-Lee and its current host institutions are MIT (Massachusetts Institute of Technology), INRIA (Institute National de Recherche en Informatique et Automatique), and Keio University of Japan.

Michaël Auffret is employed at Profium and can be reached at michael.auffret@profium.com