Tuesday, August 26, 2008

Answering mail #9

Brian Reindel writes:

Hi Adriaan,

I just read "Create a maintainable extensible XML format" on IBM's Web site, and I had a quick question. How come the move from validating against a DTD to validating against a schema? I am using PHP and SimpleXML, which only provides DTD validation tools, while the new PHP DOM provides schema validation. I really want to use SimpleXML instead of DOM because it is much more concise. However, I need to do validation because my XML schema will be public and must be followed strictly. Can you tell me at what point it is absolutely necessary to move from using a DTD to using an XSD?

Thanks for any advice you can offer.

Brian Reindel

Hi Brian,

I think you can continue using DTDs for as long as you like. Document Type Definitions are mature, time proven, and still used to describe even the newest versions of XHTML. Many people are using it to describe their XML or SGML format and not just to maintain the old ones.

XML Schemas are popular because they are supported by many tools and recommended by the World Wide Web consortium. That does not stop a large group of developers from preferring DTDs or another alternative like RELAX NG - which shares some characteristics with DTD. Personally, I think RELAX NG (Compact Notation) is a bit easier to read than a DTD, but that may be different if you are used to reading and writing DTDs.

The main advantage of a DTD is that you can use inline notation within an XML document to describe its element structure. However, XML Schema and RELAX NG (XML Notation) can be expressed as XML documents themselves which make them easier to manipulate / translate. 

The main advantage of XML Schema is the possibility to put detailed restrictions on the content within an element. Using DTD or RELAX NG, you could put these restrictions in a separate Schematron validation with the added benefit that detailed element content validation and element grammar specification are not tightly coupled in one file.

If you consider my recent example translation to RELAX NG, you could also imagine doing the same translation to the DTD format below... (with my apologies for any errors: I do not know too much about DTDs)

The bottom line is: there is a lot of choice and it is fully up to you to decide. In this weblog, I'll try to cover more on RELAX NG and Schematron. For my own daily usage, I'd say "XML Schema, unless..." Your example is a legitimate unless, if you ask me! To stress this, here is a quote from Elliotte Rusty Harold's article "The Future of XML":

DOM isn't a least-common-denominator API: it's a worst-common-denominator API. You couldn't design a worse API for processing XML if you tried.


Kind regards,

Adriaan.

<?xml encoding="UTF-8"?>

<!ELEMENT car (brand,type,kind,(tr:tire|wnd:windscreen)+)>
<!ATTLIST car
  xmlns CDATA #FIXED 'http://car.org/car'>

<!ELEMENT brand (#PCDATA)>
<!ATTLIST brand
  xmlns CDATA #FIXED 'http://car.org/car'>

<!ELEMENT type (#PCDATA)>
<!ATTLIST type
  xmlns CDATA #FIXED 'http://car.org/car'>

<!ELEMENT kind (#PCDATA)>
<!ATTLIST kind
  xmlns CDATA #FIXED 'http://car.org/car'>

<!ELEMENT tr:tire (tr:brand,tr:type)>
<!ATTLIST tr:tire
  xmlns:tr CDATA #FIXED 'http://car.org/tire'
  tr:count CDATA #REQUIRED>

<!ELEMENT wnd:windscreen (wnd:brand)>
<!ATTLIST wnd:windscreen
  xmlns:wnd CDATA #FIXED 'http://car.org/windscreen'
  wnd:count CDATA #REQUIRED>

<!ELEMENT tr:brand (#PCDATA)>
<!ATTLIST tr:brand
  xmlns:tr CDATA #FIXED 'http://car.org/tire'>

<!ELEMENT tr:type (#PCDATA)>
<!ATTLIST tr:type
  xmlns:tr CDATA #FIXED 'http://car.org/tire'>

<!ELEMENT wnd:brand (#PCDATA)>
<!ATTLIST wnd:brand
  xmlns:wnd CDATA #FIXED 'http://car.org/windscreen'>

2 comments:

Brian Reindel said...

Hi Adriaan,

Thanks so much for taking the time to answer my question so quickly, and for sharing it with other developers. I think this is extremely useful information, and the tutorial on IBM's Web site was helpful as well.

You touched on one point briefly that I think looks to be a shortcoming of DTDs, and that is that I can only specify the type of data allowed in an ELEMENT declaration (whereas in an ATTLIST I can use an enumeration to specify values). I want to be able to specify an enumeration for elements as well, but I believe only an XSD can accomplish this.

Adriaan de Jonge said...

Hi Brian,

I think you are right about that one. An interesting question to keep in mind is: "how often does your enumeration content change?". If the answer is "never", the schema is probably the right location to store the enumeration.

As the maintenance on the enumeration content increases, it becomes more interesting to choose a different construction:

You can also include a small list of options within your XML document with clearly defined primary keys. On the position where you need to refer to one of the options, you provide a reference to the element from the list.

This way, the things that change often are treated as content within the document and the schema can be considered constant.

I hope this is not too brief to understand. Maybe this is also something worth spending more time on in a longer post.

Thanks for your feedback!

Kind regards,

Adriaan.