Sunday, September 14, 2008

Scala: a misguided approach to XML

In 2001, when Scala was conceived, it might have seemed like a good idea to integrate XML capabilities in a new programming language. XML was emerging technology, Java libraries for XML processing were still young and it seemed like things could be simplified in comparison to DOM and SAX.

But even back then, some XML technologies were already quite powerful and well capable of handling complex problems. The best examples being XPath 1.0 (16 November 1999) and XSLT 1.0 (also 16 November 1999) Their proper usage just wasn't well known yet. Early versions of Cocoon already showed the first ideas of using their power by combining them into XML pipelines.

The fact is: it is not 2001 anymore. Technologies are even more mature. XPath and XSLT are at version 2.0 and XQuery 1.0 is officially a W3C recommendation (all 23 January 2007). While the 1.0 versions were already powerful, the 2.0 versions and XQuery are a lot more mature and backed up by proper research and documentation in a larger set of Recommendation documents.

At the same time in 2008, new dynamic languages are gaining popularity as Java is in an existential crisis heading towards version 7.0. And Scala is one of these languages a language that is becoming more and more popular. And it integrates XML capabilities.

"What could be wrong with adding XML capabilities?" - I hear you ask. This is an XML related weblog after all. What could be wrong with adding XML support?

If you're asking this question, then obviously you haven't seen Scala yet. Or you did, but you're not familiar with proper XML technologies (yet).

A few years ago, it was common to externalize configuration parameters of Java programs XML descriptors. In more recent years, under the pressure of dynamic languages, it became more common to integrate configurations into the code using Annotations and hard-coded values.

And now, using Scala, you can internalize the XML itself, so it becomes easier to output the hard-coded values like this: (Examples taken from scala.xml)

/* examples/phonebook/embeddedBook.scala */
package phonebook

object embeddedBook {

val company = <a href="http://acme.org">ACME</a>
val first = "Burak"
val last = "Emir"
val location = "work"

val embBook =
<phonebook>
<descr>
This is the <b>phonebook</b> of the
{company} corporation.
</descr>
<entry>
<name>{ first+" "+last }</name>
<phone where={ location }>+41 21 693 68 {val x = 60 + 7; x}</phone>
</entry>
</phonebook>;

def main(args: Array[String]) =
Console.println( embBook )

}

But that is not the worst example in this Scala XML manual. Nooooooo!

My real disappointment in this manual started when I read this sentence:

For new developments, it is more straightforward to use the more convenient Scala API rather than the cumbersome XSLT syntax, or (if it really must be XSLT), some Java library.

Personally I do not think of XSLT as cumbersome at all, but I am always open to new ideas and prove my own ideas wrong. So let's take a look at the example for XML transformations that the Scala XML manual came up with: (Examples taken from scala.xml)

object transform {
import scala.xml._ ;
import scala.xml.dtd._ ;
import org.xml.sax.InputSource ;

/* a former version of Scala used regular expression patterns, like
* in the following code. In the future, we plan to reactivate some
* well-behaved regular expressions again
// gimmick: text replacement "scalac" => <code>scalac</code>
def cook(s: String): Seq[Node] = cook1(s) ;
def cook1(s: Seq[Char]):List[Node] = s match {

case Seq( a @ _*, 's','c','a','l','a','c', b @ _* ) =>

Text(cook2( a )) :: <code>scalac</code> :: cook1( b )
case _ => List( Text( cook2( s )))
}
def cook2(s: Seq[Char]): String = {
val r = new StringBuffer();
s.foreach { c:char => val _ = r.append(c); };
r.toString()
}
*/

def transform1(ns: Iterable[Node]): Seq[Node] = {
val zs = new NodeBuffer();
for(val z <- ns) { zs &+ transform( z ) }
zs
}

/** this functions holds "templates" that transform nodes of an input tree
* into an iterable representation of a sequence of nodes of the output
* tree.
*
* It is ok to return a single node, since each node is at the same
* time a singleton sequence. Likewise, the pattern variable x will be
* of type Seq[Node], although here it is always binding a single node.
*/
def transform(n: Node):Iterable[Node] = n match {
case x @ <article>{ ns @ _ * }</article> =>
<source active="ant" title={ (x \ "title" \ "_").toString() }>
<header>
<author>Burak Emir</author>
<keywords>Scala4Ant</keywords>
<style type="text/css"></style>
</header>
<content>
<title><scala/> Ant Task</title>
{ transform1( x \ "_" ) }
</content>
</source>
case x @ <sect1>{ _* }</sect1> =>
<section>{ transform1( x \ "_" ) }</section>
case x @ <title>{ _* }</title> =>
<h>{ x \ "_" }</h>
case x @ <para>{ _* }</para> =>
<p>{ transform1( x \ "_" ) }</p>
case x @ <itemizedlist>{ _* }</itemizedlist> =>
<ul>{ transform1( x \ "_" ) }</ul>
case x @ <listitem>{ _* }</listitem> =>
<li>{ transform1( x \ "_" ) }</li>
case x @ <constant>{ _* }</constant> =>
// an xml:group is a sequence that appears to the scala type system
// as a single node. Here it is used to append a text node with a space
<xml:group><code>{ x \ "_" }</code> </xml:group>
case x @ <programlisting>{ _* }</programlisting> =>
<pre>{ x \ "_" }</pre>
case Elem(namespace, label, attrs, scp, ns @ _*) =>
Elem(namespace, label, attrs, scp, transform1( ns ):_* )
case z =>
z
};

def main(args:Array[String]) = {
if( args.length == 1 ) { // must have one arg
object ConsoleWriter extends java.io.Writer {
def close() = {}
def flush() = Console.flush
def write(cbuf:Array[char], off:int , len:int ): unit = {
var i = 0
while(i < len)
Console.print(cbuf(off + i))
}
}

val src = XML.load(new InputSource( args( 0 ))); //use Java parser

// transform returns an iterable, but we now it is a singleton
// sequence, so we get its first element
val n = transform( src ).elements.next
val doctpe = DocType("html",PublicID("-//W3C//DTD XHTML 1.1//EN","../default.dtd"), Nil)

/** write document to console, with encoding latin1, xml declaration
* and doctype
*/
XML.write(ConsoleWriter, n, "iso-8859-1", true, doctpe)

}
else error("need one arg");
}
}

Talking about cumbersome! The real problem of this code might not even be in the notation. That is just a matter of getting used to Scala syntax. The problem is in its lack of maintainability. This kind of code assumes that XML documents always follow a fixed pattern. And this idea is confirmed by an earlier section of the manual.

The manual (surprisingly adequately) makes a separation between XML that can be considered text with markup and XML that models data structures in a hierarchical tree format. But (disappointingly quickly) after that, the text starts displaying the first signs of a lack of knowledge of XML technologies:
[...]
There are several points of view that can be taken:

1. XML is regarded as text. We ignore the tree structure completely. Some text/regular expression search is used to retrieve or manipulate information. This can get you quite far for small tasks. Go away, use perl :-)
[...]

What ever happened to XPath?

The Scala designers must have suffered from the Not Invented Here Syndrome and defined their own Scala equivalent of XPath, like this: (Examples taken from scala.xml)

package bib;

object bib {

import scala.xml.{Node,NodeSeq};
import scala.xml.PrettyPrinter;

val biblio =
<bib>
<book>
<author>Peter Buneman</author>
<author>Dan Suciu</author>
<title>Data on ze web</title>
</book>
<book>
<author>John Mitchell</author>
<title>Foundations of Programming Languages</title>
</book>
</bib> ;

val pp = new PrettyPrinter(80, 5);

def main(args:Array[String]):Unit = {
Console.println( pp.formatNodes( biblio \ "book" \ "title" ));

// prints
// <title>Data on ze web</title><title>Foundations ...</title>

Console.println( pp.formatNodes( biblio \\ "title" )); // prints the same

Console.println( pp.formatNodes( biblio \\ "_" )); // prints node and all descendant

Console.println( pp.formatNodes( biblio.descendant_or_self )); // prints the same


}
}

Judging from the list of references under the manual, it is more likely that this XPath-surrogate is modeled after XPath 1.0 than XPath 2.0. And judging from the opinions of the author, it seems unlikely that Scala will adopt XML standards in a proper way any time soon.

Probably the most fundamental critique on this approach is that Scala is mixing concerns that are usually separated from each other. Usually, the challenge of software development is to separate concerns that are otherwise mixed. So why would anyone start working the other way around, working in the wrong direction?

In my opinion, the answer to the current Java 7.0 controversy is not in dynamic languages like Scala. I think it is more likely that the answer is in proper standards like XPath 2.0, XQuery 1.0 and XSLT 2.0. Over the years since 1999, these technologies matured into powerful programming platforms themselves. As I am researching these specs more and more, it turns out they are even a lot more interesting than I thought before I started. And I thought I knew these technologies. So a lot more details on these technologies will follow in later posts!

2 reacties:

James Iry said...

I'm of mixed emotions about Scala's integration of XML. My comment isn't about that, though.

I want to clarify one thing: Scala is no more a dynamic language than Java is. Scala is statically typed and in fact has a far richer, more self-consistent type system than Java does. Scala's only dynamic metaprogramming facilities are via Java reflection or runtime byte code generation and class loading fu.

Adriaan de Jonge said...

Thanks, I wasn't sure about the right classification when writing this. I put the last entry of the word dynamic between del-tags.

This is an interesting post by the way:

"Scala the statically typed dynamic language"

http://scala-blogs.org/2007/12/scala-statically-typed-dynamic-language.html