Friday, June 29, 2007

 

JHOVE performance note

Recently I noticed, in response to a user complaint about slow performance, that you can get huge performance improvements in JHOVE by disabling the HTML module in your configuration file. Because HTML is defined in such a way that a text file might be a valid HTML file if it just has a TITLE tag somewhere in it, it's necessary to parse the whole file before dismissing it, and this is painfully slow for multi-megabyte files.

If HTML files are in the mix of files you're analyzing, this may not be an option. In this case, you can still get a speed improvement by moving the HTML module lower in the config file (but still before ASCII and UTF-8), so that well-formed files of other formats won't incur the HTML module's overhead.

Labels: ,


Tuesday, June 26, 2007

 

Canonical XML 1.1

Canonical XML 1.1 is now a W3C candidate recommendation.

Canonical XML is a standardized physical representation for an XML document such that two XML documents which are logically equivalent will (usually) reduce to the same canonical document. More precisely: "Although two XML documents are equivalent (aside from limitations given in this section) if their canonical forms are identical, it is not a goal of this work to establish a method such that two XML documents are equivalent if and only if their canonical forms are identical."

XML equivalence is a complicated issue, and the recommendation's choices may surprise some people. For instance, documents which use different prefixes for the same namespace aren't canonically equivalent.

Just to be a little confusing, Canonical XML 1.1 applies to XML 1.0, not to XML 1.1.

Labels:


This page is powered by Blogger. Isn't yours?

free hit counters
free hit counters
hits since 30-Oct-2006