Friday, May 26, 2006
So you think you know XML
Over the past couple of days, a co-worker and I have engaged in a lot of metaphorical hair-pulling over an attempt to load XML data into a Tamino database. Apparently the server was interpreting the data as ISO 8859-1 encoding, even though it was UTF-8. There was no "encoding" attribute in the XML declaration; I understood this to mean a default of UTF-8. But the database's default encoding was 8859-1, and XML in a Nutshell (Second Edition) says:
All you have to do is tell the parser which character encoding the document uses. Preferably this is done through metainformation, stored in the filesystem or provided by the server. However, not all systems provide character-set metadata so XML allows documents to specify their own character set with an encoding declaration inside the XML declaration...
The encoding attribute is optional in an XML declaration. If it is omitted and no metadata is available, then the Unicode character set is assumed. The parser may use the first several bytes of the file to try to guess which encoding of Unicode is in use. If metadata is available and it conflicts with the encoding declaration, then the encoding specified by the metadata wins. [Emphasis added]
For confirmation, I checked the XML 1.1 specification. As far as I can tell, it doesn't have any clear statement about how the encoding attribute in the XML declaration relates to external metadata. So Tamino may be acting correctly, or at least with a good excuse. But it's hard to tell.
Wednesday, May 24, 2006
Microsoft Office open XML standard
pcmag.com reports that Microsoft has submitted the first draft of the Office Open XML file format specification. (The link to the draft, supplied with the article, is broken at the moment.)
It has doubled in size since its submission, and now contains more than 2,000 pages of additional documentation, including more than 160 pages devoted to documenting 356 different spreadsheet formulas alone.
"And this is just the first draft," a Microsoft spokesperson said, adding that the specification represents work from more than five months of technical committee meetings.
2000 pages? This looks as if it will die stillborn of its own weight as an open standard. I hate to imagine how long the review process will take.
Friday, May 19, 2006
PDF: Much ado about no security
There has been some concern lately over a so-called security failure in PDF. At least for a while, certain permissions set in PDF files, such as restrictions against printing, could be circumvented simply by opening the file through GMail.
But the truth is that these restrictions aren't, and never were, secure. They are "bozo bit restrictions"; certain bits are set to request that the software comply with the restriction, and that's all. Software that chooses to ignore them, or ignores them because it's buggy, can bypass them. It's also easy to write software that modifies these bits in a PDF document, so that Adobe Reader and other compliant readers will no longer refuse to perform the relevant operations. I don't personally know of such software, but I'm sure it exists. Calling these modifications "cracking" the documents gives them too much credit. The security of these features is like a sign on an unlocked door that says "please do not use this door."
There are actual security features in PDF, such as file encryption, which can't be easily broken. But setting these bits and imagining that they will stop any but the most casual users is wishful thinking.
Wednesday, May 17, 2006
Open Document is ISO standard
One of the backing organizations is OASIS (Organization for the Advancement of Structured Information Standards), which has nothing to do with Harvard's OASIS, mentioned in a previous post.
Tuesday, May 16, 2006
EAD 2002
EAD (Encoded Archival Description) is an XML format for describing archives of information, either online or in hard copy. Harvard's OASIS (Online Archival Search Information System) is moving from the original EAD to EAD 2002; reports from the process indicate that the differences between the versions aren't as well-documented as they could be. Unfortunately, EAD isn't described by a schema, only by a DTD. There is a schema on Harvard's website, but it's unofficial and has application-specific attributes.
EAD is a clear example of a format designed by committee, and people who have worked with it suspect that some of its features literally have never been used by anybody. Still, it's a thorough and widely used format for storing metadata about archives.
