XML Schema Validation for the command line

Posted: May 7, 2015 at 2:17 PM

schema
xml

It turns out that unless you use a full fledge XML editor, validating your XML document against a schema is difficult. Most tools require you to specify a single schema file. If you have an XML document that contains more than one name space this doesn’t work too well as often, each name space is in a separate schema file.

The XML document has xmlns attributes which use a URI as the identifier. These URIs are for identifying it, and not a URL, so not able to be used. In fact, different cases in the URIs specify different name spaces even in the “host” part, though that is not the case with URLs. In order for validators to find the schema, the attribute xsi:schemaLocation is used to map the name space URIs to the URLs of the schema.

The xsi:schemaLocation mapping is very simple. It is simply a white space delimited list of URI/URL pairs. None of the command line tools that I used uses this attribute to make the schema validation simple. This includes xmllint Via WayBack Machine as original link it http only. which uses the libxml2 library. I also tried to use the Java XML library Xerces, but was unable to get it to work. Xerces did not provide a simple command line utility, and I couldn’t figure out the correct java command line to invoke the validator class.

My coworker, Patrick Via WayBack Machine as original link is now defunct., found the blog entry, Nokogiri XML schema validation with multiple schema files, which talks about using xs:import to have a single schema file support multiple name spaces. With this, we realized that we could finally get our XML document verified.

As I know shell scripting well, I decided to write a script to automate creating a unified schema and validate a document. The tools don’t cache the schema documents, requiring fetching the schema each time you want to validate the XML document. We did attempt to write the schema files to disk, and reuse those, but there are issues in that some schemas reference other resources in them. If the schema is not retrieved from the web, these internal resources are not retrieved also, causing errors when validating some XML documents.

With a little bit of help from xsltproc to extract xsi:schemaLocation, it wasn’t to hard to generate the schema document and provide it to xmllint.

The code (xmlval.sh):

#!/bin/sh -

cat <<EOF |
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>

 <xsl:output method="text"/>
 <xsl:template match="/">
  <xsl:value-of select="/*/@xsi:schemaLocation"/>
 </xsl:template>

</xsl:stylesheet>
EOF
    xsltproc - "$1" |
    sed -e 's/ */\
/g' |
    sed -e '/^$/d' |
    (echo '<?xml version="1.0" encoding="UTF-8"?>'
     echo '<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:nospace="nospace" targetNamespace="http://www.example.com/nospace">'
     while :; do
        if ! read a; then
            break
        fi
        if ! read b; then
            break
        fi
        echo '<xs:import namespace="'"$a"'" schemaLocation="'"$b"'"/>'
    done
    echo '</xs:schema>') |
    xmllint --noout --schema - "$1"

Though the script looks complicated, it is a straight forward pipeline:

Lines 3-16 provide the xslt document to xsltproc on line 17 to extract schema location attribute.
Lines 18-20 replace multiple spaces with new lines and deletes any blank lines. It should probably also handle tabs, but none of the documents that I have had tabs. After this, we now have the odd lines containing the URI of the name space, and the even lines contain the URL for the schema.
Lines 21 and 22 are the header for the new schema document.
Lines 23-31 pulls in these line pairs and create the necessary xs:import lines.
Line 32 provides the closing element for the schema document.
Line 33 gives the schema document to xmllint for validation.