Tuesday, October 23, 2018

Crash Dumps: Do I submit them?

TL;DR: No, do not submit your crash dumps.  Consumers: No company has sane crash dump policies to ensure your privacy and PII is protected, minimized and secured.  Companies: You need to ensure that crash dumps are handled in a secure manner and that crash dumps are just that: a crash dump. Anything not directly related to a crash dump should be excluded. Usage statistics and the like do not belong in crash reports.

Why Not Send Dumps?

There is a long history of companies failing to minimize the data and to protect it.  Microsoft for years sent crash dumps over the internet in the clear (WER & Privacy conerns).  This allowed the NSA to harvest them, and develop 0-days for issues that MS failed to fix.  Google's Chrome would send a screencap of the entire Desktop along with it's crash dumps (link).  It previously would only send the window, but now sends the entire screen.  Though they provide a preview, there is no way to see exactly what information will be sent.

I do not relish in advising people to not submit crash dumps as this will impact developers ability to fix bugs.  But as with all aspects of security, companies continue to demonstrate that they are not willing to do the work that is necessary to protect user's data and their privacy.


You need to communicate to your users how crash dumps are handled.  Just saying, trust us, does not inspire confidence, as there are a large number of cases of data breaches where the company has said exactly that leading up to leaks.  The policy is the first step to demonstrating that you have thought about user's concerns and decided how you will handle their personal and sensitive data.

The policy also helps shape how employees will treat the data too.  By having the policy, it is a reiteration to the employees that user data isn't simply chaff, but that it needs to be protected and handled with care.

Just saying that it's protected by a privacy policy isn't enough.  For example, Google Chrome's Report an Issue says that the information is protected by their privacy policy, but if you read the Chrome browser Privacy Policy, there is nothing in there that says how the data is handled.  That it is handled like the rest of the data collected does not inspire confidence that the possibly confidential data that may be included will be handled with greater care.

How to handle dumps

The first step is to ensure that what is collected in the dump has minimum information needed to debug issues.  Code paths (back traces) are likely to be safe.  Data, such as arguments to functions, may include user data and needs to be carefully examined.  There are many different types of data that can be released from embarrassing (what website was visited), to security breach (including cookies/tokens for web sites that may not be yours), to confidential intellectual property leaking (source code, designs, etc).  Each of these may have different impact on the user, but should never happen.

Second, crash dumps need to be transmitted confidentially.  This means either using TLS or encrypting the dumps with a tool like GPG before sending.  This ensures that unauthorized parties are unable to view the contents.  The NSA used the dumps to gather information for their operations, which if Microsoft had properly protected their user's data, this would not have happened.

Third, they need to be stored in a secure manner and able to be expunged.  It should even be possible for the user to remove the crash dump if they discover that information was shared when it should not have been.  The life time that a company keeps the dumps should be limited.  If you haven't fixed a bug from five years ago, how do you know you can reproduce it, or that if you are able to reproduce it, that the code is still present in your current software?  It the crash is a major issue, it is likely that you'll have more recent dumps that exhibit the same issue if it is a problem, so old dumps are just not as useful compared to the data that may be present.

As crash data needs to be deleted, almost any cloud service is immediately excluded unless other precautions are used, such as encryption.  With the cloud, you have zero visibility into how the data is managed and how or when it is backed up.  Cloud providers rarely tell you their retention policies on back ups, and other policies that may keep data around.  Do they securely remove your VM's storage when they migrate it?  Do they ensure that storage is deleted from all clones, shards, servers and backups when you delete it?  If not, how long will that data stay around before it is finally expunged.

Fourth, access to dumps need to be controlled.  Auditing is a good first step to know who is accessing the data, but additional measures like limiting who has access needs to be used.  Not everyone on the team needs access to them.  As they are classified, they can be assigned to teams or people that need access to the data in them.  This helps make sure that an employee isn't trolling for nudes or other confidential information.  It should also limit how easy data is copied out of the archive.  How these controls are put in place will vary by company.

Edit:  Case in point:  I recently opened a support case with Apple.  Apple provides a program to collect data to send to them to help trouble shoot the issue.   The program collected 280 MB of data.  When uploading the data, Apple informs the user that it is their responsibility to NOT submit any personal information that they don't want.  There is no way most people are qualified to look at the data, and even redact it properly.  I attempted to do so, and it took a very long time, and I'm not sure that I got everything.  Expecting a normal computer user to be able to do this is insane.

Monday, October 15, 2018

TLS Client Authentication Leaks User Info (pre-TLS1.3)

It's been long known that TLS is not the best privacy protecting protocol in that SNI leaks what domain the client connects to.  I'm a bit surprised that I haven't seen the failure to protect user information when using client authentication mentioned, but it's likely that TLS client authentication is so rarely used, that this have not been on anyone's radar.

TL;DR: Just don't use TLS client authentication on anything before TLS 1.3.

With TLS 1.2 and earlier, if you use client authentication, the client certificate is transmitted in the clear.  This contains enough information to uniquely identify the user.  If it didn't, then there would be no way for the server to do the authentication.

The danger of this is that Eve (eavesdroppers) on path will be able to track your user's (or your) connections, where they connect from, figure out how much data they transfer between to/from your site and likely profile their usage.

I was confident that this was the case as I know that the entire handshake is in the clear.  It isn't till the Finished messages that the session becomes encrypted.  (TLS 1.3 fixed this by using a new derived key, [sender]_handshake_traffic_secret, to encrypt all the server params, which the client will use to encrypt it's response to the certificate request in the server params.)  I decided to verify that this was the case.

I generated a server and a client certificate and key:
openssl req -batch -new -newkey rsa:1024 -days 365 -nodes -x509 -keyout server.key -out server.crt
openssl req -batch -new -newkey rsa:1024 -days 365 -nodes -x509 -keyout client.key -out client.crt

I then launched the server, and included the -Verify and -CAfile options for s_server to request a client certificate:
openssl s_server -accept 5829 -cert server.crt -key server.key -Verify 5 -CAfile client.crt -debug

Then I ran tcpdump to capture the session:
sudo tcpdump -s 0 -n -i lo0 -w clientcert.tcpdump port 5829

And then the client to connect to the server:
openssl s_client -connect localhost:5829 -key client.key -cert client.crt -debug

A usual, non-client authenticated connection and close was about 17 packets, but when I included the client authentication, it became 42 packets (the answer!).

I loaded the packet capture into wireshark, applied the SSL protocol analysis and confirmed that the client certificate was present in clear text:
Wireshark shows TLS handshake with client authentication, with the client certificate displayed in plaintext.

So, there you have it.  Do not use client authentication, pre-TLS 1.3, if you care about the privacy of your users.

It is safe to use client authentication w/ a TLS 1.3 server as long as the server requires all clients be 1.3 clients.  If the key exchange algorithm is one of DHE_DSA, DHE_RSA, or an ECDH key exchange algorithm, the random bytes in the Hello messages are signed and these bytes are used by TLS 1.3 for downgrade protection.  As the signature covers these bytes, the client would be able to detect any attempts to modify the server or client handshake messages to force a downgrade before it would send the client certificate.

Thanks to Mike Hamburg for reviewing an earlier version of this blog post and pointing out that TLS 1.3 was not vulnerable to this and helping w/ some of the research to prove it.


Tuesday, July 3, 2018

Making FreeBSD magnet links

For the last few years, I've been producing torrents and publishing magnet links, but there is some special work that I do to make these.  The first few releases, I inserted a bogus tracker into the torrent, because despite there being plenty of tools out there for producing trackerless (DHT) torrents, they were all GUI and I never found any that were command line based.  The other was there was/is no tool for extracting the info hash and building the magnet link.  There may be tools now, but I couldn't find any when I started 3 years ago.

The following steps are based upon the recent release of FreeBSD 11.2-R, adjust as necessary.

  1. Fetch FreeBSD into a directory (I create a per release directory).  There are a few directories that you have mirror, I use wget for this.  The mirroring feature for wget isn't great.  After each command I have to remove the CHECKSUM.SHA256, CHECKSUM.SHA512 and index.html* files.
    $ wget -c -r -l 1 -nd --limit-rate=800k https://download.freebsd.org/ftp/releases/ISO-IMAGES/11.2/
    $ wget -c -r -l 1 -nd --limit-rate=800k https://download.freebsd.org/ftp/releases/VM-IMAGES/11.2-RELEASE/aarch64/Latest/
    $ wget -c -r -l 1 -nd --limit-rate=800k https://download.freebsd.org/ftp/releases/VM-IMAGES/11.2-RELEASE/amd64/Latest/
    $ wget -c -r -l 1 -nd --limit-rate=800k https://download.freebsd.org/ftp/releases/VM-IMAGES/11.2-RELEASE/i386/Latest/
  2. Fetch the signature files:
    $ wget https://www.freebsd.org/releases/11.2R/CHECKSUM.SHA512-FreeBSD-11.2-RELEASE-{amd64,i386,powerpc,powerpc-powerpc64,sparc64,arm64-aarch64}.asc
    $ wget https://www.freebsd.org/releases/11.2R/CHECKSUM.SHA512-FreeBSD-11.2-RELEASE-{amd64,i386,arm64-aarch64}-vm.asc
  3. Verify the GPG key that signed the above files.  This is usually Glen Barber's key, but not always.  I have met and verified his fingerprint in person,  If you have verified someone's key who has signed Glen's key, that is another good way.
  4. Verify the checksum files:
    $ for i in *.asc; do gpg --verify $i; done
    You should see a bunch of lines like:
    Warning: using insecure memory!
    gpg: Signature made Fri Jun 22 09:33:50 2018 PDT
    gpg:                using RSA key 0x031458A5478FE293
    gpg: Good signature from "Glen Barber <gjb@FreeBSD.org>" [full]
    gpg:                 aka "Glen Barber <glen.j.barber@gmail.com>" [full]
    gpg:                 aka "Glen Barber <gjb@glenbarber.us>" [full]
    gpg:                 aka "Glen Barber <gjb@keybase.io>" [unknown]
    gpg: WARNING: not a detached signature; file 'CHECKSUM.SHA512-FreeBSD-11.2-RELEASE-amd64-vm' was NOT verified!
    The last line can be ignored.  The non-.asc files were d/l'd and will not be used.  Make sure that all of the files report Good signature.
  5. In the past I have used BitTornado for other things, so I ended up using it as the basis to make the tool for creating trackerless torrent files.  The modifications were simple.  It appears that the original BitTornado CVS tree is off-line (anyways, it was served insecurely), but it looks like effigies/BitTornado is similar enough that it could be modified and used.  I copied btmakemetafile.py to btmaketrackerless.py and applied the following patch:
    $ diff -u btmakemetafile.py btmaketrackerless.py 
    --- btmakemetafile.py   2004-05-24 12:54:52.000000000 -0700
    +++ btmaketrackerless.py        2016-10-10 17:13:32.742081000 -0700
    @@ -23,9 +23,9 @@
     def prog(amount):
         print '%.1f%% complete\r' % (amount * 100),

    -if len(argv) < 3:
    +if len(argv) < 2:
         a,b = split(argv[0])
    -    print 'Usage: ' + b + ' <trackerurl> <file> [file...] [params...]'
    +    print 'Usage: ' + b + ' <file> [file...] [params...]'
         print formatDefinitions(defaults, 80)
    @@ -33,9 +33,9 @@

    -    config, args = parseargs(argv[1:], defaults, 2, None)
    -    for file in args[1:]:
    -        make_meta_file(file, args[0], config, progress = prog)
    +    config, args = parseargs(argv[1:], defaults, 1, None)
    +    for file in args[0:]:
    +        make_meta_file(file, None, config, progress = prog)
     except ValueError, e:
         print 'error: ' + str(e)
         print 'run with no args for parameter explanations'

    If you notice, the only thing that is done is to drop the first argument, and instead of passing it into make_meta_file, a None is passed instead.  This will simply not add trackers to the torrent file.
  6. I then run the following script to verify the downloaded files, and generate the torrent files:
    $ cat cmp.sh 
    #!/bin/sh -
    # wget -c -r -l 1 -nd --limit-rate=800k https://download.freebsd.org/ftp/releases/ISO-IMAGES/11.2/
    # wget -c -r -l 1 -nd --limit-rate=800k https://download.freebsd.org/ftp/releases/VM-IMAGES/11.2-RELEASE/aarch64/Latest/
    # wget -c -r -l 1 -nd --limit-rate=800k https://download.freebsd.org/ftp/releases/VM-IMAGES/11.2-RELEASE/amd64/Latest/
    # wget -c -r -l 1 -nd --limit-rate=800k https://download.freebsd.org/ftp/releases/VM-IMAGES/11.2-RELEASE/i386/Latest/
    # wget https://www.freebsd.org/releases/11.2R/CHECKSUM.SHA512-FreeBSD-11.2-RELEASE-{amd64,i386,powerpc,powerpc-powerpc64,sparc64,arm64-aarch64}.asc
    # wget https://www.freebsd.org/releases/11.2R/CHECKSUM.SHA512-FreeBSD-11.2-RELEASE-{amd64,i386,arm64-aarch64}-vm.asc

    grep -h '^SHA512' CHECK*.asc | sed -e 's/SHA512 (\(.*\)) = \(.*\)/\2 \1/' | sort -k 2 > sha512.from.asc

    while read hash fname; do
            if [ -e "$fname" ]; then
                    sigfile=`grep -l -- "$fname" *.asc | head -n 1`
                    echo checking "$fname", sig in: "$sigfile"
                    #res=`sha512 -q "$fname"`
                    res=`shasum -a 512 "$fname" | awk '{ print $1 }'`
                    echo "File is: $res"
                    if [ x"$res" != x"$hash" ]; then
                            echo missmatch!  "$fname"
                            exit 1
                    if ! [ -e "$fname".torrent ]; then
                            btmaketrackerless.py "$fname"
                    echo missing "$fname"
                    exit 1
    done < sha512.from.asc
  7. Once all the torrents have been generated, I then make the magnet links:
    $ cat btmakemagnet.sh 
    #!/bin/sh -

    # metainfo file.: FreeBSD-10.3-RELEASE-sparc64-bootonly.iso.torrent
    # info hash.....: 06091dabce1296d11d1758ffd071e7109a92934f
    # file name.....: FreeBSD-10.3-RELEASE-sparc64-bootonly.iso
    # file size.....: 203161600 (775 * 262144 + 0)
    # announce url..: udp://tracker.openbittorrent.com:80
    # btshowmetainfo 20030621 - decode BitTorrent metainfo files

    for i in *.torrent; do
            btshowmetainfo.py "$i" | awk '
    $0 ~ "^info hash" { info = $3 }
    $0 ~ "^file name" { name = $3 }
    END {
            print "magnet:?xt=urn:btih:" info "&dn=" name
  8. I then create the magnet links file, and update the Torrents wiki page.
Sorry about the code formatting.  I don't know how to make it look better in blogger.

Friday, March 16, 2018

Unusable Insecurity

Many people claim that security is hard, and in many cases it is hard, but that isn't an excuse to make it harder than it needs to be. There are many layers to security, but adding extra layers, or making security controls inscrutable is a great way to ensure insecurity. Security needs to be simple and straightforward to configure, and easy to understand. There may be knobs for advanced users, but defaults need to be simple and correct.

I recently looked at using S3 as a shared store for some data. I was using the account New Context created for me that had limited AWS permissions. Creating the S3 bucket was simple enough, and making it not-public was too, but then I wanted to create a user/API key that only had access to the S3 bucket. Per Amazon IAM Best Practices, you should not share your account, but create new users for access. It turns out that I did not have the CreateUser permission. I involved a co-worker who did have permissions to create the user. Adding another person to the task makes things more complex through communication and their availability to work on it instead of their normal work.

As part of creating a user, you have to figure out what the Policy that you need to assign to the user. Amazon provides some Bucket Policy Examples, but none of them is a simple policy on granting read and write permissions to the bucket. There is an Amazon Policy Generator for helping you to create the policies, but it doesn't allow you to select buckets from your account (to simplify ARN [Amazon Resource Name] selection), and there are almost 70 actions provided in the selector. After some brief reading, I settled on a simple policy that I thought would allow the new user proper access: 4 permissions: PutObjects, GetObjects, ListObjects and RestoreObjects.

My co-worker created the user and applied the policy, but then I got an error handle code. Amazon does not provide an interface for turning on logging and/or querying why a request failed. Despite the error handle, I had ZERO insight into why the request failed. I could have involved AWS support, but now that would add yet another party in attempting to properly configure S3.

At this stage, I decided to give up, as I had already spent a few hours of my time, some of my co-worker's time, and a couple weeks due to various delays due to availability and other work. In this case, storing the data in S3 was more of a nicety, and I decided that checking the data into a private git repo was adequate compared to the complexities involved in configuring S3. git was a tried and tested way to store data and restrict access while S3 for this usage was not, and hard to configure.

After I wrote this blog post, a coworker linked me to the blog post titled Writing IAM Policies: How to Grant Access to an Amazon S3 Bucket. It is concerning that this blog post has not been integrated, nor linked to from any of the IAM or S3 documentation. This is a valuable resource that should not be hidden.

I'm clearly not the only one that has had issues configuring S3 buckets. The end of 2017 has shown a large number of organizations fail to properly secure their S3 buckets, leaving many terabytes of data open for public download. It is unacceptable that such a service is so difficult to configure. The site https://s3stupidity.com/ lists the large number of breaches, many of which are by large companies who should have the technical chops (and $$) to properly configure it.

Security controls need to be simple and clear. Their descriptions need to be accurate and concise in what they do, and how they do it. Amazon does have a number of good resources, but they do not have a comprehensive guide for what each permission does. You cannot blame users for security failures when you make it next to impossible to configure properly.

Edited to remove a couple extra words.

Sunday, September 17, 2017

Adventures in Autobahn/WAMP Security

Or how security continues to suck because: It's Hard and Someone Else's Problem™

For a personal project, I've decided to use WAMP to move some events and messages around between different components.  I decided on the AutoBahn libraries and Crossbar.io as the router.  I was already somewhat familiar w/ AutoBahn from previous work, and the Crossbar.io router seems to just work.  As a security person, I decided to evaluate how to make things as secure as possible.

First off, my projects must be both authenticated and encrypted.  WAMP does not appear to have it's own encryption layer, but it does have it's own authentication layer.  You really don't want to have to trust two different authentication layers1, so being able to use TLS Channel Bindings would be an improvement.  This would ensure that a strong authentication method in WAMP would ensure that the channel is properly encrypted.  I received confirmation from the Crossbar.io team that it was present.

Autobahn and Crossbar.io supports a number of different authentication schemes.  As I plan on putting this behind a reverse proxy (which I realize will have it's own issues w/ channel binding), I wanted the strongest security binding between my client and the server (and I'm a glutton for punishment for using unproven tech).  The only one that satisfies this requirement is WAMP-Cryptosign.

After I got basic functionality working to make sure things would be workable w/ this framework, I decided to start working on the authentication piece.  First problem I ran into was that the AutoBahn|JS library does not support TLS channel binding.  There is a good read the library doesn't support it, and it's for a very bad reason.  There is no support in the browser WebSocket API to query the channel binding information necessary.  The fact that WebSockets was standardized after Channel bindings were demonstrates that the people involved in standardizing the web do not take security seriously.  As usual, they assume that security is not their problem and leaves it up to someone else to solve (or at another layer).

Disappointed that I wouldn't be able to use channel bindings w/ the web client for this project (I still had the crappy CA authentication of TLS, so not all was lost), I moved forward w/ CryptoSign.  As has been demonstrated many times, the only way to get security baked in, is to make it as easy as possible to use.  I've been long familiar w/ Crypto Box by djb (and used by the Autobahn libraries), and also the noise protocol (which my friend Trevor created).  Both of these have goals of making it simple to let developers include security in their projects and not mess it up, resulting in a broken system.  As currently implemented, Autobahn's CryptoSign is most definitely not easy to use.  

Though the documentation is decent, some examples are not present (client_ssh_key.py for example from WAMP-cryptosign Static Authentication).  The ApplicationRunner helper class does not document how to make use of authentication.  Though the static authentication page has examples, they make you write quite a bit of boiler plate.

Then even once you do that, you find out that the code doesn't even work on Python 2.7 and have to fix it for them.  Hopefully the pull request (PR) will not be ignored because of the failing CI tests, because the current CI tests are problems with their CI environment, and not the PR.  For CI checks like this, it should only ding your PR on checks that are newly failing, and ignore any checks that were previously failing.  This isn't the first project that their CI environment was broken.

Even w/ the fixes in place, there is no documented method of extracting a public key from a generated ssh key.  I will be adding a method to print this out.

If I (who knows cryptography decently) have to fix and spend hours making this work, it's no wonder than everyone things that strong cryptography is hard.  It is hard, but it shouldn't be.

1 The encryption layer must be authenticated, otherwise any attacker could MiTM the connection.  Most uses of TLS make use of the CA system for authentication (which has serious issues in trust), and most web apps add their own authentication layer on top of it (not using Basic Auth, or other scheme).  The issues w/ this is that if there is no binding between the two layers, the lower layer (application layer) cannot be sure that the upper layer has not been compromised.

Tuesday, July 28, 2015

Installing and running NetBSD and OpenBSD under bhyve

These instructions assume that you have downloaded the install ISO from the respective sources.  These were doing with specific versions, and there may be minor changes with older and newer versions.

These instructions could possibly be more simple, such as not using separate device maps for grub-bhyve.   These were testing on a month old HEAD.

There are other guides that cover most of this, and probably in more detail.  The issue that I had was the exact commands to grub to load kernels was not well documented.  Both of the images boot and are able to get DHCP leases and pass basic traffic.

Hope this helps others!


  1. Install grub2-bhyve:
    pkg install grub2-bhyve
  2. Create a file called instdev.map containing:(cd0) NetBSD-6.1.5-amd64.iso
    (hd1) netbsd.img
  3. Create the file netbsd.img with the correct size:
    truncate -s 3g netbsd.img
  4. Run the following commands (or put into a script file) under sh:MEM=512MVM=nbsd615
    bhyvectl --destroy --vm=$VM
    grub-bhyve -r cd0 -M $MEM -m instdev.map $VM <<EOFknetbsd -h -r cd0a (cd0)/netbsdbootEOFbhyve -A -H -P -s 0:0,hostbridge -s 1:0,lpc \
        -s 2:0,virtio-net,tap3 -s 3:0,virtio-blk,./netbsd.img \
        -s 4:0,ahci-cd,./NetBSD-6.1.5-amd64.iso \
        -l com1,stdio -c 2 -m $MEM $VM
  5. This will run the installer, complete the installation.
  6. Create a file called dev.map containing:
    (hd1) netbsd.img
  7. Now in the future, to run NetBSD from the image, run the following commands:
    bhyvectl --destroy --vm=$VM
    grub-bhyve -r cd0 -M $MEM -m dev.map $VM <<EOFknetbsd -h -r ld0a (hd1,msdos1)/netbsdbootEOFbhyve -A -H -P -s 0:0,hostbridge -s 1:0,lpc \
        -s 2:0,virtio-net,tap3 -s 3:0,virtio-blk,./netbsd.img \
        -l com1,stdio -c 2 -m $MEM $VM
  8. Profit!


  1. Install grub2-bhyve:
    pkg install grub2-bhyve
  2. Create a file called instdev.map containing:
    (cd0) install57.iso(hd1) openbsd.img
  3.  Create the file openbsd.img with the correct size:
    truncate -s 3g openbsd.img
  4. Run the following commands (or put into a script file) under sh:MEM=512MVM=obsd57
    bhyvectl --destroy --vm=$VM
    grub-bhyve -r cd0 -M $MEM -m instdev.map $VM <<EOFkopenbsd -h com0 (cd0)/5.7/amd64/bsd.rdbootEOFbhyve -A -H -P -s 0:0,hostbridge -s 1:0,lpc \    -s 2:0,virtio-net,tap3 -s 3:0,virtio-blk,./openbsd.img \    -s 4:0,ahci-cd,./install57.iso \    -l com1,stdio -c 2 -m $MEM $VM
  5. This will run the installer, complete the installation.
  6. Create a file called dev.map containing:
    (hd1) netbsd.img
  7. Now in the future, to run OpenBSD from the image, run the following commands:
    bhyvectl --destroy --vm=$VM
    grub-bhyve -r hd1 -M $MEM -m dev.map $VM <<EOFkopenbsd -h com0 -r sd0a (hd1,openbsd1)/bsdbootEOFbhyve -A -H -P -s 0:0,hostbridge -s 1:0,lpc \
        -s 2:0,virtio-net,tap3 -s 3:0,virtio-blk,./openbsd.img \
        -s 4:0,ahci-cd,./install57.iso \
        -l com1,stdio -c 2 -m $MEM $VM
  8. Profit!

Thursday, May 7, 2015

XML Schema Validation for the command line

It turns out that unless you use a full fledge XML editor, validating your XML document against a schema is difficult.  Most tools require you to specify a single schema file.  If you have an XML document that contains more than one name space this doesn't work too well as often, each name space is in a separate schema file.

The XML document has xmlns attributes which use a URI as the identifier.  These URIs are for identifing it, and not a URL, so not able to be used.  In fact, different cases in the URIs specify different name spaces even in the "host" part, though that is not the case with URLs.  In order for validators to find the schema, the attribute xsi:schemaLocation is used to map the name space URIs to the URLs of the schema.

The xsi:schemaLocation mapping is very simple.  It is simply a white space delimited list of URI/URL pairs.  None of the command line tools that I used uses this attribute to make the schema validation simple.  This includes xmllint which uses the libxml2 library.  I also tried to use the Java XML library Xerces, but was unable to get it to work.  Xerces did not provide a simple command line utility, and I couldn't figure out the correct java command line to invoke the validator class.

My coworker, Patrick, found the blog entry, Nokogiri XML schema validation with multiple schema files, which talks about using xs:import to have a single schema file support multiple name spaces.  With this, we realized that we could finally get our XML document verified.

As I know shell scripting well, I decided to write a script to automate creating a unified schema and validate a document.  The tools don't cache the schema documents, requiring fetching the schema each time you want to validate the XML document.  We did attempt to write the schema files to disk, and reuse those, but there are issues in that some schemas reference other resources in them.  If the schema is not retrieved from the web, these internal resources are not retrieved also, causing errors when validating some XML documents.

With a little bit of help from xsltproc to extract xsi:schemaLocation, it wasn't to hard to generate the schema document and provide it to xmllint.

The code (xmlval.sh):

1:  #!/bin/sh -  
3:  cat <<EOF |  
4:  <?xml version="1.0"?>  
5:  <xsl:stylesheet version="1.0"  
6:      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"  
7:      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
8:  >  
10:   <xsl:output method="text"/>  
11:   <xsl:template match="/">  
12:    <xsl:value-of select="/*/@xsi:schemaLocation"/>  
13:   </xsl:template>  
15:  </xsl:stylesheet>  
16:  EOF  
17:      xsltproc - "$1" |  
18:      sed -e 's/ */\  
19:  /g' |  
20:      sed -e '/^$/d' |  
21:      (echo '<?xml version="1.0" encoding="UTF-8"?>'  
22:       echo '<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:nospace="nospace" targetNamespace="http://www.example.com/nospace">'  
23:       while :; do  
24:          if ! read a; then  
25:              break  
26:          fi  
27:          if ! read b; then  
28:              break  
29:          fi  
30:          echo '<xs:import namespace="'"$a"'" schemaLocation="'"$b"'"/>'  
31:      done  
32:      echo '</xs:schema>') |  
33:      xmllint --noout --schema - "$1"  

Though the script looks complicated, it is a straight forward pipeline:

  1. Lines 3-16 provide the xslt document to xsltproc on line 17 to extract schema location attribute.
  2. Lines 18-20 replace multiple spaces with new lines and deletes any blank lines.  It should probably also handle tabs, but none of the documents that I have had tabs.  After this, we now have the odd lines containing the URI of the name space, and the even lines contain the URL for the schema.
  3. Lines 21 and 22 are the header for the new schema document.
  4. Lines 23-31 pulls in these line pairs and create the necessary xs:import lines.
  5. Line 32 provides the closing element for the schema document
  6. Line 33 gives the schema document to xmllint for validation.