Xerces DOMParser funny…

May 1, 2010

Working with Xerces DOMParser I have noticed a weird exception being reported arbitrarily by certain instance documents. This exception message is “The processing instruction target matching “[xX][mM][lL]” is not allowed.“. Looking at the XML source file it does have the normal <?xml version=”1.0″ encoding=”UTF-8″?> prefix so was initially unsure why this exception was firing every now and then?

Turns out that using the Xerces (JDK default) DOMParser ¬†implementation, there is a parser exception ONLY if there are blank lines in the XML document before the <?xml version=”1.0″ encoding=”UTF-8″?> declaration..??

Removing these blank lines (NOT the XML declaration header) allows the parser to complete successfully. I don’t profess to understand the science bit here, but this did knock me off track for a while. Interested to hear what relevance the blank lines do have in a document where whitespace is irrelevant…?


Xerces, XmlBeans and XML Schema Import Resolver

April 30, 2010

I’ve been developing a framework to provide XML validation/assurance tools based on a range of parser/validator implementations in Java. A recent experience warranted flagging, given I’ve been surprised at the number or credible examples out there on yonder interweb.

The Headline

XmlBeans 2.5 and Xerces 2.9 behave differently when using a runtime schema/resource resolver. XmlBeans relies on the use of the EntityResolver interface whereas Xerces relies on the LSResourceResolver interface, but the way the XMLSchema <import /> statement is put together causes some consistency problems in terms of how these resolvers are used…

The Goal

Take a WSDL (with inline-schema) and then use that WSDL to validate XML instance documents. You cannot validate directly with a WSDL, and therefore as a developer you spend time messing with tools to extract schema and so forth. This tool will greatly simplify this process among other things. This approach involves unpacking the inline XML Schema documents in such a way that multiple XSD fragments can be utilised at the point of validation.

The Approach

Open the WSDL, extract the schemas taking care to maintain the namespace declarations from the outer WSDL definitions section now that XSD’s will become independent XSD files and unable to inherit through context. Spin up a version of XmlBeans and Xerces, and generate a n-way schema validation report this catering for variations in the level of compliance in any one validating parser.

The Problem

Some of the WSDL documents are very complex. They contain many, sometimes hundreds of hierarchically related XML Schema fragments, each of which ‘imports’ the namespaces of numerous other schema fragments. Clearly when I separate the XSD’s from the WSDL these imports cannot resolve through inherited context so I have to make sure that I can resolve these at runtime when the validating parsers are fed the root schema. Problem is the inline schemas in the WSDL contain only a namespace import:

<import namespace=”http://some.other.namespace&#8221; />

And when this is exported to a standalone XSD file, you need to use a runtime SchemaResolver to acquire the other related XSD files such that the XmlBeans or Xerces engines can assimilate all of the type and namespace set necessary to validate the specified XML instance document. I spent a lot of time diagnosing why my XmlBeans implementation was not triggering the necessary ‘tell me where this schema lives?‘ events, such that element references were remaining unresolved, and the validation of the XML instance document was failing. With exactly the same input artifacts the Xerces resolver was happily going about it’s business and allowing me to feed it schema fragments all the way home.

So Watch Out For This…

If you are using XmlBeans, and you want to use a root XMLSchema to validate an XML document, but if the root XMLSchema contains <import /> statements, then you must have a schemaLocation present.

<import namespace=”http://some.other.namespace&#8221; ¬†schemaLocation=”file:/some.where”/>

Only with the schemaLocation present will the core XmlBeans.compileXsd() function call out to your custom EntityResolver such that you can then map the requested ‘namespace’ to a physical input source. On the other hand the Xerces parser will work with the shorter format which is already present in my

<import namespace=”http://some.other.namespace&#8221; />

Now the only problem here is that I did not want to modify the XMLSchema artifacts I was extracting from the WSDL on the basis that I am offering a service based on the inputs supplied. However, I have been forced into a position of needing to inject a schemaLocation tag into all the XSD’s as I’m extracting them from the parent WSDL. That said I am not specifying a physical location for the schemaLocation, merely reiterating the same generic namespace. I do this to just force the trigger of the resolver events, from which I then use only the supplied namespace=”http://some.other.namespace&#8221; tag in conjunction with some context information that I know in relation to where I put the schemas in my file-system, to resolve and supply the linkage to the required schema. This is sufficiently generic that I don’t deem it to be overly intrusive into the base schema, and I end up with the following modified import statements:

<import namespace=”http://external.namespace&#8221; ¬†schemaLocation=”http://external.namespace”/&gt;

With this format I get uniform behaviour between XmlBeans and Xerces. Both call out to my runtime resolvers which are then able to use a simple mapping scheme to supply the schema content and I now able to operate consistently.

This did take me a long time to diagnose, so hopefully it will be of use to others.