I have recently created a secondary and more specialised blog called ‘Semantic Integration Therapy‘ given my focus is now beginning to shift to this particular discipline. Semantic Integration in my terminology and context relates to achieving more effective application integration and SOA solutions by extending the traditional integration contract (XSD and informal documentation) with a more sementically aware mechanism. As such I’m now deep-diving into the use of XMLSchema pluse Schematron, RDF, OWL and commercial tooling such as Progress DXSI. There is a heavy overlap with the Semantic Web Community, but my focus is moreso on the transactional integration space within the Enterprise, as opposed to the holistic principles of the third generation or semantic web.
I’ve been developing a framework to provide XML validation/assurance tools based on a range of parser/validator implementations in Java. A recent experience warranted flagging, given I’ve been surprised at the number or credible examples out there on yonder interweb.
XmlBeans 2.5 and Xerces 2.9 behave differently when using a runtime schema/resource resolver. XmlBeans relies on the use of the EntityResolver interface whereas Xerces relies on the LSResourceResolver interface, but the way the XMLSchema <import /> statement is put together causes some consistency problems in terms of how these resolvers are used…
Take a WSDL (with inline-schema) and then use that WSDL to validate XML instance documents. You cannot validate directly with a WSDL, and therefore as a developer you spend time messing with tools to extract schema and so forth. This tool will greatly simplify this process among other things. This approach involves unpacking the inline XML Schema documents in such a way that multiple XSD fragments can be utilised at the point of validation.
Open the WSDL, extract the schemas taking care to maintain the namespace declarations from the outer WSDL definitions section now that XSD’s will become independent XSD files and unable to inherit through context. Spin up a version of XmlBeans and Xerces, and generate a n-way schema validation report this catering for variations in the level of compliance in any one validating parser.
Some of the WSDL documents are very complex. They contain many, sometimes hundreds of hierarchically related XML Schema fragments, each of which ‘imports’ the namespaces of numerous other schema fragments. Clearly when I separate the XSD’s from the WSDL these imports cannot resolve through inherited context so I have to make sure that I can resolve these at runtime when the validating parsers are fed the root schema. Problem is the inline schemas in the WSDL contain only a namespace import:
<import namespace=”http://some.other.namespace” />
And when this is exported to a standalone XSD file, you need to use a runtime SchemaResolver to acquire the other related XSD files such that the XmlBeans or Xerces engines can assimilate all of the type and namespace set necessary to validate the specified XML instance document. I spent a lot of time diagnosing why my XmlBeans implementation was not triggering the necessary ‘tell me where this schema lives?‘ events, such that element references were remaining unresolved, and the validation of the XML instance document was failing. With exactly the same input artifacts the Xerces resolver was happily going about it’s business and allowing me to feed it schema fragments all the way home.
So Watch Out For This…
If you are using XmlBeans, and you want to use a root XMLSchema to validate an XML document, but if the root XMLSchema contains <import /> statements, then you must have a schemaLocation present.
<import namespace=”http://some.other.namespace” schemaLocation=”file:/some.where”/>
Only with the schemaLocation present will the core XmlBeans.compileXsd() function call out to your custom EntityResolver such that you can then map the requested ‘namespace’ to a physical input source. On the other hand the Xerces parser will work with the shorter format which is already present in my
<import namespace=”http://some.other.namespace” />
Now the only problem here is that I did not want to modify the XMLSchema artifacts I was extracting from the WSDL on the basis that I am offering a service based on the inputs supplied. However, I have been forced into a position of needing to inject a schemaLocation tag into all the XSD’s as I’m extracting them from the parent WSDL. That said I am not specifying a physical location for the schemaLocation, merely reiterating the same generic namespace. I do this to just force the trigger of the resolver events, from which I then use only the supplied namespace=”http://some.other.namespace” tag in conjunction with some context information that I know in relation to where I put the schemas in my file-system, to resolve and supply the linkage to the required schema. This is sufficiently generic that I don’t deem it to be overly intrusive into the base schema, and I end up with the following modified import statements:
With this format I get uniform behaviour between XmlBeans and Xerces. Both call out to my runtime resolvers which are then able to use a simple mapping scheme to supply the schema content and I now able to operate consistently.
This did take me a long time to diagnose, so hopefully it will be of use to others.
I’ve been posting about the rise of the informal semantic contract relating to web-services and the deficiencies of XML Schema in adequately communicating the capability of anything other than a trivial service. Formalising a semantic contract by enriching a baseline structural contact (WSDL/XSD) with semantic or content-based constraints, effectively creates a smaller window of well-formedness, through which a consumer must navigate the well-formedness of their payload in issuing a request. Other factors such as incremental implementation of a complex business service ‘behind’ the generalised service interface compound the need for a semantic contract.
To clarify the relationship between structural and semantic, I happened upon a great picture which I’ve annotated…
I posted a long time back about my troubles in finding a way of performing schema validation in ruby (see Ruby and Xml Schema). At that time I was using REXML and only able to perform well-formedness checks based on basic structural integrity, but had no way to take an XSD and validate an instance document.
I’m pleased to say that there is a way to do this now, namely libxml-ruby. It’s available as a gem (gem install libxml-ruby) and the process is pretty simple:
document = LibXML::XML::Document.file(@xml_filename) schema = LibXML::XML::Schema.new(@xsd_filename) result = document.validate_schema(schema) do |message,flag| log.debug(message) puts message end
I’ve found this to be a very neat piece of code for dealing with the kind of schema integrity checking I’m looking for, and as I blend this with a number of other java-based parses using the Ruby Java Bridge I get a pretty good, consistent perspective on validity.
So there I was looking for the REXML::Document.validate_with_xsd() routine, persuading myself that it must be there somewhere, when suddenly I came to the realisation that it wasn’t ! Eh?!
I then happened upon numerous blogs and chat threads clearly explaining why the collective conscience of ‘Ruby’ had thus far deemed XML Schema unfit for inclusion in this ever expanding scripting language, because….well XSD is crap! Woo hoo I’m glad that was settled so convincingly !
Excellent I thought, if only I were working in an idealogical vacuum where XML Schema had been out-lawed years ago, but sadly no, the primary currency of integration in my world is XML declared primarily with XSD and supporting semantic information. I might not ‘like’ it, but it’s there….so I need to exploit it to avoid reinventing it….don’t I ? Or else what did I miss?
I noted with interest the justification that to effectively use XSD within Ruby, and make sense of a document validator revolving around schema, I’d have to write just as much reactive code as I would have to do if I just coded the document-specific validation routines by hand. Hmmm…not sure….I think ‘I’ would write a lot MORE code if I attempted to do that…than say the seasoned hackers who’re making such assertions with their zen-like-one-ness with the syntax.
REXML gives a structural integrity validation in a single line of code:
But whilst this gives me a warm feeling that I’m not parsing an alien binary artefact, it doesn’t give me much of an insight in terms of whether my structurally intact XML document actually manifests any of the rules/constraints laid out in the existing XSD’s that my organisation uses to declare at least some apsects of the XML validation logic. So why not just add an additional routine to provide a yes/no – ahead of all the deep and meaningful reasons as to which constraint has got the n-th degree of infringement….? So what would be the problem in offering another root level operation such as:
I know there are answers – such as ‘well extend REXML yourself and submit it!’ or ‘use another language such as..err…Java!’, so no points for phoning in with those, but I’m just perplexed that such a mainstream component as XML Schema, even at it’s most basic level, has been forceably ejected by the Ruby community thus far…
I emphasise I don’t see XML Schema as a shining light of pragmatism, but nor do I see the value in completely ignoring one of the primary currencies in a mature integration lanscape…