Friday, May 6, 2011

Can I use XSLT to parse XML into sub-files? (+ Alternative Languages/Methods)

Hey all, I have highly repetitive data with a depth of 5 nodes deep (including the root) that needs to be broken apart. (I'll include a fast sample in a minute.) What I'm looking to do is parse a ~5mb XML file into smaller sub-files based on the 3rd-depth nodes. But after that, it gets more complicated.

The task's requirements are these:

  1. Sub-files must maintain the hierarchical parents of the 3rd level node which is extracted, including their attributes.
  2. Sub-files must retain all attributes and children nodes.
  3. If XSLT cannot handle the job, attempt it in Ruby. If you aren't good at XSLT, but can tell me how to do it in Ruby or even Python, please feel free to contribute an answer in those languages. (Else try and stick with XSLT or pseudo-code.)

DOM Hierarchy:

<xml attr="whatever">
  <major-group name="whatever">
    <minor-group name="whatever">
      <another-group name="whatever">
        <last-node name="whatever"></last-node>
      </another-group>
    </minor-group>
  </major-group>
</xml>

Which I need to split on the minor-group element while retaining both its children and direct parents, and put all that (for each minor-group) in an external file. I have several files to split in this manner.

And... having never before parsed XML in Ruby, and having just begun using XSLT, I cannot yet write a script to accomplish my task with either.

I'm curious to see if XSLT is up to the task. :>

Edit:

Here's my resulting code, with the ability to show a stylesheet at the beginning of the file.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <xsl:output method="xml"/>
  <xsl:template match="minor-group">
    <xsl:variable name="filename"><xsl:value-of select="concat(@name,'.xml')"/></xsl:variable>
    <xsl:result-document href="{$filename}">
      <xsl:text disable-output-escaping="yes">
        <![CDATA[<?xml-stylesheet type="text/xsl" href="../web.xslt"?>]]>
      </xsl:text> 
      <xml>
        <xsl:attribute name="whatever"><xsl:value-of select="../../@whatever" /></xsl:attribute>
        <major-group>
          <xsl:attribute name="whatever"><xsl:value-of select="../@whatever" /></xsl:attribute>
          <xsl:copy-of select="."/>
        </major-group>
      </xml>
    </xsl:result-document>
  </xsl:template>
</xsl:stylesheet>
From stackoverflow
  • I don't believe you can parse one file into multiple output files using simply XSLT.

    If you were to break the XML up into different XML files with Ruby, and then apply the seperate XML files to the XSLT multiple times it should work.

    The Wicked Flea : It used to be possible with Apache's Xalan, http://www.abbeyworkshop.com/howto/xslt/xslt_split/index.html but it seems defunct. I have found no other related result via Google. :/ (Besides, that breaking up is what I'm trying to do with either Ruby or XSLT -- I just don't know how to preserve it all with Ruby.)
    system PAUSE : @Flea: That sample references the Redirect extension to Xalan. Looks like it's available for Xalan-J (Java version of Xalan), see http://xml.apache.org/xalan-j/extensionslib.html#redirect
    The Wicked Flea : Which I don't know how to get/use. I haven't touched Java, ever. I'll look into it.... :/
    system PAUSE : It's a bit of work to set up. But you can run a transform from the command-line, see http://xml.apache.org/xalan-j/getstarted.html#commandline
  • To extract the list of "minor group" elements, one of the following XPath expressions would be required.

    /xml/major-group/minor-group    (the explicit way)
    /*/*/*                          (the generic, any-third-level-element way)
    

    In a scripting language of your choice, read the document into a DOM, construct a loop over the XPath query, writing the results to different output files.

    With XSLT 1.0 it is not possible to generate more than one output document at a time. Hovever, XSLT 2.0 supports this via the <xsl:result-document> instruction.

    If you have an XSLT 2.0 engine at your disposal, you could try that route. A random page I found at IBM's developerWorks website shows how to get started: Tip: Create multiple files in XSLT 2.0

    The Wicked Flea : Thanks very much for the tip-off about XSLT 2.0; this ought to fix my problem, but I'm going to test it first.

0 comments:

Post a Comment