|
|||||||||||||||||||||||||||||||
|
Using vendor extensions might make XSLT
processing in Java 50 times faster.
Summary: this article describes Xalan extensions which might make XSLT processing much faster and more convenient in the same time. A little benchmark shows how using a little Jython script within XSL template makes XSL processing 5-50 times faster. Background: XML was used as consolidation format for data, that comes from different databases. Problem: We need to create nice looking report from the XML document with grouping etc. for a server side J2EE application. Approach to the solution: We will use our input XML document as a database and will try to create XSL templates which will produce necessary output. First, we will create an input file generator that will help us generating input files of a desired size. Jython file generation script looks like this: 1:from java.util import Random 2:import sys 3:import string 4: 5:r = Random() 6:comp = 1 7: 8: 9:def row( k): 10: s = "\n\t\t<row><key>" + k + "</key><val>" 11: s = s+ r.nextInt( 100 ).toString() + "</val></row>" 12: return s 13: 14:def subrpt( sic, brok, name ): 15: s = "\n<company><sic>" + sic +"</sic><broker>" + brok +"</broker><name>" + name + "</name>" 16: s = s + row( "a" ) 17: s = s + row( "b" ) 18: s = s + row( "c" ) 19: s = s + "\n</company>" 20: return s 21: 22:def nextComp(): 23: global comp 24: comp = comp + 1 25: return comp 26: 27:def generateRows( compRange, brokerRange, sicRange ): 28: for comp in range( compRange ): 29: for brok in range( brokerRange ): 30: for sic in range( sicRange ): 31: v = subrpt( "sic_" + sic.toString(), "brok_" + brok.toString(), "company_" + nextComp().toString() ) 32: print v 33: 34:def report( repeat ): 35: print "<report>" 36: for i in range( repeat ): 37: generateRows( 10,10,5 ) 38: print "\n</report>" 39: 40:arg1 = sys.argv[1] 41:repeat = string.atoi( arg1 ) 42:report( repeat ) 43: Looking at file generation script we may notice that one loop produces 500 “company” nodes, 50 “broker” nodes and 5 “sic” nodes. Such file has 100768 bytes size. Our test generator produces unique values for “company” nodes but keeps repeating values of “sic” and “broker” nodes. By supplying a particular number of loops we can control how big our input file will be. Lets generate input file with 1000 nodes: jython data/generateInFile.py 2 > data/in.xml First challenge is the lack of distinct() function in the XSL standard. So we need to use some tricks. With using “Muenchian” method our template might look like this: 1:<?xml version="1.0"?> 2:<xsl:stylesheet version="1.0" 3: xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 4: 5: <xsl:key name="distinctSic" match="sic" use="."/> 6: <xsl:key name="distinctBroker" match="broker" use="."/> 7: <xsl:key name="distinctKey" match="key" use="."/> 8: 9: <xsl:template match="/"> 10: <rpt> 11: <xsl:variable name="dSIC"> 12: <xsl:value-of select="/report/company/sic[generate-id() = generate-id( key( 'distinctSic', .) )]"/> 13: </xsl:variable> 14: <xsl:variable name="dBROKER"> 15: <xsl:value-of select="/report/company/broker[generate-id() = generate-id( key( 'distinctBroker', .) )]"/> 16: </xsl:variable> 17: <xsl:variable name="dKEYS"> 18: <xsl:value-of select="/report/company/row/key[generate-id() = generate-id( key( 'distinctKey', .) )]"/> 19: </xsl:variable> 20: <xsl:for-each select="$dSIC"> 21: <xsl:sort select="."/> 22: <xsl:variable name="SIC"> 23: <xsl:value-of select="."/> 24: </xsl:variable> 25: <xsl:for-each select="$dBROKER"> 26: <xsl:variable name="broker"> 27: <xsl:value-of select="."/> 28: </xsl:variable> 29: <xsl:for-each select="$dKEYS"> 30: <xsl:variable name="key"> 31: <xsl:value-of select="."/> 32: </xsl:variable> 33: <xsl:for-each select="/report/company[sic = $SIC and broker = $broker]/row[key = $key]"> 34: <xsl:sort select=".."/> 35: <row> 36: <sic><xsl:value-of select="$SIC"/></sic> 37: <broker><xsl:value-of select="$broker"/></broker> 38: <key><xsl:value-of select="$key"/></key> 39: <company><xsl:value-of select="../name"/></company> 40: <v><xsl:value-of select="val"/></v> 41: </row> 42: </xsl:for-each> 43: </xsl:for-each> 44: </xsl:for-each> 45: </xsl:for-each> 46: </rpt> 47: </xsl:template> 48: 49:</xsl:stylesheet> I will not describe here how “Muenchian” method works, but I want to say that it is not very convenient and self-evident at my taste. I would like to see something like SQL's SELECT DISTINCT( my_criteria ) here. Luckily Xalan XSLT processor comes with useful extension that does exactly this. The extension implements distinct() method from http://exslt.org/sets natively. Lets rewrite the template: 1:<?xml version="1.0"?> 2:<xsl:stylesheet version="1.0" 3: xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 4: xmlns:set="http://exslt.org/sets" 5: exclude-result-prefixes="set"> 6: 7: <xsl:template match="/"> 8: <rpt> 9: <xsl:variable name="dSIC"><xsl:value-of select="set:distinct( /report/company/sic )"/></xsl:variable> 10: <xsl:variable name="dBROKER"> 11: <xsl:value-of select="set:distinct( /report/company/broker )"/> 12: </xsl:variable> 13: <xsl:variable name="dKEYS"> 14: <xsl:value-of select="set:distinct( /report/company/row/key )"/> 15: </xsl:variable> 16: <xsl:for-each select="$dSIC"> 17: <xsl:sort select="."/> 18: <xsl:variable name="SIC"> 19: <xsl:value-of select="."/> 20: </xsl:variable> 21: <xsl:for-each select="$dBROKER"> 22: <xsl:variable name="broker"> 23: <xsl:value-of select="."/> 24: </xsl:variable> 25: <xsl:for-each select="$dKEYS"> 26: <xsl:variable name="key"> 27: <xsl:value-of select="."/> 28: </xsl:variable> 29: <xsl:for-each select="/report/company[sic = $SIC and broker = $broker]/row[key = $key]"> 30: <xsl:sort select=".."/> 31: <row> 32: <sic><xsl:value-of select="$SIC"/></sic> 33: <broker><xsl:value-of select="$broker"/></broker> 34: <key><xsl:value-of select="$key"/></key> 35: <company><xsl:value-of select="../name"/></company> 36: <v><xsl:value-of select="val"/></v> 37: </row> 38: <xsl:text> 39: 40: </xsl:text> 41: </xsl:for-each> 42: </xsl:for-each> 43: </xsl:for-each> 44: </xsl:for-each> 45: </rpt> 46: </xsl:template> 47: 48:</xsl:stylesheet> Now I am pretty satisfied with clarity of the new template and eager to compare performance of our templates. We will generate various input files and measure processing time. To make my life easier I will employ Ant script to do the job: 11: <target name="xslt_m" > 12: <mkdir dir="temp"/> 13: <tstamp> 14: <format property="start_m" pattern="hh:mm:ss:S" /> 15: </tstamp> 16: 17: <echo message="${start_m} - start"/> 18: <xslt force="true" 19: out="temp/out1.xml" 20: in="data/in.xml" 21: style="style/muenchian.xsl"> 22: </xslt> 23: <tstamp> 24: <format property="stop_m" pattern="hh:mm:ss:S" /> 25: </tstamp> 26: <echo message="${stop_m} - stop"/> 27: 28: </target>
Lets use the following command sequence to generate input file and process it: jython data/generateInFile.py 2 > data/in.xml;ant xslt_m xslt_d This is an example output: Buildfile: build.xml
xslt_m:
[echo] 05:55:07:267 - start
[xslt] Processing /home/kosta/dev/xsl/scripting/data/in.xml to /home/kosta/dev/xsl/scripting/temp/out1.xml
[xslt] Loading stylesheet /home/kosta/dev/xsl/scripting/style/muenchian.xsl
[echo] 05:55:10:910 - stop
xslt_d:
[echo] 05:55:10:914 - start
[xslt] Processing /home/kosta/dev/xsl/scripting/data/in.xml to /home/kosta/dev/xsl/scripting/temp/out1.xml
[xslt] Loading stylesheet /home/kosta/dev/xsl/scripting/style/distinct.xsl
[echo] 05:55:13:507 – stop
Ant prints start and stop time for XSLT processing and now we can run our command with various number of loops and calculate how long does it take to process a given input file with various XSL templates. There are results:
Although 26.6% performance increase is not very impressive it is clearly visible. With using this approach we sacrifice portability here (in theory, because Xalan is a pure Java application and we can use it everywhere ) but gain in template clarity and performance makes me believe that use of Xalan extensions at least worth considering. Resulting XML file looks like this: <rpt>
<row>
<sic>sic_0</sic><broker>brok_0</broker><key>a</key><company>company_102</company><v>40</v>
</row>
<row>
<sic>sic_0</sic><broker>brok_0</broker><key>a</key><company>company_152</company><v>11</v>
</row>
<row>
<sic>sic_0</sic><broker>brok_0</broker><key>a</key><company>company_202</company><v>2</v></row>
<row>
<sic>sic_0</sic><broker>brok_0</broker><key>a</key><company>company_252</company><v>83</v></row>
<row>
<sic>sic_0</sic><broker>brok_0</broker><key>a</key><company>company_2</company><v>41</v></row>
........And now I want to create html page from the file that will display only unique values within the groups. First, lets try to use standard XSL and use preceding-sibling to define if the value of group tag is the same in the preceding row. So, my second template looks like this: 1:<?xml version="1.0"?> 2:<xsl:stylesheet version="1.0" 3: xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 4: xmlns:set="http://exslt.org/sets" 5: exclude-result-prefixes = "set" 6: > 7: 8: <xsl:template match="/"> 9: <html> 10: <body> 11: <h1>Use sibling</h1> 12: <table border="1"> 13: <xsl:apply-templates select="/rpt"/> 14: <xsl:text> 15: </xsl:text> 16: </table> 17: </body> 18: </html> 19: </xsl:template> 20: 21: <xsl:template match="row"> 22: <tr> 23: <td> 24: <xsl:variable name="sic" select="not( sic = preceding-sibling::row[1]/sic)"/> 25: <xsl:if test="$sic"> 26: <xsl:value-of select="sic"/> 27: </xsl:if> 28: <xsl:if test="not( $sic )"> 29: <xsl:text disable-output-escaping="yes"><![CDATA[ ]]></xsl:text> 30: </xsl:if> 31: </td> 32: 33: <td> 34: <xsl:variable name="broker" select="not( broker = preceding-sibling::row[1]/broker)"/> 35: <xsl:if test="$broker"> 36: <xsl:value-of select="broker"/> 37: </xsl:if> 38: <xsl:if test="not( $broker )"> 39: <xsl:text disable-output-escaping="yes"><![CDATA[ ]]></xsl:text> 40: </xsl:if> 41: </td> 42: </tr> 43: </xsl:template> 44: 45: 46: 47: <xsl:template match="text()|@*"> 48: </xsl:template> 49: 50:</xsl:stylesheet> The template above produces result I have expected but it takes 3-4 minutes(!) to process an input file of a moderate size. Such performance does not seem acceptable, and I decided to look for another Xalan extension. Xalan's ability to extend XSLT processor with scripts looked especially promising and I rushed to try the extension. Xalan uses Bean Scripting Framework (BSF http://jakarta.apache.org/bsf/ ) for enabling XSL scripting. It allows us using all BSF supported languages within XSL template. Currently BSF supports the following languages:Javascript, Python (using either Jython or JPython), Tcl, NetRexx, XSLT Stylesheets (as a component of Apache XML project's Xalan and Xerces), in addition, the following languages are supported with their own BSF engines: Java (using BeanShell, from the BeanShell project), JRuby, JudoScript I decided to use Jython because it is a powerful language on its own and integrates with Java nicely. And I was not disappointed. Here is my template with Jython scripted extension: 1:<?xml version="1.0"?> 2:<xsl:stylesheet version="1.0" 3: xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 4: xmlns:xalan="http://xml.apache.org/xalan" 5: xmlns:jython="ext_jython" 6: exclude-result-prefixes="jython"> 7: 8: <xsl:output method="html"/> 9: 10: <xalan:component prefix="jython" functions="incCounter getCounter setHashVal getHashVal"> 11: <xalan:script lang="jpython"> 12: 13:hashCounter = { 'sic': 2,'broker':1 } 14:hash = { 'sic': 1, 'broker':2 } 15: 16:def incCounter( k ): 17: global hashCounter 18: hashCounter[ k ] = hashCounter[ k ] + 1 19: 20:def getCounter( k ): 21: return hashCounter[ k ] 22: 23:def setHashVal( k, v ): 24: global hash 25: hash[ k ] = v 26: 27:def getHashVal( k ): 28: return hash[ k ] 29: 30: </xalan:script> 31: </xalan:component> 32: 33: 34: 35: <xsl:template match="/"> 36: <h1>Use Jython</h1> 37: <table border="1"> 38: 39: <xsl:for-each select="/rpt/row"> 40: <tr> 41: 42: 43: <xsl:variable name="displaySIC"> 44: <xsl:if test="jython:getHashVal( 'sic' ) = sic">false</xsl:if> 45: <xsl:if test="not( jython:getHashVal( 'sic' ) = sic)">true</xsl:if> 46: </xsl:variable> 47: 48: <xsl:if test="not( jython:getHashVal( 'sic' ) = sic)"> 49: <xsl:value-of select="jython:setHashVal( 'sic', sic )"/> 50: </xsl:if> 51: 52: 53: <xsl:variable name="displayBROKER"> 54: <xsl:if test="jython:getHashVal( 'broker' ) = broker">false</xsl:if> 55: <xsl:if test="not( jython:getHashVal( 'broker' ) = broker)">true</xsl:if> 56: </xsl:variable> 57: <xsl:if test="not( jython:getHashVal( 'broker' ) = broker)"> 58: <xsl:value-of select="jython:setHashVal( 'broker', broker )"/> 59: </xsl:if> 60: 61: 62: 63: <td> 64: <xsl:if test="$displaySIC = 'true'"><xsl:value-of select="sic"/></xsl:if> 65: <xsl:if test="$displaySIC = 'false'"> 66: <xsl:text disable-output-escaping="yes"><![CDATA[ ]]></xsl:text> 67: </xsl:if> 68: </td> 69: 70: <td> 71: <xsl:if test="$displayBROKER = 'true'"> 72: <xsl:value-of select="broker"/> 73: </xsl:if> 74: <xsl:if test="$displayBROKER = 'false'"> 75: <xsl:text disable-output-escaping="yes"><![CDATA[ ]]></xsl:text> 76: </xsl:if> 77: </td> 78: </tr> 79: </xsl:for-each> 80: </table> 81: </xsl:template> 82: 83: 84: 85:</xsl:stylesheet> Now it is time to compare templates performance: jython data/generateInFile.py 2 > data/in.xml;ant xslt_d xslt_sibl xslt_jython
Wow! Roughly 4 seconds with Jython script against 184 seconds is very impressive result. Lets look what 6 loops mean in more familiar terms: they mean 1075844 bytes file size and 3000 “row” nodes. Such file size is not uncommon therefore using scripted extensions with Xalan brings significant performance gain to real world use cases.
So, what is the reason for Jython to outperform standard XSL? “Preceding-sibling” instruction causes XSL engine perform a query that is getting more and more expensive with growing number of nodes. Jython code does not have such problem, it uses a dictionary (map) as a storage for previous values and therefore performs better and scales very well. In the beginning I have promised 50 times better performance, but demonstrated only 4600% performance gain. Well, as you may see performance difference grows with file size, if I had used bigger file than difference could reach not only 50 times but unbelievable 100 times and more.
Conclusion: standards are good and provide solid basis for development but vendor's extensions should not be ignored and might make a lot of sense in certain circumstances.
Happy coding!
Testing environment: RedHat 9 Linux computer ( kernel 2.4.22 ) Athlon XP 2000+ with 512 MB of RAM Sun JDK j2sdk1.4.2_01 In order to make such solution working jython.jar from Jython-2.1 and bfs.jar from Xalan 2.5.1 were added to $JAVA_HOME/jre/lib/endorsed directory. |
||||||||||||||||||||||||||||||
| © 2001 - 2006 Konstantin Ignatyev | |||||||||||||||||||||||||||||||