![]() ![]() ![]() ![]() ![]() |
Top Contents Index Glossary |
In real life, you are going to have little need to echo an XML file with a
SAX parser. Usually, you'll want to process the data in some way in order to
do something useful with it. (If you want to echo it, it's easier to build a
DOM tree and use its built-in printing functions.)
But echoing an XML structure is a great way to se the SAX parser in action.
In this exercise, you'll set echo SAX parser events to System.out
.
Consider this the "Hello World" version of an XML-processing program. It shows you how to use the SAX parser to get at the data, and then echoes it to show you what you've got.
Note:
The code discussed in this section is inEcho01.java
. The file it operates on isslideSample01.xml
.
Start by creating a file named Echo.java
and enter the
skeleton for the application:
public class Echo extends HandlerBase { public static void main (String argv[]) { } }
This class extends HandlerBase
,
which implements all of the interfaces we discussed in An
Overview of the Java XML APIs. That lets us override the methods we care
about and default the rest.
Since we're going to run it standalone, we need a main method. And we need command-line arguments so we can tell the app which file to echo.
Next, add the import statements for the classes the app will use:
import java.io.*; import org.xml.sax.*; import javax.xml.parsers.SAXParserFactory; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParser; public class Echo extends HandlerBase { ...
The classes in java.io
, of course, are needed to do output. The
org.xml.sax
package defines all the interfaces we use for the SAX parser. The SAXParserFactory
class creates the instance we use. It throws a ParserConfigurationException
if it is unable to produce a parser that matches the specified configuration
of options. (You'll see more about the configuration options later.) Finally,
the SAXParser is what the factory returns for parsing.
public static void main (String argv []) { if (argv.length != 1) { System.err.println ("Usage: cmd filename"); System.exit (1); } try { // Set up output stream out = new OutputStreamWriter (System.out, "UTF8"); } catch (Throwable t) { t.printStackTrace (); } System.exit (0); }
static private Writer out;
When we create the output stream writer, we are selecting the UTF-8 character encoding. We could also have chosen US-ASCII, or UTF-16, which the Java platform also supports. For more information on these character sets, see Java's Encoding Schemes.
Now (at last) you're ready to set up the parser. Add the text highlighted below to set it up and get it started:
public static void main (String argv []) { if (argv.length != 1) { System.err.println ("Usage: cmd filename"); System.exit (1); } // Use the default (non-validating) parser SAXParserFactory factory = SAXParserFactory.newInstance(); try { // Set up output stream out = new OutputStreamWriter (System.out, "UTF8"); // Parse the input SAXParser saxParser = factory.newSAXParser(); saxParser.parse( new File(argv [0]), new Echo() ); } catch (Throwable t) { t.printStackTrace (); } System.exit (0); }
With these lines of code, you created a SAXParserFactory
instance,
as determined by the setting of the javax.xml.parsers.SAXParserFactory
system property. You then got a parser from the factor and gave the parser an
instance of this class to handle the parsing events, telling it which input
file to process.
Note:
The javax.xml.parsers.SAXParser class is a wrapper that defines a number of convenience methods. It wraps the (somewhat-less friendly) org.xml.sax.Parser object. If needed, you can obtain that parser using the SAXParser's getParser() method.
For now, you are simply catching any exception that the parser might throw. You'll learn more about error processing in a later section of the tutorial, Handling Errors with the Nonvalidating Parser.
Note:
The parse method that operates on File objects is a convenience method. Under the covers, it creates anorg.xml.sax.InputSource
object for the SAX parser to operate on. To do that, it uses a static method in thecom.sun.xml.parser.Resolver
class to create an InputSource from ajava.io.File
object. You could do that, too, but the convenience method makes things easier.
The most important interface for our current purposes is the DocumentHandler
interface. That interface requires a number of methods that the SAX parser invokes
in response to different parsing events. For now, we are concerned with only
five of them: startDocument
, endDocument
, startElement
,
endElement
, and characters
. Enter the code highlighted
below to set up methods to handle those events:
... static private Writer out;
public void startDocument () throws SAXException { } public void endDocument () throws SAXException { } public void startElement (String name, AttributeList attrs) throws SAXException { } public void endElement (String name) throws SAXException { } public void characters (char buf [], int offset, int len) throws SAXException { } ...
Each of these methods is required by the interface
to throw a SAXException
.
An exception thrown here is sent back to the parser, which sends it on to the
code that invoked the parser. In the current program, that means it winds up
back at the Throwable
exception handler at the bottom of the main
method.
When a start tag or end tag is encountered, the name of the tag is passed as
a String to the startElement
or endElement
method,
as appropriate. When a start tag is encountered, any attributes it defines are
also passed in an AttributeList
.
Characters found within the element are passed as an array of characters, along
with the number of characters (length
) and an offset into the array
that points to the first character.
The DocumentHandler
methods throw SAXException
s but
not IOException
s, which can occur while writing. The SAXException
can wrap another exception, though, so it makes sense to do the output in a
method that takes care of the exception-handling details. Add the code highlighted
below to define an emit
method that does that:
public void characters (char buf [], int offset, int Len) throws SAXException { } private void emit (String s) throws SAXException { try { out.write (s); out.flush (); } catch (IOException e) { throw new SAXException ("I/O error", e); } } ...
When emit is called, any I/O error is wrapped in SAXException
along with a message that identifies it. That exception is then thrown back
to the SAX parser. You'll learn more about SAX exceptions later on. For now,
keep in mind that emit
is a small method that handles the string
output. (You'll see it called a lot in the code ahead.)
There is one last bit of infrastructure we need before doing some real processing.
Add the code highlighted below to define an nl
method that writes
the kind of line-ending character used by the current system:
private void emit (String s) ... } private void nl () throws SAXException { String lineEnd = System.getProperty("line.separator"); try { out.write (lineEnd); } catch (IOException e) { throw new SAXException ("I/O error", e); } }
Note: Although it seems like a bit of a nuisance, you will be invoking
nl()
many times in the code ahead. Defining it now will simplify the code later on. It also provides a place to indent the output when we get to that section of the tutorial.
Finally, let's write some code that actually processes the DocumentHandler
events we added methods for.
Add the code highlighted below to handle the start-document and end-document events:
public void startDocument () throws SAXException { emit ("<?xml version='1.0' encoding='UTF-8'?>"); nl(); } public void endDocument () throws SAXException { try { nl(); out.flush (); } catch (IOException e) { throw new SAXException ("I/O error", e); } }
Here, you are echoing an XML declaration when the parser encounters the start
of the document. Since you set up the OutputStreamWriter
using
the UTF-8 encoding, you include that specification as part of the declaration.
Note: However, the IO classes don't understand the hyphenated encoding names, so you specified "UTF8" rather than "UTF-8".
At the end of the document, you simply put out a final newline and flush the output stream. Not much going on there. Now for the interesting stuff. Add the code highlighted below to process the start-element and end-element events:
public void startElement (String name, AttributeList attrs) throws SAXException { emit ("<"+name); if (attrs != null) { for (int i = 0; i < attrs.getLength (); i++) { emit (" "); emit (attrs.getName(i)+"=\""+attrs.getValue (i)+"\""); } } emit (">"); } public void endElement (String name) throws SAXException { emit ("</"+name+">"); }With this code, you echoed the element tags, including any attributes defined in the start tag. To finish this version of the program, add the code highlighted below to echo the characters the parser sees:
public void characters (char buf [], int offset, int len) throws SAXException { String s = new String(buf, offset, len); emit (s); }
Congratulations! You've just written a SAX parser application. The next step is to compile and run it.
Note: To be strictly accurate, the character handler should scan the buffer for ampersand characters ('&') and left-angle bracket characters ('<') and replace them with the strings "
&
" or "<
", as appropriate. You'll find out more about that kind of processing when we discuss entity references in Substituting and Inserting Text.
To compile the program you created, you'll execute the appropriate command for your system (or use one of the command scripts mentioned below):
Windows:
javac -classpath %XML_HOME%\jaxp.jar;%XML_HOME%\parser.jar Echo.javaUnix:
javac -classpath ${XML_HOME}/jaxp.jar:${XML_HOME}/parser.jar Echo.javawhere:
XML_HOME
is where you installed the JAXP and Project X libraries.- jaxp.jar contains the JAXP-specific APIs
- parser.jar contains the interfaces and classes that make up the SAX and DOM APIs, as well as Sun's reference implementation, Project X.
Note:
If you are using version 1.1 of the platform you also need to add %JAVA_HOME%\lib\classes.zip to both the compile script and run script (below), whereJAVA_HOME
is the location of the Java platform.
To run the program, you'll once again execute the appropriate command for your system (or use one of the command scripts mentioned below):
Windows:
java -classpath .;%XML_HOME%\jaxp.jar;%XML_HOME%\parser.jar Echo slideSample.xmlUnix:
java -classpath .:${XML_HOME}/jaxp.jar:${XML_HOME}/parser.jar Echo slideSample.xml
To make life easier, here are some command scripts you can use to compile and
run your apps as you work through this tutorial.
Unix Windows Scripts build
,run
build.bat
,run.bat
Netscape
Click, choose File-->Save As Right click, choose
Save Link As.Internet
Explorer -/-Right click, choose Save Target As.
The program's output as stored in Echo01-01.log
.
Here is part of it, showing some of its weird-looking spacing:
Looking at this output, a number of questions arise. Namely, where is the excess vertical whitespace coming from? And why is it that the elements are indented properly, when the code isn't doing it? We'll answer those questions in a moment. First, though, there are a few points to note about the output:... <slideshow title="Sample Slide Show" date="Date of publication" author="Yours Truly"> <slide type="all"> <title>Wake up to WonderWidgets!</title> </slide> ...
<!-- A SAMPLE set of slides -->does not appear in the listing. Comments are ignored by definition, unless you implement a
LexicalEventListener
instead of aDocumentHandler
. You'll see more about that later on in this tutorial.
Element attributes are listed all together on a single line. If your window isn't really wide, you won't see them all.
The single-tag empty element you defined (<item/>
)
is treated exactly the same as a two-tag empty element (<item></item>
).
It is, for all intents and purposes, identical. (It's just easier to type
and consumes less space.)
This version of the echo program might be useful for displaying an XML file, but it's not telling you much about what's going on in the parser. The next step is to modify the program so that you see where the spaces and vertical lines are coming from.
Note: The code discussed in this section is in
Echo02.java
. The output it produces is contained inEcho02-01.log
.
Make the changes highlighted below to identify the events as they occur:
public void startDocument () throws SAXException {nl(); nl(); emit ("START DOCUMENT"); nl(); emit ("<?xml version='1.0' encoding='UTF-8'?>");} public void endDocument () throws SAXException { nl();nl(); emit ("END DOCUMENT"); try { ... } public void startElement (String name, AttributeList attrs) throws SAXException {nl(); emit ("ELEMENT: "); emit ("<"+name); if (attrs != null) { for (int i = 0; i < attrs.getLength (); i++) {emit (" ");emit (attrs.getName(i)+"=\""+attrs.getValue (i)+"\"");nl(); emit(" ATTR: "); emit (attrs.getName (i)); emit ("\t\""); emit (attrs.getValue (i)); emit ("\""); } }if (attrs.getLength() > 0) nl(); emit (">"); } public void endElement (String name) throws SAXException {nl(); emit ("END_ELM: "); emit ("</"+name+">"); } public void characters (char buf [], int offset, int len) throws SAXException {nl(); emit ("CHARS: |"); String s = new String(buf, offset, len); emit (s);emit ("|"); }
Compile and run this version of the program to produce a more informative output listing. The attributes are now shown one per line, which is nice. But, more importantly, output lines like this one:
CHARS: | |
show that the characters
method is responsible for echoing both
the spaces that create the indentation and the multiple newlines that separate
the attributes.
Note: The XML specification requires all input line separators to be normalized to a single newline. The newline character is specified as
\n
in Java, C, and Unix systems, but goes by the alias "linefeed" in Windows systems.
To make the output more readable, modify the program so that it only outputs characters containing something other than whitespace.
Note: The code discussed in this section is in
Echo03.java
.
Make the changes shown below to suppress output of characters that are all whitespace:
public void characters (char buf [], int offset, int len) throws SAXException {nl(); emit ("CHARS: |");nl(); emit ("CHARS: "); String s = new String(buf, offset, len);emit (s);emit ("|");if (!s.trim().equals("")) emit (s); }
If you run the program now, you will see that you have eliminated the indentation as well, because the indent space is part of the whitespace that precedes the start of an element. Add the code highlighted below to manage the indentation:
static private Writer out;private String indentString = " "; // Amount to indent private int indentLevel = 0; ... public void startElement (String name, AttributeList attrs) throws SAXException {indentLevel++; nl(); emit ("ELEMENT: "); ... } public void endElement (String name) throws SAXException { nl(); emit ("END_ELM: "); emit ("</"+name+">");indentLevel--; } ... private void nl () throws SAXException { ... try { out.write (lineEnd);for (int i=0; i < indentLevel; i++) out.write(indentString); } catch (IOException e) { ... }
This code sets up an indent string, keeps track of the current indent level,
and outputs the indent string whenever the nl
method is called.
If you set the indent string to "", the output will be un-indented
(Try it. You'll see why it's worth the work to add the indentation.)
You'll be happy to know that you have reached the end of the "mechanical" code you have to add to the Echo program. From here on, you'll be doing things that give you more insight into how the parser works. The steps you've taken so far, though, have given you a lot of insight into how the parser sees the XML data it processes. It's also given you a helpful debugging tool you can use to see what the parser sees.
The complete output for this version of the program is contained in Echo03-01.log
.
Part of that output is shown here:
ELEMENT: <slideshow ... CHARS: CHARS: ELEMENT: <slide ... END_ELM: </slide> CHARS: CHARS:
Note that the characters
method was invoked twice in a row. Inspecting
the source file slideSample01.xml
shows that there is a comment before the first slide. The first call to characters
comes before that comment. The second call comes after. (Later on, you'll see
how to be notified when the parser encounters a comment, although in most cases
you won't need such notifications.)
Note, too, that the characters
method is invoked after the first
slide element, as well as before. When you are thinking in terms of hierarchically
structured data, that seems odd. After all, you intended for the slideshow
element to contain slide
elements, not text. Later on, you'll see
how to restrict the slideshow
element using a DTD. When you do
that, the characters
method will no longer be invoked.
In the absence of a DTD, though, the parser must assume that any element it sees contains text like that in the first item element of the overview slide:
<item>Why <em>WonderWidgets</em> are great</item>
Here, the hierarchical structure looks like this:
ELEMENT: <item> CHARS: Why ELEMENT: <em> CHARS: WonderWidgets END_ELM: </em> CHARS: are great END_ELM: </item>
In this example, it's clear that there are characters intermixed with the hierarchical structure of the elements. The fact that text can surround elements (or be prevented from doing so with a DTD or schema) helps to explain why you sometimes hear talk about "XML data" and other times hear about "XML documents". XML comfortably handles both structured data and text documents that include markup. The only difference between the two is whether or not text is allowed between the elements.
Note:
In an upcoming section of this tutorial, you will work with theignorableWhitespace
method in theDocumentHandler
interface. This method can only be invoked when a DTD is present. If a DTD specifies thatslideshow
does not contain text, then all of the whitespace surrounding theslide
elements is by definition ignorable. On the other hand, ifslideshow
can contain text (which must be assumed to be true in the absence of a DTD), then the parser must assume that spaces and lines it sees between theslide
elements are significant parts of the document.
![]() ![]() ![]() ![]() ![]() |
Top Contents Index Glossary |