"Extending URL handling in Java to support new locations for data."
So, for example, the configuration for the common Java logging suite log4j [log4j] can be provided as easily from a local file on the hard disk as from an Internet web site without any changes being required to the application code.
As another example, an overnight batch process might take data via ftp from a remote server, but be more easily tested by running against a sample disk file containing a known dataset.
The standard method of describing such abstract locations for data is through a URL (Universal Resource Locator)
- the most common example of these being web site address such as http://www.accu.org.
Java comes with built-in support for URLs, most obviously through the java.net.URL class.
URL class to read data in a location agnostic manner.
package howzatt;
public class Example {
public static void main( String[] args ) {
for ( String uri : args ) {
read( uri );
}
}
public static void read( String uri ) {
try {
java.net.URL url = new java.net.URL( uri );
java.io.InputStream is = url.openStream();
int ch;
while ( ( ch = is.read() ) != -1 ) {
System.out.print( (char)ch );
}
is.close();
}
catch ( java.io.IOException ex ) {
System.err.println( "Error reading " +
uri + ": " + ex );
}
}
}
This program can be run against a local file using an argument like "file:example.txt", a remote file using an argument like "file://server/path/file" or a Web site
using an argument like "http://www.accu.org/" (subject to any restrictions by network firewalls or proxies.)
urn:isbn:978-0-470-84674-2 which is the ISBN number of a book. In practice however the distinction between the two terms is often blurred. There is a fuller discussion of this issue on the w3c site [W3C - URI].
Each identifier before the first colon in a URI name defines a 'scheme' and schemes such as http and
ftp are globally recognised as standard. The full list of official schemes is held by the Internet Assigned Numbers Authority [IANA].
Java has support for URIs and URLs through the java.net.URI and java.net.URL classes. Additionally, Java is supplied with inbuilt support for a number of different schemes,
as a minimum support for the following is guaranteed :- http, https, ftp, file, and jar.
Note that Java refers to these schemes as protocols although, for example, processing an http
URL involves two protocols - DNS to resolve the host name and HTTP to access the data.
Although the five standard protocols are often adequate, there are sometimes cases where access is required to other data sources. Often the location of this data can be described using the URI syntax but it may not be a "official" URI scheme.
For example, data might be obtainable using scp (secure file copy) and the obvious URI of scp://user@host/path/file could be used to represent the location of a file on some remote host.
Or again, data may be supplied in a zip file or some other compressed format and you want to be able to
access the data uncompressed from within your program.
Fortunately Java allows us to supply our own protocol handlers to extend the set of supported schemes.
There are existing extensions to the Java protocol handlers provided by various site on the Internet and supporting various protocols; one such example is Hansa [Hansa]. If your requirements are for support of a well-known protocol you may be able to find a pre-written protocol handler.
However, there may be times when you want to implement a protocol handler yourself - whether for an unsupported official scheme or for a proprietary one.
URL class. The process consists of three parts:
java.net.URLStreamHandler, that knows how to open a connection to URLs of the new scheme
java.net.URLConnection, to access data from these connections
The first two steps will obviously depend heavily on the specifics of the protocol being supported, and may involve such actions as opening network connections or invoking external programs. I'll illustrate the process with a very simple example that uses the "quote of the day" service to make the general principle clear without requiring too many protocol-specific details.
The final step involves plugging your new classes into the processing the URL class uses when it comes across a protocol for the first time. The URL class attempts to create an instance of the correct URLStreamHandler class in the following order:
URL class then the createURLStreamHandler method of the factory is called with the protocol name.
java.protocol.handler.pkgs which is a "|" delimited list of packages. For each package it tries to load
the class <package>.<protocol>.Handler, which, if present, must be the URLStreamHandler for the given protocol.
For stand-alone applications the easiest way to register your new protocol is to define the system property used by the URL class; so let's see how this might work.
C:> telnet localhost qotd
"We want a few mad people now. See where the sane ones have landed us!"
George Bernard Shaw (1856-1950)
Connection to host lost.
If this attempt fails, you might need to start the service (or connect to another machine that does offer the qotd
service). On Windows it is one of the "Simple TCP/IP Services".
In order to access this service from my example program at the start of the article I need a URL syntax, so I've picked
the simple format qotd://hostname. Since we are using an unofficial scheme there are several alternative ways of encoding the data as a URI.
Here is example code for a simple stream handler for the qotd protocol:
and then the actual connection handling code itself:
package howzatt.qotd;
public class Handler
extends java.net.URLStreamHandler {
protected java.net.URLConnection
openConnection(java.net.URL u)
throws java.io.IOException {
return new QotdConnection( u );
}
}
Now if we compile these two additional classes, we can use the
package howzatt.qotd;
public class QotdConnection
extends java.net.URLConnection {
private static final int QOTD = 17;
private java.net.Socket socket;
public QotdConnection( java.net.URL u ) {
super( u );
}
public void connect()
throws java.io.IOException {
final String host = getURL().getHost();
socket = new java.net.Socket( host, QOTD );
connected = true;
}
public java.io.InputStream getInputStream()
throws java.io.IOException {
if ( ! connected )
connect();
return socket.getInputStream();
}
}
qotd protocol with the example program shown earlier like this:
java -Djava.protocol.handler.pkgs=howzatt howzatt.Example qotd://localhost
If all is well we get a quote displayed - we have transparently extended our simple application to acquire data from a different source.
The registration problem is harder because of two design issues.
URL class cannot be changed
As a mentioned earlier, one way of registering your URLStreamHandler class with the URL class is to
provide a factory object. Unfortunately this mechanism is somewhat inflexible; specifically the setURLStreamHandlerFactory method can be called at most once in a given Java Virtual Machine.
This may be a valid restriction for a small Java application but it becomes hard to manage when two different
parts of the application, possibly written by unrelated teams, each wish to register a factory for their own
protocol with the URL class.
However, even leaving this problem aside, the factory approach requires the application code to register the factory explicitly which makes it hard to add new protocols to existing programs. This is what we did earlier to the example program, and is one of the most powerful aspects of Java's protocol handler support.
On the other hand, using the protocol.Handler convention can be problematic because of the
way Java class loaders work.
When a new protocol is detected by the URL class it tries to load the appropiate handler class but using the class loader that was used to load the URL class itself.
For a stand-alone application this does not usually present a problem, but where the Java code is running inside a web service or as an applet it is normal for user-supplied code to be loaded by a different class loader than the core Java classes.
In these cases, any protocol handler class supplied in the user code will not be found by the system class loader
used to load the java.net.URL class.
In these cases it also may not be as simple to externally configure the system property used by the URL class and
the System.setProperty method can be used at runtime to add additional packages. Note however that this approach
might be barred by the security manager and can must also be taken to ensure that any existing packages defined
by this system property are retained.
For example:
public static void register() {
final String packageName = Handler.class.getPackage().getName();
final String pkg = packageName.substring( 0, packageName.lastIndexOf( '.' ) );
final String protocolPathProp = "java.protocol.handler.pkgs";
String uriHandlers = System.getProperty( protocolPathProp, "" );
if ( uriHandlers.indexOf( pkg ) == -1 ) {
if ( uriHandlers.length() != 0 )
uriHandlers += "|";
uriHandlers += pkg;
System.setProperty( protocolPathProp, uriHandlers );
}
}
FileSystemManager class.
The weakness with such an approach is that it does not of itself support handling of additional protocols
when using existing code that uses the java.net.URL class internally to connect to a URL.
Another approach is to use the factory registration, but to provide a factory class that itself supports registration of multiple different stream handlers using different names.
This approach supports code using the java.net.URL class, but it does require a registration call for each protocol and so hence changes are needed to an application before it can make use of the new URLs. However the approach gets around the problems discussed above with multiple class loaders since the factory is loaded by the user code class loader rather than by the class loader for the URL class.
URLConnection model. Security can be a particular problem here since the usual way of including a username/password into a URL uses plain text which is obviously rather insecure.
There is a parallel with the way that Unix treats "everything like a file" - even access to system information. This common view of data means that simple tools may have wide applicability. The same principle applies with the use of URLs in Java - the abstraction can make programs able to process a wide range of data from a variety of sources without needing explicit coding.
Java provides a relatively simple mechanism to add new protocols to your applications and hence widen the range of locations for sourcing data.
There is a great deal of power in this approach; sadly the specific details of registering with the URL class
are not very flexible but in most cases there are various techniques to work around the limitations.
| log4j | Apache log4j: http://logging.apache.org/log4j/ |
| RFC2396 | Uniform Resource Identifiers (URI): Generic Syntax: http://www.ietf.org/rfc/rfc2396.txt |
| W3C - URI | URIs, URLs, and URNs: Clarifications and Recommendations http://www.w3.org/TR/uri-clarification/ |
| IANA | Uniform Resource Identifer (URI) Schemes http://www.iana.org/assignments/uri-schemes.html |
| Hansa | Project Hansa: http://wiki.ops4j.org/dokuwiki/doku.php?id=hansa:hansa |
| VFS | Commons Virtual File System http://commons.apache.org/vfs/ |