|
Offisor XML: an improvement?
Using Offisor is easy, but the real question is, is it
beneficial. This requires analyzing the XML format that is
output and how easy or difficult it is to work with. To test
this, I tried saving a Word doc as HTML (Word 2002 - just to
test as it was undocumented), parsing it with Offisor and then
comparing the results. What I discovered was that there was a
significant difference on a number of levels. Take a look at the
following files:
These four documents differ quite a bit. To start, the XML
version is pretty, readable, intuitive, and clean. The
transformation takes no time -- this is one lean application.
All that is going on behind the scenes for the HTML rendition is
to apply an eXtensible Stylesheet Language Transformations
(XSLT) transformation. I was impressed with the ease of doing
this.
Of the HTML documents, you will find that the Offisor HTML
document has the least lines: 187 (note: Offisor also left a
couple lines stating I was using a demo version). In comparison,
the Word exported HTML was a whopping 579 lines and the
Dreamweaver parsed HTML 264. This is with a simple clean up to
remove empty spaces. The Dreamweaver example did retain some of
the styling, which arguably could be nice in that documents look
quite familiar after conversion. However, the styling includes
font and other information not as style classes, but as defined
elements. This could be a problem for content management systems
that attempt to apply a singular style site-wide. Overall, I
felt that Offisor shined. It also goes without saying that
Offisor documents can be parsed dynamically and using XSLT,
developers can transform that XML into whatever they like. Of
course, using the Save As method and Dreamweaver parsing both
require additional steps prior to uploading content. Overall,
the XML output by Offisor is very intuitive to use and
straightforward. A
Word XML reference can be found at the Davisor site.
Of course, I didn't compare the actual proprietary Word doc
format file. Why? Well, it's a binary format so it really isn't
worth comparing. If this wasn't the case there wouldn't be any
need for tools like Offisor, now would there?
How Offisor can be used
Where I think Offisor gets exciting is all of the potential
applications. Converting documents from a popular and
proprietary format, as well as basic HTML, into a universal one
is undoubtedly a move in the right direction. The possibilities
of what can be done are numerous:
- E-mail Attachment Conversion - By dynamically converting Word attachments in e-mail, a number of advantages can be gained by elimination of virus threats, enabling simple in-browser HTML viewing of these attachments and viewing attachments on computers and devices that don't have Word installed.
- Content Management Delivery - Clearly the most obvious: enabling everyone in the office to simply upload their Word documents, as well as HTML and other documents and be able to translate them into HTML, Wireless Markup Language (WML) or any other form of XML that can be read by any variety of devices and applications.
- Universal Content Repository - storing all HTML and Word docs in a universal format - especially XML - can eliminate numerous data storage issues. Because XML is text-based as opposed to Word's binary format, documents can be stored in character format.
- Superior Indexing - Once again, because XML is text-based as opposed to Word's binary format, documents can be indexed and users can perform keyword searches on the content.
- Easy PDF Conversion - enable easy conversion, even on the fly, to PDF format. The examples even include a working version of this functionality. PDF's read-only capacity make it a superior format in some circumstances such as contracts and helping users that are concerned about Word viruses.
Conclusion
Quite frankly, I couldn't be more impressed with this tool. It
is simple, cheap (300 Euros, about $350US, tool only, or 600
Euros, about US$700, including 1 year of support and
upgrades),it does exactly what it says, converting binary Word
docs and loosely-structured HTML docs into XML and it does it
well. The documentation is comprehensive and the API is simple.
I only look forward to being able to parse other Office
documents, such as Excel and Powerpoint files. While undoubtedly
performing these types of translations within the Microsoft
development environment is probably accessible to developers, in
Java/J2EE this has always been a challenge. I think that Offisor
can help us all to deal with this pervasive issue.
Drew Falkman is
the author of the JRun Web Application Construction Kit and co-
author (with Ben Forta) of Reality ColdFusion: J2EE Integration,
both published by Macromedia Press. Over the past 6 years, Drew
has developed over 150 Web applications in all sizes using
ColdFusion and Java. Currently Drew consults, speaks at events,
writes for numerous publications, and teaches courses at
Portland State University. His latest project through his
consulting company, Veraison LLC, was a real-time cattle auction
using Flash Remoting and Flash Communication Server. In
addition, Drew is a member of Team Macromedia, a certified
ColdFusion Developer and a certified Macromedia instructor.
New on the Java Boutique:
New Review:
Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling
API boasts simplicity, ease-of-integration, a well-rounded feature
set, and it's free!
New Applet:
Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA
sequences into three useful formats.
Elsewhere on internet.com:
WebDeveloper Java
Lots of Java information on webdeveloper.com
WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.
ScriptSearch Java
Hundreds of free Java code files to download.
jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.
|