RSS

CML & The Importance of Documentation

Perhaps you have never heard of CML, a standard for Chemists in the computer world. CML stands for Chemical Markup Language. It is in essence the reason for the invention of XML. XML provides a structured language for marking up information through the use of tags. Chemists have had to deal with a variety of formats, each usually dependent on the software that created the file. For instance, the program Gaussian saves its file in the .G03 format. This file is completely different than a GAMESS file of the same molecule. CML and XML in general triumph here through the use human readable tags and structured typing no matter which application created the file.

The idea here is that there would be a certain structure to a CML file that each chemistry application would support. You could create molecule or structure in Gaussian and then view it in Chime or JMol all using one consistent format. Of course, this vision never succeeds as it requires complete support from the development community, but in many cases it is not the software developers at fault alone. In this specific case and many others, the standard setters are at fault.

I am currently developing a web interface for chemical integration with database storage as well as developing a few perl scripts to help the process. One of the main things I was working on was converting a Gaussian 03 file into a standards compliant CML file. Simple right? I think not. XML in practice should be a very simple thing to create; however, when building to be compliant, you don’t get to create and choose the tags. You are dependent on what was set out in specifications previous to your involvement. Sometimes, these specifications are genius and it’s simple to bind your code into them. Sometimes, it would be simple, except something is lacking. That thing happens to be the key. It’s documentation.

I spent days and days scanning through the CML home-page and Google trying to find standards compliant example files as well as definite documentation explaining the correct guidelines for a CML file. All of this is missing. Yea, there are a few sample xml files, but they are scattered, rare, and often conflict with other examples. The written documentation was also difficult to find and basically useless. How can this happen? How can such a great idea, developing a standard, be ruined so badly? The problem lies in the fact that time is wasted by developers just trying to figure out whether their output files are standards compliant. I’m still not sure whether my perl script generates valid CML.

This situation perfectly illustrates why we, as developers, must focus so hard on documentation, not only for our benefit, but for the benefit of other organizations and independent developers. Standards mean nothing when no one complies. The only way to get companies and developers to comply with your standards is to not make it a hassle. Honestly, we were actually thinking of creating our own CML-like format just because the documentation and support was so atrocious.

What you need to take out of this is documentation should not be a secondary thought. It’s not something you lay on some intern or your worst programmer. It’s something that, if not written as code is developed, should be focused on with a full heart later. Documentation is not outsourced or sent off to a lower tier. Documentation should be a forefront of your product, so that you cause little to no inconvenience to those trying to support your standards, your idea, your product. These possible developers send lots of good press towards your product if they incorporate your content into their packages. Why should you make doing so a nuisance for them? Make it so easy, so simple that they cannot refuse to support your standard or your product.

Learn from the mistakes of CML and make sure that your product is the best it can be and documented the best it can be. Don’t think that everyone will get your work. You must explain it, and don’t half ass it.

Ok, that’s enough for the lecture, but that was pretty fun. Thanks for reading.
-Dustin

  • I didn't come across your comments until today... As Egon mentioned, there's quite a large user community around CML -- surprising you didn't turn up CDK or Open Babel, which can both generate (and parse) compliant CML.

    Of course Open Babel will also read GAMESS and Gaussian files and generate CML for you -- no need for a Perl script.

    I do agree with you that there is a lack of a "CML examples" repository. Well, there are some in the CML repository, but they tend to emphasize esoteric areas of CML which have been added in later revisions.

    But there's a great open source chemistry community, e.g.:
    http://blueobelisk.org/
    http://blueobelisk.org/planetbo/

    Personally, I think you have to consider both the documentation on the net and the authors -- no one can write documentation that covers everything, although some commercial efforts come close (e.g., Apple or Microsoft development sites, Trolltech's excellent Qt documentation...).

    Good luck!
  • Nice rant about the lack of documentation :), and confirm this is often a problem. It is interesting to note that there is a shift of customs here; in the past one would contact the author of some program, and ask questions, or pointers. With the huge amount of information now freely browsable on the internet we assume this to be the de facto standard, while in many cases it isn't yet.

    The point: threat CML as an open source project, and talk to the user and developer communities, and do not restrict yourself to lack of or outdated documentation. (E.g. join #cdk on irc.freenode.net)

    BTW, what trouble did you have? Could I help here?
blog comments powered by Disqus
« I’m Writing for The Apple Blog | Paul has the scoop on Todos 1.5 »