Monday's session of the Bioinformatics workshop I have been attending and blogging about in the last days was the final formal session of the workshop. Donat Agosti, who heads up Plazi, entitled his talk "Literature & XML: or How to Have More Time to Think".
Before he launched into his presentation, Donat challenged us on our motivations for doing science and the reasons we get funded to do science at all. Regarding the last point, science is usually funded for advancements to be made in our knowledge of the world around us and for humanity. However, if this knowledge is unavailable to the rest of the scientific (or non-scientific, for that matter) world, science is failing what it initially really set out to do. So, Donat made a strong point for making science generally accessible and making use of non-copyrighted resources.
Donat then went on to explain how the information that is available in the written format (e.g. in papers) can be marked up using e.g. XML, and the information in these papers thereby made accessible in an electronic format. So, for example, if you publish about a species, info on its location, its host and a predator may perhaps be included in the manuscript. The words in the paper that describe the location, host and predator of the species could thus be defined as such, and accordingly link a species to a location, host and predator. Imagine a world where this information would be electronically accessible (and prevent us from browsing through paper after paper to look for what we need). I imagine that, in addition to saving us hours of time, it would open the door to a myriad of new ideas and analyses and provide a whole new level of understanding of 'how the world works'.
The major factor preventing large-scale marking up of published information is copyrighting. Only older papers can be marked up without violating copyright information. Currently, the Biodiversity Heritage Library (BHL) is scanning and marking up (for taxon only though) large quantities of old publications. When you visit the website, you can therefore search for a taxon and the scanned publications with page numbers (that are linked) can be accessed. If, e.g. your taxon's info is on pp. 98 of 329 pages in a publication, there is no need to trawl through pages of information to try find your taxon - you can go straight to pp. 98 where the taxon is mentioned.
Donat, together with Guido Sautter, a programmer, then went on to explain how they use GoldenGate Editor to mark up documents to XML. The process they use is somewhat slower than the BHL because it is more thorough. Instead of only marking up taxon names, they mark up other information such as taxonomic treatments, morphology, synonyms, etc. A great deal of the process is automated, so that only little input is needed from people.
Finally, the information that's extracted from publications can be linked to websites such as Zoobank, GenBank, be fed to GBIF, etc. There are a vast varity of options.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment