African Vegetation - the PhD: 2009

Wednesday 16 December 2009

Visit to South Africa

After the Christmas holidays I will be spending a month in the lab of Belinda Reyers at the CSIR in Stellenbosch, South Africa, where I hope to model the potential for carbon sequestration across the African continent. I hope to gain much from the visit - it sounds like their group is doing a lot of work that relates to my proposed project.

Friday 4 December 2009

Finished

...literally and figuratively. I handed in my 99-page half-way report this afternoon. It was a good exercise in reflecting on what I had done in the last 17 months of my PhD, and where I'm going the last 19. I realize that I have learnt many skills since my start here and that I have gained much from the courses and experience of my colleagues. On the other hand, I also feel like I haven't managed as much as what I'd hoped to have done by now, and I hope to be able to make up for some of that at least in part in the next part of my PhD.
However, for now I'm exhausted and looking forward to not getting home from work at 11pm every night, and, in the not-so-distant future, a holiday in (hopefully) sunny South Africa.

Wednesday 2 December 2009

While preparing for my half-way examination I have been reading up about Clean Development Mechanisms (CDMs), particularly in Africa. It's a little disappointing that many of these projects involve planting plantations of exotic eucalypts on large tracts of natural grassland (Jindal et al. 2009). That's just one more reason for me to think that putting a price on carbon isn't such a good idea; it just seems like one evil replacing the other. Having grown up in an area which had largely been transformed from grasslands to monospecific plantations, which harboured very few of the indigenous species and caused many of the streams in the region to almost dry out, I cannot see how these can be considered instruments to 'save the environment'.

Thursday 5 November 2009

I have been reading much about the distribution of African vegetation types - particularly what determines their distribution. It is interesting to go back to the 'old' literature, and discover how people with field knowledge only made the same generalizations that we find today with data obtained using advanced remote sensing techniques, which, in principle, allow us to find statistical relationships without ever putting our foot on the continent. In particular, I am impressed by that seminal work: Frank White's vegetation map of Africa. How one man could have gained an overview of the vegetation of this enormous, and often difficult to travel in, continent, is incredible! On the other hand, eventhough I am working with large-scale datasets, I find that knowing some parts of Africa is immensely helpful for interpreting the patterns I find. I do have to be careful, however, that my interpretation of the results I am obtaining for my studies is not biased towards what I know of southern African (and the little I saw of West African) ecology? I should plan a trip to East Africa soon, I think! :)

Wednesday 7 October 2009

A late-night entry

Despite mounting pressure, and days when I feel like my PhD might implode (I hope I'm not the only one who has days like this), this week has been good. Increasingly I have more direction for the upcoming projects and I look forward to the legume project, which seems to be taking off. I am excited about the collaboration with the legume group at Kew, which, I am sure, will be inspirational and interesting. And, of course, I look forward to spending some time in London - I love the city (and everyone speaks a language I understand)! Let's hope it'll all work out.

Thursday 24 September 2009

fun at the museum

I remember how, as a little girl, I once got to go 'behind the scenes' at Pietermaritzburg's Natural History Museum. All I remember is seeing a (dead and decomposing and very smelly) crocodile in a huge bath being prepared for, well, I'm not sure what, but some museum collection. That's when I first realized that there is more to a museum than just the exhibits. And since then I have been fascinated with what happens behind museum doors.
So, it's been great attending a course at Stockholm's Natural History Museum. Not only did we walk through exhibits on our way to work every morning, but we also had the opportunity to see where the museum personnel work and store their materials. Yesterday we climbed to the roof of the museum - which was fun and a little frightening (I seem to be developing a fear of heights as I grow older!). On our way up we passed the dusty and almost-forgotten collecting materials, still marked, of René Malaise, who was a well-known Swedish entomologist and explorer. Today we walked through the 3million-specimen insect collection and saw, amongst others, type specimens collected by Linnaeus' students. For someody like me, coming from a country with a relatively young history of science, this is exiting stuff!

Wednesday 23 September 2009

databasing

What I like about this Bioinformatics workshop is that, after all the presentations we heard and the information we have been presented with, we have time to implement what we have learnt. We have three days to work on our projects. The advantage of this is that there is more opportunity to get a grip on the content of the course, and there is opportunity to ask people in the know-how for advice. I know for a fact that, had we not been presented this opportunity, I would have gone home and only gotten round to this work in a few months' time, and, of course, forgotten much! So, at the moment I am learning to use mySQL (another language... sigh). ;) I hope to make good progress on setting up my Fabaceae database before returning to Århus on Friday.

Tuesday 22 September 2009

uBio

I have been perusing the many websites we have been introduced to. Lots of information and tools are available on these websites, so here I summarize the functions available on uBio (links to all tools provided on the website).

uBioRSS- RSS feeds can be searched based on taxonomic names, as defined in the Species 2000 and ITIS Catalogue of Life 2007 Annual Checklist
TaxaToy - plots the number of species discovered over time. This can also be queried according to taxon.
OntoSpecies - provides links to browse taxonomic trees and provides links to several outside web resources
LigerCat - Literature and Genomics Resource Catalogue
Aging Portal -resource providing information on the lifespan and aging of species
LinkIt - allows the user to retrieve taxonomic names (if I remember correctly, even ones that have been misspelled) from websites. (Apparently this tool is meant to be extended extracting info from pdfs in the future.) If these names are present in one of a number of other reference websites (e.g. nomenclatural websites, GBIF, phylogenetic websites, MorphBank, etc.), a link to these names on these other websites is provided.
FindIT - taxonomic names are extraced from websites, free text or files that are uplaided (including pdfs). This is great if you want a list of species mentioned in a file/document.
ParseIT - separates scientific names into their various components. Can be used for outputting in XML format.
MapIT - All names in a document or website are extracted and... Well, the rest I don't really get. Will try to figure it out.
CrawlIT - Crwals through a website and all links on that website and returns all the names mentioned there.
X:ID - allows users to create their own identification keys which can be run over the web or locally.

Monday 21 September 2009

XML Markups

Monday's session of the Bioinformatics workshop I have been attending and blogging about in the last days was the final formal session of the workshop. Donat Agosti, who heads up Plazi, entitled his talk "Literature & XML: or How to Have More Time to Think".
Before he launched into his presentation, Donat challenged us on our motivations for doing science and the reasons we get funded to do science at all. Regarding the last point, science is usually funded for advancements to be made in our knowledge of the world around us and for humanity. However, if this knowledge is unavailable to the rest of the scientific (or non-scientific, for that matter) world, science is failing what it initially really set out to do. So, Donat made a strong point for making science generally accessible and making use of non-copyrighted resources.
Donat then went on to explain how the information that is available in the written format (e.g. in papers) can be marked up using e.g. XML, and the information in these papers thereby made accessible in an electronic format. So, for example, if you publish about a species, info on its location, its host and a predator may perhaps be included in the manuscript. The words in the paper that describe the location, host and predator of the species could thus be defined as such, and accordingly link a species to a location, host and predator. Imagine a world where this information would be electronically accessible (and prevent us from browsing through paper after paper to look for what we need). I imagine that, in addition to saving us hours of time, it would open the door to a myriad of new ideas and analyses and provide a whole new level of understanding of 'how the world works'.
The major factor preventing large-scale marking up of published information is copyrighting. Only older papers can be marked up without violating copyright information. Currently, the Biodiversity Heritage Library (BHL) is scanning and marking up (for taxon only though) large quantities of old publications. When you visit the website, you can therefore search for a taxon and the scanned publications with page numbers (that are linked) can be accessed. If, e.g. your taxon's info is on pp. 98 of 329 pages in a publication, there is no need to trawl through pages of information to try find your taxon - you can go straight to pp. 98 where the taxon is mentioned.
Donat, together with Guido Sautter, a programmer, then went on to explain how they use GoldenGate Editor to mark up documents to XML. The process they use is somewhat slower than the BHL because it is more thorough. Instead of only marking up taxon names, they mark up other information such as taxonomic treatments, morphology, synonyms, etc. A great deal of the process is automated, so that only little input is needed from people.
Finally, the information that's extracted from publications can be linked to websites such as Zoobank, GenBank, be fed to GBIF, etc. There are a vast varity of options.

Sunday 20 September 2009

More backtracking - here is a summary of what was presented on Monday last week.
The workshop was introduced by Torsten Eriksson from Stockholm University. He gave a general introduction to databases (DBs), which I found very useful, as I am only starting to think about setting up DBs. Here are some of the suggestions he made for databasing:
1.) Before you start setting up your DB, decide (as far as possible) what you want to put into your database, 2.) Identify the relationship between objects (including the nature of the relationships - one to many vs. one to one, 3.) Identify the attributes of the objects you want to put into the DB, 4.) Identify unique identifiers for each object (attributes should only be dependent on unique identifiers).
As a good exercise, the basic structure of the DB can be sketched, where objects become tables and attributes become columns in the tables. The data type must be specified for each column, unique IDs are columns that may not be empty and, as the name suggest, must be unique. Relationships are then modelled between tables, using the IDs as 'connectors'.
Torsten also suggested that one use one of the open source software to set up DBs, because they are accessible to everyone everywhere. One suggestion that was made is mySQL. There are also DBs that have been designed specifically for certain purposes, e.g. the ontology DBs mentioned in previous blogs, Specify and MX.

Katja Seltmann (see blog on ontologies below) followed with an introduction to MX as an example of a taxonomic database. It is web-based and the core data object is an operational taxonomic unit (OTU), which means that data can be entered for any taxonomi levels. See more here.

In the afternoon Katja and Torsten co-hosted a tutorial during which we were expected to link up the components of the Darwin Core components. To assist in the set-up of the DB, we downloaded phpMyAdmin, which provides an interface for mySQL and, and is therefore a little easier (particularly in the beginning) than entirely coding a DB. And, even if the DB is created using phpMyAdmin, the sql code is provided, which is a definite advantage. Installation of phpMyAdmin was semi-complicated, as it wouldn't just run in Windows. However, some web-searches resulted in Katja establishing that it may be best to download mySQL and phpMyAdmin together using XAMPP. I have tried phpMyAdmin a little, and it seems relatively intuitive. I hope there'll be time in the next week to further develop my DB.

Morphbank

The other part of Friday´s lecture was Debbie Paul´s Morphbank introduction. Morphbank is a repository for photos of living organisms, which is meant to serve as an aid for species identification. Photos come with a range of metadata, which, I think, could potentially be useful for a variety of other reasons. Data can be uploaded via the web individually or in batch, or, if the identification of the specimen is not to species level, be sent to Morphbank directly. Morphbank is also linked to various other online databases.
I browsed around Morphbank a little and it seems that at this stage the coverage for African species is somewhat sparse. Perhaps a reason to raise awareness amongst African researchers of this resource. Also, could it be possible to link Morphbank to other photographic repositories, such as the West African Plant Database or the Flora of Zimbabwe?

Friday 18 September 2009

Ontologies

Today's workshop sessions started with a presentation bei Katja Seltmann (see her wiki page to check out some of her other projects!). She has a really diverse background and is currently working as a software developer/entomologist on the Hymenoptera Anatomy Ontology Project (HAO). An ontology comprises the definition of a set of concepts and the relationships between them. So, the HAO is basically a database that 1.) provides definitions of anatomical structures of hymenopterans, 2.) links these to several other fields, e.g. synonyms, literature references, etc. and 3.) defines the relationships between the structures (e.g. antennae are parts of the head, etc.). When you search for a term, you obtain the definition(s) for the structure, and a list of other structures related to the search terms, and the nature of the relationships.
The HAO group set up their own ontology, however, a platform for developing these interactively is freely available through Oboedit. With this software you can either create your own ontology, or access/edit other ontologies - for a list of these see Obofoundry. The ontologies here are not only anatomical - some are taxonomic, others biochemical, etc. The interface is easy to use and one gets a relatively good overview of the connections between different elements of the ontology.
I was trying thinking whether such ontologies would be useful for my legume project. I could create a taxonomic ontology, though, for my purposes, a simple database showing the relationships between different taxa would probably suffice. As for character ontologies - probably not that applicable to me either. However, it is 1.) always good to know what options re out there - perhaps something like this may come in useful at a later stage, and 2.) the exercise was informative in explaining how databases can be set up and also for introducing some aspects of XMLing (not discussed above).
Just as an aside - the blogging exercise is really helping me to synthesise all the information from the course. It's as Vince said - blogging should, perhaps, in the first line be seen as a selfish exercise to keep track of one's activities and synthesise information.

A night amongst great people

After a short stroll through Stockholm city last night, we had a 'communal dinner' at our villa, which progressed to interesting conversations about politics and science. What a privilege to be amongst people from different nations (we were American, Brazilian, Danish, Swedish, British, Estonian, South African) with different experiences of and insights into humanity and its problems, and possible solutions to such problems.

Thursday 17 September 2009

Data sharing platforms and linking and collating information from various websites

I am working backwards now, posting stuff from the previous few days. However, I think it important to write my thoughts down - for digesting this information myself more than for anyone else at this stage. All the same, any comments are welcome.
On Wednesday we tackled issues perhaps from more of a taxonomic angle, though they do refer more widely. A major issue that was raised was the availability of biological information on the web, and the fact that it is spread across many different websites that are not linked to one another. Rod Page (University of Glasgow) pointed out that when doing web searches, wikipedia is, in the (vast) majority of instances the primamary source that emerges. However, wiki is completely open source, so there are problems associated with wiki. In contrast, 'Encyclopedia of Life' (EOL), as an example of what would generally be considered more reliable info, is closed source and dependent on a single person for editing (or allowing edititing). Therefore, often EOL is not updated regularly. In addition, EOL has no reference list to indicate where information has oriinated from (in contrast to e.g. wiki). As an in-between, Rod Page and Vincent Smith (London Museum of Natural History) offered two alternatives. Vince spoke about scratchpads as a platform for taxonomists to share information. We played around on them a little. In general, they are user-friendly (for someone with my webpage-developing abailities esp.) They link to other databases, thereby providing information on phylogenies and taxonomy, nomenclature, geographic information (via GBIF), images, and other information. You can, of course, choose, what you would and would not like to have displayed on your site. Most importantly, you can invite people to contribute to the sites in various ways, giving them different levels of administrative rights. Also, you can choose which of the information is visible to all viewers of the website, and which is visible to only 'members' of the scratchpad group. That way, it is a useful platform for data exchange and allowing experts to update details of the page. Some of the taxonomists who were trying out the scratchpads at yesterday's workshop had minor suggestions for improvement, and, from discussions, it was clear that scratchpads are constantly being developed and improved. Rod Page, on the other hand, suggested semantic wikis as a way of exchanging data and, more importantly, link information from various websites into one website. The difference to scratchpads here is that the privacy level is not as high and that the user needs to put more effort into creating the websites. The user can, however, define more precisely what he needs for the site. He also presented new ideas, many still 'in development' on how, e.g., RSS feeds could be used to extract localty records. This was, again, something I was particularly interested in, as it could potentially be useful for me - if I could extract geographic information from RSS feeds, I potentially harvest a much wider data source. However, chatting with Rod it emerged that the accuracy of such an exercise - esp. for my purposes - would probably be compromised by such an approach. I will, however, keep track of Rod's blogs. He has many interesting and, in the world of bioinformatics, cutting edge ideas.

On why to provide good metadata - you never know who might find it useful!

Today's course content was mainly aimed at molecular biologists (which is somewhat beyond my field of expertise). Johannes Bergsten provided an overview of DNA Barcoding, a way of identifying DNA samples to species level, while Henrik Nielsen introduced techniques allowing molecular biologists to extract data from Genbank to supplement their own data.

What interested me most about both presentations was that they showed that coordinates were often included as metadata of gene sequences. As part of my PhD I am hoping to set up a database on African Fabaceae to study their biogeography. In some ways this is a mammoth undertaking, given the diversity of the group and the often lacking (esp. in electronic format) data on African groups. Any source of information could thus be potentially useful. Therefore, I looked for information on locality records of legumes on BOLD Systems, the DNA Barcoding website; however, no information on African legumes was available here. As a test, Henrik helped me extract information on localities of South African Fabaceae from GenBank. No latitude/longitude information was available, though some localities were provided. Unfortunately, locality information was, with three exceptions, on the province scale (or broader), thereby making it of limited use for my purposes. What a pity! However, it made me realize once again that providing good metadata when publishing your data can be of great use for the scientific world in the long run. If these data are hidden in the deepest darkest depth of your scientific publication, it will be difficult to access and of limited use to people like us! (And, as an aside, wouldn't the fact that most research is funded by tax payers provide more incentive for these moneys to be put to wider use with minimal effort?)

A new beginning

As part of my PhD I have the privilege of attending courses on all manner of interesting projects. At the moment I am in Stockholm attending GBIF's Bioinformatics Course organized by Kevin Holston. Yesterday Rod Page gave us an introduction to some of the issues he's been grappling with, and ideas he's been trying to develop. But he started out his lecture imploring us (as the course participants) to start blogging about our work. I was inspired. So, here goes...

I will start by giving feed-back on the content of this course, and hopefully continue to blog over the next months.

African Vegetation - the PhD