What DataGov can learn from BBC archive

I was walking my dog and listening to a BBC podcast when I heard a story about the digitalisation of the BBC archives. They have over 40 years of audio archives and they are currently turning it into a digital, searchable archive. It will be a great source of accurate historical information when completed. However, they face one great challenge; how do they enrich the audio content with metadata tags so searching for topics e.g. people who speak on the show, themes discussed, is possible?

The sheer amount of data doesn't allow a conventional approach such as hiring students for a summer to tag them :)  So how do they source the metadata?

They use an automated speech-to-text recognition system that tries to guess the metadata tags. The quality of the recordings, the variety of languages and accents, as well as other variables make this task extremely difficult, however. So the metadata tags need to be reviewed by humans.

The review is achieved via crowd-sourcing amongst the archive users. The BBC publishes the tags "as is" with a disclaimer and allows users to change them when they have a better suggestion.

This is a very simple solution that SCALES with large metadata enrichment projects. You'll yourself be in a similar situation if you start to gather metadata for a DataGov project. So what can DataGov learn from this?

We should admit that the initial project has no chance of enriching all the metadata (especially business descriptions) to a 100% level in terms of completeness and quality.

We should allow users to enrich and correct the published metadata in an easy online manner.

We should encourage users and let them know that their feedback will bring better search results (a direct benefit) and is essential to the overall success of DataGov.

PS: Crowdsourcing the World Service Radio Archive: an experiment from BBC R&D.