Week 3 – Optical Character Recognition vs. Double Re-Keying – Quality vs. Quantity?

In the digital age, the online archive is a historians greatest ally. However, the accuracy of the search functions in these archives can hide key sources if they are not fully understood. The texts in the digital archives more often than not started life as paper documents, that had to be digitised. The most popular method for doing this is Optical Character Recognition, a programme that scans the documents and recreates digitally the way it believes the document reads. With sources that contain handwriting, especially from previous centuries, this can cause great issues with spellings and words that are not accurately copied, hindering key word searches in these archives. Another form of digitisation is ‘double re-keying’, whereby the text is typed twice, by two different typists and then the two transcriptions are compared by computer. Any differences are then manually resolved. (1) As double rekeying involved manual labour, it is also much more expensive, with Michael Lesk approximating the cost to be around £1/KB of data vs. £0.13/KB for OCR. (2)

In order to discuss the differences between the two types of digitisation, British Newspapers 1600 – 1900 and British History Online can be compared, which use OCR and double rekeying respectively. Using two common words, ‘Irish’ and ‘Yesterday’, that had common misspellings or that would have been spelled slightly differently in Old English and comparing the actual spelling with some alterations there are varying results:

Spelling British Newspapers 1600 – 1900 British History Online
Irish 837, 778 11,018
Irifh 21,987 2
Jrish 1614 1
Jrifh 309 0
I rish 4349 35
J rish 267 35
F rish 127 35
Yesterday 1537542 12668
Yefterday 165986 0
Yesterdae 128 1*

As you can see from these results, despite the British History Online archive having fewer sources they are altogether more accurate.  Out of the 866,451 variations of the word ‘Irish’ that were searched, 0.03% for British Newspapers were inaccuracies, compared to the 0.003% of search returns on BHO, making it 10 times more accurate. For ‘Yesterday’, 0.09% inaccurate results against 0.00007%, an infinitesimal amount. From this small pool of results alone it can be seen that the difference between OCR and double rekeying on keyword searches can be the difference between finding a key source and not, for historians.

While digital archives are becoming increasingly more popular with historians, as they can make the process of finding primary sources much cheaper, many lack the technological education that will allow them to access a much greater bank of sources. Both of the archives cost around £2 million to create and maintain,  but British Newspapers 1600 – 1900 is only around 22- 52% accurate – a very large margin of error, and BHO having less sources but they are all much more accurately transcribed and searchable for the untrained user. (3)  When historians are selecting their online archives from the hundreds available, it is no longer a case of blindly searching for keywords – they must be educated in the individual archive’s search algorithm, and take advantage of it. The strength of the British Newspapers 1600 – 1900 archive is its vastness, this is also its downfall. British History Online benefits not only in using double rekeying for accurate results, but better quality of searching for users. While its price tag can be seen as a limitation to some, the phrase ‘you get what you pay for’ is relevant in this case. The search for primary sources in the digital age is no longer a question of quantity, but of quality.

* The source with this misspelling had been written with a sic erat scriptum, changing it from ‘Yesterdae’ (the term actually searched) to ‘Yesterda[y]’, automatically changing it and thus including it in the search for both spellings. This pre-corrected term can help historians who are searching for only ‘Yesterday’, and still give them results of different spellings; this would only happen with physical typists in double rekeying, not with a computer programme such as OCR, highlighting one of its many benefits.

(1) Clive Emsley, Tim Hitchcock and Robert Shoemaker, ‘Old Bailey Online – About This Project’, Old Bailey Proceedings Online (http://www.oldbaileyonline.org; consulted 27 April 2015)

(2) M. Lesk, Understanding Digital Libraries (California, 2005), p. 55

(3) ‘About This Project – Connected Histories’, Connected Histories (http://www.connectedhistories.org/about.aspx; consulted  27 April 2015)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s