Brain vs Processor: Part 3 – PDF Data Mining

Use PyBossa and Crowdcrafting to extract information from PDF documents

7 September 2015

Cover photo by Lennart Tange

Reading time ~ 4 minutes

GIFs & memes ~ 0

arrow_back Back to entries

Welcome to Part 3 of our Brain vs. Processor blog series, which is examining how the human brain remains a more versatile and effective tool than computers for interpreting certain types of data, and detailing how you can use PyBossa software to tap into these capabilities.

So far we’ve talked about how humans are ace at interpreting images and sounds. This blog post focuses on another amazing skill we possess that you’re doing right now… yup, reading!

Interpretation skills

Computers are becoming more sophisticated at transcribing information from documents. But humans still have the edge when it comes to more advanced analysis, such as extracting specific key pieces of information, understanding the meaning of a document and being able to summarise it, and interpreting old swirly-wurly illegible handwriting. Researchers are using PyBossa and Crowdcrafting to exploit these skills.

Transcribing documents in Crowdcrafting

Scifabric’s head honcho Daniel Lombraña González has kindly recorded a little video that shows how easy it is to set up a PDF transcription project in Crowdcrafting.

Just create a new project, use the platform’s integration with Dropbox to upload your PDF files, then set-up tasks for volunteers using Crowdcrafting’s convenient ‘Transcribing Documents’ template and – voilà – you’ve got yourself a data mining project.

We have some awesome projects making the most of these capabilities.

Open Oil

The Open Oil project wants to make the oil industry more transparent. Oil conglomerates can be made up of hundreds of companies related by a complex web of contracts. Regulation often requires that the contracts be made publicly available but nevertheless understanding the industry as a whole is a massive challenge. Open Oil uploads contracts into their own Crowdcrafting project and asks volunteers to extract key pieces of information. This will help the team gain a better understanding of how the industry works.


Photo by Pete Markham Has the sun set on secrecy in the oil industry?

Follow Open Oil on Twitter.

Iceland’s Criminal Court

In the Héraðsdómar - sýknað eða sakfellt project, Icelandic citizen Páll Hilmarsson wanted to test a journalist’s published assertions that a particular judge, who had a conviction rate of 95%, was biased. Páll uploaded 4,700 judgements of the Reykjavik district court to Crowdcrafting and asked volunteers to note the identity of the judge and whether there was a conviction in each case. In only 17 days, 300 users completed almost 15,000 tasks. Analysis showed that the judge in question was within the statistical norm and that, said journalist, needed a lesson in statistics. We think he got one…


LibCrowds is a new crowdsourcing platform that aims to make the physical and digital collections of the British Library more accessible. Its maiden project – Convert-a-Card – asks volunteers to transcribe text from printed card catalogues into electronic records in order to make them available to a worldwide audience. The project is initially focused on the library’s Asian and African collections, particularly the Chinese and Indian catalogues. Data identified, transcribed or translated as part of the project will be freely accessible from the British Library’s Explore catalogue.


Photo by Asmund Indexed cards.

Follow LibCrowds and the British Library on Twitter.

Need a hand?

If you need to do a bit of data mining, PyBossa and Crowdcrafting may provide you with the ideal tool. Need any help or advice on setting your project up? Contact us on

Share this blog post: