Siddhesh Rane

Processing Wikipedia with Spark

What Beginners Wished They Knew

Posted by Siddhesh Rane on May 15, 2018 Tags: spark semantic web nlp

For my Btech project I had to process all articles in Wikipedia. The uncompressed dump of its plaintext is 14GB in size, which doesn’t sound like much until you run even a trivial word count pipeline on it. I had to do more than that, think matching 100 million labels over all the articles. It was no task for a single machine, so I used Spark to distribute the load over a cluster of machines. In this post I’ll list everything that I wish I didn’t have to discover on my own.