Siddhesh Rane

My personal blog about tech

Processing Wikipedia with Spark

What Beginners Wished They Knew

For my Btech project I had to process all articles in Wikipedia. The uncompressed dump of its plaintext is 14GB in size, which doesn’t sound like much until you run even a trivial word count pipeline on it. I had to do more than that, think matching 100 million labels over all the articles. It was no task for a single machine, so I used Spark to distribute the load over a cluster of machines. In this post I’ll list everything that I wish I didn’t have to discover on my own.

Jekyll vs JBake

In my previous blog post I talked about how Jekyll got me into blogging but still ditched it in favour of JBake In this post I’ll summarize all the important architectural differences between Jekyll and JBake and the factors that made me choose JBake.

My First Blog

I was never really into blogging. Even today it’s not the conventional aspects of blogging that has got me started with one. In this post I will explore why I chose a static blog over the obvious choice of a Wordpress one.