Siddhesh Rane

Processing Wikipedia with Spark

What Beginners Wished They Knew

Posted by Siddhesh Rane on May 15, 2018 Tags: spark semantic web nlp

For my Btech project I had to process all articles in Wikipedia. The uncompressed dump of its plaintext is 14GB in size, which doesn’t sound like much until you run even a trivial word count pipeline on it. I had to do more than that, think matching 100 million labels over all the articles. It was no task for a single machine, so I used Spark to distribute the load over a cluster of machines. In this post I’ll list everything that I wish I didn’t have to discover on my own.

Jekyll vs JBake

Posted by Siddhesh Rane on June 21, 2017 Tags: blog jbake jekyll

In my previous blog post I talked about how Jekyll got me into blogging but still ditched it in favour of JBake In this post I’ll summarize all the important architectural differences between Jekyll and JBake and the factors that made me choose JBake.

My First Blog

Posted by Siddhesh Rane on June 06, 2017 Tags: blog jekyll

I was never really into blogging. Even today it’s not the conventional aspects of blogging that has got me started with one. In this post I will explore why I chose a static blog over the obvious choice of a Wordpress one.