Skip to content


Everybody Disco

I’m looking at a new project called Disco.  It was developed by Nokia as an implementation of MapReduce, the Google-spawned algorithm for sharding a large computing task into pieces that can be crunched by multiple cores or servers.  Word has it that MapReduce is used extensively at Google, probably to build that big index, I’m guessing.

The core of Disco is implemented in Erlang, the concurrent, fault-tolerant, multicore-ready distributed computing platform developed some time ago at Ericsson.  Erlang is brilliant, though kind of weird, and represents a big educational hurdle for the existing programming population.  It’s just too different from the existing major programming paradigms.

Disco takes a stab at solving that problem, by allowing programmers to write their jobs in Python.  The jobs are executed by the Erlang core, buying all that distributed, fault-tolerant goodness that Erlang provides, but keeping it safely sealed away from application developers who can work in the relatively friendlier world of Python.

Here (lifted directly from the Disco documentation) is a Disco job:

from disco.core import Disco, result_iterator
 
def fun_map(e, params):
    return [(w, 1) for w in e.split()]
 
def fun_reduce(iter, out, params):
    s = {}
    for w, f in iter:
        s[w] = s.get(w, 0) + int(f)
    for w, f in s.iteritems():
        out.add(w, f)
 
results = Disco("disco://localhost").new_job(
		name = "wordcount",
                input = ["http://discoproject.org/chekhov.txt"],
                map = fun_map,
		reduce = fun_reduce).wait()
 
for word, frequency in result_iterator(results):
	print word, frequency

So this code snip is about creating a word count of some text.  MapReduce always consists of two functions – the Map function, which is used to split up a big job into a bunch of smaller jobs, and the Reduce function which assembles it back together into a single result.  (This is the essence of MapReduce, and isn’t tied to a particular technology).

The code above has two fun_* functions.  “Fun” is a Erlang-ism that creates an anonymous function, not unlike a lambda in Python.  The functions themselves are passed into the Disco instance which then spits out the results, once all the reduce functions exit no doubt.

So in the above code example, it looks like each word gets its own job, zipping through the text and getting a frequency count.  The job split is initially established by fun_map.  Then fun_reduce runs, concurrently, once per unique word in the text and counts up the frequency of that word, adding its results to the “out” accumulator.  Disco ties it all together and returns it as the “results”.

Wait, this gets better.  Disco comes with tools that allow it to be deployed on Amazon’s EC2 computing cloud.  (Hm, Python.  Django-Disco anyone?) Imagine dynamic, linear capacity scaling, on rented compute cycles, with easily written Python jobs.  I think I might be salivating a bit.

I’m a huge fan of anything that can deliver concurrent programming power in a form that’s paletable to programmers that haven’t grown up with it.  I’m going to eagerly watch the Disco project to see how it does.

Give a shout-out:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Reddit
  • Slashdot
  • Technorati
  • RSS
  • Tumblr
  • Twitter

Posted in programming.

Tagged with , , , , .


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.



Some HTML is OK

or, reply to this post via trackback.