PiCloud introduction

09 Mar 2010 #distributed #picloud #pyro #python

A couple weeks ago my coworker mentioned PiCloud. It claims to be “Cloud Computing. Simplified.” for python programming. Indeed, their trivial examples are too good to be true, basically. I pointed out how the way it was packaging up code to send over the wire was a lot like Pyro‘s Mobile Code feature. We actually use Pyro mobile code quite a bit at work, within the context of our own distributed system running across machines we maintain.

After getting beta access I decided to check it out today. I spent about 15 minutes playing around with it, and decided to do a short writeup because there’s so little info out there. The short version is that technically it’s quite impressive. Simple, but more complicated than square(x) cases are as easy as they say. Information about PiCloud is in pretty short supply, so here’s my playing around reproduced for all to see.

Installing/first using

This is pretty easy. I’m using a virtualenv because I was skeptical, but it’s neat how easy it is even with that. So I’m going to setup a virtualenv, install ipython to the virtualenv, then install the cloud egg. At the end I’ll add my api key to the ~/.picloud/cloudconf.py file so I don’t need to type it repeatedly. The file is created when you first import cloud, and is very straightforward.

chmullig@gore:~$ virtualenv picloud
New python executable in picloud/bin/python
Installing setuptools............done.
chmullig@gore:~$ source picloud/bin/activate
(picloud)chmullig@gore:~$ easy_install -U ipython
Searching for ipython
#snip
Processing ipython-0.10-py2.6.egg
creating /home/chmullig/picloud/lib/python2.6/site-packages/ipython-0.10-py2.6.egg
Extracting ipython-0.10-py2.6.egg to /home/chmullig/picloud/lib/python2.6/site-packages
Adding ipython 0.10 to easy-install.pth file
Installing iptest script to /home/chmullig/picloud/bin
Installing ipythonx script to /home/chmullig/picloud/bin
Installing ipcluster script to /home/chmullig/picloud/bin
Installing ipython script to /home/chmullig/picloud/bin
Installing pycolor script to /home/chmullig/picloud/bin
Installing ipcontroller script to /home/chmullig/picloud/bin
Installing ipengine script to /home/chmullig/picloud/bin

Installed /home/chmullig/picloud/lib/python2.6/site-packages/ipython-0.10-py2.6.egg
Processing dependencies for ipython
Finished processing dependencies for ipython
(picloud)chmullig@gore:~$ easy_install http://server/cloud-1.8.2-py2.6.egg
Downloading http://server/cloud-1.8.2-py2.6.egg
Processing cloud-1.8.2-py2.6.egg
creating /home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.8.2-py2.6.egg
Extracting cloud-1.8.2-py2.6.egg to /home/chmullig/picloud/lib/python2.6/site-packages
Adding cloud 1.8.2 to easy-install.pth file

Installed /home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.8.2-py2.6.egg
Processing dependencies for cloud==1.8.2
Finished processing dependencies for cloud==1.8.2
(picloud)chmullig@gore:~$ python -c 'import cloud' #to create the ~/.picloud directory
(picloud)chmullig@gore:~$ vim .picloud/cloudconf.py #to add api_key and api_secretkey

Trivial Examples

This is their trivial example, just to prove it’s as easy for me as it was for them.

In [1]: def square(x):
 ...:     return x**2
 ...:
In [2]: import cloud
In [3]: cid = cloud.call(square, 10)
In [4]: cloud.result(cid)
Out[4]: 100

BAM! That’s just stupidly easy. Let’s try a module or two.

In [5]: import random
In [6]: def shuffler(x):
 ...:     xl = list(x)
 ...:     random.shuffle(xl)
 ...:     return ''.join(xl)
 ...:
In [8]: cid = cloud.call(shuffler, 'Welcome to chmullig.com')
In [9]: cloud.result(cid)
Out[9]: ' etcmmhmoeWll.cgcl uioo'

Less-Trivial Example & Packages

So that’s neat, but what about something I wrote, or something that’s off pypi that they don’t already have installed? Also quite easy. I’m going to be using  Levenshtein edit distance for this, because it’s simple but non-standard. For our purposes we’ll begin with a pure python implementation, borrowed from Magnus Lie. Then we’ll switch to a C extension version, originally written by David Necas (Yeti), which I’ve rehosted on Google Code.

(picloud)chmullig@gore:~$ wget -O hetlev.py http://hetland.org/coding/python/levenshtein.py
#snip
2010-03-09 12:13:04 (79.2 KB/s) - `hetlev.py' saved [707/707]
(picloud)chmullig@gore:~$ easy_install http://pylevenshtein.googlecode.com/files/python-Levenshtein-0.10.1.tar.bz2
Downloading http://pylevenshtein.googlecode.com/files/python-Levenshtein-0.10.1.tar.bz2
Processing python-Levenshtein-0.10.1.tar.bz2
Running python-Levenshtein-0.10.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-mqtK2d/python-Levenshtein-0.10.1/egg-dist-tmp-igxMyM
zip_safe flag not set; analyzing archive contents...
Adding python-Levenshtein 0.10.1 to easy-install.pth file

Installed /home/chmullig/picloud/lib/python2.6/site-packages/python_Levenshtein-0.10.1-py2.6-linux-x86_64.egg
Processing dependencies for python-Levenshtein==0.10.1
Finished processing dependencies for python-Levenshtein==0.10.1
(picloud)chmullig@gore:~$

Now both are installed locally and built. Beautiful. Let’s go ahead and test out the hetlev version.

In [18]: def distances(word, comparisonWords):
 ....:     results = []
 ....:     for otherWord in comparisonWords:
 ....:         results.append(hetlev.levenshtein(word, otherWord))
 ....:     return results
In [24]: zip(words, distances(word, words))
Out[24]:
[('kitten', 0),
 ('sitten', 1),
 ('sittin', 2),
 ('sitting', 3),
 ('cat', 5),
 ('kitty', 2),
 ('smitten', 2)]

Now let’s put that up on PiCloud! It’s, uh, trivial. And fast.

In [25]: cid = cloud.call(distances, word, words)
In [26]: zip(words, cloud.result(cid))
Out[26]:
[('kitten', 0),
 ('sitten', 1),
 ('sittin', 2),
 ('sitting', 3),
 ('cat', 5),
 ('kitty', 2),
 ('smitten', 2)]

Now let’s switch it to use the C extension version of edit distance from the PyLevenshtein package, and try to use it with PiCloud.

In [32]: import Levenshtein
In [33]: def cdistances(word, comparisonWords):
 results = []
 for otherword in comparisonWords:
 results.append(Levenshtein.distance(word, otherword))
 return results
 ....:
In [38]: zip(words, cdistances(word, words))
Out[38]:
[('kitten', 0),
 ('sitten', 1),
 ('sittin', 2),
 ('sitting', 3),
 ('cat', 5),
 ('kitty', 2),
 ('smitten', 2)]

In [39]: cid = cloud.call(cdistances, word, words)
In [40]: cloud.result(cid)
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (30, 0))

ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (37, 0))
---------------------------------------------------------------------------
CloudException                            Traceback (most recent call last)
CloudException: Job 14:
 Could not depickle job
Traceback (most recent call last):
 File "/root/.local/lib/python2.6/site-packages/cloudserver/workers/employee/child.py", line 202, in run
 File "/usr/local/lib/python2.6/dist-packages/cloud/serialization/cloudpickle.py", line 501, in subimport
 __import__(name)
ImportError: ('No module named Levenshtein', <function subimport at 0x2290ed8>, ('Levenshtein',))

Installing C-Extension via web

Not too surprisingly that didn’t work – Levenshtein is a C extension I built on my local machine. PiCloud doesn’t really make it obvious, but you can add C-Extensions via their web interface. Amazingly you can point it to an SVN repo and it will let you refresh it. It seems to download and call setup.py install, but it’s a little unclear. The fact is it just worked, so I didn’t care. I clicked on “Add Repository” and pasted in the URL from google code,  http://pylevenshtein.googlecode.com/svn/trunk. It built it and installed it, you can see the output on the right. I then just reran the exact same command and it works.

In [41]: cid = cloud.call(cdistances, word, words)

In [42]: cloud.result(cid)
Out[42]: [0, 1, 2, 3, 5, 2, 2]

In [43]: zip(words, cloud.result(cid))
Out[43]:
[('kitten', 0),
 ('sitten', 1),
 ('sittin', 2),
 ('sitting', 3),
 ('cat', 5),
 ('kitty', 2),
 ('smitten', 2)]

Slightly more complicated

I’ve written a slightly more complicated script that fetches the qwantzle corpus and uses jaro distance to find the n closest words in the corpus to a given word. It’s pretty trivial and dumb, but definitely more complicated than the above examples. Below is closestwords.py

import Levenshtein
import urllib
 
class Corpusinator(object):
        '''
        Finds the closest words to the word you specified.
        '''
        def __init__(self, corpus='http://cs.brown.edu/~jadrian/docs/etc/qwantzcorpus'):
                '''Setup the corpus for later use. By default it uses
                http://cs.brown.edu/~jadrian/docs/etc/qwantzcorpus, but can by overridden
                by specifying an alternate URL that has one word per line. A number, a space, then the word.
                '''
                raw = urllib.urlopen('http://cs.brown.edu/~jadrian/docs/etc/qwantzcorpus').readlines()
                self.corpus = set()
                for line in raw:
                        try:
                                self.corpus.add(line.split()[1])
                        except IndexError:
                                pass
 
        def findClosestWords(self, words, n=10):
                '''
                Return the n (default 10) closest words from the corpus.
                '''
                results = {}
                for word in words:
                        tempresults = []
                        for refword in self.corpus:
                                dist = Levenshtein.jaro(word, refword)
                                tempresults.append((dist, refword))
                        tempresults = sorted(tempresults, reverse=True)
                        results[word] = tempresults[:n]
                return results

Very simple. Let’s try ‘er out. First locally, then over the cloud.

In [1]: import closestwords

In [2]: c = closestwords.Corpusinator()

In [3]: c.findClosestWords(['bagel', 'cheese'], 5)
Out[3]:
{'bagel': [(0.8666666666666667, 'barge'),
           (0.8666666666666667, 'bag'),
           (0.8666666666666667, 'badge'),
           (0.8666666666666667, 'angel'),
           (0.8666666666666667, 'age')],
 'cheese': [(1.0, 'cheese'),
            (0.95238095238095244, 'cheesed'),
            (0.88888888888888895, 'cheers'),
            (0.88888888888888895, 'cheeks'),
            (0.8666666666666667, 'cheeseball')]}

Unfortunately it just doesn’t want to work happily with PiCloud & ipython when you’re running import closestwords. The most obvious won’t work, cloud.call(c.findClosestWords, [‘bagel’]). Neither will creating a tiny wrapper function and calling that within ipython:

def caller(words, n=10):
    c = closestwords.Corpusinator()
    return c.findClosestWords(words, n)
cloud.call(caller, ['bagel'])

I created a stupidly simple wrapper python file, wrap.py:

import closestwords
import cloud
cid = cloud.call(closestwords.caller, ['bagel',])
print cloud.result(cid)

That gives an import error. Even putting that caller wrapper above at the bottom of the closestwords.py and calling it in the __main__ section (as I do below with c.findClosestWords) didn’t work.

However if I stick it directly in closestwords.py, initialize the instance, then run it from there, everything is fine. I’m not sure what this means, if it’s supposed to happen, or what. But it seems like it could be a pain in the butt just to get  it calling the right function in the right context.

if __name__ == '__main__':
        import cloud
        c = Corpusinator()
        cid = cloud.call(c.findClosestWords, ['bagel',])
        print cloud.result(cid)

What passes for a conclusion

I had a good time playing with PiCloud. I’m going to look at adapting real code to use it. If I get carried away AND feel like blogging I’ll be sure to post ‘er up. They have pretty good first tier support for the map part of map/reduce, which would be useful. Two links I found useful when working with PiCloud:


Update 3/11

Aaron Staley of PiCloud wrote me a nice email about this post. He says my problem with the closestwords example was due to a server side bug they’ve fixed. In playing around, it does seem a bit better. A few ways I tried to call it failed, but many of them worked. I had trouble passing in closestwords.caller, either in ipython or the wrapper script. However re-defining caller in ipython worked, as did creating an instance and passing in the instance’s findClosestWords function. A+ for communication, guys.

In [3]: cid = cloud.call(closestwords.caller, ['bagel'])

In [4]: cloud.result(cid)
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (30, 0))

ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (37, 0))
---------------------------------------------------------------------------CloudException: Job 36: Could not depickle job
Traceback (most recent call last):
 File "/root/.local/lib/python2.6/site-packages/cloudserver/workers/employee/child.py", line 202, in run
AttributeError: 'module' object has no attribute 'caller'

In [8]: c = closestwords.Corpusinator()
In [9]: cid = cloud.call(c.findClosestWords, ['bagel', 'cheese'], 5)
In [10]: cloud.result(cid)
Out[10]:
{'bagel': [(0.8666666666666667, 'barge'),
 (0.8666666666666667, 'bag'),
 (0.8666666666666667, 'badge'),
 (0.8666666666666667, 'angel'),
 (0.8666666666666667, 'age')],
 'cheese': [(1.0, 'cheese'),
 (0.95238095238095244, 'cheesed'),
 (0.88888888888888895, 'cheers'),
 (0.88888888888888895, 'cheeks'),
 (0.8666666666666667, 'cheeseball')]}

In [11]: def caller(words, n=10):
 ....:     c = closestwords.Corpusinator()
 ....:     return c.findClosestWords(words, n)
 ....:
In [12]: cid = cloud.call(caller, ['bagel'])
In [13]: reload(closestword)
KeyboardInterrupt
In [13]: cloud.result(cid)
Out[13]:
{'bagel': [(0.8666666666666667, 'barge'),
 (0.8666666666666667, 'bag'),
 (0.8666666666666667, 'badge'),
 (0.8666666666666667, 'angel'),
 (0.8666666666666667, 'age'),
 (0.8222222222222223, 'barrel'),
 (0.8222222222222223, 'barely'),
 (0.81111111111111123, 'gamble'),
 (0.79047619047619044, 'vaguely'),
 (0.79047619047619044, 'largely')]}

Update 3/12

I did some more experimentation with PiCloud, posted separately.