A couple weeks ago my coworker mentioned PiCloud. It claims to be “Cloud Computing. Simplified.” for python programming. Indeed, their trivial examples are too good to be true, basically. I pointed out how the way it was packaging up code to send over the wire was a lot like Pyro‘s Mobile Code feature. We actually use Pyro mobile code quite a bit at work, within the context of our own distributed system running across machines we maintain.
After getting beta access I decided to check it out today. I spent about 15 minutes playing around with it, and decided to do a short writeup because there’s so little info out there. The short version is that technically it’s quite impressive. Simple, but more complicated than square(x) cases are as easy as they say. Information about PiCloud is in pretty short supply, so here’s my playing around reproduced for all to see.
This is pretty easy. I’m using a virtualenv because I was skeptical, but it’s neat how easy it is even with that. So I’m going to setup a virtualenv, install ipython to the virtualenv, then install the cloud egg. At the end I’ll add my api key to the ~/.picloud/cloudconf.py file so I don’t need to type it repeatedly. The file is created when you first import cloud, and is very straightforward.
chmullig@gore:~$ virtualenv picloud New python executable in picloud/bin/python Installing setuptools............done. chmullig@gore:~$ source picloud/bin/activate (picloud)chmullig@gore:~$ easy_install -U ipython Searching for ipython #snip Processing ipython-0.10-py2.6.egg creating /home/chmullig/picloud/lib/python2.6/site-packages/ipython-0.10-py2.6.egg Extracting ipython-0.10-py2.6.egg to /home/chmullig/picloud/lib/python2.6/site-packages Adding ipython 0.10 to easy-install.pth file Installing iptest script to /home/chmullig/picloud/bin Installing ipythonx script to /home/chmullig/picloud/bin Installing ipcluster script to /home/chmullig/picloud/bin Installing ipython script to /home/chmullig/picloud/bin Installing pycolor script to /home/chmullig/picloud/bin Installing ipcontroller script to /home/chmullig/picloud/bin Installing ipengine script to /home/chmullig/picloud/bin Installed /home/chmullig/picloud/lib/python2.6/site-packages/ipython-0.10-py2.6.egg Processing dependencies for ipython Finished processing dependencies for ipython (picloud)chmullig@gore:~$ easy_install http://server/cloud-1.8.2-py2.6.egg Downloading http://server/cloud-1.8.2-py2.6.egg Processing cloud-1.8.2-py2.6.egg creating /home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.8.2-py2.6.egg Extracting cloud-1.8.2-py2.6.egg to /home/chmullig/picloud/lib/python2.6/site-packages Adding cloud 1.8.2 to easy-install.pth file Installed /home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.8.2-py2.6.egg Processing dependencies for cloud==1.8.2 Finished processing dependencies for cloud==1.8.2 (picloud)chmullig@gore:~$ python -c 'import cloud' #to create the ~/.picloud directory (picloud)chmullig@gore:~$ vim .picloud/cloudconf.py #to add api_key and api_secretkey
This is their trivial example, just to prove it’s as easy for me as it was for them.
In [1]: def square(x): ...: return x**2 ...: In [2]: import cloud In [3]: cid = cloud.call(square, 10) In [4]: cloud.result(cid) Out[4]: 100
BAM! That’s just stupidly easy. Let’s try a module or two.
In [5]: import random In [6]: def shuffler(x): ...: xl = list(x) ...: random.shuffle(xl) ...: return ''.join(xl) ...: In [8]: cid = cloud.call(shuffler, 'Welcome to chmullig.com') In [9]: cloud.result(cid) Out[9]: ' etcmmhmoeWll.cgcl uioo'
So that’s neat, but what about something I wrote, or something that’s off pypi that they don’t already have installed? Also quite easy. I’m going to be using Levenshtein edit distance for this, because it’s simple but non-standard. For our purposes we’ll begin with a pure python implementation, borrowed from Magnus Lie. Then we’ll switch to a C extension version, originally written by David Necas (Yeti), which I’ve rehosted on Google Code.
(picloud)chmullig@gore:~$ wget -O hetlev.py http://hetland.org/coding/python/levenshtein.py #snip 2010-03-09 12:13:04 (79.2 KB/s) - `hetlev.py' saved [707/707] (picloud)chmullig@gore:~$ easy_install http://pylevenshtein.googlecode.com/files/python-Levenshtein-0.10.1.tar.bz2 Downloading http://pylevenshtein.googlecode.com/files/python-Levenshtein-0.10.1.tar.bz2 Processing python-Levenshtein-0.10.1.tar.bz2 Running python-Levenshtein-0.10.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-mqtK2d/python-Levenshtein-0.10.1/egg-dist-tmp-igxMyM zip_safe flag not set; analyzing archive contents... Adding python-Levenshtein 0.10.1 to easy-install.pth file Installed /home/chmullig/picloud/lib/python2.6/site-packages/python_Levenshtein-0.10.1-py2.6-linux-x86_64.egg Processing dependencies for python-Levenshtein==0.10.1 Finished processing dependencies for python-Levenshtein==0.10.1 (picloud)chmullig@gore:~$
Now both are installed locally and built. Beautiful. Let’s go ahead and test out the hetlev version.
In [18]: def distances(word, comparisonWords): ....: results = [] ....: for otherWord in comparisonWords: ....: results.append(hetlev.levenshtein(word, otherWord)) ....: return results In [24]: zip(words, distances(word, words)) Out[24]: [('kitten', 0), ('sitten', 1), ('sittin', 2), ('sitting', 3), ('cat', 5), ('kitty', 2), ('smitten', 2)]
Now let’s put that up on PiCloud! It’s, uh, trivial. And fast.
In [25]: cid = cloud.call(distances, word, words) In [26]: zip(words, cloud.result(cid)) Out[26]: [('kitten', 0), ('sitten', 1), ('sittin', 2), ('sitting', 3), ('cat', 5), ('kitty', 2), ('smitten', 2)]
Now let’s switch it to use the C extension version of edit distance from the PyLevenshtein package, and try to use it with PiCloud.
In [32]: import Levenshtein In [33]: def cdistances(word, comparisonWords): results = [] for otherword in comparisonWords: results.append(Levenshtein.distance(word, otherword)) return results ....: In [38]: zip(words, cdistances(word, words)) Out[38]: [('kitten', 0), ('sitten', 1), ('sittin', 2), ('sitting', 3), ('cat', 5), ('kitty', 2), ('smitten', 2)] In [39]: cid = cloud.call(cdistances, word, words) In [40]: cloud.result(cid) ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (30, 0)) ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (37, 0)) --------------------------------------------------------------------------- CloudException Traceback (most recent call last) CloudException: Job 14: Could not depickle job Traceback (most recent call last): File "/root/.local/lib/python2.6/site-packages/cloudserver/workers/employee/child.py", line 202, in run File "/usr/local/lib/python2.6/dist-packages/cloud/serialization/cloudpickle.py", line 501, in subimport __import__(name) ImportError: ('No module named Levenshtein', <function subimport at 0x2290ed8>, ('Levenshtein',))
Not too surprisingly that didn’t work – Levenshtein is a C extension I built on my local machine. PiCloud doesn’t really make it obvious, but you can add C-Extensions via their web interface. Amazingly you can point it to an SVN repo and it will let you refresh it. It seems to download and call setup.py install, but it’s a little unclear. The fact is it just worked, so I didn’t care. I clicked on “Add Repository” and pasted in the URL from google code, http://pylevenshtein.googlecode.com/svn/trunk. It built it and installed it, you can see the output on the right. I then just reran the exact same command and it works.
In [41]: cid = cloud.call(cdistances, word, words) In [42]: cloud.result(cid) Out[42]: [0, 1, 2, 3, 5, 2, 2] In [43]: zip(words, cloud.result(cid)) Out[43]: [('kitten', 0), ('sitten', 1), ('sittin', 2), ('sitting', 3), ('cat', 5), ('kitty', 2), ('smitten', 2)]
I’ve written a slightly more complicated script that fetches the qwantzle corpus and uses jaro distance to find the n closest words in the corpus to a given word. It’s pretty trivial and dumb, but definitely more complicated than the above examples. Below is closestwords.py
import Levenshtein import urllib class Corpusinator(object): ''' Finds the closest words to the word you specified. ''' def __init__(self, corpus='http://cs.brown.edu/~jadrian/docs/etc/qwantzcorpus'): '''Setup the corpus for later use. By default it uses http://cs.brown.edu/~jadrian/docs/etc/qwantzcorpus, but can by overridden by specifying an alternate URL that has one word per line. A number, a space, then the word. ''' raw = urllib.urlopen('http://cs.brown.edu/~jadrian/docs/etc/qwantzcorpus').readlines() self.corpus = set() for line in raw: try: self.corpus.add(line.split()[1]) except IndexError: pass def findClosestWords(self, words, n=10): ''' Return the n (default 10) closest words from the corpus. ''' results = {} for word in words: tempresults = [] for refword in self.corpus: dist = Levenshtein.jaro(word, refword) tempresults.append((dist, refword)) tempresults = sorted(tempresults, reverse=True) results[word] = tempresults[:n] return results |
Very simple. Let’s try ‘er out. First locally, then over the cloud.
In [1]: import closestwords In [2]: c = closestwords.Corpusinator() In [3]: c.findClosestWords(['bagel', 'cheese'], 5) Out[3]: {'bagel': [(0.8666666666666667, 'barge'), (0.8666666666666667, 'bag'), (0.8666666666666667, 'badge'), (0.8666666666666667, 'angel'), (0.8666666666666667, 'age')], 'cheese': [(1.0, 'cheese'), (0.95238095238095244, 'cheesed'), (0.88888888888888895, 'cheers'), (0.88888888888888895, 'cheeks'), (0.8666666666666667, 'cheeseball')]}
Unfortunately it just doesn’t want to work happily with PiCloud & ipython when you’re running import closestwords. The most obvious won’t work, cloud.call(c.findClosestWords, [‘bagel’]). Neither will creating a tiny wrapper function and calling that within ipython:
def caller(words, n=10): c = closestwords.Corpusinator() return c.findClosestWords(words, n) cloud.call(caller, ['bagel']) |
I created a stupidly simple wrapper python file, wrap.py:
import closestwords import cloud cid = cloud.call(closestwords.caller, ['bagel',]) print cloud.result(cid) |
That gives an import error. Even putting that caller wrapper above at the bottom of the closestwords.py and calling it in the __main__ section (as I do below with c.findClosestWords) didn’t work.
However if I stick it directly in closestwords.py, initialize the instance, then run it from there, everything is fine. I’m not sure what this means, if it’s supposed to happen, or what. But it seems like it could be a pain in the butt just to get it calling the right function in the right context.
if __name__ == '__main__': import cloud c = Corpusinator() cid = cloud.call(c.findClosestWords, ['bagel',]) print cloud.result(cid) |
I had a good time playing with PiCloud. I’m going to look at adapting real code to use it. If I get carried away AND feel like blogging I’ll be sure to post ‘er up. They have pretty good first tier support for the map part of map/reduce, which would be useful. Two links I found useful when working with PiCloud:
Aaron Staley of PiCloud wrote me a nice email about this post. He says my problem with the closestwords example was due to a server side bug they’ve fixed. In playing around, it does seem a bit better. A few ways I tried to call it failed, but many of them worked. I had trouble passing in closestwords.caller, either in ipython or the wrapper script. However re-defining caller in ipython worked, as did creating an instance and passing in the instance’s findClosestWords function. A+ for communication, guys.
In [3]: cid = cloud.call(closestwords.caller, ['bagel']) In [4]: cloud.result(cid) ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (30, 0)) ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (37, 0)) ---------------------------------------------------------------------------CloudException: Job 36: Could not depickle job Traceback (most recent call last): File "/root/.local/lib/python2.6/site-packages/cloudserver/workers/employee/child.py", line 202, in run AttributeError: 'module' object has no attribute 'caller' In [8]: c = closestwords.Corpusinator() In [9]: cid = cloud.call(c.findClosestWords, ['bagel', 'cheese'], 5) In [10]: cloud.result(cid) Out[10]: {'bagel': [(0.8666666666666667, 'barge'), (0.8666666666666667, 'bag'), (0.8666666666666667, 'badge'), (0.8666666666666667, 'angel'), (0.8666666666666667, 'age')], 'cheese': [(1.0, 'cheese'), (0.95238095238095244, 'cheesed'), (0.88888888888888895, 'cheers'), (0.88888888888888895, 'cheeks'), (0.8666666666666667, 'cheeseball')]} In [11]: def caller(words, n=10): ....: c = closestwords.Corpusinator() ....: return c.findClosestWords(words, n) ....: In [12]: cid = cloud.call(caller, ['bagel']) In [13]: reload(closestword) KeyboardInterrupt In [13]: cloud.result(cid) Out[13]: {'bagel': [(0.8666666666666667, 'barge'), (0.8666666666666667, 'bag'), (0.8666666666666667, 'badge'), (0.8666666666666667, 'angel'), (0.8666666666666667, 'age'), (0.8222222222222223, 'barrel'), (0.8222222222222223, 'barely'), (0.81111111111111123, 'gamble'), (0.79047619047619044, 'vaguely'), (0.79047619047619044, 'largely')]}
I did some more experimentation with PiCloud, posted separately.