Second was adding multiprocessing. I figured this would be easy, but the way I originally wrote the code (the function required both an element, and a penalty matrix) meant that just plain multiprocessing.Pool.map() wasn’t working. I wrapped it up with itertools.izip(iterator, itertools.repeat(matrix)), but that gives a tuple which you can’t easily export. It turns out that this, basically calculatorstar from the Pool example, is a godsend:
def wrapper(args): return func(*args)
So that got me down to 8 minutes on my laptop. However the cool part is that on an 8 core server I was down to only 1 minute, 30 seconds. Those are the sorts of iteration times I can deal with.
Then I decided to try the PiCloud. After trying it out earlier this week I thought it would be interesting to test it on a real problem that linearly scales with more cores. They advertise it in the docs, so I figured maybe it would be useful and even faster. Not so fast. It was easy to write after I already had multiprocessing working, but the first version literally crashed my laptop. I later figured out that the “naive” way to write it made it suck up all the RAM on the system. After less than 5 minutes I killed it with 1.7GB/2GB consumed. Running it on the aforementioned 8 core/32GB server had it consume 5GB before it finally crashed with a HTTP 500 error. I posted in the forums, got some advice, but still can’t get it working. (Read that short thread for the rest of the story). This seems like exactly what they should be nailing, but so far they’re coming up empty.
]]>After getting beta access I decided to check it out today. I spent about 15 minutes playing around with it, and decided to do a short writeup because there’s so little info out there. The short version is that technically it’s quite impressive. Simple, but more complicated than square(x) cases are as easy as they say. Information about PiCloud is in pretty short supply, so here’s my playing around reproduced for all to see.
This is pretty easy. I’m using a virtualenv because I was skeptical, but it’s neat how easy it is even with that. So I’m going to setup a virtualenv, install ipython to the virtualenv, then install the cloud egg. At the end I’ll add my api key to the ~/.picloud/cloudconf.py file so I don’t need to type it repeatedly. The file is created when you first import cloud, and is very straightforward.
chmullig@gore:~$ virtualenv picloud New python executable in picloud/bin/python Installing setuptools............done. chmullig@gore:~$ source picloud/bin/activate (picloud)chmullig@gore:~$ easy_install -U ipython Searching for ipython #snip Processing ipython-0.10-py2.6.egg creating /home/chmullig/picloud/lib/python2.6/site-packages/ipython-0.10-py2.6.egg Extracting ipython-0.10-py2.6.egg to /home/chmullig/picloud/lib/python2.6/site-packages Adding ipython 0.10 to easy-install.pth file Installing iptest script to /home/chmullig/picloud/bin Installing ipythonx script to /home/chmullig/picloud/bin Installing ipcluster script to /home/chmullig/picloud/bin Installing ipython script to /home/chmullig/picloud/bin Installing pycolor script to /home/chmullig/picloud/bin Installing ipcontroller script to /home/chmullig/picloud/bin Installing ipengine script to /home/chmullig/picloud/bin Installed /home/chmullig/picloud/lib/python2.6/site-packages/ipython-0.10-py2.6.egg Processing dependencies for ipython Finished processing dependencies for ipython (picloud)chmullig@gore:~$ easy_install http://server/cloud-1.8.2-py2.6.egg Downloading http://server/cloud-1.8.2-py2.6.egg Processing cloud-1.8.2-py2.6.egg creating /home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.8.2-py2.6.egg Extracting cloud-1.8.2-py2.6.egg to /home/chmullig/picloud/lib/python2.6/site-packages Adding cloud 1.8.2 to easy-install.pth file Installed /home/chmullig/picloud/lib/python2.6/site-packages/cloud-1.8.2-py2.6.egg Processing dependencies for cloud==1.8.2 Finished processing dependencies for cloud==1.8.2 (picloud)chmullig@gore:~$ python -c 'import cloud' #to create the ~/.picloud directory (picloud)chmullig@gore:~$ vim .picloud/cloudconf.py #to add api_key and api_secretkey
This is their trivial example, just to prove it’s as easy for me as it was for them.
In [1]: def square(x): ...: return x**2 ...: In [2]: import cloud In [3]: cid = cloud.call(square, 10) In [4]: cloud.result(cid) Out[4]: 100
BAM! That’s just stupidly easy. Let’s try a module or two.
In [5]: import random In [6]: def shuffler(x): ...: xl = list(x) ...: random.shuffle(xl) ...: return ''.join(xl) ...: In [8]: cid = cloud.call(shuffler, 'Welcome to chmullig.com') In [9]: cloud.result(cid) Out[9]: ' etcmmhmoeWll.cgcl uioo'
So that’s neat, but what about something I wrote, or something that’s off pypi that they don’t already have installed? Also quite easy. I’m going to be using Levenshtein edit distance for this, because it’s simple but non-standard. For our purposes we’ll begin with a pure python implementation, borrowed from Magnus Lie. Then we’ll switch to a C extension version, originally written by David Necas (Yeti), which I’ve rehosted on Google Code.
(picloud)chmullig@gore:~$ wget -O hetlev.py http://hetland.org/coding/python/levenshtein.py #snip 2010-03-09 12:13:04 (79.2 KB/s) - `hetlev.py' saved [707/707] (picloud)chmullig@gore:~$ easy_install http://pylevenshtein.googlecode.com/files/python-Levenshtein-0.10.1.tar.bz2 Downloading http://pylevenshtein.googlecode.com/files/python-Levenshtein-0.10.1.tar.bz2 Processing python-Levenshtein-0.10.1.tar.bz2 Running python-Levenshtein-0.10.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-mqtK2d/python-Levenshtein-0.10.1/egg-dist-tmp-igxMyM zip_safe flag not set; analyzing archive contents... Adding python-Levenshtein 0.10.1 to easy-install.pth file Installed /home/chmullig/picloud/lib/python2.6/site-packages/python_Levenshtein-0.10.1-py2.6-linux-x86_64.egg Processing dependencies for python-Levenshtein==0.10.1 Finished processing dependencies for python-Levenshtein==0.10.1 (picloud)chmullig@gore:~$
Now both are installed locally and built. Beautiful. Let’s go ahead and test out the hetlev version.
In [18]: def distances(word, comparisonWords): ....: results = [] ....: for otherWord in comparisonWords: ....: results.append(hetlev.levenshtein(word, otherWord)) ....: return results In [24]: zip(words, distances(word, words)) Out[24]: [('kitten', 0), ('sitten', 1), ('sittin', 2), ('sitting', 3), ('cat', 5), ('kitty', 2), ('smitten', 2)]
Now let’s put that up on PiCloud! It’s, uh, trivial. And fast.
In [25]: cid = cloud.call(distances, word, words) In [26]: zip(words, cloud.result(cid)) Out[26]: [('kitten', 0), ('sitten', 1), ('sittin', 2), ('sitting', 3), ('cat', 5), ('kitty', 2), ('smitten', 2)]
Now let’s switch it to use the C extension version of edit distance from the PyLevenshtein package, and try to use it with PiCloud.
In [32]: import Levenshtein In [33]: def cdistances(word, comparisonWords): results = [] for otherword in comparisonWords: results.append(Levenshtein.distance(word, otherword)) return results ....: In [38]: zip(words, cdistances(word, words)) Out[38]: [('kitten', 0), ('sitten', 1), ('sittin', 2), ('sitting', 3), ('cat', 5), ('kitty', 2), ('smitten', 2)] In [39]: cid = cloud.call(cdistances, word, words) In [40]: cloud.result(cid) ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (30, 0)) ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (37, 0)) --------------------------------------------------------------------------- CloudException Traceback (most recent call last) CloudException: Job 14: Could not depickle job Traceback (most recent call last): File "/root/.local/lib/python2.6/site-packages/cloudserver/workers/employee/child.py", line 202, in run File "/usr/local/lib/python2.6/dist-packages/cloud/serialization/cloudpickle.py", line 501, in subimport __import__(name) ImportError: ('No module named Levenshtein', <function subimport at 0x2290ed8>, ('Levenshtein',))
Not too surprisingly that didn’t work – Levenshtein is a C extension I built on my local machine. PiCloud doesn’t really make it obvious, but you can add C-Extensions via their web interface. Amazingly you can point it to an SVN repo and it will let you refresh it. It seems to download and call setup.py install, but it’s a little unclear. The fact is it just worked, so I didn’t care. I clicked on “Add Repository” and pasted in the URL from google code, http://pylevenshtein.googlecode.com/svn/trunk. It built it and installed it, you can see the output on the right. I then just reran the exact same command and it works.
In [41]: cid = cloud.call(cdistances, word, words) In [42]: cloud.result(cid) Out[42]: [0, 1, 2, 3, 5, 2, 2] In [43]: zip(words, cloud.result(cid)) Out[43]: [('kitten', 0), ('sitten', 1), ('sittin', 2), ('sitting', 3), ('cat', 5), ('kitty', 2), ('smitten', 2)]
I’ve written a slightly more complicated script that fetches the qwantzle corpus and uses jaro distance to find the n closest words in the corpus to a given word. It’s pretty trivial and dumb, but definitely more complicated than the above examples. Below is closestwords.py
import Levenshtein import urllib class Corpusinator(object): ''' Finds the closest words to the word you specified. ''' def __init__(self, corpus='http://cs.brown.edu/~jadrian/docs/etc/qwantzcorpus'): '''Setup the corpus for later use. By default it uses http://cs.brown.edu/~jadrian/docs/etc/qwantzcorpus, but can by overridden by specifying an alternate URL that has one word per line. A number, a space, then the word. ''' raw = urllib.urlopen('http://cs.brown.edu/~jadrian/docs/etc/qwantzcorpus').readlines() self.corpus = set() for line in raw: try: self.corpus.add(line.split()[1]) except IndexError: pass def findClosestWords(self, words, n=10): ''' Return the n (default 10) closest words from the corpus. ''' results = {} for word in words: tempresults = [] for refword in self.corpus: dist = Levenshtein.jaro(word, refword) tempresults.append((dist, refword)) tempresults = sorted(tempresults, reverse=True) results[word] = tempresults[:n] return results |
Very simple. Let’s try ‘er out. First locally, then over the cloud.
In [1]: import closestwords In [2]: c = closestwords.Corpusinator() In [3]: c.findClosestWords(['bagel', 'cheese'], 5) Out[3]: {'bagel': [(0.8666666666666667, 'barge'), (0.8666666666666667, 'bag'), (0.8666666666666667, 'badge'), (0.8666666666666667, 'angel'), (0.8666666666666667, 'age')], 'cheese': [(1.0, 'cheese'), (0.95238095238095244, 'cheesed'), (0.88888888888888895, 'cheers'), (0.88888888888888895, 'cheeks'), (0.8666666666666667, 'cheeseball')]}
Unfortunately it just doesn’t want to work happily with PiCloud & ipython when you’re running import closestwords. The most obvious won’t work, cloud.call(c.findClosestWords, [‘bagel’]). Neither will creating a tiny wrapper function and calling that within ipython:
def caller(words, n=10): c = closestwords.Corpusinator() return c.findClosestWords(words, n) cloud.call(caller, ['bagel']) |
I created a stupidly simple wrapper python file, wrap.py:
import closestwords import cloud cid = cloud.call(closestwords.caller, ['bagel',]) print cloud.result(cid) |
That gives an import error. Even putting that caller wrapper above at the bottom of the closestwords.py and calling it in the __main__ section (as I do below with c.findClosestWords) didn’t work.
However if I stick it directly in closestwords.py, initialize the instance, then run it from there, everything is fine. I’m not sure what this means, if it’s supposed to happen, or what. But it seems like it could be a pain in the butt just to get it calling the right function in the right context.
if __name__ == '__main__': import cloud c = Corpusinator() cid = cloud.call(c.findClosestWords, ['bagel',]) print cloud.result(cid) |
I had a good time playing with PiCloud. I’m going to look at adapting real code to use it. If I get carried away AND feel like blogging I’ll be sure to post ‘er up. They have pretty good first tier support for the map part of map/reduce, which would be useful. Two links I found useful when working with PiCloud:
Aaron Staley of PiCloud wrote me a nice email about this post. He says my problem with the closestwords example was due to a server side bug they’ve fixed. In playing around, it does seem a bit better. A few ways I tried to call it failed, but many of them worked. I had trouble passing in closestwords.caller, either in ipython or the wrapper script. However re-defining caller in ipython worked, as did creating an instance and passing in the instance’s findClosestWords function. A+ for communication, guys.
In [3]: cid = cloud.call(closestwords.caller, ['bagel']) In [4]: cloud.result(cid) ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (30, 0)) ERROR: An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (37, 0)) ---------------------------------------------------------------------------CloudException: Job 36: Could not depickle job Traceback (most recent call last): File "/root/.local/lib/python2.6/site-packages/cloudserver/workers/employee/child.py", line 202, in run AttributeError: 'module' object has no attribute 'caller' In [8]: c = closestwords.Corpusinator() In [9]: cid = cloud.call(c.findClosestWords, ['bagel', 'cheese'], 5) In [10]: cloud.result(cid) Out[10]: {'bagel': [(0.8666666666666667, 'barge'), (0.8666666666666667, 'bag'), (0.8666666666666667, 'badge'), (0.8666666666666667, 'angel'), (0.8666666666666667, 'age')], 'cheese': [(1.0, 'cheese'), (0.95238095238095244, 'cheesed'), (0.88888888888888895, 'cheers'), (0.88888888888888895, 'cheeks'), (0.8666666666666667, 'cheeseball')]} In [11]: def caller(words, n=10): ....: c = closestwords.Corpusinator() ....: return c.findClosestWords(words, n) ....: In [12]: cid = cloud.call(caller, ['bagel']) In [13]: reload(closestword) KeyboardInterrupt In [13]: cloud.result(cid) Out[13]: {'bagel': [(0.8666666666666667, 'barge'), (0.8666666666666667, 'bag'), (0.8666666666666667, 'badge'), (0.8666666666666667, 'angel'), (0.8666666666666667, 'age'), (0.8222222222222223, 'barrel'), (0.8222222222222223, 'barely'), (0.81111111111111123, 'gamble'), (0.79047619047619044, 'vaguely'), (0.79047619047619044, 'largely')]}
I did some more experimentation with PiCloud, posted separately.
]]>