ano.malo.us pretty typical actually..

Posts tagged bayesian-networks

Bayesian Networks in the Cloud

“If computers of the kind I have advocated become the computers of the future, then computing may someday be organized as a public utility just as the telephone system is a public utility… The computer utility could become the basis of a new and important industry.”

– John McCarthy, MIT Centennial in 1961

This is the premise behind cloud or utility computing. You simply request some computing power, use it for as long as you need and only pay for what you use. Once you compare this with the usual process — order machines, configure them, use them and then keep paying for them while they sit idle — you realize just how disruptive (in the good sense) cloud computing can be.

At Loudcloud, the original business model was to provide cloud computing and server management services for web startups. Instead of worrying about managing its environment, a startup could focus on its core competencies while Loudcloud would use its automated systems to quickly create and scale-up computing environments. The changing economy forced Loudcloud to alter its focus and me to alter my address :) but I’ve been interested in the idea ever since.

So, I was really excited when Amazon announced its EC2 (Elastic Compute Cloud) service which lets anyone request a few compute instances and pay only $.10 per hour per instance. For $10, you can have a 100 node cluster for an hour!

Amazon’s approach is really simple: they use virtualization to split physical machines into instances. You can use pre-existing Amazon Machine Images (AMI) or create your own using the tools they provide. I’m not sure what virtualization technology they use but the idea is similar to VMWare or Xen. Because the instances are created from machine images, they lose all created data when they are terminated — so you either download your data or store in Amazon’s S3 storage service.

I started with a public AMI with Fedora and some basic libs, added python 2.5 and pebl and saved as a custom image. And now, I have a way to learn Bayesian networks in the cloud. I had to run some small jobs today but our lab Xgrid was busy, so instead of waiting or interrupting my web-browsing work by running the jobs on my laptop, I created an instace using my custom AMI and analyzed some data for 20 cents. These were small jobs and didn’t require a cluster but I have MPI and IPython1 installed on the AMI so I can create an ad-hoc cluster and run larger jobs.

This has the potential to really change academic computing. My lab has a grid which happened to be busy but many other labs don’t have such resources. Instead of going through the university beurocracy to get some time on the campus clusters, a student can simply use EC2 to get some analysis done. The best part is that you pay per instance-hour. So 100 nodes running for one hour costs the same as one node runnng for 100 hours but you get your results in an hour instead of 4 days.

I’m going to integrate access to EC2 into the next version of pebl so you can simply specify your dataset, some parameters, your EC2 security keys and have an ad-hoc cluster in the cloud chugging away on your data.