Tuesday, November 04, 2008

Amazon and Azure

I was reading my esteemed colleague Dmitry Sotnikov's overview comparison of the proliferation of cloud platforms, and it inspired me to write down my take on these two approaches to the world of 'cloud computing', whatever that is.

I am a big fan of Amazon Web Services (AWS) because it's available now and you pay only for what you use. No startup fee, no minimum subscription charges, no relentless mailings offering upgrades. I am able to boot up remote systems and interact with them without having to do an install from scratch, etc. This works nicely in my world, but then I am a technical guy so I am not put off by having to write a program or login to a machine with ssh to do something.

So it's great, you do everything yourself, and it's a pain because yes, you do everything yourself. As I am often reminded in yoga class, maximum flexibility yields maximum pain. Even though I am not afraid to write a program, there is a big conceptual space to absorb before I can start writing that program. First I need to understand what an Amazon machine image is, how to attach persistent block storage, etc. It's really no surprise that although AWS has received a lot of attention from the press, hobbyists, and academics, it's not what you could call mainstream.

Microsoft's Azure is not yet completely defined, but I think I can see where they are headed. While not as feature-complete as Amazon, they are focusing on providing a nice layer of abstraction that offers what will likley be full-on system management of systems which physically reside who-knows-where. A small number of services that works simply and well is arguably better than a raw interface on a comprehensive set of services. Plus Microsoft is able to buy themselves some time to finish building out the data centers, and use the early adopters as guinea pigs to drive out the operational issues.

So as not to cannibalize their lucrative desktop biz, Microsoft is positioning this so that you have many options for creating applications. They say you can build your application with the new Azure APIs and tools such that you can leverage services in the cloud only if and when it makes sense to do so. In other cloud-computing scenarios, you need to commit to a particular application architecture which is either resident 'offsite' or very local. If they deliver on the promise of making it easy to dynamically leverage cloud services, or bring the whole thing inside your firewall, or mix these notions as needed, then there might truly be some useful software here. You bet you are still committing to the Microsoft stack with all its various pros and cons, and the essential feature the deliver with Azure is that you get a new architectural option for lateral scaling, hopefully without having to work too hard.

Somewhere, there is, no doubt, a giant room of people coding like crazy to build out more services, create the tutorials, and clean up the APIs. So while Microsoft is not offering as much granularity as Amazon, they are offering what will eventually be a highly usable interface on the web which allows people to do stuff that they normally do. You can build on your local machine, and push the result to the cloud right from your desktop. No ssh is (apparently) involved, and, for most people, that's the right answer and removes an important barrier to entry for many people.

For Amazon, there are things like ElasticFox, a Firefox plug-in application for managing one's AWS services. But it's really just syntactic sugar on top of the raw APIs-- it lacks the abstraction that captures the work being done in the way that actual people think about it. Microsoft's focus on application hosting deployed right from the IDE represents an important distinction: Amazon offers the ability to do whatever you want, Microsoft gives you an application development environment AND a nice web front end to what you probably need to do anyway for production systems. Three are compromises here because you naturally don't have the same flexibility with Azure that you have AWS, but in many, many deployment scenarios, all that flexibility just gets in the way.

Monday, September 01, 2008

Enabling Semantic Infrastructure for Collaborative Systems

As more companies embrace the techniques of enterprise social software (ESS), I start wondering about what I am going to do with this extra data coming my way. ESS systems can promote collaboration which often leads to Even More Data coming into my "information space." Currently it's not so bad-- I have a middling amount of email, about fifty thousand bookmarks, about a hundred RSS feeds-- how hard can that be to keep up with? When you think about how much that space would expand if I add even my most immediate group of co-workers, in effect, I have all of their bookmarks, RSS feeds, etc. This is only a good thing if I have some help dealing with it all.

The only way out of the coming information glut will be to have the machines help dig us out. But how to do that? Enterprise search is heavy, expensive, local text search. It can be helpful, but it is not the way to handle any the meaning of the data, i.e., its semantics. Semantic data management needs to be baked into a content system, so the generation of metadata just becomes part of the working environment. This can be in the very simple form of tags/folksonomies and standard representations of working groups using techniques such as friend of a friend (FOAF) and description of a project (DOAP). When semantic data management capability is part of the infrastructure in an ESS-type environment, it promises to allow data to be organized and queries in interesting and emergent ways. In a corporate environment, the system can easily link up associated projects, content, or people.

This sounds fabulous in theory, but adding such capability to your environment is actually worse and harder than putting in a bigtime enterprise text search server. Semantic web technology is some of the newest 10-year old technology in our bag of tricks, and it has the capability to capture, analyze, and derive value from relationships inside of the data. But, in order to get this kind of semantic connectivity, you need a purpose-built data store, a server to house it, and an analytical/query system to get to it. Even worse, semantic meta data gets big very quickly which can lead to storage and query-response issues. So, not helpful, right? The software integration problem alone is daunting enough, and adding semantic infrastructure to my overall ESS platform means that I have to adopt yet another system to manage.

Naturally, I want to imagine that I have access to a semantic data server that looks like it lives inside my machine room, and that feels like software-as-a-service (SaaS). I have worked quite a bit with virtualization from external providers: Amazon's EC2, CohesiveFT, and others. These are amazing systems that allow me to sign up, submit some commands, and some magic happens. What's great about that is that I don't have to put the systems up, I don't even need to understand how they work. I can just use them as if I *had* spent 6 months adding a new wing to my server room. With a virtual semantic server, I can integrate the promise of the "linked web of data" into my ESS platform to manage and leverage the meaning of all that data.

Now, it's important to observe that virtualization works in the SaaS model because of the generic nature of the task. The vendors of such services are able to optimize for scale, performance, and reliability without needing to know precisely how the systems are going to be used. Semantic data as a service falls into this same category in that it's generic and needs to be actively optimized for scale, performance, and reliability.

Talis, a UK-based company is aiming to be the Amazon EC2 of the semantic web, and I think they have a good shot at it. The same principles apply: Talis is concerned with making a semantic store fast, reliable, and scalable, so you don't have to. Your data is stored and processed somewhere else, but it's always your data. Via a straightforward HTTP-based interface, you add metadata, and query against it.

Mind you, this is all very new, and Talis themselves have not yet defined their precise business model, but they are working on it, and making developer access free-for-the-asking for the time being. Clearly, there are many real-world issues to resolve, such as SLAs, privacy, and billing models, but the key notion here is that semantic data processing is quite generic, and we should not be creating our own semantic servers to manage this data. In consumer-land, "Web 2.0" is creeping toward the linked web of data where more people are (finally) starting to understand what TimBL was talking about with this 'semantic web' stuff. Now is the time as ESS systems proliferate, we who glue these systems together should be taking advantage of what semantic web technology can do for us, and skip the server set-up part by using a system such as that provided by Talis.

See the Talis.com website and their developer wiki (n2.talis.com) for some overview articles and taste of how to interact with a Talis data store. A future post will include some of my experiments with the system.

Monday, January 21, 2008

SimpleDB From Amazon Web Services, Part I

I just heard about a great new website called Amazon.com! I don’t know what ‘Amazon’ has to do with selling things, but they have stuff for sale there. Tell your friends you heard it here first! I pride myself on staying hip to startups like these.

A little reading at this “web” site leads me to find that they offer more than a paltry assortment of books, CDs, power tools, sporting goods, etc. Turns out they also have web services for sale for virtual hosting, disk storage, and recently, a database. The database-in-the-sky notion piqued my interest, so I spent some time working with this new service. What follows is an overview only-- in “part 2” on this topic, I’ll talk more about code details, but for now, the goal is to discuss what’s interesting about the beta-level offering called "SimpleDB".

If you are not familiar with Amazon Web Services (AWS), the general idea is that many fundamental computing infrastructure components ought to be available on-demand in the network (or, more fancifully, in the “cloud”). Amazon has a gigantic infrastructure, proven ability to manage it, and, maybe more importantly, they also have a gigantic billing infrastructure. Thus, Amazon can cost-effectively provide virtual machines, disk storage, message queuing, etc. by reselling the bits and pieces of infrastructure that fall out on their machine room floors.

SimpleDB is an Amazon Web Service. All of Amazon's web services are structured as pay-for-what-you-use. There is no startup cost, and you pay tiny amounts for transactions, and for the storage you eventually use. It's a perfect long-tail kind of scenario. For all those people who need to maintain “only” a few thousand records in a table and don't want to run a system with a database on it (and deal with backups, power, redundancy, optimization, etc.) this sort of service is perfect.

I think it’s also interesting in that they made the decision to not just present an interface on a standard relational database model.

Simple is Good

Most of us geek types would over-engineer this and create a multi-tenant database instance using Oracle or Postgres or something. The API would consist of ways to send strings containing SQL to such a service and get back XML chunks of data. But if you think about it, most applications, especially smallish web-based applications, have schemas that are not particularly complicated. Naturally, there are times when you need your own database with thousands of tables, referential integrity, and smart DBAs to keep it all tuned and running. With SimpleDB, AWS is betting that the vast majority of “Web 2.0” applications will be applications with simple data needs.

This is the sweet spot that Amazon is trying to capture with the SimpleDB. You do not submit SQL strings, you simply perform something like name-value pair (attribute) gets and sets, and the underlying system performs a spell to produce your data. Offering a service like this provides users a simple interface to perform simple operations and high reliability (in theory—it’s just beta now!). Amazon wins too because they then have a controllable, revenue-generating service.

Of course there are issues regarding service-level agreements (SLAs) if you are going to run your business on such a service, and that's why Amazon has such long beta programs-- so they can evaluate the usage patterns the user create to figure out what they can reliably support. The stinkers, they still charge for the beta while using us early adopters as guinea pigs. How come I don’t get a discount for this?

Not Normal

So how can you possibly reap any value from what amounts to a “.properties” file in the cloud? DBAs recoil in horror at the idea of production data that looks more like FileMaker or Excel than Oracle. But if you don't care how “efficient” it is or isn't, and make it someone else's problem, as long as you get your data back “soon enough” then it’s all good.

Consider the following data set for a training log application. Some days I run, and some days I ride my bike. I have different data points for each type of workout. Normally (ha, a pun) I would create a table to hold this stuff like this:

Key, date, distance, type, distance, time, heartrate, route

Which leads you to "CREATE TABLE workout (blah blah blah)" and "INSERT workout VALUES (blah, blah, blah)" and the usual SQL hoops. Since I have about 20 different routes that I run or ride, I would likely end up with another table to hold route data with a key relationship, and you know the rest.

The big advantage of SimpleDB is that it's well, simple. Everything looks like a name-value pair. As long as you can grab something by its key, you can set a name-value pair for that item. Even better, a given attribute for a given key can have multiple values. This is a trick that is sort of cheating in Relational Dataville—you know, where you put a “magic string” (e.g., “Red, Green, Blue”) into a column value which gets interpreted in code after it’s extracted or parsed at query time by an incomprehensible stored procedure. SimpleDB treats this case as typical, and optimizes around it. So, in that case where you might have several potential values for a single attribute (think column), you just set that value too. The effect is that an attribute named “color” can have a query-able value of both “Red” and “Green” without having to make a separate table to achieve join-like behavior.

Another aspect that is a little jarring at first is that each item in your store can have its own collection of attributes. If you want to aggregate similar items, make sure they share an attribute that you can use for grouping. A lack of structure allows you to make monumentally messy databases because queries are at the mercy of your ability to follow conventions, but I think it’s a nice balance between perfectly normal square data, and a sparse matrix in Excel. That is, for every item you put into your SimpleDB database, if you want things to answer to queries that need to understand “Category”, make sure you provide an attribute called Category for each item.

Here is an illustration using (gasp) lisp-like notation (apologies to lisp purists):
(itemKey (name value) (name value) (name value) …)

E.g.,
(323 ('date' 'Jan 12, 2008') ('type' 'road ride')
('distance' '53km') ('time' '1:50'))

I.e., for unique key ‘323’, set a property called 'date' to 1/12, distance is 53km, etc.

Earlier I said that you can assign multiple values to an attribute. So, using the training example, I might want a property which lists the songs I was listening to during the workout. In square-table land, you'd have to create a second table with a primary key relationship and do a join to see all that data put together. Instead, I can add that data in all its multi-dimensional glory inline like this:

(324 ('date' 'Jan 13, 2007')
('distance' '53km')
('time' '1:50') (music ('tom waits' 'david byrne' 'cat power')))

Seem interesting? In the second portion of this piece, I’ll talk more about queries and what the programmatic interface is like.

Comments, questions, and corrections are always welcome. Thanks for reading.