Monday, January 21, 2008

SimpleDB From Amazon Web Services, Part I

I just heard about a great new website called Amazon.com! I don’t know what ‘Amazon’ has to do with selling things, but they have stuff for sale there. Tell your friends you heard it here first! I pride myself on staying hip to startups like these.

A little reading at this “web” site leads me to find that they offer more than a paltry assortment of books, CDs, power tools, sporting goods, etc. Turns out they also have web services for sale for virtual hosting, disk storage, and recently, a database. The database-in-the-sky notion piqued my interest, so I spent some time working with this new service. What follows is an overview only-- in “part 2” on this topic, I’ll talk more about code details, but for now, the goal is to discuss what’s interesting about the beta-level offering called "SimpleDB".

If you are not familiar with Amazon Web Services (AWS), the general idea is that many fundamental computing infrastructure components ought to be available on-demand in the network (or, more fancifully, in the “cloud”). Amazon has a gigantic infrastructure, proven ability to manage it, and, maybe more importantly, they also have a gigantic billing infrastructure. Thus, Amazon can cost-effectively provide virtual machines, disk storage, message queuing, etc. by reselling the bits and pieces of infrastructure that fall out on their machine room floors.

SimpleDB is an Amazon Web Service. All of Amazon's web services are structured as pay-for-what-you-use. There is no startup cost, and you pay tiny amounts for transactions, and for the storage you eventually use. It's a perfect long-tail kind of scenario. For all those people who need to maintain “only” a few thousand records in a table and don't want to run a system with a database on it (and deal with backups, power, redundancy, optimization, etc.) this sort of service is perfect.

I think it’s also interesting in that they made the decision to not just present an interface on a standard relational database model.

Simple is Good

Most of us geek types would over-engineer this and create a multi-tenant database instance using Oracle or Postgres or something. The API would consist of ways to send strings containing SQL to such a service and get back XML chunks of data. But if you think about it, most applications, especially smallish web-based applications, have schemas that are not particularly complicated. Naturally, there are times when you need your own database with thousands of tables, referential integrity, and smart DBAs to keep it all tuned and running. With SimpleDB, AWS is betting that the vast majority of “Web 2.0” applications will be applications with simple data needs.

This is the sweet spot that Amazon is trying to capture with the SimpleDB. You do not submit SQL strings, you simply perform something like name-value pair (attribute) gets and sets, and the underlying system performs a spell to produce your data. Offering a service like this provides users a simple interface to perform simple operations and high reliability (in theory—it’s just beta now!). Amazon wins too because they then have a controllable, revenue-generating service.

Of course there are issues regarding service-level agreements (SLAs) if you are going to run your business on such a service, and that's why Amazon has such long beta programs-- so they can evaluate the usage patterns the user create to figure out what they can reliably support. The stinkers, they still charge for the beta while using us early adopters as guinea pigs. How come I don’t get a discount for this?

Not Normal

So how can you possibly reap any value from what amounts to a “.properties” file in the cloud? DBAs recoil in horror at the idea of production data that looks more like FileMaker or Excel than Oracle. But if you don't care how “efficient” it is or isn't, and make it someone else's problem, as long as you get your data back “soon enough” then it’s all good.

Consider the following data set for a training log application. Some days I run, and some days I ride my bike. I have different data points for each type of workout. Normally (ha, a pun) I would create a table to hold this stuff like this:

Key, date, distance, type, distance, time, heartrate, route

Which leads you to "CREATE TABLE workout (blah blah blah)" and "INSERT workout VALUES (blah, blah, blah)" and the usual SQL hoops. Since I have about 20 different routes that I run or ride, I would likely end up with another table to hold route data with a key relationship, and you know the rest.

The big advantage of SimpleDB is that it's well, simple. Everything looks like a name-value pair. As long as you can grab something by its key, you can set a name-value pair for that item. Even better, a given attribute for a given key can have multiple values. This is a trick that is sort of cheating in Relational Dataville—you know, where you put a “magic string” (e.g., “Red, Green, Blue”) into a column value which gets interpreted in code after it’s extracted or parsed at query time by an incomprehensible stored procedure. SimpleDB treats this case as typical, and optimizes around it. So, in that case where you might have several potential values for a single attribute (think column), you just set that value too. The effect is that an attribute named “color” can have a query-able value of both “Red” and “Green” without having to make a separate table to achieve join-like behavior.

Another aspect that is a little jarring at first is that each item in your store can have its own collection of attributes. If you want to aggregate similar items, make sure they share an attribute that you can use for grouping. A lack of structure allows you to make monumentally messy databases because queries are at the mercy of your ability to follow conventions, but I think it’s a nice balance between perfectly normal square data, and a sparse matrix in Excel. That is, for every item you put into your SimpleDB database, if you want things to answer to queries that need to understand “Category”, make sure you provide an attribute called Category for each item.

Here is an illustration using (gasp) lisp-like notation (apologies to lisp purists):
(itemKey (name value) (name value) (name value) …)

E.g.,
(323 ('date' 'Jan 12, 2008') ('type' 'road ride')
('distance' '53km') ('time' '1:50'))

I.e., for unique key ‘323’, set a property called 'date' to 1/12, distance is 53km, etc.

Earlier I said that you can assign multiple values to an attribute. So, using the training example, I might want a property which lists the songs I was listening to during the workout. In square-table land, you'd have to create a second table with a primary key relationship and do a join to see all that data put together. Instead, I can add that data in all its multi-dimensional glory inline like this:

(324 ('date' 'Jan 13, 2007')
('distance' '53km')
('time' '1:50') (music ('tom waits' 'david byrne' 'cat power')))

Seem interesting? In the second portion of this piece, I’ll talk more about queries and what the programmatic interface is like.

Comments, questions, and corrections are always welcome. Thanks for reading.