Amazon SimpleDB / SDS
14 12 2007Amazon today launched a limited beta of the SimpleDB service. Unlike the S3, this service provides storage for structured data. In their developer documentation, this service is described as complementary to S3:
Unlike Amazon S3, Amazon SimpleDB is not storing raw data. Rather, it takes your data as input and expands it to create indices across multiple dimensions, which enables you to quickly query that data. Additionally, Amazon S3 and Amazon SimpleDB use different types of physical storage. Amazon S3 uses dense storage drives that are optimized for storing larger objects inexpensively. Amazon SimpleDB stores smaller bits of data and uses less dense drives that are optimized for data access speed.
In order to optimize your costs across AWS services, large objects or files should be stored in Amazon S3, while smaller data elements or file pointers (possibly to Amazon S3 objects) are best saved in Amazon SimpleDB. Because of the close integration between services and the free data transfer within the AWS environment, developers can easily take advantage of both the speed and querying capabilities of Amazon SimpleDB as well as the low cost of storing data in Amazon S3, by integrating both services into their applications.
This new service undoubtedly generated quite a bit of enthusiasm: O’Reilly,TechCrunch…. For some, this is the long awaited “database” in the cloud and is seen as Amazon’s addressing the requirements and realities of today’s web application development. While this is no doubt an important service that complements Amazon’s S3 and EC2, is it really the ‘database’ that your average web application developer is hoping for? Is this the MySQL in the sky and a challenger to Oracle?
From a technical standpoint, the answer is, “No.” While SimpleDB supports storage and indexing of Structured Data, it is not a drop-in replacement of a RDBMS for a very simple reason: It supports atomic updates at a single Item level. According to its developer documentation:
The PutAttributes operation creates or replaces attributes in an item. You specify new attributes using a combination of the Attribute.X.Name and Attribute.X.Value parameters. You specify the first attribute by the parameters Attribute.0.Name and Attribute.0.Value, the second attribute by the parameters Attribute.1.Name and Attribute.1.Value, and so on.
Multiple REST calls to update multiple entities (Items) mean separate, atomic updates with eventual consistency. A read (GET attributes) of an item following immediately a write (PUT attributes) is not guaranteed to return the updated attribute values, since the subsequent read may be served by a node that has yet to receive the update. In simple terms, SimpleDB does not support the notion of updates to multiple entities in a single transaction. The granularity of the transaction (taking into account of eventual consistency) is at the single item level.
I think this is an important consideration when one looks at SimpleDB as a ‘replacement’ of MySQL. It all depends on how the application is designed. Strictly speaking, SimpleDB is a structured index with the semantics of a MultiMap (list of attribute-value pairs, where multiple values can be associated to an attribute). It is schema-less (as in a big HashMap with up to 256 keys) and transaction-less (across multiple entities/ items anyway). It is a structured index not unlike Google Base (ST anyone?). It is not a MySQL equivalent. Application developers need to take this into consideration: if their application assumes consistency of transactional update of multiple entities (tables), SimpleDB is not the solution.
Having said all this, I think SimpleDB is a great service. If anything, using it will force the application developers to carefully consider the data model of the application. Most of the time, running on top of a RDBMS is just overkill given the complexity and overhead of keeping a RDBMS running, not to mention the scalability problems (or the headache that follows with sharding the databases.) It very nicely complements S3 by providing ability to index structured data, saving you from having to implement your own Lucene index sitting on top of data in S3. The only thing I am not so sure about is all this is implemented as REST over HTTP/S, with the results implemented in XML rather than JSON. It would be nice to have an optional, lighter weight API specifically optimized for clients running on EC2…. possibly using a lightweight RPC layer such as Facebook’s Thrift or some clone of Google’s Protocol Buffer?
Update / “Second Thoughts”
Actually, it’s not even quite like having Lucene sitting on S3, because SimpleDB’s query language does not support full text search (technical discussion here). So as-is, each Domain in SimpleDB basically looks like a table where the columns are indexed and more columns (up to 256) can be added at any time. For example, you can have a Domain called Employee with a set of attributes like name, department and the query allows trivial searches by name and department quite easily. Set operations like union and intersections are also possible. Queries operate on a given Domain. So it might be tricky to have a separate Domain called Person and try to find say the intersection of Employee and Person. Also, if you want to have a record of a Person who is also an employee as items in separate Domains, SimpleDB’s API does not seem to provide that transactional semantics. The only way is to denormalize your data into one catch-all Domain/ table. For most applications, this is not so much a limitation but an opportunity to design for performance and scalability.
Cases described above are all things that a typical RDBMS user takes for granted. This is no longer the case in the world of SimpleDB (or say Google’s BigTable or HBase) and applications need to be architected accordingly.
Categories : General






Recent Comments