Amazon SimpleDB / SDS

14 12 2007

Amazon today launched a limited beta of the SimpleDB service. Unlike the S3, this service provides storage for structured data. In their developer documentation, this service is described as complementary to S3:

Unlike Amazon S3, Amazon SimpleDB is not storing raw data. Rather, it takes your data as input and expands it to create indices across multiple dimensions, which enables you to quickly query that data. Additionally, Amazon S3 and Amazon SimpleDB use different types of physical storage. Amazon S3 uses dense storage drives that are optimized for storing larger objects inexpensively. Amazon SimpleDB stores smaller bits of data and uses less dense drives that are optimized for data access speed.

In order to optimize your costs across AWS services, large objects or files should be stored in Amazon S3, while smaller data elements or file pointers (possibly to Amazon S3 objects) are best saved in Amazon SimpleDB. Because of the close integration between services and the free data transfer within the AWS environment, developers can easily take advantage of both the speed and querying capabilities of Amazon SimpleDB as well as the low cost of storing data in Amazon S3, by integrating both services into their applications.

This new service undoubtedly generated quite a bit of enthusiasm: O’Reilly,TechCrunch…. For some, this is the long awaited “database” in the cloud and is seen as Amazon’s addressing the requirements and realities of today’s web application development. While this is no doubt an important service that complements Amazon’s S3 and EC2, is it really the ‘database’ that your average web application developer is hoping for? Is this the MySQL in the sky and a challenger to Oracle?

From a technical standpoint, the answer is, “No.” While SimpleDB supports storage and indexing of Structured Data, it is not a drop-in replacement of a RDBMS for a very simple reason: It supports atomic updates at a single Item level. According to its developer documentation:

The PutAttributes operation creates or replaces attributes in an item. You specify new attributes using a combination of the Attribute.X.Name and Attribute.X.Value parameters. You specify the first attribute by the parameters Attribute.0.Name and Attribute.0.Value, the second attribute by the parameters Attribute.1.Name and Attribute.1.Value, and so on.

Multiple REST calls to update multiple entities (Items) mean separate, atomic updates with eventual consistency. A read (GET attributes) of an item following immediately a write (PUT attributes) is not guaranteed to return the updated attribute values, since the subsequent read may be served by a node that has yet to receive the update. In simple terms, SimpleDB does not support the notion of updates to multiple entities in a single transaction. The granularity of the transaction (taking into account of eventual consistency) is at the single item level.

I think this is an important consideration when one looks at SimpleDB as a ‘replacement’ of MySQL. It all depends on how the application is designed. Strictly speaking, SimpleDB is a structured index with the semantics of a MultiMap (list of attribute-value pairs, where multiple values can be associated to an attribute). It is schema-less (as in a big HashMap with up to 256 keys) and transaction-less (across multiple entities/ items anyway). It is a structured index not unlike Google Base (ST anyone?). It is not a MySQL equivalent. Application developers need to take this into consideration: if their application assumes consistency of transactional update of multiple entities (tables), SimpleDB is not the solution.

Having said all this, I think SimpleDB is a great service. If anything, using it will force the application developers to carefully consider the data model of the application. Most of the time, running on top of a RDBMS is just overkill given the complexity and overhead of keeping a RDBMS running, not to mention the scalability problems (or the headache that follows with sharding the databases.) It very nicely complements S3 by providing ability to index structured data, saving you from having to implement your own Lucene index sitting on top of data in S3. The only thing I am not so sure about is all this is implemented as REST over HTTP/S, with the results implemented in XML rather than JSON. It would be nice to have an optional, lighter weight API specifically optimized for clients running on EC2…. possibly using a lightweight RPC layer such as Facebook’s Thrift or some clone of Google’s Protocol Buffer?

Update / “Second Thoughts”

Actually, it’s not even quite like having Lucene sitting on S3, because SimpleDB’s query language does not support full text search (technical discussion here). So as-is, each Domain in SimpleDB basically looks like a table where the columns are indexed and more columns (up to 256) can be added at any time. For example, you can have a Domain called Employee with a set of attributes like name, department and the query allows trivial searches by name and department quite easily. Set operations like union and intersections are also possible. Queries operate on a given Domain. So it might be tricky to have a separate Domain called Person and try to find say the intersection of Employee and Person. Also, if you want to have a record of a Person who is also an employee as items in separate Domains, SimpleDB’s API does not seem to provide that transactional semantics. The only way is to denormalize your data into one catch-all Domain/ table. For most applications, this is not so much a limitation but an opportunity to design for performance and scalability.

Cases described above are all things that a typical RDBMS user takes for granted. This is no longer the case in the world of SimpleDB (or say Google’s BigTable or HBase) and applications need to be architected accordingly.



Scaling Web Apps with Apache 2.2 and mod_proxy_balancer on Debian

26 07 2007

Using Apache as a reverse proxy to front application servers (like mod_jk with Tomcat) is a common pattern in web application architecture, including applications based on Weblogic or WebSphere. Weblogic for instance, has a load balancer module for clustering of app servers.

In version 2.2 onward, the mod_proxy module of the Apache server has been extended to support load balancing with the mod_proxy_balancer module. For a lot of Rails installations, this seems to be a popular way of scaling. To implement this on Debian, here are the basic steps:

1. Install Apache2
To install Apache2 on Debian, use the apt-get command to install the apache2 package. This step is pretty much automatic:
apt-get apache2

2. Configure Required Modules
Apache2 on Debian has a slightly different layout. The files are in the /etc/apache2 directory and are divided into modules and sites:

domU-12-31-36-00-31-41:/etc/apache2# ls -alkn
total 60
drwxr-xr-x 7 0 0 4 2007-07-27 05:18 .
drwxr-xr-x 45 0 0 4 2007-07-26 22:41 ..
-rw-r--r-- 1 0 0 24 2007-03-27 12:53 apache2.conf
drwxr-xr-x 2 0 0 4 2007-07-17 07:57 conf.d
-rw-r--r-- 1 0 0 1 2007-03-27 12:58 envvars
-rw-r--r-- 1 0 0 0 2007-07-17 07:57 httpd.conf
drwxr-xr-x 2 0 0 4 2007-07-26 08:06 mods-available
drwxr-xr-x 2 0 0 4 2007-07-26 08:17 mods-enabled
-rw-r--r-- 1 0 0 1 2007-07-17 07:57 ports.conf
drwxr-xr-x 2 0 0 4 2007-07-26 08:08 sites-available
drwxr-xr-x 2 0 0 4 2007-07-26 08:08 sites-enabled

So instead of one monolithic httpd.conf file, things are broken down to small fragments of files for each moduel and the use of symbolic links allows quick changes of configuration. To faciliate configuration, several commands are available: a2enmod, a2ensite, etc. Here is a good reference.

At a minimum, several modules need to be configured and enabled. The module names are basically the file basenames in the mods-available directory.

In this example, the application server nodes are running locally at different ports and we want to reverse proxy requests to these nodes via the load balancer. The configuration files are to be stored in the mods-available directory. Here are the sample configurations.
The balancer:
domU-12-31-36-00-31-41:/etc/apache2# more mods-available/proxy_balancer.conf
<Proxy balancer://app>
# cluster member
BalancerMember http://127.0.0.1:8080 loadfactor=1
BalancerMember http://127.0.0.1:8081 loadfactor=1
</Proxy>

The main config file:
domU-12-31-36-00-31-41:/etc/apache2# more sites-enabled/000-default
#NameVirtualHost *
<virtualhost *:80>
ServerAdmin webmaster@localhost
ProxyPass / balancer://app/
ProxyPassReverse / balancer://app/
ErrorLog /var/log/apache2/error.log
LogLevel warn
CustomLog /var/log/apache2/access.log combined
ServerSignature On
</virtualhost>

It’s important to note the use of ‘/’ in the ProxyPassReverse directives and how they bind to the cluster name defined in the config file for the balancer (proxy_balance.conf). Once the config files are ready, the modules need to be enabled…3. Enable Modules
After the modules configuration files are complete, the modules are enabled via:

a2enmod proxy
a2enmod proxy_balancer
a2enmod proxy_http

For some reason, proxy_http is frequently overlooked. Without it, Apache will throw 403 errors when accessing resources at proxied URLs. In the logs, error messages like

[Thu Jul 26 07:06:13 2007] [warn] proxy: No protocol handler was valid for the URL /foo. If you are using a DSO version of mod_proxy, make sure the proxy submodules are included in the configuration using LoadModule.

To show the loaded modules:

domU-12-31-36-00-31-41:/etc/apache2# apache2ctl -t -D DUMP_MODULES
Loaded Modules:
core_module (static)
log_config_module (static)
logio_module (static)
mpm_worker_module (static)
http_module (static)
so_module (static)
alias_module (shared)
auth_basic_module (shared)
authn_file_module (shared)
authz_default_module (shared)
authz_groupfile_module (shared)
authz_host_module (shared)
authz_user_module (shared)
autoindex_module (shared)
cache_module (shared)
cgid_module (shared)
dir_module (shared)
env_module (shared)
mime_module (shared)
negotiation_module (shared)
proxy_module (shared)
proxy_balancer_module (shared)
proxy_http_module (shared)
rewrite_module (shared)
setenvif_module (shared)
status_module (shared)
Syntax OK

Once the server is restarted (apache2ctl -k restart), the changes will take effect and requests should now be routed to the application server nodes running on ports 8080 and 8081 as configured in the example.

The load balancer module is quite powerful: it has options for setting load factors to better balance load across nodes of varying capacity.  For more information on configuration options, see the documentation.



Running Amazon EC2

14 07 2007

So I was told the other day that I had 48 hours to migrate one of my previous AMI (Amazon Machine Image) instances, because the instance was running on degraded hardware (which apparently had some hardware failure). Amazon was pro-active about it and had stopped billing for this particular instance. In any case, I was able to login to this instance and retrieve files that are specific to this instance.

One great thing about EC2 is that there is a wide selection of public AMIs available. This means one can experiment with different system configurations, from the different Linux distributions (e.g. Debian, Fedora, Gentoo) to the software installed (e.g. a full LAMP stack). Since my old instance has been running the Amazon-provided Fedora4, I figure it’s time to try something else and also document the steps.

1. Getting the certificate and private key ready
Amazon provides a set of command-line tools (written in Java) for working with EC2. These tools use HTTPS to communicate with the web service and therefore requires X.509 certificates and private keys to be around. Once you are in the EC2 program (currently Beta at aws.amazon.com), you can generate the required certificate and keys. Since I already have these generated before, there’s little to do here. Amazon recommends putting these credential files in, say, ~/.ec2, like so:

$ ls -alkn ~/.ec2
total 12
drwx------ 5 501 501 170 Jul 14 15:05 .
drwxr-xr-x 53 501 501 1802 Jul 14 14:11 ..
-rw-r--r-- 1 501 501 689 Jul 14 14:11 cert-A6O5VGEIFPYKTCNTXVK4D2XE5ESNCB7U.pem
-rw-r--r-- 1 501 501 721 Jul 14 14:11 pk-A6O5VGEIFPYKTCNTXVK4D2XE5ESNCB7U.pem

2. Install EC2 Tools
The command line tools are available from the Developer Connection site (here). These tools also expect Java to be installed on your machine. Once installed, the tools are ready for use after setting a few environment variables. For example, the downloaded zip file unzips to directory foo, and the environment variables are then set as follows:

export EC2_HOME=~/projects/ec2/ec2-api-2007-03-01/ec2-api-tools
export PATH=$EC2_HOME/bin:$PATH
export EC2_PRIVATE_KEY=~/.ec2/pk-A6O5VGEIFPYKTCNTXVK4D2XE5ESNCB7U.pem
export EC2_CERT=~/.ec2/cert-A6O5VGEIFPYKTCNTXVK4D2XE5ESNCB7U.pem

That’s it!

3. Finding public AMIs, checking instances
Several commands are useful:

ec2-describe-images -a This shows a list of available AMIs. This has grown to be a pretty long list from the early days of EC2. This information is also available at http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=101

Since this new instance will be serving web pages with a Java and S3 backend, an AMI that is small, optimized for Xen, would be ideal. For this, I decided to try out the Gentoo distribution (ami-8b8a6fe2). This particular AMI has the basics like openssh and Apache2 and that’s about it.

ec2-describe-instances basically shows any currently running instances. There is also a Firefox plugin that simplifies much of these tasks by providing a nice UI. Still, command line tools are scriptable…

4. Running a new AMI instance
With the tools in place and the cert/keys ready, starting up a new instance is pretty trivial. First, we need to generate a keypair for the new instance. An instance of a public AMI has no password and logging in via ssh requires public/private keypair. Since this is specific to my instantiations of the AMI, a keypair need to be generated so that one half of the keypair is embedded in this AMI allowing me to login later on with the other half of the keypair. To do this, simply use the command:

ec2-add-keypair ami-8b8a6fe2-gentoo-base-eminent

The key name ami-8b8a6fe2-gentoo-base-eminent is created following a simple convention that denotes which AMI (ami-8b8a6fe2) and the configuration (gentoo-base-eminent). This key is important in later starting up new instances of this AMI. This command prints out the private RSA key necessary for ssh login (via the -i option) later:

-----BEGIN RSA PRIVATE KEY-----
MIIEpgIBAAKCAQEAmmVOcPrBRXgGbo3XtvKxld/Glmuqi9gGKLNzyfUspKCuSjwmgHB91y7e8aH+
tGyHdbYnHPC/nNbh15F3jjdneM5W1GphcUJu4m2HylAklgTOC8pYVdS8XacKiGSBaUXvZimXCsH/
Uzcm3rxfxwNESwWpsg9aPXYi//T0quqM1xvZNFXO1s1s5ZJfKugCUUJrq365afaOR1hiipx+02U5
zKSTYZc9XWKbbaNSSeIDCPh8CZTxEH/FEuutaMxisMJ26uAqD0plnc1sj+mv8NNCl+/XgTlPLzVg
...
-----END RSA PRIVATE KEY-----

Since this output needs to be captured in a file, we can just do this:

ec2-add-keypair ami-8b8a6fe2-gentoo-base-eminent > ~/.ec2/ami-8b8a6fe2-gentoo-base-eminent.id

Change the permission of the id file (chmod 600 ~/.ec2/ami-8b8a6fe2-gentoo-base-eminent.id) or ssh won’t like it! Now that we have the keypair, start up a new instance:

ec2-run-instances ami-8b8a6fe2 -k ami-8b8a6fe2-gentoo-base-eminent

This here starts up one instance of the Gentoo AMI (ami-8b8a6fe2) identified by the keypair (ami-8b8a6fe2-gentoo-base-eminent) . This instance is started with the default group. In order to gain ssh access, we need to authorize port access (for the default group):

ec2-authorize default -p 22 for ssh, and ec2-authorize default -p 80 for HTTP.

5. Connecting to the new AMI instance
Now to connect to the new instance, we first find out how to get to it:

$ ec2-describe-instances
...
INSTANCE i-8d688be4 ami-8b8a6fe2 ec2-72-44-51-245.z-1.compute-1.amazonaws.com ...

To connect to it, simply ssh to the hostname listed above:

ssh -i ~/.ec2/ami-8b8a6fe2-gentoo-base-eminent.id root@ec2-72-44-51-245.z-1.compute-1.amazonaws.com

That’s it!



Hello world!

16 06 2007

Well, here we go.  This is my very, very first post.  In true programming fashion, here’s the prototypical “Hello World!” statement.  From this point on, this blog will document many interesting thoughts, opinions, and experiments, covering interesting topics-of-the-day in technology, commerce, and media.  At least that’s the plan:  to create a place to document and prove that these thoughts actually existed in my head at some point in time…