CS530 S08

TR 11:40-12:55

Olin 245

 Architecture of Large-Scale Information Systems

 
 
    

Getting Started on AWS

The CS530 projects will use the Amazon Web Services (AWS) to give us practical experience in using web services and especially clusters.

Amazon has graciously offered to provide us with a prepaid AWS account to support the course. This is a single account, to be shared among all the CS530 students.

1. Conventions for Sharing a Single AWS Account

Because we will all be sharing a single Amazon AWS account, it will be possible (with some effort) for students to decrypt and read one another's Amazon machine image files, or to steal one another's static content files. This is not fundamentally different from the bad old days when CS programming projects were done on a batch computing system and printouts were delivered in sorted piles in public terminal rooms. The University Academic Integrity Policy applies to everything we put into AWS, and you are expected to follow this policy scrupulously.

Also as a consequence of sharing a single account, we will need to modify some of the procedures from the Amazon documentation so different students' projects will not interfere with one another. We have the following issues:

  • The course AWS account is registered to me (ademers@cs.cornell.edu) and only I know the password. Some of the account setup operations described in the AWS Getting Started documentation require this password, but these operations should be done exactly once per account, and I have already done them.

  • Authentication for AWS operations requires either

    • a KeyID and a Secret Key, or
    • an Account ID together with Private Key and X.509 Certificate files in .pem format.

    This nominally "secret" information must be known by every CS530 student. We have not included it in this (public) document; instead, it is available for download from CMS as part of Project 3. Effectively, we are relying on CMS authentication to restrict the AWS account information to registered CS530 students.

  • Some AWS resources are in either global or AWS-account-wide name spaces. We have defined naming conventions for these resources. Here are the conventions regarding the S3 storage service and the EC2 compute service:

    • Buckets (S3 file system directories): There is a hard account-wide limit of 100 bucket names -- only a bit more than two per student -- so we ask you not to create multiple buckets. We currently have buckets named

      edu-cornell-cs-cs530
      edu-cornell-cs-cs530-ajd28
      edu-cornell-cs-cs530-vv39
      The first of these is for course-wide files like the file aws-get-started.html you are currently reading; please don't put anything there. The remaining ones are the personal buckets of the instructor (netid ajd28) and TA (netid vv39). Note the reversed-domain-name convention, similar to the convention for naming Java classes, but using hyphens (-) rather than dots to separate components. Every CS530 student should create a single personal bucket using your own netid according to this naming convention,
      edu-cornell-cs-cs530-netid
      and do all your work in that bucket. The only exception is that it is okay to create a bucket that extends your conventional bucket name, for example
      edu-cornell-cs-cs530-ajd28-debugging-bucket
      as long as that bucket is used only temporarily, and deleted at the end of every work session. If you think there is a compelling reason why you need additional permanent buckets, please discuss it with the instructor or TA.
    • Image Name Prefixes (discussed below): These should be chosen to be unique by including your netid and then appending whatever descriptive name you want according to the pattern

      im-netid-description
      for example
      im-ajd28-test1
      im-ajd28-load-balancer
      im-vv39-database
      As discussed below, each image name prefix occurs as a prefix of each of the (many) file names of a bundled machine image. Thus, the naming convention will make it possible for students to keep more than one machine image in a single bucket.
    • Keypairs (discussed below): Choose keypair names using the pattern

      kp-netid-description
      for example
      kp-ajd28-0
      kp-ajd28-lb
      kp-vv39-db1
      Keypair names do not correspond directly to S3 file names, but they are used in the Amazon EC2 API and have to be unique across the shared account.
    • Security groups (discussed below): Choose group names using the pattern

      gp-netid-description
      for example
      gp-ajd28-proj1
      gp-vv39-app-servers
      Like keypair names, group names do not correspond directly to S3 file names, but they are used in the Amazon EC2 API and have to be unique across the shared account.

    To help things work smoothly with so many persons sharing the same AWS account, we ask you to adhere scrupulously to the above conventions.

The following sections will describe what you need to do to set up your system to use the shared AWS account, in particular pointing out which parts of the Amazon "Getting Started" documentation you should not do.

The examples will follow Linux / MacOS syntax and filename conventions, with occasional hints for Windows users. Long commands may be split across several lines in this documentation for readability; your command interpreter won't let you do that.

2. Downloading AWS Account Information

To get the shared AWS account information, log on to CMS go to Project 2 and download the file account-info.txt. Open the file in a text editor and follow the instructions you see there. When you are done, you will have accomplished the following. You will have created a directory

~/.aws
or (for Windows users)
c:\aws
containing two files:
pk-xxxxxxxx.pem
cert-xxxxxxxx.pem
Values for xxxxxxxx will come from the download. These files contain the private key and X.509 certificate used to authenticate to Amazon EC2.

You will also have modified your environment to contain

AWS_ACCOUNT_ID=xxxxxxxx
AWS_KEY_ID=xxxxxxxx
AWS_SECRET_KEY=xxxxxxxx
EC2_PRIVATE_KEY=hhhhhhhh/.aws/pk-xxxxxxxx.pem
EC2_CERT=hhhhhhhh/.aws/cert-xxxxxxxx.pem
or (for Windows users)
AWS_ACCOUNT_ID=xxxxxxxx
AWS_KEY_ID=xxxxxxxx
AWS_SECRET_KEY=xxxxxxxx
EC2_PRIVATE_KEY=c:\aws\pk-xxxxxxxx.pem
EC2_CERT=c:\aws\cert-xxxxxxxx.pem
The actual values for xxxxxxxx will come from the download, and the actual value for hhhhhhhh will be the path to your home directory.

All done! The following sections will discuss getting started with S3 and EC2 in some detail.

Here is a general guideline: If you go through the Amazon "Getting Started" documents for S3 or EC2 there will be places where you will be instructed to use the Amazon web site to sign up for a service or to create a new X.509 certificate. In other places you will be instructed to create a new keypair, to modify the rules of the default network security group, or to do some other thing that affects the global state of the AWS account. Clearly, if several students were to try this concurrently it would be a Bad Thing. So,

You should ignore such instructions!
The AWS account has already been set up, the AccountID, KeyID, Secret Key, X.509 certificate have already been created, and you have installed them on your machine. Your bucket names, keypair names, image names and security group names should always be constructed according to the conventions described in Section 1, and you should avoid the default network security group altogether.

3. Getting Started with S3

The Amazon "Getting Started" document for S3 is here.

Read the first couple of sections.

As stated earlier, ignore the section "Subscribing to the Service" -- we've already subscribed, and the KeyID and Secret Key are already in your environment.

The next section, "Authenticating", can be skimmed over quickly for now, as the techniques it describes are built in to any API library you might choose to use.

The Java code samples in the remaining sections assume you are using this Amazon S3 Library in Java. I have tried this library in Java 1.5 and it seems okay. But the example drivers S3Test.java and S3Driver.java create bucket names using a deterministic function of the account KeyID, and so are not safe for use in a shared account. The following changes are needed:

static final String awsAccessKeyId = System.getenv("AWS_KEY_ID");
static final String awsSecretAccessKey = System.getenv("AWS_SECRET_KEY");
static final String myNetID = insert your netid here
static final String bucketName =
     "edu-cornell-cs-cs530-" + myNetID + "-test-bucket";
Make these changes to both S3Test.java and S3Driver.java, in the obvious places. The code will now create a test bucket name using our bucket naming conventions with your own netid, avoiding name collisions with other users.

Make sure you can run both S3Test and S3Driver without errors. At that point you should be able to go successfully through the remaining sections of the Amazon tutorial ("Creating a Bucket" through "Listing Keys").

Eventually (in fact, for Section 4 below) you will need a suite of command line tools (a "shell") to manipulate your S3 buckets and objects. A couple of these are available, for example a Java one here. Or you may want to write a few simple tools yourself to gain experience, for example using the Amazon S3 Library in Java that the Amazon documentation relies on.

I (Demers) personally prefer the jets3t toolkit, as I find its API a bit more natural. Both libraries work.

There are also a couple of S3 shells available as linux bash scripts, here and here. These are noteworthy in that they can easily be installed and run on an Amazon EC2 image as part of a custom startup package. We will have more to say about this in the main Project 3 document.

4. Getting Started with EC2

The Amazon "Getting Started" document for EC2 is here. Unlike the S3 document, this one is written in a step-by-step "cookbook" style.

As always, you should ignore the early sections about signing up for AWS S3 and EC2 services, as this is already done. Notes about the remaining sections follow.

Setting up the Tools

Follow the instructions for downloading the ec2 command line tools and "Telling the Tools Where They Live" (setting the EC2_HOME environment variable and updating the PATH environment variable).

The final step of this section, "Telling the Tools Who You Are" (which sets EC2_PRIVATE_KEY and EC2_CERT in the environment), has already been done.

The Amazon online documentation includes a reference manual for the command line tools. It is a Really Good Idea to read the manual page for each ami command as you are about to use it, to make sure you understand what it is about to do.

Running an Instance

This section includes a step for "Generating a Keypair", which must be changed to conform to our shared account naming conventions. The Amazon instructions tell you toname your keypair "gsg-keypair" (The "gsg" part presumably stands for "Getting Started Guide.") Instead, you should use a netid-specific name following our naming conventions; for example,

kp-ab123-gsg
where the ab123 part should be replaced by your own netid. In each of the remaining steps that requires the name of an EC2 keypair, substitute "kp-ab123-gsg" for "gsg-keypair". The Amazon document instructs you to store the private key of the keypair in a local file. The logon step where you connect to your instance using an ssh client requires the name of this file, so put it someplace where you can find it, for example
~/.aws/id-rsa-kp-ab123-gsg
Again, the name you use for this file is arbitrary, but you need some convention that will enable you to find the RSA private key file associated with each of your EC2 keypair names.

The Network Security Group is another important issue that is not well covered in the Amazon document. Every EC2 instance runs in a named security group that you specify when you start the instance. The security group has a set of firewall rules that control network connectivity between instances in the group and instances outside it.

If you start an instance without explicitly specifying a security group, the instance runs in a predefined group named "default". Clearly, it would be a Bad Idea for concurrent users of a shared account to have instances running in the (same) default security group; so we use the naming convention described above for security groups. You should create a new security group for the remainder of this exercise using a command like

ec2-add-group   gp-ab123-xxx   -d   "yyyyyyyy"
Where as above ab123 should be replaced by your own netid, xxx by a string to make the name unique among group names you define, and yyyyyyy by a short description of the group. For example,
ec2-add-group   gp-ajd28-test   -d   "test group for getting started"

You can check to make sure this worked by typing

ec2-describe-groups   gp-ab123-xxx
or just
ec2-describe-groups
which will list all groups that have been defined by anyone using the shared account.

For some of the later steps in this exercise you will need to specify the group name explicitly rather than allowing it to default to "default". The command that actually starts your instance is the first of these. For example

ec2-run-instances   ami-2bb65342   -g   gp-ajd28-test   -k   kp-ajd28-gsg
starts an instance in the specified group gp-ajd28-test.

In the next step, "Authorizing Network Access," you need to specify your group name in place of "default". For example,

ec2-authorize   gp-ajd28-test   -P   tcp   -p   22
ec2-authorize   gp-ajd28-test   -P   tcp   -p   80
Opens the standard TCP ports for ssh (22) and HTTP (80) in the group gp-ajd28-test.

At this point, you should be able to connect to your instance with the ssh command from the Amazon documentation, using the name of the RSA private key file you saved when you created your EC2 keypair, and the external network address assigned to your running instance, for example

ssh   -i   ~/.aws/id-rsa-kp-ajd28-gsg   root@ec2-67-202-33-73.compute-1.amazonaws.com
Well, "Congratulations!" You have started an instance.

As the Amazon docs warn you, DO NOT go away without remembering to shut down your instance -- the account is charged for the instance for as long as it continues to run.

Creating an Image

For this part of the exercise there is a convention that may not be obvious: Examples that use the prompt string

prompt>
are commands you execute on your local machine; while examples using the prompt string
#
are commands you execute on a running EC2 instance to which you have logged in as the root user with ssh. It is important to keep this straight.

In addition, we need to worry about naming conventions for Amazon Machine Image (AMI) bundles. The procedure used in the Amazon document gives the image files default names, and so does not support bundling more than one AMI into the same bucket.

First, you need to copy the private key and X.509 certificates associated with (shared) account up to the running machine instance you want to bundle. Using our naming and environment conventions a command something like

prompt> scp   -i   ~/.aws/id-rsa-kp-ajd28-gsg ${EC2_PRIVATE_KEY}   ${EC2_CERT}
    root@ec2-67-202-33-73.compute-1.amazonaws.com:/mnt
will upload these files to the /mnt directory of the running instance.

At this point you are ready to ask the EC2 instance to bundle itself. As you can see from the Amazon document, the default bundling command creates a number of files with names of the form

image.foo.bar
in the /mnt directory of the instance. These names appear in the S3 bucket where the AMI is stored, preventing you from creating more than one AMI in the same bucket. This is a Bad Thing. To avoid it, you need to add a common prefix to the name of each file in the bundle using the -p option to the ec2-bundle-vol command. This is the "Image Name Prefix" discussed in our naming conventions in Section 1. The command
# ec2-bundle-vol   -d   /mnt   -p   im-ajd28-gsg
    -k   /mnt/pk-xxxxxxxxx.pem   -c   /mnt/cert-xxxxxxxxx.pem
    -u iiiiiiii -r i386
bundles the image to a collection of files on /mnt; all the file names will begin with the prefix "im-ajd28-gsg" rather than "image". Here xxxxxxxx must be replaced by the names of the .pem files that were uploaded using scp above, and iiiiiiii must be replaced by your AWS Account ID, the value of the environment variable AWS_ACCOUNT_ID on your local machine. Sadly, these values don't appear in the environment of the instance, where you are executing the ec2-bundle-vol command, so you will need to cut and paste.

The next step is uploade the AMI to S3. The example command in the Amazon document does not reflect our use of an image name prefix. In the manifest file name /mnt/image.manifest.xml you need to replace "image" by the same image name prefix you specified in the ec2-bundle-vol command earlier. For example, the command

# ec2-upload-bundle   -b   edu-cornell-cs-cs530-ajd28   -m   /mnt/im-ajd28-gsg.manifest.xml
     -a   aws-key-id   -s   aws-secret-key
was used to upload an AMI to my ownpersonal bucket. The aws-key-id and aws-secret-key need to be replaced by their true values. These are the values of $AWS_KEY_ID$ and $AWS_SECRET_KEY$ on your own machine, but again, as the upload command is being run on the EC2 instance, the environment variables will not be available and you will have to cut and paste.

Once you have successfully uploaded your AMI you no longer need your running instance; you can shut it down with the command

#   /sbin/shutdown   -h   now
on the instance itself, or you can use the AWS command
prompt>   ec2-terminate-instances   i-nnnnn
which is probably more reliable.

The final step of this lengthy process is to register your AMI so you can start it in a new instance. Again the Amazon documentation needs to be modified to get the manifest file name right. For example, the command

prompt>   ec2-register   edu-cornell-cs-cs530-ajd28/im-ajd28-gsg.manifest.xml
could be used to register the image we uploaded in the previous step.

At this point you have a registered AMI and can try running it. The command given in the Amazon document has one of the same issue we discussed when running a public AMI: for a shared account, you should never start an instance in the "default" group, so the command to run your instance should specify one of your own group names, for example

prompt>   ec2-run-instances   ami-5bae4b32   -g   gp-ajd28-test
Comparing this command to the one used to start a public instance, note there is no longer a keypair (-k) argument -- the bundled image implicitly uses the same keypair that it was created with.