Quick and Dirty Amazon S3 Integration

Unless you’ve been hiding under a rock, you probably know about Amazon S3 by now. As far as I’m concerned, it’s about as cheap and easy to use as cloud-based storage gets. If you’re writing your own application, there’s a very simple API for getting stuff into and out of it. If you’ve got a baked application, it’s a bit more complicated. Sometimes you want to take advantage of the cost savings and scalability of S3, but can’t (or don’t want to) modify the web application to use the s3 API directly.

I came up a quick and dirty workaround for this by installing a little utility called s3cmd to move the files for me. Here are the steps I took. I’m using Ubuntu on this server, but this should really work with minor adaptation on any Unix variant.

The basic theory of operation is this. The files in a local directory will get synced to s3, then your site will redirect (via .htaccss) visitors to the s3 files instead of the local files.

  1. Have the same setup as me: Apache under Ubuntu.
  2. Set up an s3 account.
  3. Install s3cmd:
    sudo apt-get install s3cmd
  4. Configure s3cmd. It will prompt you for your API keys, which you can get from your account page.
    s3cmd --configure
  5. Create a bucket:
    s3cmd mb s3://mah-bucket
  6. Create a script somewhere called s3sync.sh (or whatever you like). Be sure to change “mah-bucket” to something more meaningful.
    1. #!/bin/sh
    2. s3cmd sync –acl-public /path/to/local/content/folder/ s3://mah-bucket
  7. Make sure your script is executable:
    chmod 0755 s3sync.sh
  8. Run your script. All your images will be copied to your s3 bucket:
    ./wps3sync.sh
  9. Make sure apache has mod_rewrite turned on, then edit your .htaccess file so it includes the lines below. Don’t forget to change “mah-bucket” to whatever you called your bucket in step 6. Also change “url/path/to/folder/” to the url path of the foler from step 6. (Note presence of “^” and lack of slash at the beginning.)
    1. <IfModule mod_rewrite.c>
    2. RewriteEngine On
    3. RewriteBase /
    4. RewriteRule ^url/path/to/folder/(.*) http://mah-bucket.s3.amazonaws.com/uploads/$1 [R,L]
    5. </IfModule>

That’s it! Your site will now redirect requests for the original files to your s3 bucket. The only caveat is that you will need to run your sync script every time you upload new files so they get copied to s3. I run mine manually via ssh, but this could become a pain if there was a lot of new files. One option would be to cron it to run every few minutes, but bear in mind that s3 is billed by the request and it will add up over time. A better way of auto-syncing might be to write a script that polls your uploads folder for changes, then calls your s3 script if it finds something different.

Here’s a modified s3sync.sh that uses md5deep to see if the folder contents have changed. You can cron this as often as you like and it will only talk to s3 if there are actually changes.

  1. #!/bin/sh
  2.  
  3. HASH_FILE=/path/to/home/dir/hashes.txt
  4. HASH_DIFF_FILE=/path/to/home/dir/tmp_diff_hashes.txt
  5. LOCAL_DIR=/path/to/local/content/folder
  6. S3_BUCKET=s3://mah-bucket/
  7.  
  8. if [ ! -f $HASH_FILE ]
  9. then
  10.   echo "\nCreating new hash file $HASH_FILE\n";
  11.   md5deep -rl "$LOCAL_DIR" > $HASH_FILE
  12. fi
  13.  
  14. if [ -f $HASH_DIFF_FILE ]
  15. then
  16.   rm $HASH_DIFF_FILE
  17. fi
  18.  
  19. md5deep -x $HASH_FILE -r $LOCAL_DIR > $HASH_DIFF_FILE
  20.  
  21. if [ -s $HASH_DIFF_FILE ]
  22. then
  23.   s3cmd sync –acl-public $LOCAL_DIR $S3_BUCKET
  24.   rm -f $HASH_FILE
  25.   md5deep -rl "$LOCAL_DIR" > $HASH_FILE
  26. fi
  27.  
  28. if [ -f $HASH_DIFF_FILE ]
  29. then
  30.   rm $HASH_DIFF_FILE
  31. fi

And that’s that! I hope someone finds this useful. If you did, let me know!

3 Responses to “Quick and Dirty Amazon S3 Integration”

  1. On August 18th, 2010 at 6:36 pm awol said:

    watch out for that dirty hidden 5GB filesize limit on s3 (if your site hosts large files)


  2. On September 8th, 2010 at 4:47 pm d said:

    Very interesting.
    One question: instead of using a hash process (not sure how much time and processing power would it take on a 50,000 images dir) can it be done by listing the files created or modified after the last upload?
    Cheers…


  3. On September 8th, 2010 at 4:50 pm Jaybill McCarthy said:

    @d – Sure, that would work too. Good idea. Feel free to suggest changes to my script there that would do that and I’d be happy to post them and give you credit.