Amazon  S3 (Simple Storage Service) is an excellent AWS cloud storage option.  that provides object storage, with seamless scalability and reliability.

S3  is a trusted storage option among developers, and it offers HIPAA and  PCI-DSS compliant encryption for stored data.. The encryption options  are client-side encryption and server side encryption. Server-side  encryption is auto-managed by S3 itself, and is the more popular of the  two encryption types.

The  recommended ways to use S3 are through the AWS SDK for various  programming languages and the AWS CLI (command line interface). Though  there are various third-party tools and software which let S3 buckets be  used as mounted file systems (for example, see this post on Amazon S3 sync using EMR), relying on such proprietry software is not the most popular  use case. Many developers use S3 through its HTTP/HTTPS endpoints in  their code.

AWS S3 Sync with CLI Commands

The  core focus of this article is to explore options available for syncing  S3 data in a bucket with that of contents in a directory on a file  system. There are two possible scenarios here: case 1, the file system  contents have to be updated to reflect new contents in the S3 bucket,  and case 2, contents of S3 buckets have to be updated to reflect new  contents in the file system directory.

Let  us see how this is done using AWS CLI commands. The commands in this  article assume that the S3 prefixes and directory names being used are  pre-existing, and the AWS CLI is configured with IAM credentials  providing read and write access to the S3 buckets. For detailed  instructions on how to configure AWS CLI for read-write access, see the documentation on configuring and uploading a file to S3 using AWS CLI.

Use Case 1: Synchronizing (updating) local file system with the contents in the S3 bucket

The use case here is to update the contents of the local file system with that of newly added data inside the S3 bucket.

The AWS CLI command aws s3 sync <source><destination> downloads  any files (objects) in S3 buckets to your local file system directory  that aren’t already present on the local file system.

For example, say we want the contents of S3 bucket named example-bucket to be downloaded to the local current directory. The command is as follows:

aws s3 sync s3://example-bucket

An  S3 object will be downloaded if the size of the S3 object differs from  the size of the local file, the last modified time of the S3 object is  newer than the last modified time of the local file, or the S3 object  does not exist in the local directory. The last modified time of the  local file is changed to the last modified time of the S3 object.

Use Case 2: Synchronizing (updating) S3 bucket with the contents of the local file system

Most  often, we need the contents of local file system to be uploaded to S3  buckets to continue propagating changes or addition of files to S3  buckets regularly. This is achieved by the same aws S3 sync command.

For example, say we want the contents of the current directory to be synced to an S3 bucket named example-bucket. The command is as follows:

aws s3 sync . s3://example-bucket

A  local file is uploaded if the size of the local file is different than  the size of the S3 object, the last modified time of the local file is  newer than the last modified time of the S3 object, or the local file  does not exist under the specified bucket and prefix.

An  interesting behaviour of the sync command when used to upload the  contents of a local directory is that it won’t upload empty directories  from the local file system. The reason for this behavior is that S3 is  an object storage service, so it has different semantics than a regular  file system. S3 does not create or use actual physical folders or a  directory structure. S3 has buckets and objects. S3 Objects with the  help of prefixes are realized like a directory.

Therefore, when aws s3 sync is used to upload content to S3 buckets, empty directories are ignored  ad nothing is uploaded. When empty directories have files within, they  will be uploaded. As a quick workaround for situations where we need  even empty directories to be uploaded to S3, it is advised to put a  dummy file within such directories.

Complete Synchronization between local file system and S3 bucket

Full  synchronization between a local directory and an S3 bucket is achieved  by regularly firing the following commands in sequence through a CRON  job.

aws s3 sync s3://example-bucket .
aws s3 sync . s3://example-bucket

A  question that needs answering here is what happens with any files  existing under the specified prefix and bucket but not existing in the  local directory or vice versa? The answer is that they are not deleted  unless a --deleteparameter is added to the command.

The command:

aws s3 sync <source> <destination> -–delete

In  addition to synchronizing, this command will delete the files that  exist in the destination but not in the source during sync.

Closing thoughts

There  are several other ways to achieve synchronization using third party  proprietary and free software as well. It has been observed that AWS CLI based synchronization does not perform well when the volume of data is large.  Using other methods such as AWS step functions with an AWS data  pipeline might be a better way in case we have a huge volume of data to  be synchronized.