Amazon S3 (Simple Storage Service) is an excellent AWS cloud storage option. that provides object storage, with seamless scalability and reliability.
S3 is a trusted storage option among developers, and it offers HIPAA and PCI-DSS compliant encryption for stored data.. The encryption options are client-side encryption and server side encryption. Server-side encryption is auto-managed by S3 itself, and is the more popular of the two encryption types.
The recommended ways to use S3 are through the AWS SDK for various programming languages and the AWS CLI (command line interface). Though there are various third-party tools and software which let S3 buckets be used as mounted file systems (for example, see this post on Amazon S3 sync using EMR), relying on such proprietry software is not the most popular use case. Many developers use S3 through its HTTP/HTTPS endpoints in their code.
AWS S3 Sync with CLI Commands
The core focus of this article is to explore options available for syncing S3 data in a bucket with that of contents in a directory on a file system. There are two possible scenarios here: case 1, the file system contents have to be updated to reflect new contents in the S3 bucket, and case 2, contents of S3 buckets have to be updated to reflect new contents in the file system directory.
Let us see how this is done using AWS CLI commands. The commands in this article assume that the S3 prefixes and directory names being used are pre-existing, and the AWS CLI is configured with IAM credentials providing read and write access to the S3 buckets. For detailed instructions on how to configure AWS CLI for read-write access, see the documentation on configuring and uploading a file to S3 using AWS CLI.
Use Case 1: Synchronizing (updating) local file system with the contents in the S3 bucket
The use case here is to update the contents of the local file system with that of newly added data inside the S3 bucket.
The AWS CLI command
aws s3 sync <source><destination> downloads any files (objects) in S3 buckets to your local file system directory that aren’t already present on the local file system.
For example, say we want the contents of S3 bucket named
example-bucket to be downloaded to the local current directory. The command is as follows:
aws s3 sync s3://example-bucket
An S3 object will be downloaded if the size of the S3 object differs from the size of the local file, the last modified time of the S3 object is newer than the last modified time of the local file, or the S3 object does not exist in the local directory. The last modified time of the local file is changed to the last modified time of the S3 object.
Use Case 2: Synchronizing (updating) S3 bucket with the contents of the local file system
Most often, we need the contents of local file system to be uploaded to S3 buckets to continue propagating changes or addition of files to S3 buckets regularly. This is achieved by the same aws S3 sync command.
For example, say we want the contents of the current directory to be synced to an S3 bucket named
example-bucket. The command is as follows:
aws s3 sync . s3://example-bucket
A local file is uploaded if the size of the local file is different than the size of the S3 object, the last modified time of the local file is newer than the last modified time of the S3 object, or the local file does not exist under the specified bucket and prefix.
An interesting behaviour of the sync command when used to upload the contents of a local directory is that it won’t upload empty directories from the local file system. The reason for this behavior is that S3 is an object storage service, so it has different semantics than a regular file system. S3 does not create or use actual physical folders or a directory structure. S3 has buckets and objects. S3 Objects with the help of prefixes are realized like a directory.
aws s3 sync is used to upload content to S3 buckets, empty directories are ignored ad nothing is uploaded. When empty directories have files within, they will be uploaded. As a quick workaround for situations where we need even empty directories to be uploaded to S3, it is advised to put a dummy file within such directories.
Complete Synchronization between local file system and S3 bucket
Full synchronization between a local directory and an S3 bucket is achieved by regularly firing the following commands in sequence through a CRON job.
aws s3 sync s3://example-bucket .
aws s3 sync . s3://example-bucket
A question that needs answering here is what happens with any files existing under the specified prefix and bucket but not existing in the local directory or vice versa? The answer is that they are not deleted unless a
--deleteparameter is added to the command.
aws s3 sync <source> <destination> -–delete
In addition to synchronizing, this command will delete the files that exist in the destination but not in the source during sync.
There are several other ways to achieve synchronization using third party proprietary and free software as well. It has been observed that AWS CLI based synchronization does not perform well when the volume of data is large. Using other methods such as AWS step functions with an AWS data pipeline might be a better way in case we have a huge volume of data to be synchronized.