S3 Bash Script: Large File Downloads (Fast & Multi-Part In Parallel)

If you have a large file on S3 that you need to download on to your local machine, its can be a slow process. Since I had to do this many times, I decided to write this script.

Basically, this script will download the file in “chunks” or parts. And then assemble the parts into the whole file. By modifying the script slightly, you can control things like how many parts etc.

Security First: Create A User With Only Permissions To Download A File (And nothing else)

There are many horrifying ways in which you could get a huge AWS bill. So, we want to do this action safely. The most common way to get a huge AWS bill is if somebody gets hold of your AWS credentials and it turns out that those credentials have too much power and can do too much on AWS.

So, in this guide, we are going to start with creating an AWS user and then allowing that user to do ONLY one thing: Download a particular file on S3.

Below is a video I made that goes over the process..

You Will Need “AWS CLI”

The bash script uses AWS CLI (AWS command line interface). If you don’t have it and don’t know what it is, below is how AWS describes it here.

The AWS Command Line Interface (AWS CLI) is a unified tool to manage your AWS services. With just one tool to download and configure, you can control multiple AWS services from the command line and automate them through scripts.

– From AWS Command Line Interface Page

How To Install The AWS CLI?

It should be fairly simple. You just go here and click on the correct link for your OS.

links to download the aws cli and install on your machine. needed for large file download bash script

The Bash Script

Okay. I am going to assume that you have AWS CLI installed. I am also assume that you have a an Access Key and Secret key. (Ideally of a user that ONLY has permission to download this file and nothing else.)

Lets go on. Lets look at the bash script. Its heavily commented so that you know whats going on on each line.

#!/bin/bash

# Hello from LiveFireDev.com! :-)

# INPUT REQUIRED: Enter in The Access Key & Secret Key The Has The Permission To
# Download The File
export AWS_ACCESS_KEY_ID=<REPLACE_ME!!>
export AWS_SECRET_ACCESS_KEY=<REPLACE_ME!!>

# INPUT REQUIRED: Give the name of the bucket
BUCKET=<REPLACE_ME!!>

# INPUT REQUIRED: Give the "key" or name of the file
KEY=<REPLACE_ME!!>

# INPUT REQUIRED: Choose the numbber of parts. OR just leave it as 10.
NUMBER_OF_PARTS=10

# We are going to use the AWS cli to get the size of the file in bytes.
# We will use some GREP to clean up the result and just extract the value in bytes
# This grep command will work fine on Linux & on Windows Sub System.
# But the default MAC grep does not allow the regex query required. See notes at the bottom of
# the application
SIZE_OF_FILE_IN_BYTES=$(aws s3api head-object --bucket ${BUCKET} --key ${KEY} | grep ContentLength | grep -Po "\d+")

# Get the size of each part
SIZE_OF_EACH_PART=$((SIZE_OF_FILE_IN_BYTES/NUMBER_OF_PARTS))

echo "Size of file in Bytes: $SIZE_OF_FILE_IN_BYTES"

BYTE_RANGE_START=0
BYTE_RANGE_END=$SIZE_OF_EACH_PART
PART_NUMBER=0

# While loop to make a request to AWS using the CLI
# When trying to download an object with the AWS CLI, we can specify and ask for only a small
# byte range. We are going to do that here and then make the next request for the next byte range.
# All the downloads will run in parallel.
while [ $PART_NUMBER -le $(($NUMBER_OF_PARTS-1)) ]
do
    echo "Downloading part Number: ${PART_NUMBER}. Byte Range Range: ${BYTE_RANGE_START} to ${BYTE_RANGE_END}"

    aws s3api get-object --bucket ${BUCKET} --key ${KEY} --range bytes=${BYTE_RANGE_START}-${BYTE_RANGE_END} part-${PART_NUMBER}.part &

    BYTE_RANGE_START=$((BYTE_RANGE_START+SIZE_OF_EACH_PART+1))
    BYTE_RANGE_END=$((BYTE_RANGE_START+SIZE_OF_EACH_PART))
    PART_NUMBER=$((PART_NUMBER+1))
done

echo "Waiting for all parts to complete downloading..."

# Just waiting for all the downloads to complete
wait

echo "All parts are downloaded. Combining..."

# Using "cat" to combine all the files into the original file
cat $(ls | grep part-) > $KEY

# Next we remove all the "part files"
rm $(ls | grep part-)

# Done! :-)
echo "File is ready: $KEY"

How To Use The Bash Script?

You probably already know this. But, just for the sake of completeness I am going to give you the broad steps.

  1. Create a file on your machine and copy paste the above script into it.
  2. Replace all the <REPLACE_ME!!> (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, BUCKET & KEY)
  3. Convert your file into an executable by running “chmod +x <your file name>
  4. Run your file from the terminal with: “./<your file name>

An Explanation Of Whats Going On

The comments given in the script should give a fairly clear idea about what is going on. But, let me lay it out so that you feel confident about playing with it and modifying it and making it your own.

NOTE:  When ever I have introduced a concept below, I have linked to a resource that can help you dive deeper into that concept. So click on those links for more information.

Line 7 to 17

# INPUT REQUIRED: Enter in The Access Key & Secret Key The Has The Permission To
# Download The File
export AWS_ACCESS_KEY_ID=<REPLACE_ME!!>
export AWS_SECRET_ACCESS_KEY=<REPLACE_ME!!>

# INPUT REQUIRED: Give the name of the bucket
BUCKET=<REPLACE_ME!!>

# INPUT REQUIRED: Give the "key" or name of the file
KEY=<REPLACE_ME!!>

# INPUT REQUIRED: Choose the numbber of parts. OR just leave it as 10.
NUMBER_OF_PARTS=10

Just setting up the inputs. Access keys, bucket, file etc.

Line 24

SIZE_OF_FILE_IN_BYTES=$(aws s3api head-object --bucket ${BUCKET} --key ${KEY} | grep ContentLength | grep -Po "\d+")

Using the head-object sub command from the AWS CLI to get details about the object. Then using GREP to extract just the number of bytes.

The response from the head-object command is something like:

{
    "AcceptRanges": "bytes",
    "ContentType": "text/html",
    "LastModified": "Thu, 16 Apr 2015 18:19:14 GMT",
    "ContentLength": 77,
    "VersionId": "null",
    "ETag": "\"30a6ec7e1a9ad79c203d05a589c8b400\"",
    "Metadata": {}
}

We use grep to get the ContentLength line. And then just extract out the number of bytes using grep again. (Note: this second grep command does not work well on the default Mac install of grep. See the note for Mac Users below)

Finally we store the result of all that into a variable.

Line 27

SIZE_OF_EACH_PART=$((SIZE_OF_FILE_IN_BYTES/NUMBER_OF_PARTS))

Based on the number of parts, we are figuring out how many bytes each part or clump of the file will have. We are going to use this to make requests to AWS for different byte ranges of the file.

Line 39 to 48

while [ $PART_NUMBER -le $(($NUMBER_OF_PARTS-1)) ]
do
    echo "Downloading part Number: ${PART_NUMBER}. Byte Range Range: ${BYTE_RANGE_START} to ${BYTE_RANGE_END}"

    aws s3api get-object --bucket ${BUCKET} --key ${KEY} --range bytes=${BYTE_RANGE_START}-${BYTE_RANGE_END} part-${PART_NUMBER}.part &

    BYTE_RANGE_START=$((BYTE_RANGE_START+SIZE_OF_EACH_PART+1))
    BYTE_RANGE_END=$((BYTE_RANGE_START+SIZE_OF_EACH_PART))
    PART_NUMBER=$((PART_NUMBER+1))
done

Its just a usual while loop. What is interesting here is that we use the –range bytes argument on the AWS CLI to download only the slice of the file needed. Also, note that we end the command with a “&”. We are doing this because we want to run all the downloads in parallel.

Line 53 to 64

wait

echo "All parts are downloaded. Combining..."

# Using "cat" to combine all the files into the original file
cat $(ls | grep part-) > $KEY

# Next we remove all the "part files"
rm $(ls | grep part-)

# Done! :-)
echo "File is ready: $KEY"

First we wait for all the downloads to complete. Then we use the “cat” command to combine the parts. Then use the rm command to clean up the parts.

Delete Your User When You Are Done

Okay. It all worked out. You have your file. Now, its time to delete the user you created on the AWS console. You might choose not to do this depending on your situation.

delete the user from the iam console after downloading the large file

(Click on the image to open it up larger)

Note For Mac Users

The script should work fine on Mac except for one thing. The version of GREP installed on the Mac does not support the above Regex parameter. So, you need to install a version that does. This article shows you how.

If you do not want to do that, then you can manually put in the size of the file in bytes on line 24. We will no longer depend on GREP regex.

Note For Windows Users

Since this article is all about a bash script, in order to use this on Windows, you would need something that runs a bash script. I have heard good things about: Windows Subsystem for Linux (WSL)

However, I have not tried it. I have tested the above script on Ubuntu. So, if you do install WSL the above script should run well.