How to copy data from Azure to S3 programmatically
There are several possible reasons why you might want to copy files from Azure to AWS S3 (or vice versa). Maybe you want to store the backup on S3. Maybe you have subscribed to both the clouds and you want to keep the data synchronized between the two. Wanted to share this work because I had to gather it from various sources.
Here’s a fully automated lambda solution for copying the data on-demand or triggering lambda on schedule.
This blog post covers one approach to migrate data from Microsoft Azure Blob Storage into Amazon S3 using the AWS Lambda service.
Why use the AWS lambda approach?
How lambda ease the work for us? Here are some of the advantages of using it for copying data:
- It is simpler, easy to use, and less expensive than EC2 or Beanstalk.
- It supports multiple programming languages.
- It’s a cost-effective option as it’s a serverless service.
- With fewer code lines, you can copy data from Azure to S3.
Step by Step
Here I will walk you through the steps needed to set up lambda in your own AWS.
Pre-requisites:
It is assumed that you have valid accounts with both AWS Cloud and Azure Cloud. If you are following along with this blog post, I assume that you are familiar with the following, and requirements are already met.
- Azure Blob: Create Azure Blob and authorize access to the blob if you want to restrict the access to make the blob private. There are various options that Azure Storage offers for authorizing access to resources.
- AWS S3 bucket: Create an S3 bucket and block all public access to restrict access to the bucket. Then create AWS Identity and Access Management (IAM) users in your AWS account and grant those users incremental permissions on your Amazon S3 bucket and the folders in it.
- AWS Lambda service: We will do hands-on and create a lambda function that will copy the data from Azure to Python.
Let’s do hands-on:
To implement this, we will do the following:
- Create an S3 bucket (or use an existing bucket)
- Create an IAM policy and an execution role (you can either create it by going to the IAM section on the console or you can choose the option of ‘Create a default execution role with basic settings’ while creating a Lambda function)
- Add the following policies to the Lambda execution role
- S3 bucket access policy
- KMS policy to encrypt/decrypt the environment variables used in code
- AWSLambdaVPCAccessExecutionRole policy
4. Create a Lambda function to write code for copying the data from Azure to S3 using Python SDK for Azure and AWS.
Here’s a snippet to download the Azure blob ( authenticated using AzureActiveDirectory ):
# Get the access token first to access the blob
# Replace {account,client secrets} with your account,client secrets!
oauth_url = https://<account>.blob.core.windows.net
token_credential = ClientSecretCredential(
<active_directory_tenant_id>,
<active_directory_application_id>,
<active_directory_application_secret>
)# Downloading blob using access token
# Replace {conatiner_name,blob_name} with your conatiner_name,blob_name!
blob_service_client = BlobServiceClient(account_url=oauth_url, credential=token_credential)
blob_client = blob_service_client.get_blob_client(container=<container_name>, blob=<blob_name>)
download_stream = blob_client.download_blob()
data = download_stream.readall()
Get the Azure ClientCredentials(active_directory_tenant_id, active_directory_application_id and active_directory_application_secret) from the Azure account and store them in environment variables(using KMS encryption) in lambda configuration.
In the example above, the Python client library handles the authorization of the request to download the blob. Azure Storage client libraries for other languages also handle the authorization of the request for you. However, if you call an Azure Storage operation with an OAuth token using the REST API, you’ll need to construct the Authorization header by using the OAuth token.
# Replace {container,file.txt,mystorageaccount,BearerToken} with your container, blob, account and access token!
GET /container/file.txt HTTP/1.1
Host: mystorageaccount.blob.core.windows.net
x-ms-version: 2017-11-09
Authorization: Bearer eyJ0eXAiOnJKV1...Xd6j
To get the access token use the following REST endpoint:
POST /{tenant}/oauth2/v2.0/token HTTP/1.1 //Line breaks for clarity
Host: login.microsoftonline.com
Content-Type: application/x-www-form-urlencodedclient_id=<client_id>
&scope=https%3A%2F%2Fgraph.microsoft.com%2F.default
&client_secret=<client_secret>
&grant_type=client_credentials
The data object will hold the Azure blob that you can use to directly upload to S3 using the following S3 method:
# Replace {bucket_name,file_name} with your bucket_name,file_name!
s3 = boto3.client('s3')
response = s3.put_object( Bucket=<bucket_name>,Body=data,Key=<file_name>)
The boto3 is a Python SDK for AWS, boto3 client uses the s3 put_object method to upload the downloaded Blob to S3.
For this code to run in lambda, you will have to install some dependencies and add them to the Lambda layer.
This is how my requirements.txt for the lambda function looks like:
adal==1.2.7
azure-core==1.13.0
azure-identity==1.5.0
azure-storage-blob==12.8.1
cryptography==3.4.7
jwt==1.2.0
isodate==0.6.0
msal==1.11.0
msal-extensions==0.3.0
msrest==0.6.21
oauthlib==3.1.0
portalocker==1.7.1
pycparser==2.20
requests==2.24.0
requests-oauthlib==1.3.0
You can copy this file and install these packages on your machine or download them from here( if your machine OS is not compatible with the OS used by your AWS Lambda ).
I have python 3.7 installed on my machine and I used the same python version in lambda too. So, I downloaded these files using pip on mac and then zipped the dependencies and uploaded them to the Lambda layer from where lambda code picks it up. For more details, you can refer to this article.
If you face any import/package issues try to change the dependency version depending on the Python version of your lambda.
Final Lambda code using Python3.7:
import os
import boto3
from base64 import b64decode
from os import environ
from azure.identity import ClientSecretCredential
from azure.storage.blob import BlobServiceClient# Read Azure service creds and other details from KMS
oauth_url = environ.get('oauth_url')
encrypted_active_directory_tenant_id = environ.get('active_directory_tenant_id')
encrypted_active_directory_application_id = environ.get('active_directory_application_id')
encrypted_active_directory_application_secret = environ.get('active_directory_application_secret')# Decrypt secrets
active_directory_tenant_id = boto3.client('kms').decrypt(CiphertextBlob=b64decode(encrypted_active_directory_tenant_id),EncryptionContext={'LambdaFunctionName': os.environ['AWS_LAMBDA_FUNCTION_NAME']})['Plaintext'].decode('utf-8')
active_directory_application_id = boto3.client('kms').decrypt(CiphertextBlob=b64decode(encrypted_active_directory_application_id),EncryptionContext={'LambdaFunctionName': os.environ['AWS_LAMBDA_FUNCTION_NAME']})['Plaintext'].decode('utf-8')
active_directory_application_secret = boto3.client('kms').decrypt(CiphertextBlob=b64decode(encrypted_active_directory_application_secret),EncryptionContext={'LambdaFunctionName': os.environ['AWS_LAMBDA_FUNCTION_NAME']})['Plaintext'].decode('utf-8')# Function to copy the blob without creating any /tmp/ file
def copyBlobInMemory():
token_credential = ClientSecretCredential(
active_directory_tenant_id,
active_directory_application_id,
active_directory_application_secret
)
# Downloading blob
blob_service_client = BlobServiceClient(account_url=oauth_url, credential=token_credential)
blob_client = blob_service_client.get_blob_client(container=<container_name>, blob=<blob_name>)
download_stream = blob_client.download_blob()
data = download_stream.readall()
# Uploading to S3
s3 = boto3.client('s3')
response = s3.put_object( Bucket=<bucket_name>,Body=data,Key='key_name')
print("Blob copied to S3 In-Memory")
def lambda_handler(event, context):
copyBlobInMemory()
Click on Test Lambda and it will copy the blob. You can set the schedule to trigger the lambda daily.
If you have a lot of blobs/files that need to be copied and the lambda function runs out of time, in that case, you may think of switching from lambda to EC2. There are no timeout restrictions on EC2, unlike lambda.
You may skip the below approaches if you are only interested in the lambda solution !!
Let’s also talk about the other options that are available and why we preferred lambda over these solutions.
Rclone: Rclone is a command-line program to manage files on cloud storage. Rclone is very similar to rsync but instead of working with local directories, it works with different cloud storage providers. Rclone supports various cloud providers. To copy the data from one storage provider to another you need to install Rclone on your machine, configure the Rclone tool, and set the remotes. To copy the data from Azure to S3 you also need to have Azure CLI and AWS CLI installed first. While it is easy to use Rclone and it is a recommended option when you don’t want to write any code and use Rclone commands to copy the data. For a programmatic/automated solution, you will need an EC2 machine where you will install Rclone and schedule a cronjob that uses Rclone commands to copy your data. It is not a very good option when you don’t have an existing EC2 instance and you are setting up a new instance just for data migration. In that case, you can go with the Lambda approach as lambda is less expensive than even the smallest EC2 instances.
AWS Elastic Beanstalk: One of the approaches to migrate data from Azure to S3 is using AWS Elastic Beanstalk. The detailed approach of migrating data using Node.js package azure-blob-to-s3 is explained here. Lambda is simpler and less expensive, while Elastic Beanstalk lets you run full applications and gives you control over their environment. Since copying the data from Azure to S3 is not much complicated, I preferred lambda over Beanstalk.
AzCopy: AzCopy is a command-line utility that you can use to copy blobs or files to or from a storage account. Azcopy can be used to copy data from S3 to Azure directly. But for vice versa, it cannot directly copy the data. This can be done by mounting the s3 on a VM and then using azure copy. However s3fs method is not generally recommended for continuous transfer of data from azure to s3, since there have been frequent reports of s3fs being choked. This method is recommended if the data transfer from Az to AWS has to be made once ( i.e. data migration ). For daily sync/data transfer this method is not recommended.