Automating Your Snowflake Cloning Strategy with AWS

Serving refreshed data on a daily basis with CloudWatch and Lambda

This topic again?

I’ve written a number of blogs recently on why you shouldn’t be developing with data in the ways of the old world. What I mean by that is there are plenty of ways to get data parity across Dev, Test, and Production environments with little effort in a cloud based ecosystem. However, there are few companies that are taking this type of approach in their data strategy.

I’ve written how to accomplish this on GCP for BigQuery, on GCP for Snowflake, and this time I’ll show how to accomplish this on AWS for Snowflake.

How can we accomplish this?

Since we’ve established that we’re using Snowflake, we can take advantage of the Zero Copy Clone functionality. What this means is that we can easy clone an entire database, and with zero additional cost if we don’t edit the data in the cloned copy. Also, we established that we’re using AWS, which is a platform that many Snowflake customers are comfortable with, and using Lambda and CloudWatch are great solutions for this problem. These two services will allow us to:

  1. Execute a command to tell Snowflake what to clone

  2. Tell that command when to run

That solution architecture looks something like this when it’s all put together.

aws_cloudwatch_lambda_snowflake.png

Let’s get down to business.

Setting up the Lambda Function

I’ll be using Node.js for this example, but that doesn’t mean that you couldn’t do this in Python, or any other supported Snowflake driver language.

The function’s code

Since we are going to use CloudWatch events to trigger our function, we need to structure it in a way that Lambda likes for those events. Your index.js file should look something like this:

exports.handler = async (event, context, callback) => {
console.log('event');
//import the necessary package
var snowflake = require('snowflake-sdk');
//create the connection
var connection = snowflake.createConnection({
  account: '<your account name here>',
  username: '<your username here>',
  password: '<your password here>'
});
connection.connect(function(err, conn) {
  if (err) {
    console.error('Unable to connect: ' + err.message);
  } else {
    console.log('Successfully connected as id: ' + connection.getId());
  }
});
//run the sql statement
var statement = await connection.execute({
  sqlText: 'create or replace database phils_dev_database clone demo_db;',
  complete: function(err, stmt, rows) {
    if (err) {
      console.error('Failed to execute statement due to the following error: ' + err.message);
    } else {
      console.log('Successfully executed statement: ' + stmt.getSqlText());
    }
  }
});
callback(null, 'Finished');
};

What this function will do is establish a connection to Snowflake, and then run the SQL command you specify. In this case, it is:

create or replace database phils_dev_database clone demo_db;

You will need to create a package.json object to accompany this index.jsfile, as well as zip it all up. If you’re not sure how to create that, it’s pretty simple; in your root folder for your function, run npm install . This will create the dependencies and install them in the node_modules folder.

Pro tip on zipping: if you are on macOS and zipping folders for Lambda, use the command below — Lambda can get a bit picky about folder structuring and this will cure the issue.

zip -r ../<your_folder_name_here>.zip *

Once your zip folder is configured and ready, load it into Lambda, and be sure to select the Node.js 8.10 runtime.

aws_cloudwatch_config.png

Function configuration

In the spirit of AWS’s poor documentation, I’ll skip how I set up my function… Just kidding! I’ll help you guys and gals out.

Be sure to select CloudWatch logs as the triggering event when setting up the function. We’ll configure CloudWatch soon, but it is important that Lambda is listening to the correct service. Other than that, the rest of the function set up will be the same as all other Lambda functions in your AWS environment; in general, follow your organization’s recommended settings.

My set up worked fine with a new service role for this function, and the minimum amount of memory allocated. I tested this both with VPCs and security groups and rules, and without a VPC and had performant results. Like I said above, the most important thing is having the correct listener, and then follow what you usually do for function set ups. If you get stuck, this page is a great resource that mirrors what we’re trying to accomplish here.

Be sure to save and test your function once it’s all configured.

Setting up CloudWatch

This is also relatively simple. Navigate over to CloudWatch in the console and create a rule either from the landing page, or from the left hand side, select “Rules” and then create one on the next page. From there, it’s a few simple clicks to set up the event.

Set up the event to based on a schedule, and then edit the schedule to fit your needs. Then select the function we just created, and leave the other settings in their default states.

Then all we need to do is give it a name, and it click on create.

Once you’re function runs for the first time, be sure to head over to Snowflake to see your cloned database!

Wrap Up

Working with fresh data is a no brainer when it’s so easy to deliver it by setting up some Lambda functions and CloudWatch events. Implementing this kind of solution has so much ROI, not only in the form of code quality, but also in terms of time to market and team moral.

What else can we help you with? Reach out for a consultation on how we can help you get the most out of your Snowflake, AWS, cloud and data implementations.