Snowflake: Data Loading

This piece is an introduction to Snowflake data loading. Snowflake supports four options.

1.     Web Interface: limited amounts of data

2.     Bulk Data: SQL and SnowSQL using the command-line interface

3.     Automated: bulk load data using Snowpipe

4.     Third-party: bulk load data from external sources

Here, the focus is bulk loading of structured data from delimited plain text files such as CSV files, although Snowflake supports loading data from semi-structured data.

Bulk loading is performed in two phases. First, upload data files to a location where Snowflake can access the files. This is referred to as staging. Then load the data from staged files into a designated table.

Snowflake hold files in a 'temporary' location called stages. Each table and user has a stage. Snowflake also supports creating named stages. Internal stages enable secure storage of data files without requiring external resources. However, if data files are already staged in a supported cloud storage location, skip step one and load directly from external locations. Just provide the URL for the location, as well as access credentials if the location is protected. Named stages can also be created to point at external locations.

Phase two requires running a virtual warehouse. The warehouse extracts data from each file and inserts it as rows in a table. Warehouse size impacts load performance times. When loading multiple files or large files, a bigger warehouse will process these tasks quicker (see post on Virtual Warehouses for more information on sizing requirements).

Now, let's demonstrate data loading using SQL and SnowSQL to load CSV files from a local machine into a table. A table is created using the following SQL.

CREATE OR REPLACE TABLE [insert_tablename_here] ( [insert_fieldname#1_here] [insert_datatype_here], [insert_fieldname#2_here] [insert_datatype_here], ... ... );

The created field names should correspond to the field names in the files containing the input data. We start by creating an internal stage name and execute a put command to stage the files in CSV files.

PUT file:///tmp/load/[insert_fieldname#1_here].csv @csvfiles

Inspecting the code;

a)     File specifies the full directory path and names of the files to stage

b)     The '@' symbol specifies where to stage the files

By default, the put command compresses data files using gzip. To check that the files are staged, the list command can be run. Now we're ready to load data from the stage files into a table.

First specify a virtual warehouse to use. Then use the copy command to load the files.

COPY INTO [insert_tablename_here] FROM @csvfiles

Inspecting the code;

a)     into specifies the table where the data will be loaded

b)     from identifies the location of the stage files

The script above is the minimum required for a warehouse to process data loading. Snowflake SQL offers additional wildcard matching of multiple files, as well as options for error handling whilst data loading.

As alluded to Snowpipe can be used to bulk load data from file stages and external locations. Snowpipe uses the copy command but with additional features that let you automate the process.  Snowpipe also eliminates the need for a virtual warehouse. Instead, it uses internal compute resources to continuously load data as files are staged.

Third-party tools to bulk load data are often referred to as extract, transform, load (ETL). An expanding ecosystem of applications and drivers for loading data from a range of external sources can be found under the Partner Connect tab (Fivetran makes light work of connecting to multiple data sources with a quick and simple setup).

The fourth option for loading data is the data loading wizard in the Snowflake Web UI. Simply select the table and click the load button. The wizard simplifies loading by combining the two phases into a single operation and automatically deletes the stage files after loading. The wizard however is only intended for loading a limited number of files with small amounts of data. For larger amounts of data, it is best to use the other options.

This concludes the introduction to data loading.

Previous
Previous

File Split & Consolidation

Next
Next

Snowflake: virtual warehouses