Snowflake: Core Concepts
This post presents key concepts of the Snowflake cloud data warehouse. Snowflake is an analytic database that runs on the most popular cloud providers (AWS, Azure, GCP) and handles all aspects of authentication, configuration, resource management, data protection and availability optimisation.
What makes Snowflake So Special?
Shared disk architectures use multiple nodes to access data shared on a single storage system. In contrast, shared nothing architectures store a portion of the data in each node and each cluster in a data warehouse. Snowflake combines both architectures.
Snowflake's data architecture consists of three layers:
1. Storage
2. Compute
3. Services
Each layer scales independently and includes built-in redundancy.
How does it work?
Snowflakes stores data in databases. A database is a logical grouping of objects consisting primarily of tables and views organised into schemas. Snowflake supports structured relational data using standard SQL data types. Additionally, Snowflake's variant data type stores semi-structured non-relational data such as JSON, parquet and so on. Regardless of the data type, ANSI standard SQL is used to perform data-related tasks.
Snowflake uses secure cloud storage to maintain data. As data is loaded into tables, Snowflake converts it into a compressed format and encrypts it using AES 256 encryption.
The compute layer is where queries are executed using resources provisioned from a cloud provider. Unlike traditional architectures, Snowflake creates independent compute clusters that access the same data storage layer without performance degradation.
To create a virtual warehouse, give it a name and specify a size. Snowflake handles the provisioning and configuration of the underlying compute resources. A virtual warehouse can also be scaled up or down without downtime. When a virtual warehouse is resized, subsequent queries take advantage of the additional resources. Snowflake cloud architecture also enables concurrency without resource contention. For example, separate virtual warehouses can be used to handle loading and querying simultaneously. Because virtual warehouses access the same data storage layer, updates or inserts become immediately available to other warehouses.
The services layer manages the entire system. It authenticates users, manages sessions, secures data, and performs query compilation and optimisation. The services layer also coordinates data storage updates and access, ensuring all virtual warehouses see the new version instantaneously once a transaction is completed. A vital component of the services layer is the metadata store, which powers several Snowflake features, including zero-copy cloning, time travel and data sharing. The services layer is maintained by Snowflake using resources distributed across multiple zones to ensure high availability.
Connect to Snowflake through provided clients such as the JDBC or ODBC drivers. Snowflake also provides a command-line client and web interface. A growing ecosystem of external tools include native connectivity with Snowflake.
To illustrate the different layers working together, let's walk through the lifecycle of a query. After connecting to Snowflake through one of the supported clients and initiating a session, specify the virtual warehouse and submit a query. The services layer first validates authorisation. It then creates an optimised query plan. Next, the services layer sends the query execution instructions to the virtual warehouse, allocating resources, requesting data from the storage layer, and executes the query. The results are then returned.
What do you need to manage Snowflake?
Snowflake eliminates most of the parameters required to tune performance by traditional data warehouses. You only need to create the databases, tables and virtual warehouses, load data and execute queries. Snowflake handles the rest.
How much does Snowflake cost?
Pricing is based on usage. Pay only for the storage and compute resources utilised. Storage costs are based on the amount of compressed data stored in database tables, plus the data retained to support Snowflake's data recovery features. Compute prices are based on warehouse size and how long the warehouse runs.
To conclude, Snowflake is faster, easier to use and more flexible than traditional data warehouses.