Wednesday 21 June 2017

Microsoft Azure Data Warehouse and Distribution Keys

I am in the process of increasing the window of time of a data propagation line from a week to 12 weeks. The issue experienced is that the loading time for the process was not scaling so well, at least  until yesterday. In trying to optimize the process, I have learned a number of important things about Azure SQL Data Warehouse.

An OLTP database developer or DBA cannot think of Azure SQL Warehouse in the same way as an OLTP or Monolithic system. The most important thing to understand about Azure Warehouse is that when a developer engages it, they are leveraging up to 60 hidden database systems behind the scenes through 1 to many database controllers (controlled by the DWU). DWU stands for Data Warehousing Unit - a standardized metric which groups all hardware resources together in a relative manner and increase them proportionally to the increase in the DWU configuration of the cloud based warehouse (used for billing). This distributed processing and storage is a very important factor when choosing a distribution key for a table and designing objects and mechanisms for bulk operations.

A distribution key, one of many distribution options, is a table creation option which tells Azure HOW to distribute the data for a given table ACROSS those 60 databases. So if one chooses to distribute 100 million records on low cardinality field containing only 2 distinct values and NULLs, then those 100 million records will be spread across only 3 databases. Therefore, at the very least, a high cardinality field or a surrogate key should be chosen to distribute the data EVENLY across the nodes to leverage the true power of parallel processing.

If a developer is smart, they will make sure the distribution key they select is the same as those of other tables. To explain what this means, I will need to mention how the database performs distribution. The database will use a hash function on the value in the selected column to relegate the record to one of sixty "bucket" databases, this means that a pk/fk value shared by 2 tables can be hashed and thus the database will physically store the records of those 2 tables within the same node. Why does this matter? Well, it means when those 2 tables are joined, the database engine will not need to break context and contact a different node in search of records qualifying the join - it merely carries out a like for like join within its own node. Multiply this by 60, and one has 60 database performing disparate joins independently and in parallel and returning records to the Azure Warehouse controller as they're matched. Pretty Neat.


As for the distributions of data and skewing (a term used to highlight the case of unevenly distributed data), I have a number of queries which detail the problem and I will be posting them in time.

A more descriptive post (including visuals) can be found here

1 comment: