An OLTP database developer or DBA cannot think of Azure SQL
Warehouse in the same way as an OLTP or Monolithic system. The most important
thing to understand about Azure Warehouse is that when a developer engages it,
they are leveraging up to 60 hidden database systems behind the scenes through
1 to many database controllers (controlled by the DWU). DWU stands for Data Warehousing
Unit - a standardized metric which groups all hardware resources together in a
relative manner and increase them proportionally to the increase in the DWU
configuration of the cloud based warehouse (used for billing). This distributed
processing and storage is a very important factor when choosing a distribution
key for a table and designing objects and mechanisms for bulk operations.
A distribution key, one of many distribution options, is a
table creation option which tells Azure HOW to distribute the data for a given
table ACROSS those 60 databases. So if one chooses to distribute 100 million
records on low cardinality field containing only 2 distinct values and NULLs,
then those 100 million records will be spread across only 3 databases.
Therefore, at the very least, a high cardinality field or a surrogate key
should be chosen to distribute the data EVENLY across the nodes to leverage the
true power of parallel processing.
If a developer is smart, they will make sure the
distribution key they select is the same as those of other tables. To explain
what this means, I will need to mention how the database performs distribution.
The database will use a hash function on the value in the selected column to
relegate the record to one of sixty "bucket" databases, this means
that a pk/fk value shared by 2 tables can be hashed and thus the database will physically
store the records of those 2 tables within the same node. Why does this matter?
Well, it means when those 2 tables are joined, the database engine will not
need to break context and contact a different node in search of records
qualifying the join - it merely carries out a like for like join within its own
node. Multiply this by 60, and one has 60 database performing disparate joins independently
and in parallel and returning records to the Azure Warehouse controller as
they're matched. Pretty Neat.
As for the distributions of data and skewing (a term used to
highlight the case of unevenly distributed data), I have a number of queries
which detail the problem and I will be posting them in time.
A more descriptive post (including visuals) can be found here
Thumbs up!
ReplyDelete