Monday 10 July 2017

ShuffleMoveOperation

Right, so I've had to handle a number of these over the last few weeks while setting up data pipeline in Azure WH. To understand the ShuffleMoveOperation I will give a bit of context on Azure architecture and will leave the following post here which briefly talks about Azure warehouse.

An Azure Warehouse is a double edged sword. It has multiple databases underneath the hood which can make noticeable quick work of any given work load involving a few million records
or more. It does this by making use of 60 database nodes in its eco system of components. However, the downside is if the developer/warehouse admin develops his or her solution in a complex manner or in a way where the objects of interest are not harmonic in terms of distribution, he or she are going to have a difficult time.
In other words, to succeed in leveraging the warehouse dog sled of parallelism, one needs to design their database tables in a way that is logically aligned with the nodes.
How do I do this? With the distribution key attribute of a table. Distribution tells the database how to disseminate data across the disks of those 60 nodes.

There are 3 distribution types I am aware of.

ROUND_ROBIN - This merely distributes records evenly across all nodes in a 'one for you, and one for you, and one for you' manner - thus there is no business domain logic applied in the distribution and the records are evenly and randomly distributed. This is effective if you need to get data into a location fast.

REPLICATE - This option tells the database that it needs to create 60 copies of the object across the 60 nodes. I've used this for dimensional tables. So when I add a record to the DimCountry - the record is appended to the 60 copies of the table across the nodes. More on this in a moment.

HASH(COLUMN_NAME) - This distribution option is one of two keys to rid yourself of the ShuffleMove (more on this in a moment). It is very simple: High cardinality column in the HASH function means more even data distribution across the 60 nodes, low cardinality means uneven distribution or lumps of data on all or a few nodes. Azure will distribute the data by passing each rows column into a MD5 hash function, and the result is then used to determine which node the record will be stored on. The higher the cardinality, the more evenly the data is distributed. So what happens if one uses the gender column for the HASH function? the data will likely be distributed on 3 nodes: 1 for NULL, 1 for Male and 1 for Female. this means that retrieving your data will mean the developer is only using 3 of 60 nodes for heavy lifting. If I use AGE as a distribution key, results may be better but still there are likely to be lumps. If I used a surrogate key, I should have a much finer and even level of distribution.

The thing with Azure is that if I have lumps of data across my nodes, my query will run only as fast as the slowest node: if I have a limp of 50 million records on one node, while the remainder have 10, my query will only return once the 50 million are loaded - the others will standby and wait on the 50 million to load.

Why not just use round robin for maximum distribution? Because if I want to join my 1 billion row fact table to another table, the Azure nodes will need to talk to each other in completing the join.
For instance: If I have 2 Fact tables with a billion rows each - and they're distributed using round robin, the Azure Warehouse will need to draw up a master list of all records to complete the join. This will require something called an internal DMS - Data Migration Service. This is when the nodes talk to each other in an attempt to find data on foreign nodes to complete their respective queries.  So, if 2 tables are distributed evenly on a shared hash column, their sibling records will be stored on their own local nodes and no DMS call will be required.

What about 10 or 20 dimensional tables that need to be joined onto a fact table? Well, this is where the aforementioned REPLICATE distribution method comes into play. If every node has a copy of the DimCountry table, the respective node will not need to invoke a DMS to get country data from another node in an aim to satisfy the join.


So, if you are seeing a ShuffleMoveOperation - it means that the nodes are stepping out of their local domain to retrieve data in another node. The solution is to keep the data locally using simpler queries, replicated tables and the hash distribution function on common foreign key values with a high cardinality.