Right, so I've had to handle a number of these over the last
few weeks while setting up data pipeline in Azure WH. To understand the ShuffleMoveOperation I will give a bit of
context on Azure architecture and will leave the following post here which
briefly talks about Azure warehouse.
An Azure Warehouse is a double edged sword. It has multiple
databases underneath the hood which can make noticeable quick work of any given
work load involving a few million records
or more. It does this by making use of 60 database nodes in
its eco system of components. However, the downside is if the
developer/warehouse admin develops his or her solution in a complex manner or
in a way where the objects of interest are not harmonic in terms of
distribution, he or she are going to have a difficult time.
In other words, to succeed in leveraging the warehouse dog
sled of parallelism, one needs to design their database tables in a way that is
logically aligned with the nodes.
How do I do this? With the distribution key attribute of a
table. Distribution tells the database how to disseminate data across the disks
of those 60 nodes.
There are 3 distribution types I am aware of.
ROUND_ROBIN - This merely distributes records evenly across
all nodes in a 'one for you, and one for you, and one for you' manner - thus
there is no business domain logic applied in the distribution and the records
are evenly and randomly distributed. This is effective if you need to get data
into a location fast.
REPLICATE - This option tells the database that it needs to
create 60 copies of the object across the 60 nodes. I've used this for
dimensional tables. So when I add a record to the DimCountry - the record is appended
to the 60 copies of the table across the nodes. More on this in a moment.
HASH(COLUMN_NAME) - This distribution option is one of two
keys to rid yourself of the ShuffleMove (more on this in a moment). It is very
simple: High cardinality column in the HASH function means more even data
distribution across the 60 nodes, low cardinality means uneven distribution or
lumps of data on all or a few nodes. Azure will distribute the data by passing
each rows column into a MD5 hash function, and the result is then used to determine
which node the record will be stored on. The higher the cardinality, the more
evenly the data is distributed. So what happens if one uses the gender column
for the HASH function? the data will likely be distributed on 3 nodes: 1 for
NULL, 1 for Male and 1 for Female. this means that retrieving your data will
mean the developer is only using 3 of 60 nodes for heavy lifting. If I use AGE
as a distribution key, results may be better but still there are likely to be
lumps. If I used a surrogate key, I should have a much finer and even level of
distribution.
The thing with Azure is that if I have lumps of data across
my nodes, my query will run only as fast as the slowest node: if I have a limp
of 50 million records on one node, while the remainder have 10, my query will only
return once the 50 million are loaded - the others will standby and wait on the
50 million to load.
Why not just use round robin for maximum distribution?
Because if I want to join my 1 billion row fact table to another table, the
Azure nodes will need to talk to each other in completing the join.
For instance: If I have 2 Fact tables with a billion rows
each - and they're distributed using round robin, the Azure Warehouse will need
to draw up a master list of all records to complete the join. This will require
something called an internal DMS - Data Migration Service. This is when the
nodes talk to each other in an attempt to find data on foreign nodes to
complete their respective queries. So,
if 2 tables are distributed evenly on a shared hash column, their sibling
records will be stored on their own local nodes and no DMS call will be
required.
What about 10 or 20 dimensional tables that need to be
joined onto a fact table? Well, this is where the aforementioned REPLICATE
distribution method comes into play. If every node has a copy of the DimCountry
table, the respective node will not need to invoke a DMS to get country data
from another node in an aim to satisfy the join.
So, if you are seeing a ShuffleMoveOperation - it means that
the nodes are stepping out of their local domain to retrieve data in another
node. The solution is to keep the data locally using simpler queries,
replicated tables and the hash distribution function on common foreign key
values with a high cardinality.