Monday, 10 July 2017

ShuffleMoveOperation

Right, so I've had to handle a number of these over the last few weeks while setting up data pipeline in Azure WH. To understand the ShuffleMoveOperation I will give a bit of context on Azure architecture and will leave the following post here which briefly talks about Azure warehouse.

An Azure Warehouse is a double edged sword. It has multiple databases underneath the hood which can make noticeable quick work of any given work load involving a few million records
or more. It does this by making use of 60 database nodes in its eco system of components. However, the downside is if the developer/warehouse admin develops his or her solution in a complex manner or in a way where the objects of interest are not harmonic in terms of distribution, he or she are going to have a difficult time.
In other words, to succeed in leveraging the warehouse dog sled of parallelism, one needs to design their database tables in a way that is logically aligned with the nodes.
How do I do this? With the distribution key attribute of a table. Distribution tells the database how to disseminate data across the disks of those 60 nodes.

There are 3 distribution types I am aware of.

ROUND_ROBIN - This merely distributes records evenly across all nodes in a 'one for you, and one for you, and one for you' manner - thus there is no business domain logic applied in the distribution and the records are evenly and randomly distributed. This is effective if you need to get data into a location fast.

REPLICATE - This option tells the database that it needs to create 60 copies of the object across the 60 nodes. I've used this for dimensional tables. So when I add a record to the DimCountry - the record is appended to the 60 copies of the table across the nodes. More on this in a moment.

HASH(COLUMN_NAME) - This distribution option is one of two keys to rid yourself of the ShuffleMove (more on this in a moment). It is very simple: High cardinality column in the HASH function means more even data distribution across the 60 nodes, low cardinality means uneven distribution or lumps of data on all or a few nodes. Azure will distribute the data by passing each rows column into a MD5 hash function, and the result is then used to determine which node the record will be stored on. The higher the cardinality, the more evenly the data is distributed. So what happens if one uses the gender column for the HASH function? the data will likely be distributed on 3 nodes: 1 for NULL, 1 for Male and 1 for Female. this means that retrieving your data will mean the developer is only using 3 of 60 nodes for heavy lifting. If I use AGE as a distribution key, results may be better but still there are likely to be lumps. If I used a surrogate key, I should have a much finer and even level of distribution.

The thing with Azure is that if I have lumps of data across my nodes, my query will run only as fast as the slowest node: if I have a limp of 50 million records on one node, while the remainder have 10, my query will only return once the 50 million are loaded - the others will standby and wait on the 50 million to load.

Why not just use round robin for maximum distribution? Because if I want to join my 1 billion row fact table to another table, the Azure nodes will need to talk to each other in completing the join.
For instance: If I have 2 Fact tables with a billion rows each - and they're distributed using round robin, the Azure Warehouse will need to draw up a master list of all records to complete the join. This will require something called an internal DMS - Data Migration Service. This is when the nodes talk to each other in an attempt to find data on foreign nodes to complete their respective queries.  So, if 2 tables are distributed evenly on a shared hash column, their sibling records will be stored on their own local nodes and no DMS call will be required.

What about 10 or 20 dimensional tables that need to be joined onto a fact table? Well, this is where the aforementioned REPLICATE distribution method comes into play. If every node has a copy of the DimCountry table, the respective node will not need to invoke a DMS to get country data from another node in an aim to satisfy the join.


So, if you are seeing a ShuffleMoveOperation - it means that the nodes are stepping out of their local domain to retrieve data in another node. The solution is to keep the data locally using simpler queries, replicated tables and the hash distribution function on common foreign key values with a high cardinality.

Thursday, 29 June 2017

Explain Plan in Azure Warehouse (sort of)

Simplistically, the purpose of an explain plan is to reveal how the database plans to deliver results for a given query. In Oracle we have "SQL> EXPLAIN PLAN FOR <QUERY>;" and in SQL Server we have "click->display estimated execution plan" which shows a graphical diagram showing all the details of the execution steps and path. Both are important to diagnose query performance related issues.

In Azure Data warehouse, we have very little. It takes a certain amount of effort to get something readable as the aforementioned SSMS "rclick->display" option does not work in the studio. There are a number of posts on the internet about how to get this information but the results seem somewhat lacklustre: one method is to get the execution plan in an XML format that is not compatible with SSMS - so one is left to look at an XML file. The other option is to go the Azure portal and see the query running via the web console. Both options falls short of what most db developers are accustomed to.

I have found a useful alternative although it is not exactly a traditional explain plan. What it is is the sequence of steps the database is taking to achieve the results.
These steps can be found in the object "sys.dm_pdw_request_steps" - how I use this is shown below.

I can only use the following steps while my query is running. Once the query is done, the steps vanish from the steps table. Therefore, this is more of a technical query progress report showingthe steps that ran, the steps which are running, and the steps which will run. The first key is identify my sessions REQUEST_ID in sys.dm_pdw_exec_requests. I use sys.dm_pdw_exec_requests and I find that the most laborious queries pop in the top results. Once I have those master request and their child requests, I use those to find the segments of steps in the "sys.dm_pdw_request_steps" table.

print ''
print '============================================='
print '============================TOP QUERIES IN DW'
print '============================================='
print ''
go

  select TOP 10
         status,
         request_id as request_id,
         submit_time as submit_time,
         start_time as start_time,
         (total_elapsed_time/1000) as total_elapsed_time,
         substring(command,0,50) command
    from sys.dm_pdw_exec_requests
order by 1 desc,total_elapsed_time DESC
GO

print ''
print '========================================='
print '============================RUNNING STEPS'
print '========================================='
print ''
go


  SELECT DATEDIFF(SECOND, s.start_time, s.end_time) Time_In_Seconds,
         s.request_id AS request,
         s.operation_type AS operation_type,
         s.distribution_type AS distribution_type,
         s.location_type AS location_type,
         s.status AS status,
         s.start_time AS start_time,
         s.end_time AS end_time,
         s.total_elapsed_time AS total_elapsed_time,
         s.row_count AS row_count,
         s.command AS command
    FROM sys.dm_pdw_request_steps s
   WHERE request_id IN (SELECT request_id
                          FROM sys.dm_pdw_exec_requests
                         WHERE status = 'Running'
                           AND total_elapsed_time > 5 ) 
ORDER BY s.request_id, s.step_index;

For simplicity, I reduced the result sets both column and row wise so I can focus on a session that is taking a while to run. Request QID89235213 is a massive query running in parallel across the warehouse. It is inserting data into a table. Its steps can be shown in the RUNNING STEPS block. I can see the time its taken for each step, the number of rows per step and the type of operation it is running. At the time of rendering, the results show that the process is running a ShuffleMoveOperation which is likely processing 66 million records. It still has a number of PENDING steps to process.

=============================================
============================TOP QUERIES IN DW
=============================================

status                           request_id           total_elapsed_time command
-------------------------------- -------------------- ------------------ --------------------------------------------------
Running                          QID89235213                        1881 INSERT INTO Warehouse.tb_FactTable_WHTest                                                      

=========================================
============================RUNNING STEPS
=========================================

request      operation_type       distribution_type location_type   status     total_elapsed_time row_coun command
------------ -------------------- ----------------- --------------- ---------- ------------------ -------- ------------------------------
QID89235213  RandomIDOperation    Unspecified       Control         Complete   0                  -1       TEMP_ID_1788
QID89235213  OnOperation          AllDistribution   Compute         Complete   78                 -1       CREATE TABLE [tempdb].[dbo].[
QID89235213  ShuffleMoveOperation Unspecified       DMS             Running    936636             -1       SELECT [T1_1].[EventName] AS
QID89235213  OnOperation          AllDistribution   Compute         Pending    NULL               -1       DROP TABLE [tempdb].[dbo].[TE
QID89235213  OnOperation          AllDistribution   Compute         Pending    NULL               -1       DROP TABLE [tempdb].[dbo].[TE


As the engine completes each step, the query is closer to completion. This helps me know whats happening under the hood of the Azure Warehouse Database.

Wednesday, 21 June 2017

Microsoft Azure Data Warehouse and Distribution Keys

I am in the process of increasing the window of time of a data propagation line from a week to 12 weeks. The issue experienced is that the loading time for the process was not scaling so well, at least  until yesterday. In trying to optimize the process, I have learned a number of important things about Azure SQL Data Warehouse.

An OLTP database developer or DBA cannot think of Azure SQL Warehouse in the same way as an OLTP or Monolithic system. The most important thing to understand about Azure Warehouse is that when a developer engages it, they are leveraging up to 60 hidden database systems behind the scenes through 1 to many database controllers (controlled by the DWU). DWU stands for Data Warehousing Unit - a standardized metric which groups all hardware resources together in a relative manner and increase them proportionally to the increase in the DWU configuration of the cloud based warehouse (used for billing). This distributed processing and storage is a very important factor when choosing a distribution key for a table and designing objects and mechanisms for bulk operations.

A distribution key, one of many distribution options, is a table creation option which tells Azure HOW to distribute the data for a given table ACROSS those 60 databases. So if one chooses to distribute 100 million records on low cardinality field containing only 2 distinct values and NULLs, then those 100 million records will be spread across only 3 databases. Therefore, at the very least, a high cardinality field or a surrogate key should be chosen to distribute the data EVENLY across the nodes to leverage the true power of parallel processing.

If a developer is smart, they will make sure the distribution key they select is the same as those of other tables. To explain what this means, I will need to mention how the database performs distribution. The database will use a hash function on the value in the selected column to relegate the record to one of sixty "bucket" databases, this means that a pk/fk value shared by 2 tables can be hashed and thus the database will physically store the records of those 2 tables within the same node. Why does this matter? Well, it means when those 2 tables are joined, the database engine will not need to break context and contact a different node in search of records qualifying the join - it merely carries out a like for like join within its own node. Multiply this by 60, and one has 60 database performing disparate joins independently and in parallel and returning records to the Azure Warehouse controller as they're matched. Pretty Neat.


As for the distributions of data and skewing (a term used to highlight the case of unevenly distributed data), I have a number of queries which detail the problem and I will be posting them in time.

A more descriptive post (including visuals) can be found here

Friday, 16 June 2017

Data Analytics, Power BI and Crime Statistics

2 weeks ago I was moved into an analytics team to help the data analyst render 150 million records of raw unstructured data. I was given an overview of the data, the problems preventing the rendering of the data and how the data is being collected. The data analyst was trying to render all this data using a single view. Over 5 days I built a propagation mechanism or 'data pipeline' that processed the data in a SQL Warehouse and aggregated it to be consumed without having to perform calculations during consumption - problem solved. The issue of waiting and crashing was no longer and the analytics team could now visually represent the data they had been sitting on for a number of months. In that short 2 week period, I briefly learned about MS Power BI - an easy to use graphical desktop application that can turn most data into something visual and rather meaningful.

After working in the back end for a very very long time via SQLCMD and SQLPlus with the aim of designing and optimization database systems, I consider this application a breath of fresh air. I am now able to see the data and now value data for the first time as opposed to valuing the system that prepares data on behalf of a client or customer.

This new application had me curious about its other uses. SO, I went to the SAPS (Police Force) website and downloaded crime statistics for the country. I then flattened the data in Google Docs and brought it into Power BI for the purpose of 'seeing' what was happening in South Africa in relation to crime.


The first screen shot is that of a Treemap diagram which aggregates all crimes over a 10 year period into categories (similar to that of GDP treemap diagram) and for the years of 2005 to 2014. I've made attempted murder and murder a bright red colour.

Oh yeah, and aaaall my camera equipment was stolen out of my car last week, so that crime fits squarely in the 3rd row, 2nd column: I am one of the 1.3million "theft out of or from motor vehicle" statistics. Moving on.


.


The screen shot above displays murders and attempted murders in the same time period.
We can see that the rate drops in 2011 and 2012 and slowly begins to climb again.
It should be noted that the 2010 Soccer World cup took place in South Africa in 2010 and our economy was quite resilient in and after the 2008 recession.


Monday, 22 May 2017

Database Source Code Output

Today the development databases are inaccessible. The network team perform a migration over the weekend and now most of the development teams are unable to do their work. I've taken this opportunity to note something down. Its a mechanic I've used for a number of years when working in a Windows environment.

First, a bit of context. I spent several years working with Oracle in Linux and Unix. For those years, I made use of parametrized shell scripting and SQLPLUS to make short work analytical queries for the purposes of resolving domain/business problems quickly without the need to write out SQL. I had become accustomed to scripting in a Linux environment and automating my tasks. Moving to a Windows environment effectively clipped my wings as I was no longer able to use scripts and work with immediacy. Or so I thought. Within a week of working with SQL server, I made a fantastic transition into using Linux shell scripts in the windows environment using Linux emulator and SQLCMD. One of the positive effects is that I can make use of native Windows applications through the shell script. One of these little inventions is detailed below. This method can also be used with SQLPLUS in Windows.

Enter SourceDump
The sourcedump function is one that takes in a module name and dumps out the entire obejct into a .sql file. Once dumped, it is opened with Notepad++.
1. The first part required is a script that dumps the module out. This is a SQL database so I make use of sys.all_sql_modules instead of Oracle’s dba_source.

set nocount on
use $(Database)
go

select top 1 
       definition
  from sys.all_sql_modules (nolock)
 where object_name(object_id) like '%$(search_phrase)%'
   and object_id > 0

I place this query into a file called
c:\mine\sql\r_sql_source_dump.sql

2. The next task is to have a folder to store the files output from the script.
c:\mine\sourcedump

Once these 2 requirements are met, I move onto the shell configuration and functions.
 
 
3. In the Linux emulator, open the .profile file with vim editor
[rayb.] ➤ vim $HOME/.profile

2. Adjust the $PATH variable to increase its scope to key SQL directories (SQLCMD + DLL's) and include a text editor (Notepad++ is what I make use of)
export PATH=$PATH:"/drives/c/Program Files/Notepad++":"/drives/c/Program Files (x86)/Microsoft SQL Server/Client SDK/ODBC/130/Tools/Binn/":"/drives/c/Program Files (x86)/Microsoft SQL Server/130/Tools/Binn/":"/drives/c/Program Files (x86)/Microsoft SQL Server/130/DTS/Binn/":"/drives/c/Program Files (x86)/Microsoft SQL Server/130/Tools/Binn/ManagementStudio/":"/drives/c/Program Files/TortoiseSVN/bin":"/drives/c/Users/rayb/AppData/Local/Microsoft/WindowsApps"

3. Add the following shell function to the .profile file
sourcedump()
{
  sqlcmd -y 0 -w 65000 -S "<SERVER_IP>,<INSTANCE_PORT>" -v Database="$1" -U sa -P <PASSWORD> -v search_phrase="$2" -o "c:\\mine\\sourcedump\\$2.sql" -i c:\\mine\\sql\\r_sql_source_dump.sql
  notepad++ "c:\\mine\\sourcedump\\$2.sql"
}

4. Resource the .profile file
[2017-05-22 09:14.04]  ~
[rayb.       ] ➤ source $HOME/.profile

[2017-05-22 09:14.09]  ~
[rayb.        ] ➤

5. Output a module








Friday, 12 May 2017

SQL Server Top Running Queries by CPU seconds per minute

I have arrived in a new environment. I handed the real application cluster over in a stable condition and have since moved into a new Microsoft environment. In the past week, I've been writing a number of SQL scripts for SQLCMD. The scripts will speed up the process of learning and understanding the environment.

The script below was refurbished for the current environment - it is a TOPCPU script I wrote for another SQL environment around 4 years ago. It follows a similar form to the one I use for Oracle environments. Unfortunately, the code below is indented badly on the page - I will come back to cleaning it up.  The substring is for controlled output through SQLCMD. The script divides the duration of the users session by the amount of CPU seconds consumed and then orders the results by this new figure. It is not 100% perfect but it often sheds light on whats going on inside the system and any session that is out of control will usually spike to the top of the report. It has served me well over the years. It can also be adjusted to report on IO, though through my anecdotal experience, the TOPCPU often reports bad sessions to the top more often than TOPIO.

:setvar NOCOUNT ON

use $(database)

go

print ''

print '___________________________________________________________LOAD UP'

print ''

print '=============================================='
print 'TOP CPU CONSUMING SESSIONS ON : $(database)'
print '=============================================='
print ''
go

     select top 20
            substring(convert(varchar,sp.spid),0,4) sid,
            substring(sp.hostname,0,10) hostname,
            substring(object_name(st.objectid,sp.dbid),0,20) obj_name,            substring(convert(varchar(45),replace(replace(replace(st.text,char(13),''),char(10),''),char(9),'')),0,45) text,
            substring(sp.loginame,0,10) loginame,
            substring(db_name(sp.dbid),0,10) did,
            CASE WHEN sp.blocked > 0 THEN 'YES' ELSE 'NO' END blx,
            substring(sp.lastwaittype,0,15) lastwaittype,
            substring(sp.status,0,10) status,
            substring(sp.program_name,0,15) program_name,            substring(convert(varchar(15),replace(replace(sp.cmd,char(13),''),char(10),'')),0,15) cmd,
            substring(sp.nt_domain,0,10) nt_domain,
            substring(sp.nt_username,0,8) nt_username,
            convert(varchar(6),datediff(minute,sp.login_time,current_timestamp )) minutes,
            sp.login_time,
            convert(varchar(6),
                    sp.cpu/case 
                             when datediff(minute,sp.login_time,current_timestamp ) <= 0 then 1 
                             else datediff(minute,sp.login_time,current_timestamp ) end) cpu_pm
       from sys.sysprocesses sp (nolock)
cross apply sys.dm_exec_sql_text(sp.sql_handle) as st 
 order by sp.cpu/case 
                   when datediff(minute,sp.login_time,current_timestamp ) <= 0 then 1 
                   else datediff(minute,sp.login_time,current_timestamp ) end  desc
go

print ''

print '_________________________________________________________COMPLETED'
print ''

Results below - obfuscated owing to the sensitive nature of the business. The script was setup on a screen resolution at 1920x1080 - the output may be jumbled on something smaller.