BeepityBop: June 2017

Thursday, 29 June 2017

Explain Plan in Azure Warehouse (sort of)

Simplistically, the purpose of an explain plan is to reveal how the database plans to deliver results for a given query. In Oracle we have "SQL> EXPLAIN PLAN FOR <QUERY>;" and in SQL Server we have "click->display estimated execution plan" which shows a graphical diagram showing all the details of the execution steps and path. Both are important to diagnose query performance related issues.

In Azure Data warehouse, we have very little. It takes a certain amount of effort to get something readable as the aforementioned SSMS "rclick->display" option does not work in the studio. There are a number of posts on the internet about how to get this information but the results seem somewhat lacklustre: one method is to get the execution plan in an XML format that is not compatible with SSMS - so one is left to look at an XML file. The other option is to go the Azure portal and see the query running via the web console. Both options falls short of what most db developers are accustomed to.

I have found a useful alternative although it is not exactly a traditional explain plan. What it is is the sequence of steps the database is taking to achieve the results.
These steps can be found in the object "sys.dm_pdw_request_steps" - how I use this is shown below.

I can only use the following steps while my query is running. Once the query is done, the steps vanish from the steps table. Therefore, this is more of a technical query progress report showingthe steps that ran, the steps which are running, and the steps which will run. The first key is identify my sessions REQUEST_ID in sys.dm_pdw_exec_requests. I use sys.dm_pdw_exec_requests and I find that the most laborious queries pop in the top results. Once I have those master request and their child requests, I use those to find the segments of steps in the "sys.dm_pdw_request_steps" table.

print ''
print '============================================='
print '============================TOP QUERIES IN DW'
print '============================================='
print ''
go

select TOP 10
status,
request_id as request_id,
submit_time as submit_time,
start_time as start_time,
(total_elapsed_time/1000) as total_elapsed_time,
substring(command,0,50) command
from sys.dm_pdw_exec_requests
order by 1 desc,total_elapsed_time DESC
GO

print ''
print '========================================='
print '============================RUNNING STEPS'
print '========================================='
print ''
go

SELECT DATEDIFF(SECOND, s.start_time, s.end_time) Time_In_Seconds,
s.request_id AS request,
s.operation_type AS operation_type,
s.distribution_type AS distribution_type,
s.location_type AS location_type,
s.status AS status,
s.start_time AS start_time,
s.end_time AS end_time,
s.total_elapsed_time AS total_elapsed_time,
s.row_count AS row_count,
s.command AS command
FROM sys.dm_pdw_request_steps s
WHERE request_id IN (SELECT request_id
FROM sys.dm_pdw_exec_requests
WHERE status = 'Running'
AND total_elapsed_time > 5 )
ORDER BY s.request_id, s.step_index;

For simplicity, I reduced the result sets both column and row wise so I can focus on a session that is taking a while to run. Request QID89235213 is a massive query running in parallel across the warehouse. It is inserting data into a table. Its steps can be shown in the RUNNING STEPS block. I can see the time its taken for each step, the number of rows per step and the type of operation it is running. At the time of rendering, the results show that the process is running a ShuffleMoveOperation which is likely processing 66 million records. It still has a number of PENDING steps to process.

=============================================
============================TOP QUERIES IN DW
=============================================

status request_id total_elapsed_time command
-------------------------------- -------------------- ------------------ --------------------------------------------------
Running QID89235213 1881 INSERT INTO Warehouse.tb_FactTable_WHTest

=========================================
============================RUNNING STEPS
=========================================

request operation_type distribution_type location_type status total_elapsed_time row_coun command
------------ -------------------- ----------------- --------------- ---------- ------------------ -------- ------------------------------
QID89235213 RandomIDOperation Unspecified Control Complete 0 -1 TEMP_ID_1788
QID89235213 OnOperation AllDistribution Compute Complete 78 -1 CREATE TABLE [tempdb].[dbo].[
QID89235213 ShuffleMoveOperation Unspecified DMS Running 936636 -1 SELECT [T1_1].[EventName] AS
QID89235213 OnOperation AllDistribution Compute Pending NULL -1 DROP TABLE [tempdb].[dbo].[TE
QID89235213 OnOperation AllDistribution Compute Pending NULL -1 DROP TABLE [tempdb].[dbo].[TE

As the engine completes each step, the query is closer to completion. This helps me know whats happening under the hood of the Azure Warehouse Database.

Wednesday, 21 June 2017

Microsoft Azure Data Warehouse and Distribution Keys

I am in the process of increasing the window of time of a data propagation line from a week to 12 weeks. The issue experienced is that the loading time for the process was not scaling so well, at least until yesterday. In trying to optimize the process, I have learned a number of important things about Azure SQL Data Warehouse.

An OLTP database developer or DBA cannot think of Azure SQL Warehouse in the same way as an OLTP or Monolithic system. The most important thing to understand about Azure Warehouse is that when a developer engages it, they are leveraging up to 60 hidden database systems behind the scenes through 1 to many database controllers (controlled by the DWU). DWU stands for Data Warehousing Unit - a standardized metric which groups all hardware resources together in a relative manner and increase them proportionally to the increase in the DWU configuration of the cloud based warehouse (used for billing). This distributed processing and storage is a very important factor when choosing a distribution key for a table and designing objects and mechanisms for bulk operations.

A distribution key, one of many distribution options, is a table creation option which tells Azure HOW to distribute the data for a given table ACROSS those 60 databases. So if one chooses to distribute 100 million records on low cardinality field containing only 2 distinct values and NULLs, then those 100 million records will be spread across only 3 databases. Therefore, at the very least, a high cardinality field or a surrogate key should be chosen to distribute the data EVENLY across the nodes to leverage the true power of parallel processing.

If a developer is smart, they will make sure the distribution key they select is the same as those of other tables. To explain what this means, I will need to mention how the database performs distribution. The database will use a hash function on the value in the selected column to relegate the record to one of sixty "bucket" databases, this means that a pk/fk value shared by 2 tables can be hashed and thus the database will physically store the records of those 2 tables within the same node. Why does this matter? Well, it means when those 2 tables are joined, the database engine will not need to break context and contact a different node in search of records qualifying the join - it merely carries out a like for like join within its own node. Multiply this by 60, and one has 60 database performing disparate joins independently and in parallel and returning records to the Azure Warehouse controller as they're matched. Pretty Neat.

As for the distributions of data and skewing (a term used to highlight the case of unevenly distributed data), I have a number of queries which detail the problem and I will be posting them in time.

A more descriptive post (including visuals) can be found here

Friday, 16 June 2017

Data Analytics, Power BI and Crime Statistics

2 weeks ago I was moved into an analytics team to help the data analyst render 150 million records of raw unstructured data. I was given an overview of the data, the problems preventing the rendering of the data and how the data is being collected. The data analyst was trying to render all this data using a single view. Over 5 days I built a propagation mechanism or 'data pipeline' that processed the data in a SQL Warehouse and aggregated it to be consumed without having to perform calculations during consumption - problem solved. The issue of waiting and crashing was no longer and the analytics team could now visually represent the data they had been sitting on for a number of months. In that short 2 week period, I briefly learned about MS Power BI - an easy to use graphical desktop application that can turn most data into something visual and rather meaningful.

After working in the back end for a very very long time via SQLCMD and SQLPlus with the aim of designing and optimization database systems, I consider this application a breath of fresh air. I am now able to see the data and now value data for the first time as opposed to valuing the system that prepares data on behalf of a client or customer.

This new application had me curious about its other uses. SO, I went to the SAPS (Police Force) website and downloaded crime statistics for the country. I then flattened the data in Google Docs and brought it into Power BI for the purpose of 'seeing' what was happening in South Africa in relation to crime.

The first screen shot is that of a Treemap diagram which aggregates all crimes over a 10 year period into categories (similar to that of GDP treemap diagram) and for the years of 2005 to 2014. I've made attempted murder and murder a bright red colour.

Oh yeah, and aaaall my camera equipment was stolen out of my car last week, so that crime fits squarely in the 3rd row, 2nd column: I am one of the 1.3million "theft out of or from motor vehicle" statistics. Moving on.

The screen shot above displays murders and attempted murders in the same time period.
We can see that the rate drops in 2011 and 2012 and slowly begins to climb again.
It should be noted that the 2010 Soccer World cup took place in South Africa in 2010 and our economy was quite resilient in and after the 2008 recession.