Data Warehousing and Data Science

18 December 2015

SQLBits XV Liverpool 4-7th May 2016

Filed under: Data Warehousing,Event,SQL Server — Vincent Rainardi @ 10:17 pm

I’ve posted 8 sessions for SQLBits XV in Liverpool. Hopefully one of them got chosen so I can to present it.

  1. Populating a Fact Table using SSIS
  2. Testing a SQL Server Data Warehouse
  3. Indexing Fact Tables in SQL Server
  4. SQL Server 2016 Data Warehousing & BI Features (deleted as it’s duplicate with another speaker)
  5. Using a SQL Server Data Warehouse for CRM
  6. Using a SQL Server Data Warehouse for BPM
  7. Loading a Dimension Table in SSIS
  8. T-SQL: Back to Basic

1. Populating a Fact Table using SSIS

In this session I will be doing a demo on my article “Populating a Fact Table using SSIS”: by going through it step by step on the screen. Giving the audience plenty of time to understand the mechanics and to ask questions.

When populating a fact table, people come across issues like these:

  • Where do I populate my fact table from?
  • How do I get the dimension keys to put into my fact table?
  • Where can I get the data for the measures columns?
  • With what do I populate the snapshot date column?
  • What is the primary key of my fact table?
  • The source table doesn’t have a primary key of the source table. What should I do?
  • I can’t find the row in the dimension table, what should I do?

As always, the best way to explain is by using an example. So in this session I’m going to do the following, and hopefully by the end of the session the above issues / questions in the audience’s mind will be answered.

  • Describe the background on the company and the data warehouse
  • Create the source tables and populate them
  • Create the dimension tables and populate them
  • Create the fact table (empty)
  • Build an SSIS package to populate the fact table, step by step

2. Testing a SQL Server Data Warehouse

The usual issue when we test a data warehouse is the sheer volume of data, which makes the normal testing method doesn’t work. Over the years, various ways have been developed to provide solutions to this problem, from a manual process to automated process.

In this session I will be demo-ing 3 approaches for testing SQL Server-based data warehouses:

  1. Using Manual Comparison
  2. Using Excel
  3. Using SSIS – SQL Server – SSAS

Along the way I will show the challenges, the mechanics, and the solutions.

Using Excel we will need to sort the data first. And then we use different formula for compare string columns, date column, logical column and numeric columns. We also need to incorporate the tolerance levels. Finally we can present the data quality for each column.

To verify the data in the data warehouse using the 3rd approach, first we need to match the rows to the source system. Then we verify the attribute columns. Then we verify the measure columns. SSIS facilitates the flow of data from both the source system and the warehouse into the test area, which is in SQL Server. A series of stored procedures then do the matching process in SQL Server, and compare the attribute columns and measure columns. The SSAS enables the testers to dissect the compared data, to find out which specific data areas are causing the issue.

3. Indexing Fact Tables in SQL Server

Index Fact Tables in SQL Server are different to Oracle or DB2 because of clustered index. Some people say that it is better to create a clustered index on the fact key, then non-clustered on each dimension key. Some say that it is better to use the clustered index on the snapshot date column. Of course the considerations are different between periodic snapshot fact tables, accumulative snapshot fact tables and transaction fact tables.

In this session I will go through the principles in indexing the 3 types of fact tables, including physical ordering, slimness of the clustered index, multi-column clustered index, a PK doesn’t have to be a clustered index, which dimension key column to index, when to include a column in an index, and of course partitioning, i.e. indexing partitioned fact tables. As always, it is better to explain by example than theory so I will give an example on each principles so that we can understand how it is applied in practice. For example: the performance comparison. I will also add my own “lessons learned”, i.e. mistakes I’ve made in the past so you can avoid making the same mistakes.

The title is “in SQL Server” because principles I will be explaining in this session are only applicable specifically to SQL Server data warehouses. It does not apply to Oracle, DB2 or Teradata data warehouses.

4. SQL Server 2016 Data Warehousing & BI Features

In this session I would like to go through the new features in SQL Server 2016 which are applicable to data warehousing and BI such as DB Engine: Stretch database, JSON, column-store indexes, memory-optimized tables, live query statistics, query store, temporal tables; SSAS: DBCC for SSAS, SSAS Extended event GUI, new DAX functions, tabular model partitions, roles in tabular; SSIS: custom logging in SSIS, Excel 2013 data sources, OData V4 data source, R in Execute SQL Task, HDFS connections, Auto Adjust Buffer Size, reusable control flow template; SSRS: Pin Report Items, project template for SSRS, PPTX format, PDF based printing, custom parameter pane, HMTL 5, Report Builder UI, Tree Map, Sunburst Charts, High DPI, Subscription; MDS: data compression at entity level, super user function, CRUD permissions, transaction log.

I may not be able to fit all of the above points into the session, but will try to cover as many as possible. It is essentially a session which goes through the new features in SQL 2016. Wherever possible I will try to demo the feature rather than talking about it on PowerPoint. Because by seeing we learn a lot more than by discussing them in theory. Also for each feature I will mention how they can be used in data warehousing or BI development. If a feature already exist in 2014, but it is enhanced in 2016 I will show only the improvement, and not going through the fundamentals of that feature.

5. Using a SQL Server Data Warehouse for CRM

This session is taken from my article “Using a Data Warehouse for CRM”, which I wrote with Amar Gaddam:

A data warehouse is not only for BI. It can also be used from Customer Relationship Management (CRM). In fact, a DW is the best platform for doing CRM. In this session I will show how to use a data warehouse built in SQL Server for doing core CRM activities such as: 1. Single Customer View, 2. Permission management, 3. Campaign segmentation, 4. Campaign Results, 5. Customer Analysis, 6. Personalisation, 7. Customer Insight, 8. Customer Loyalty Scheme.

If you don’t work in CRM and not familiar with the above concepts, don’t worry. I will explain them one by one during this sessions. Although it would be help a lot if you read my article above, before coming to this session.

For each point I will try to show the table design in SQL Server DW so we can all see how they are implemented, not just the theory. Due to time limit I may not be able to cover all the above points, but I promise I will try to cover as much as possible.

6. Using a SQL Server Data Warehouse for BPM

BPM = Business Performance Management. A Data Warehouse is not only used for BI. It is also used for CRM and BPM. In this session I will show how to use a Data Warehouse for BPM using Balanced Scorecard. Before the session I will have built a Data Warehouse on SQL Server, for BPM, and during the session I will show the design of this DW.

The Data Warehouse will contain Sales data, Financial data, Customer Service data, and Order Processing data. Each of this data will form part of the Balanced Scorecard. In addition to Fact and Dimension tables, a Data Warehouse used for BPM contains one additional area which stores the KPI scores. I will show how this area is designed, and how the KPI values and score are calculated.

As a take away, I hope the audience can learn how a DW is used outside BI, how the additional area are designed and built, and how it is all implemented on a SQL Server platform. I will also show how the SSIS packages that populate the Data Warehouse from the source system, and the SSRS reports which shows the KPIs, the Balanced Scorecard and Performance Scoring calculation.

7. Loading a Dimension Table in SSIS

We can load a dimension table in SSIS using a) SCD Transformation, b) Merge Join + Conditional Split, c) Change Data Capture, d) Merge command, and e) Upsert using Execute SQL. In this session I will be showing/demo-ing these 5 approaches on the screen one by one, then compare them in terms of efficiency/performance, code clarity/maintenance, and ease of build. It is based on my article:

SCD Transformation and Merge Join + Conditional Split are both using row-by-row operation hence not efficient compared to Upsert. CDC is a mechanism to extract the data changes. To load a dimension table we need to read the CDC output table, and update or insert into the dimension table based on the _$Operation column. The Merge command is buggy, has concurrency issues, requires an index to support performance, and does the Insert twice.

In every data warehouse project, we need to load many dimension tables. So this is a fundamental knowledge to know, for those of us who uses SQL Server and SSIS for your warehouse.

8. T-SQL: Back to Basic

In SQL Server we usually look forward to working with the latest features, such as SQL 2016, Power BI, ML, Azure, etc. So much so that we often forget the basics such as creating/modifying a constraint, PK, FK, index, trigger, partitioned table, synonym; joins, correlated subquery, update from, output an SP into a table, cast, NULLIF, variables, temp tables, table variables, cross apply, while, case; string/date functions, row number, transaction, except, rank, find duplicate rows, etc.

This session is intended to be a refresher, i.e. we know all the above but we forget. We will go back to basic. I will cover 3 sections: a) creating database objects, b) database development, c) database administration. This session is based on my article: So this session will consists of many short SQL scripts (T-SQL) and only T-SQL, i.e. I won’t be using the GUI. I won’t be able to cover every single script in that article (there are over 250 in total!), but I will pick the important ones, and avoid the ones which are similar to the others.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: