Data Warehousing and Data Science

11 January 2011

4 Sessions for SQLBits 8, Brighton, April 2011

Filed under: Event — Vincent Rainardi @ 2:54 pm
Tags:

I have submitted 4 sessions for SQLBits 8, 7-9th April 2011, Brighton:

1. Testing Your Data Warehouse Using SSAS

I wrote a book about Building a Data Warehouse. One chapter of my book is about testing a data warehouse. Testing a data warehouse is an art. On the one hand we need to make sure that the figures in the warehouse tally with the source systems. On the other hand there are billions of numbers to test. And to add complexity sometimes the numbers from the source are modified, on their way to the warehouse.  One technique to test the data warehouse is to use SSAS. We build a cube and put the numbers from both the source and DW into this cube. Not only this technique saves a lot of time, but we can quickly pin point where the problems are.

2. Advanced Dimensional Modelling

This session is for anybody involved in the design of a data warehouse. Many of us know what dimensions and fact tables are. But as I explained in my blog, dimensional modelling is more than just dimensions and fact tables. This session is about advanced dimensional modelling topics such as Fact Table Primary Key, Vertical Fact Tables, Aggregate Fact Tables, SCD Type 6, Snapshotting Transaction Fact Tables, 1 or 2 Dimensions, Dealing with Currency Rates, When to Snowflake, Dimensions with Multi Valued Attributes, Transaction-Level Dimensions, Very Large Dimensions, A Dimension With Only 1 Attribute, Rapidly Changing Dimensions, Banding Dimension Rows, Stamping Dimension Rows and Real Time Fact Table. Prerequisites: You need have a basic knowledge of dimensional modelling and relational database design.

3. Partitioning Your Fact Tables

In data warehousing, partitioning is one of the top 3 things that can speed up the performance (the other two being summary tables and indexing, as I explained in my book). Partitioning your fact table can speed up the performance up to 10 times, both query and loading. Yet 9 out of 10 the fact tables are not partitioned. One of the reasons is because initially there were only 5-10 million rows. After a year, the fact table contains 1 billion rows and loading time degrades, from 2 hour to 5 hours. Queries slows down too, from 1 second to 20 seconds. In this session I’m going to explain only partitioning, but I’ll leave some time for Q&A if you want to ask about the other two. Prerequisites: Requires a little bit of data warehousing knowhow.

4. SQL Server Data Warehousing

8 years ago people raised their eyebrows when you said you’re building a data warehouse on SQL Server. In today analogy, it is probably similar to saying that you are building a data warehouse on MS Access. In those days DB2 and Oracle reigned the market.

Thanks to Microsoft, today SQL Server is a respectable platform for data warehousing. And for business intelligence. It’s completely reversed now. If we say we are building a DW-BI on DB2 or Oracle, they question “Why not SQL Server?” This session in a way celebrates this golden era of SQL Server DWBI, just like my book. As time is limited I’m going to talk about the DW side only, not the BI side (RS, IS, AS) e.g. merge, change data capture, star join query, bitmap filter, change tracking, minimally logged insert, resource governor, backup compression, data compression and indexed views.

For further info about SQLBits please look at SQLBits web site, www.sqlbits.com.

23 June 2010

SQLBits7

Filed under: Event — Vincent Rainardi @ 8:24 pm
Tags:

SQLBits 7, the largest SQL Server conference in Europe, will take place in York, 30th Sep to 2nd Oct. Different from SQLBits 6 which was only 1 day, SQLBits 7 will take the same format as SQLBits 5. It will be 3 days:

Thursday: in-depth, full day seminar (pay)
Friday: deep-dive, advanced session conference (pay)
Saturday: community day (free)

Submit a session for SQLBits

SQL Server 2008 Data Warehousing Features

Covers data warehousing features in SQL Server 2008 such as merge, change data capture, star join query/bitmap filter, change tracking, minimally logged insert, resource governor, backup compression, data compression and indexed views.

For each point I will explain what they are, why should we use them/what can we use them for in data warehousing, and a demo so you can see it for yourselves.
This is an introductory session. I won’t go into deep technical details about the mechanism of how they work. My goal is to introduce them to you, and if you are interested, you can study them further. Many SQL Server data warehousing “shops” have not used/taken advantage of these features, which is ashame because they are very useful.

18 March 2010

SQL Server 2008 Data Warehousing Features Presentation

Filed under: Event — Vincent Rainardi @ 12:38 am
Tags:

My presentation tonight, 17th March 2010, at the SQL Server User Group meeting at Microsoft, Victoria, is here, including all the scripts. It is titled SQL Server 2008 Data Warehousing Features.

Neil Hambly presented Indexed Views and Computed Columns. Duncan Sutcliffe from Hitachi (was Edenbrook) presented BI in Office 2010 (Excel, PowerPivot and SharePoint). Both were excellent. Chris Testa-O’Neill and Ashwani Roy were the hosts, helping answering questions. Jamie Thomson was also there, helping out with the Q&A session. The meeting was organised by Tony Rogerson.

21 December 2009

SSAS DMV Nugget at SQL Server User Group

Filed under: Analysis Services,Event — Vincent Rainardi @ 9:58 pm
Tags: ,

On 10th June I presented a 10 minute nugget about SSAS DMV at the SQL Server User Group event at Microsoft TVP. Andrew Fryer presented 2008 R2 and Gemini and Ashwani Roy presented SSAS and MDX Studio. Sybase presented Information Architecture. Details of the event below.

This is the DMV scripts I used that evening:

select * from $system.dbschema_tables –where table_catalog = ‘AW’
select * from $system.dbschema_columns –where table_catalog = ‘AW’
select * from $system.dbschema_catalogs
select * from $system.mdschema_cubes
select * from $system.mdschema_hierarchies
select * from AW.[$Dim Product].[$Dim Product] –database dimension
select * from AW.[Sales].[$Dim Product] –cube dimension
select * from AW.[Sales].[Fact Sales] –cube
select * from AW.[$Dim Product].[Dim ProductMG] –dimension cube (1 dim no measure)
select * from $system.mdschema_members where cube_name = ‘$Dim Product’ and [dimension_unique_name] = ‘[Dim Product]’ and [hierarchy_unique_name] = ‘[Dim Product].[Color]’
select * from AW.[$Dim Product].[$Dim Product] where Color = ‘Black’ or Color = ‘Silver’ –no LIKE,<>,in,JOIN
select distinct Color from AW.[Sales].[$Dim Product]
select * from $system.discover_commands
select * from $system.discover_connections
select * from $system.discover_memoryusage
select * from $system.discover_object_memory_usage
select * from $system.discover_object_activity where object_reads > 0
select * from $system.discover_partition_stat
select * from systemrestrictschema($system.discover_partition_stat, CUBE_NAME = ‘Sales’, DATABASE_NAME = ‘AW’, MEASURE_GROUP_NAME = ‘Fact Sales’, PARTITION_NAME = ‘Fact Sales’)
select * from $system.discover_partition_stat
select * from systemrestrictschema($system.discover_dimension_stat,DIMENSION_NAME = ‘Dim Product’, DATABASE_NAME = ‘AW’)
select * from systemrestrictschema($system.discover_performance_counters, PERF_COUNTER_NAME = ‘\MSAS 2008:Connection\Current connections’,PERF_COUNTER_NAME = ‘\MSAS 2008:MDX\Total Cells Calculated’)

An example of discover schema rowset XMLA for comparison with DMV (DMV is so much easier isn’t?)
<Body>
<Discover xmlns =”urn:schemas-microsoft-com:xml-analysis”>
<RequestType>MDSCHEMA_CUBES</RequestType>
<Restrictions>
<RestrictionList>
<CATALOG_NAME>AW</CATALOG_NAME>
<CUBE_NAME>Sales</CUBE_NAME>
</RestrictionList>
</Restrictions>
<Properties>
<PropertyList>
<Catalog>AW</Catalog>
</PropertyList>
</Properties>
</Discover>
</Body>
</Envelope>
<!–Discover
( in RequestType as EnumString,
in Restrictions as Array,
in Properties as Array,
out Resultset as Rowset
)–>
And this is the script to create the linked server, and putting the DMV output into a table using select into to enable other T-SQL predicates such as ‘like’.
The idea with putting the DMV output into a table is so that we cancombine SSAS DMV with SSAS OlapQueryLog. We can then create a data warehouse (fact and dimension tables) from these tables, then we can build a cube on top of that data warehouse.
use SSASQueryLog
select * from QueryLog

exec master.dbo.sp_addlinkedserver @server=’SSAS’, @srvproduct=N”,
@provider=’MSOLAP’, @datasrc=’VIVA2′, @catalog=’AW’
select * from OpenQuery(SSAS, ‘select [Measures].[Order Quantity] on columns from [Sales]’)
select * from OpenQuery(SSAS, ‘select * from $system.discover_connections’)
select * from OpenQuery(SSAS,’select * from systemrestrictschema($system.discover_performance_counters, PERF_COUNTER_NAME = ”\MSAS 2008:Connection\Current connections”,
PERF_COUNTER_NAME = ”\MSAS 2008:MDX\Total Cells Calculated”)’)
–create/insert into table:
select * into Object_Activity from OpenQuery
(SSAS,’select * from $system.discover_object_activity’)
select * from Object_Activity where Object_Parent_Path like ‘%.AW.%’ and Object_Reads > 0 –like

select * into Object_Memory_Usage from OpenQuery
(SSAS,’select * from $system.discover_object_memory_usage’)
select * from Object_Memory_Usage –where Object_Parent_Path like ‘%.AW.%’

select * into Memory_Usage from OpenQuery
(SSAS,’select * from $system.discover_memoryusage’)
select * from Memory_Usage –where ObjectParentPath like ‘%.AW.%’

Details of the event

Evening Meeting on Wed Jun 10, 2009 in READING
SQL 2008 R2 and Gemini; From Data Modelling to Information Architecture and Attribute Relationships, Aggregations and using MDX Studio to its best
Organiser: UK SQL Server User Group
Address: Building 3, Microsoft Campus, Thames Valley Park, Reading, RG6 1WG
Agenda:
5.30pm – 6:00 Registration and networking + Tea/Coffee with biscuits.
Meet and greet.
6:00pm – 6:15pm Round Table discussion and Nuggets – ALL
Take stock and get the latest news in the SQL Server field. This is also a great opportunity to ask any burning questions you have, may be a problem at work.
It’s also a great opportunity to share your knowledge and gain exposure in the industry by giving a short 1 – 6 minute “nugget” demo/tip.
6.15pm – 7:00pm Andrew Fryer, Microsoft
SQL Server 2008 R2 and Gemini
SQL Server 2008 R2 will be the release with Gemini in it. Gemini is about trying to marry self service BI for end users with proper control of that data for the IT Professional. The beta program has only been recently announced and so it’s still early days but if you want to know what to expect then this is your first real opportunity in the UK. So join Andrew Fryer for an hour of peering into what 2010 BI looks like
7:00pm – 7:45pm Ruairi Prendiville, Sybase

From Data Modeling to Information Architecture

It used to be about doing the thing right: a tactical approach for data modelling was necessary, because all the data was contained within the same domain. Now it’s about doing the right thing. Management demands information that is strategic and which helps them make business decisions. This information is more complex, housed in multiple domains, created in different languages, and used by different areas of the organisation for opposing purposes. Adding to the complexity is the critical need for it to be timely, accurate and comprehensive.
In this 45-minute session, Andrew de Rozairo and Alex Pratt of Sybase will impart the secrets of successfully building a foundation for a comprehensive and successful information architecture. You’ll learn how to take a new approach to modelling: combining the business view, information view and the technology view to achieve a strategic vision.
7:45pm – 8pm BREAK: Light refreshments
More time to network and ask questions…
8pm – 8:15pm Vincent Rainardi
Nugget (demo) on SSAS
8:15pm – 9.00pm Ashwani Roy
Attribute Relationships, Aggregations and using MDX Studio to its best
Setting proper relationships for the attributes of a dimension is essential from a query performance point of view. It is one of the most important things while dimensional modelling.
Aggregations are pre calculated summaries of cube data that help enable Microsoft SQL Server 2005 Analysis Services (SSAS) to provide rapid query responses. Aggregations are like fact table data grouped and aggregated to a higher level in the Hierarchy. Building the correct aggregations can greatly enhance query performance.
MDX Studio can be used to analyze complex MDX expressions, monitor performance characteristics of MDX queries and get insight into how MDX interacts with other UDM features such as attribute relationships. MDX Studio provides unique visualization representation of MDX expression and allows the user to interact with different stages of MDX execution. It is built by Mosha Pasumansky, who is inventor of MDX and one of the Architects of Microsoft Analysis Services.

25 November 2009

Building Cubes From Operational Systems (SQLBits Presentation)

Filed under: Analysis Services,Event — Vincent Rainardi @ 9:40 pm
Tags: ,

My presentation at SQLBits 5 last Saturday is here: Building Cubes From Operational Systems.

SQL Scripts for the DDL and DSV are here: SQLScripts.zip, and the BIDS project files are here: Broker5.zip.

Any problem downloading please let me know.

22 November 2009

SQLBits 5

Filed under: Uncategorized — Vincent Rainardi @ 9:27 am
Tags:

On Friday and Saturday, 20th-21st November 2009 I attended SQLBits 5 at Celtic Manor, Newport, Wales. I felt lucky to get the opportunity to meet and talk to famous SQL Server experts like Chris Webb, Henk van der Valk, Satya Jayanti, Bob Duffy, Rob Farley, Alistair Aitchison, Anthony Howcroft, Simon Sabin, Darren Green, Allen Mitchel, Andre Kamman, Ian Marritt, Matt Feltham, Martin Bell, James Rowland-Jones, Andrew Couch, Andrew Fryer, Ramesh Meyyappan, John Tunnicliffe and Ashwani Roy. I attended the sessions of Donald Farmer (R2 SSRS, MDS), Thomas Kejser (Madison/PDW), James Rowland-Jones (2008 DW) , Bob Duffy (Gemini/PowerPivot), Christian Bolton (Statistics), Allan Mitchell (SSIS 2008) and am grateful to get the opportunity to learn directly from these experts.

I tried to understand what motivates them, why would they want to spend a lot of time and effort, organising SQL Server conference like SQLBits. During the breaks I asked several people in the conference how do they learn new product/version (by reading books, web sites, and using the product), how they solve problems (mainly by googling), and how they keep themselves up-to-date with the technology developments (by going to events and subscribe to newsletters/mailing list). I talked to several vendors and attended their presentations to understand their products: Wherescape (Red), Board, Attunity (Oracle-SSIS CDC), Quest (Spotlight, SSAS) and Solid Quality Mentors. I am particulary impressed with Wherescape Red’s ability to design data warehouses, complete with ETL, cubes and documentation, in very short time, but disappointed with Spotlight’s ability to monitor SSAS. I missed SQLSentry, but I have evaluated their product (with regard to SSAS monitoring) in the past few weeks.

I presented a session about building cubes from operational systems. I explained why would we want to do it (the benefits), how to do it (a step-by-step demo in BIDS to build a stock broker cube from a stock broker database), and what risks/pitfalls should we be aware of (performance implications, date dimension, large dimension, etc). There were about 20-25 people attending my sessions, about 4-5 questions. Had problem with the projector (again!) even though I have tested it just a couple of hours before. Overall I’m very impressed with SQLBits 5, I learned a lot (both technical/SQL Server and non-technical), both from the speakers, sponsors and from the audience. I look forward to next one.

15 November 2009

Building Cubes from Operational Systems

Filed under: Uncategorized — Vincent Rainardi @ 8:07 am
Tags:

Three years ago, I always thought that cubes* must always be built from a data warehouse or data mart*. So I did. I built a data mart first, with the ETL, before creating the cubes. About a year ago a data warehousing colleague of mine asked me if it is possible to build a cube straight from the transactions system, without building a data warehouse first. I explained to him that in theory it was possible, but there were challenges. Since then I have built several cubes straight from the operational systems*. In doing so I have encountered numerous issues, and found ways to overcome them.

*”What is a cube?”, “What is the difference between a data warehouse and a data mart?”, and “What is the difference between an operational system and a transaction system?” are explained at the end of this post.

When we build a cube from an operational system, we face challenges like:

  • there is no surrogate key
  • there is no date table
  • attributes are buried in large transaction tables
  • normalised structure – attributes are scattered on multiple tables
  • there is no ‘time of day’ dimension table
  • attributes and measures are located in the same table
  • data is raw – no calculations / transformations
  • performance implications

So this Saturday at the SQLBits* I will be presenting about that, building cubes from operational systems. I thought the best way to explain it is using a demo. So I plan to spend about 10 minutes laying out the background and describing the benefits, then we will be off to BIDS* and SSMS* building a cube from a stock broker database.

*SQLBits is a SQL Server conference in UK, held since 2007. Currently it is held twice a year, i.e. every 6 months. The first one was on 6/10/07 at the Microsoft campus in Reading. The next one, SQLBits 5, is this week, 19-21/11/09, in Newport. The first one I attended was SQLBits 3, 13/9/08 in Hatfield. At SQLBits 4 in Manchester, 28/3/09, I presented a topic about data warehousing. SQLBits usually have 4 main tracks: SQL DBA, SQL Development, BI and new release. SQLBits is usually free to attend, or at least there is a free day. In SQLBits 5, Thursday and Friday are ‘paid’ days, where as the Saturday is free. I found that it is worth taking a day off work and pay the conference fee, to be taught by experts like Chris Webb, Allan Michell, Donald Farmer and Thomas Kejser. The complete agenda is here.

*BIDS = Business Intelligence Development Studio, is the development tool that we use to build SSAS cube, SSIS package and SSRS reports. It is Visual Studio, but customised specifically to do BI development, as opposed to building .NET applications.

*SSMS= SQL Server Management Studio. It is the tool that we use to administer SQL Server 2005 and 2008, as well as doing design and development e.g. creating stored procedures, table structures, and database diagrams.

*What is a cube? In daily conversation, a cube is an SSAS database. Amongst SSAS developers, we use the word ‘cube’ to differentiate our stuff from ‘database’. In SQL Server community, the word ‘database’ means relational database, where as the word ‘cube’ means OLAP database. I know that this is technically incorrect, and in my younger years I would have protested strongly (my background is engineering physics), so for those who prefered a more scientific definition, here it is: “A multidimensional database (MDB) is a structure designed to store measure data (fact) and hierarchical data (dimension), where the combination of members of each dimension points to zero, one or many measure values. This relationship can be described as a discreet function m[1..n] = f (d1, …, dn), in other words m is a function of d1 to dn, where m is the measure value and d1 to dn are the dimension members. The structure stores the aggregate values as well as the base values, typically in compressed multidimensional array format, rather than in RDBMS tables. Aggregate values are pre-computed summary or calculations of the base values.”

*What is the difference between a data warehouse and a data mart?
A data mart only has one fact table where as a data warehouse has many fact tables. Again, I know it is technically incorrect, and for those who preferred a more formal and precise definition, quoting from my book, “A data warehouse is a system that retrieves and consolidates data periodically from source systems into a dimensional or normalized data store. It normally keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batch not every time a transaction happens in the source system.”

*What is the difference between an operational system and a transaction system? An operational system can be a reporting system (mainly read) or a transaction system (mainly write). A transaction system is an IT system which companies use to enter daily transactions such as sales, payments and call logs. A reporting system is an IT system which companies (or other types of organisations) use to extract data from the transaction system either to understand what happened in the business or analyse the business.

Blog at WordPress.com.