Data Platform and Data Science

31 March 2010

SQL Server 2008 Backup Compression

Filed under: SQL Server — Vincent Rainardi @ 6:14 pm
Tags:

One of the feature in SQL Server 2008 that is useful for data warehousing is backup compression. In data warehousing, backup is usually an integral part of the overnight ETL / data loading process. In the overnight warehouse ETL, there are 4 points where backup or restore is normally located:

  1. Backup of the source system, and restore it in the warehouse server.
  2. Backup of stage database, in case we need to reload today’s data incremental extract. Or yesterday’s.
  3. Backup of the ODS, after it is updated successfully.
  4. Backup of the data marts, after they are updated successfully.
  5. Backup of the main warehouse, after loading. In some DW implementations, not only one but several backups, usually differential.
  6. Restore of DW backup, in SSD (scalable shared databases) –  technique to speed up DW queries by having multiple read-only copies of the warehouse.
  7. Restore of DW backup, in DR server, in the event of a disaster, or during routine DR tests.

Ok, that’s more like 7 points rather than 4.

Now, the thing about DW backup is: it’s a big database. The main DB of a DW varies from 50 GB to 5 TB*. The backup time varies from 10 mins to 3 hours*. If your DW is at the high end of this range, you will already use backup compression would give a big benefit. And we don’t only keep 1 copy of backup on disk. Many DW operations keep 3-5 copies on disk, and many more on tapes. So if backup compression can save 50% of disk space, and 50% of backup time, well, it would be useful. It would be very useful. It could be a reason to upgrade to 2008. It could be a reason to switch to SQL Server.

*well that probably covers about 90% of the population. There are exceptions, as always. Remember we are talking SQL Server data warehousing, and today is 2010. If today was 2005, we will be talking about a completely different sizes of DW. Probably in the range of 5 to 50 GB. A million row fact table in 2005 was something to talk about. Today, there’s nothing special about a billion row fact table. Remember we are talking SQL Server. SQL Server! Hmmm, how fast SQL DW grew in the last 5 years. In 2000 the concept of SQL Server DW wasn’t even there. Back then the word “SQL Server” and “data warehouse” were never in the same sentence. In 2000 people compared SQL Server with Microsoft Access. In 2000 DW was something that normally associated on Teradata, DB2, at least Oracle. But not SQL Server. And now SQL Server PDW hosting 50 TB DW. Ah, something to blog about next time: “SQL Server data warehousing: then, now and future”.

True that there are backup compression products out there, such as LightSpeed. So if your DW is > 500 GB the question is not about whether to compress or not to compress. You have to compress, period. Both for speed / time saving and for the disk/space saving. The question is: which software do you use to compress. Well the answer is sort of obvious really. Try those products and use the one that gives the highest compression ratio in the shortest possible backup time. With one usual note: budget permitting.

But there is one big advantages of using SQL Server backup compression, compared to any backup compression software: anybody can restore backup. If you use Litespeed for example, to restore the backup you also need to have LiteSpeed. So you end up with installing Litespeed on many servers. Copying databases from production to development SQL server for example, you need to have LiteSpeed installed on both servers. Restoring prod backup in DR server for example, you need to have LiteSpeed installed on DR.

But using 2008 backup compression it comes as standard. And it’s only a matter of time before all SQL Servers in the company are upgraded to 2008, right? So in the end everybody will throw away their backup compression software, and replace it with SQL Server. IF (and that’s quite a big IF) SQL Server backup compression can squeeze the database to a good size and do it fast.

In 2008 R2, backup compression will drop to Standard edition. Currently (R1) it is in Enterprise edition. A compressed backup can be restored by all editions of 2008. Obviously 2005 can’t restore them.

So how do you do a compressed backup? Simple, just add “with compression”, like this:
backup database Test to disk = ‘c:\test2.bak’ with compression

Or, use GUI: Right click the database, Tasks, Backup:

Then click Options pane, and select Compressed Backup:

And how to restore it?
Simple, same way really. No difference whether the backup we are restoring is compressed or not:
restore database from disk = ‘c:\Test.bak’

And using GUI:

Now, let’s test how small the database becomes after compression.

For this purpose I prepare a database Test, which is a dimensional database, containing 2 fact tables and 3 dimension tables. The fact tables contain 1 million rows each with 2 measures, and the dimension table 1000 rows each. I’m going to backup this database without compression first. Then with compression. I will repeat each test 3 times, and average the results.

— Backup Compression
use master
backup database Test to disk = ‘c:\test.bak’

/*
28.579 sec (7.543 MB/sec), 221,492 KB
24.225 sec (8.898 MB/sec), 221,492 KB
26.956 sec (7.997 MB/sec), 221,492 KB
*/

backup database Test to disk = ‘c:\test2.bak’ with compression
/*
18.211 sec (11.837 MB/sec), 84,259 KB
20.129 sec (10.709 MB/sec), 84,259 KB
19.628 sec (10.982 MB/sec), 84,259 KB
*/

So the above test shows that:
a) The size of the backup is consistent. It doesn’t change each time we backup.
b) The backup time varies. It can be up to 18% difference (24.225 sec compared to 28.579 sec)
c) Using backup compression the disk space is 62% less (137,233/221,492)
d) Using backup compression the backup time is 27% less (7.26/19.32)

Point c above (saving 62% disk space) is very useful for data warehousing. For some companies it could be the reason to upgrade to 2008. Or even to switch to SQL Server (if your current data warehouse platform does not have a backup compression). Because in data warehousing we are talking Terabytes, not Gigabytes, 62% less disk space saves a lot of money.

Point d is also very useful for data warehousing. As I specify at the beginning of this article, there are many points where backup is used in the overnight data warehouse load process. So if these points are 27% faster it would decrease the load window, which sometimes is a make or break factor in data warehousing. For example, it could decrease your load window from 6 hours to 5 hours.

One last thing before I end this article, you can set the SQL Server instance to compressed backup by default. Meaning that if you don’t specify it will be compressed by default.

The T-SQL command is:
exec sp_configure ‘backup compression default’, ‘1’
reconfigure

And using GUI on SSMS: Right click on the server instance, properties, database settings, compress backup:

28 March 2010

Join brings more rows

Filed under: Data Warehousing — Vincent Rainardi @ 12:59 pm
Tags:

When populating a fact table from the source system we do a lookup on the dimension tables to get the dimension surrogate keys. We join the dimension tables to the fact stage table on the natural keys:

select D1.SK, D2.SK
from fact_stage_table S
left join dim1 D1 on S.D1NK = D1.NK
left join dim2 D2 on S.D2NK = D2.NK

(SK = Surrogate Key, NK = Natural Key)

Say the query returns 200,000 rows.
But when we bring dim3 the query returns say 230,000 rows:
select D1.SK, D2.SK, D3.SK
from fact_stage_table S
left join dim1 D1 on S.D1NK = D1.NK
left join dim2 D2 on S.D2NK = D2.NK
left join dim3 D3 on S.D3NK = D3.NK

This is because 1 row in the fact stage table corresponds to 2 rows (or more) in dim3.
In this case we need to find out which column in the fact stage table causing the split.
To find out the column causing the split, strip out dim1 and dim2 from the query, and group by D3.SK:
select S.D3NK from fact_stage_table S
left join dim3 D3 on S.D3NK = D3.NK
group by S.D3NK having count(*) > 1

Say the above query returns 4 rows: NK1 to NK4. This means that in the fact stage table, there are more than 1 row for NK1.
So now we query the fact stage table to see which column is causing the issue:
select * from fact_stage_table where D3NK = ‘NK1’

The column(s) causing the issue have different values across the rows. All other columns in the last query should have the same value across the rows, except these 1 or 2 columns that cause the issue.

Common causes are:

  • It’s a type 2 dim, so there are several versions for each NK. One possible solution is to use the latest version, i.e. where Is_Current = ‘Y’ (or use validity dates, e.g. valid from and valid to).
  • It has an outrigger hence there are multiple rows in the dim table for each NK. One possible solution is to take the first version, i.e. where outrigger_id = 1
  • The dim table is populated multiple source systems. It is possible that the NK1 exists in source system 1, but also exists in source system 2.

Then apply those additional conditions to the investigation query:
select S.D3NK from fact_stage_table S
left join dim3 D3 on S.D3NK = D3.NK and D3.Is_Current = ‘Y’ and D3.Outrigger_ID = 1
group by S.D3NK having count(*) > 1

If this returns nothing, then apply that condition to the main query:
select D1.SK, D2.SK, D3.SK
from fact_stage_table S
left join dim1 D1 on S.D1NK = D1.NK
left join dim2 D2 on S.D2NK = D2.NK
left join dim3 D3 on S.D3NK = D3.NK and D3.Is_Current = ‘Y’ and D3.Outrigger_ID = 1

This should now returns 200,000 rows. If the fact stage table has 200,000 rows, after we join with the dimension tables it should still return 200,000 rows. If less return less than this (say 170,000), check if you use inner join — you should use left outer join. If it return more than this (say 230,000), check if split occurs (as I explain above). If it does, find out which column causes the split and put the additional where clause for that column.

24 March 2010

Disk Space in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 8:46 am
Tags:

It is surprisingly rare for IT people (in general) to appreciate that data warehousing requires a lot of disk space. Non-IT people, on the other hand, is more appreciative. They understand that if we collect data from many different places and put them in 1 place, then naturally it will require a big disk space.  But IT people is quite the opposite: they can appreciate that email and file servers are disk hungry, but they cannot understand why data warehousing requires tons of disk space. Many IT people still compare that data warehousing with other database applications, i.e. a few GB should be enough.

They cannot understand for example, why the data warehouse is 500 GB, if the source system is only 50 GB. Today was the 4th time I received such astonishment. So tonight I would like to write down the reasons why data warehousing requires a lot more disks than transactional system. I will try to write for audience who are not from data warehousing background, using plain IT language, avoiding any DW terms. So for readers who are from DW background, please bear with me.

  1. In data warehousing, we make a lot of copies of the source system. This is called “snapshoting”. Say the main transaction table in the source system contains 2 million rows. In the DW, we store all 2 million rows, every single day (or week, or month). So in a year we will be storing 365 x 2 = 730 million rows. In DW we store the data in more compact form (using integer keys), so if the source system row is say 500 bytes, in the DW the row width is probably only 250 bytes. And we don’t store every single column from the source system, but only some of them, so the row width in DW is probably only 100 bytes, one fifth of its original size. So if the transaction table in the source system is 1 GB, the daily ‘snapshot’ in the warehouse is only 200 MB. Multiplied by 365 it’s about 70 GB.
  2. In data warehousing we are very concern with query performance. We use a lot of indexing to provide good query performance. Indexes take a lot of space. In transactional system it’s between 20% to 75% of the table/data size. In data warehousing the index is between 0.5 to 3 times the size of the tables. So if your DW database is 100 GB without index, the indexes would probably take between 50 and 300 GB. 50 if you try to be a minimalist e.g. only index where absolutely necessary, with the primary reasons usually being ETL/load time and disk constraint.  300 GB if you try to be a perfectionist, i.e. every single dimension key column in the fact table*, every single attribute used in the queries, etc.
    *I don’t think I can keep it DW jargon free now
    Note that in SQL Server we don’t have bitmap index like in Oracle – hence more space is required for indexing to the same effect.
  3. In the transaction systems everything is normalised, so tables are relatively small, as there’s no redundant data. In the warehouse, everything is denormalised, so tables are bigger. Sometimes a lot bigger. This is because denormalised data model has redundant data i.e. the same value is repeated many times.
  4. In data warehousing we create materialised view. It’s an Oracle term (well, DB/2 too). In SQL Server it’s called indexed view. The idea of materialised view is to query a table, then store the result of this query on the disk. So when the real query comes the response time would be lightning fast. For example: a report submits this query, which is a typical query in DW – it’s called “star join”:
    select something, something, sum(something)
    from table1 join table2 on … join table3 on …
    where … and … and … group by … having …

    and the query takes 2 minutes. If we store the result of this query on disk, SQL Server doesn’t have to do this calculation. The original query might read say 100 million rows, but the result may be only 1000 rows. So if the original query takes 2 minutes, reading this materialised view is instant (0 seconds), because the materialised view contains the exact data that the query wants i.e. the 1000 rows.
  5. In data warehousing we like to do partitioning. Partitioning is basically slicing a table in (say) 100 chunks and spread them across (say) 5 disks. Say the estimated database size is 100 GB. If we don’t do partitioning, we say to SQL Server: create database1 on disk1, size = 100 GB. Now when you partition it into 5 disks, we need to create a filegroup (FG) on each of these disks. When creating these FGs you can’t size each FG = 100 GB/5 = 20 GB. You need to specify more, say 30 GB, because they don’t grow at the same pace. So FG3 may be full whilst FG4 is still half empty. This is because the FGs contains different tables which grow at different rates. This results in more disk space than if you don’t partition the tables e.g. 30 x 5 = 150 GB instead of 100 GB.
  6. Operation of partitioning takes some space. It’s not a lot but still something. I’ll illustrate with 2 operations: truncating by switching, and sliding window maintenance. In Oracle (I think it is from version 8) we have “truncate partition” command. In SQL Server we don’t (not sure why not, probably because of copy right). So we have to prepare an empty partition, then swap it with the partition that we want to truncate. This technique takes seconds, where as “delete” takes minutes (30-60 minutes are common).
    Sliding window is a technique to maintain a rolling window, say last 24 months. So on the first (working) day of every month we drop the oldest month (month 24) and add the current month. The data for each month is stored in different partitions, so at the beginning of every month we “switch out” the oldest partition and “switch in” the current month partition.
  7. In data warehousing, we do a lot of backups. Apart from the main warehouse database, and the data marts, we also backup the stage databases, standing data database, metadata database, ETL database and ODS database. We also keep the source system backups. And we don’t only store 1 copy of backup (last night’s only), but 5-10 copies and sometimes 20. For example, we keep a week’s worth of the stage backup on disk (beyond this is usually on tape). This is to enable us to reload the DW if we had 2-3 days of incremental load failure by redoing the load for those failed days. It is common for the DW backup size to be 5 times the size of DW databases. If the total of all DW databases is 800 GB, the backup is probably 2-4 TB. Fortunately they are on RAID 0, rather than RAID 10. RAID 10 requires 4 times as much disk as RAID 0. More about RAID on point 10 below.
  8. In data warehousing we need a big space for development and testing. Dev and test environments are in TB, unlike in the transaction systems which are usually in GB. It is common that dev environment is 5 times the production size (the disk, not memory or CPU). Test is also 5 times prod. This is because: a) quite often we have several projects running simultaneously, and each project requires disk space similar to the prod, and b) unlike testing a transaction system where minimal amount of data will do, testing a data warehouse often requires the full set of data, because the analyses/reports needs a good amount of data. So if the total size our DW databases is 500 GB, we can expect the development space to be 2.5 TB and the test/UAT space to be 2.5 TB.
  9. Log expansion at loading. When a warehouse is being loaded at night, the log expands. Yes, true that we use a lot of bulk loading. Yes, true that we set the databases to simple recovery mode. But in reality, at loading time the DW still expand. It’s just the nature of log-based RDBMS, and SQL Server is no exception. So if one of your mart is say 30 GB when it’s first deployed, expect that during normal daily load it will expand up to 50 GB, i.e. the data file (MDF & NDF) will relatively stay the same, but the log file (LDF) will expand during load. Of course, after DB maintenance plan is executed (backup log etc), the log file size is reduced, but you’ve got to allow some space for expansion during loading. By the way, “backup log with truncate only” is deprecated in 2008.
  10. In data warehousing we like RAID 10 (1+0 that is, not 0+1). This is because we are very performance concern. In the transaction systems, tables contain thousands rows, sometimes hundred of thousands. In data warehousing, the main tables (they are called “fact tables”) contains million of rows, sometimes billions. When you have a table with 2 billion rows in it, you would want to a) place it in 10 different disks to spread the IO (see point 5 above about partitioning), and b) you want to put it on RAID 10 rather than 1 or 5 or 0. I’ll skip the technical details as my target reader is non DW people: to make 1 TB of space, RAID 1 requires 2 TB of disks. RAID 5 requires 1.05 TB of disks (I’m simplifying here). RAID 10 requires 4 TB of disks. RAID 0, which I mentioned on point 7, requires 1 TB of disks. In data warehousing we need RAID 10 because it gives the best query performance and the best load performance.
  11. Not only we need RAID 10, but we need many sets of small RAID 10. If you have 100 disks of 146 GB each (formatted capacity 137), you don’t put all of them in 1 big chain of RAID 10. No, that’s not good for performance because we can’t create many file groups on different spindles for partitioning.  Instead, we take 4 disks and make 1 RAID 10 of 137 GB. Take another 4 disks and make another 137 GB RAID 10. So we will end up with 25 RAID 10s. This is good because now we can spread the IO when that 2 billion rows partitioned fact table is being quieried.

I hope the above points had made you aware about the disk space requirements in data warehousing. And this is general really, not only in SQL Server platform, but all other platform. Whether you build your DW on Sybase, Informix or DB2, you still hit the same problem generally.

One more thing before I end this writing: exponential growth. It is the nature of data warehouse development to be exponential growth. If you predict that the space you just bought should be enough for the next 3 years, expect that it will be full within 1.5 years.

Vincent Rainardi, 23rd March 2010.

18 March 2010

SQL Server 2008 Data Warehousing Features Presentation

Filed under: Event — Vincent Rainardi @ 12:38 am
Tags:

My presentation tonight, 17th March 2010, at the SQL Server User Group meeting at Microsoft, Victoria, is here, including all the scripts. It is titled SQL Server 2008 Data Warehousing Features.

Neil Hambly presented Indexed Views and Computed Columns. Duncan Sutcliffe from Hitachi (was Edenbrook) presented BI in Office 2010 (Excel, PowerPivot and SharePoint). Both were excellent. Chris Testa-O’Neill and Ashwani Roy were the hosts, helping answering questions. Jamie Thomson was also there, helping out with the Q&A session. The meeting was organised by Tony Rogerson.

15 March 2010

Bitmap Filter (Star Join Query Optimisation)

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 9:54 pm
Tags: ,

Bitmap Filter is an in-memory structure that uses a compact representation of a set of values from one table in one part of a query to filter the rows on the second table located in another part of the query.

In SQL Server data warehousing, Bitmap Filter is used to reduce the number of fact rows involved in a star join query by eliminating fact rows that does not satisfy the dimension where clause, before the fact table enters the join. This improves the query performance.

A star join query is SQL query where a large fact table is joined to dimension tables, filtered on dimension attributes and grouped by a set of dimension attributes to get an aggregate of the measures.

Bitmap filter works on:

  • select statements
  • read-only part of update, delete, insert and merge
  • when creating Indexed views (materialised views)
  • reduce query compilation time

Example/demo:

select d1.attr2, d3.attr2, d1.attr1, sum(f.measure1)
from FactNoPK f
inner join dim1 d1 on d1.dimkey1 = f.dimkey1
inner join dim2 d2 on d2.dimkey2 = f.dimkey2
inner join dim3 d3 on d3.dimkey3 = f.dimkey3
where d2.attr2 between ‘Value100’ and ‘Value120’
group by d1.attr2, d3.attr2, d1.attr1

With Bitmap Filter:
PROBE([Opt_Bitmap1005], [Test].[dbo].[FactNoPK] as [f].dimkey1, N’[IN ROW]’) AND
PROBE([Opt_Bitmap1006]. [Test].[dbo].[FactNoPK] as [f].dimkey2, N’[IN ROW]’) AND
PROBE([Opt_Bitmap1007]. [Test].[dbo].[FactNoPK] as [f].dimkey3, N’[IN ROW]’)

Notes:

  • Bitmap filter only works on inner join. It doesn’t work on outer or cross join.
  • Bitmap filter only works on hash joins. It doesn’t work on merge or nested lookup joins.
  • The fact table need to be > 100 pages.
  • The join must be on a single column.
  • Bitmap filter can slow down queries

About joins:

  • Nested loop join: table1 is small and table2 is big, e.g. 5 & 50k rows, and join columns are indexed
  • Merge join: both tables are equally big e.g. 400k & 500k rows and join columns indexed.
  • Hash join: both tables are big but 1 is much bigger e.g. 100k & 1m rows, join columns unindexed.

Bitmap filter doesn’t work if:

  • It’s not a hash join
  • Out of date stats/missing indexes (cardinality/rows)
  • Not enough memory
  • Not enough threads (BF only works in parallel query – you need more than 1 CPU)

13 March 2010

Creating a Data Mart from a Data Warehouse: Four Questions

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 11:44 pm
Tags: ,

I meant a dimensional data warehouse, not a normalised data warehouse here. The things I’d like to cover are:

  1. In the DM, could we change the grain of the fact table? (to a higher grain)
    In doing so could we combine several fact tables into one?
  2. If we have SCD type2 dimension in the DW, should we make it type 1 in the DM?
  3. If the DW is in star schema, could we make it snow flake in DM?
    If the DW is in snow flake schema, could we make it a star?
  4. Should the DM be allowed to change the attribute names and entity names?
    Is there a good reason for doing it?

The answers of these questions largely depend on the reason of creating the DM in the first place. In this article I listed down 6 reasons for creating a DM from a DW:

  1. Because we need to create a static copy of a few fact & dim tables which does not change every day like the warehouse, for analysis purposes.
  2. Because they need a mart for data mining.
  3. To ease the query workload on the warehouse.
  4. Because they want to change the data to simulate some business scenarios.
  5. To support the query performance of the reports.
  6. Because the data warehouse is in normalised format.

In this article I’d like to focus on #5: a DM which is created to support the reports’ performance (or BI application). The DM is structured differently to the DW.

In the DM, could we change the grain of the fact table? (to a higher grain)
In doing so could we combine several fact tables into one?

Yes by all means. That was the intention of creating the data mart in the first place. To support query/report performance we need to create the fact table at a higher grain and that’s what relational aggregation is all about. Alternatively you could build a cube and have your reports reading the cube.

Yes by all means combine the fact tables. The idea of optimising data structure for report performance is that we store the data in a ready-for-consumption format. If this means reading several fact tables from the DW and combine them into a single DM fact table so be it. This is where the covering index will be very useful. A nonclustered index that includes all ‘grain determinant’ dim keys in the fact table would speed up the join. And no, referring to this article, a “fact key” column would not be useful as you can’t join on it.

If we have SCD type2 dimension in the DW, should we make it type 1 in the DM?

Yes by all means do that. If you want to make your mart perform, simplify it. The principle of designing a mart is: 1) only take from DW what you really need, 2) structure it in the most performing way. The DW on the other hand, has different design principles, i.e. 1) store anything and everything, and 2) structure it in the most flexible way.

SCD type 2 is a classic example of this ‘flexibility’ principle. Worry that there will be a need to understand the value of those attribute when the transaction happened, the DW modeller declares the dimension as type 2. Some designers even created the “current attribute” columns, in that type 2 dim, containing the current value. And some created the “current key” column, which contain the dim key of the current row. For a good background on this Slowly Changing Dimension type 2+1, please read Margy Ross and Ralph Kimball article here. Look for the heading “type 2 with current overwrite”. It is not a new concept. They wrote it 5 years ago (March 2005). I first came across it when writting this article. In there I highlighted what Margy Ross said about type 6 which is type 2+3+1 in her article: Combining SCD Techniques. And that was in October 2000!

So yes, if the dimension is type 2 in the DW and you do not need historical value of attributes, you should make the dimension as type 1 in your mart.

Just a few notes on the details when creating a type 1 dim from a type 2 dim:

  1. Use Is Current Flag column instead of date validity columns (it’s faster). A discussion about the data type of the Is Current Flag column is here.
  2. If your source type 2 dimension has “current attribute” columns, use them (instead of the normal attribute column).
  3. If the type 2 dim has both “is_current” flag and “current attribute”, use is_current column regardless whether you use the “current attribute” columns or the normal attribute columns.
  4. You will need to translate the dim keys in the fact table to the current dim keys. If your source dim has “current dim key” column, it makes the job a lot easier. Otherwise this operation is a quite heavy. Essentially we need to prepare a table with 2 columns: dim key and current dim key. We work out the current dim key column by looking up the natural key where is_current = ‘Y’

If the DW is in star schema, could we make it snow flake in DM?
If the DW is in snow flake schema, could we make it a star?

First let’s go through the reasons. The reasons people need to normalised a star schema into a snow flake schema are: a) to support a data entry application where the application tags certain information entered against the attribute member, b) to support Reporting Services drop down lists, possibly Sharepoint lists also, c) ease of maintenance, because that particular attribute is used by many dimensions, and d) company standard for dimensional modelling.

Where as the reasons people need to make a star are: a) query performance, b) query performance and c) query performance.

Considering that the purpose of creating the data mart is to support report’s query performance, In my opinion the answer to this 2 questions is: yes by all means make it snow flake or star as necessary to support the report/application.

Some notes when normalising a star schema:

  1. There is another way of normalise a star (other than physicalised it as tables). And that is by creating views. Be it materialised view or not (indexed view in SQL Server).
  2. If you have both attribute code and attribute name in the DW dim then “select distinct attribute code, attribute name from dim1” – no problem. But if you don’t have the attribute code, then you need to either a) link on attribute name, or b) create and maintain the attribute code in the mart. I’d recommend the latter. Of course there’s another option which is to ask the DW to add the attribute code.

When creating a star from snow flake:

  1. Consider SCD type, e.g. a) parent & child source are type 2, target is type 2, b) parent and child source are type 1, target is type 1, c) parent is type 2, child is type 1 (or vice versa) and target is type 1. To create a type 2 from type 1 is not possible.
  2. Out of the 3 combinations on point 1 above (more if you have 3 levels), the issue is when you want to create a type2 target (option a in point 1 above). We need to make sure that the parent is joining to the child dimension on the right row, e.g. which one is the child row for that parent row etc.
  3. Set your dimension grain (of the target dimension) according to what the application/report need. Usually not at the lowest possible level, but at higher level, e.g. product type – product group – product code: if the report works at product type then set the grain at product type, so the report will have good query performance.
  4. If necessary create 2 dims at different grain, 1 at product type and the other at product code. Reports which work at product type level don’t have to use the product code dim. The higher level dim could be created using indexed view rather than a physical table.
  5. If you change the dimension (either the grain or SCD type) you will need to translate the dim keys in the fact table.

Should the DM be allowed to change the attribute names and entity names?
Is there a good reason for doing it?

By ‘entity’ I mean table, e.g. dimension table and fact table.

Yes, in my opinion. The DW is designed to be flexible and incorporating different business areas/units, so in DW we create it as 2 dimensions. Whereas in DM we only deal with 1 area of the business so we only need to create 1 dimension.

For example, in wealth management we have 2 account dimensions: private accounts and corporate accounts. They have different attributes because their have different natures of business. They are used in different fact tables. So the dimension name in the DW are accordingly named as dim_private_account and dim_corporate_account.

When we create a DM for private business, it makes sense if we want to call the account dimension as dim_account (without the ‘private’ prefix) because in this mart we only deal with the private business.

But if the name changes are because ‘I just simply don’t like it’ I would discourage you from changing the attribute/table name. As much as possible you should try to keep the attribute and table names in the DM the same as in DW. This is for consistency, and also for ease of use. If a measure in DW is called potential_profitability_ratio, and in DM it’s called potential_profitability or profitability_ratio then users will be confused. Not only the end users, but report writers and ETL developers will be confused too. Exact column names give the additional assurance that they are the same, and contain the same data.

Data type of “Is Current Flag” column

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 1:00 pm
Tags: ,

In data warehousing we have an Is Current Flag column in SCD type 2 dimension table. This flag indicates whether the row is the current version or not.

Some DW data modellers believe that in SQL Server platform this column should be created as bit rather than int or Y/N. Because it saves space. Int is 4 bytes. Y/N is char(1) which is 1 byte (2 if nchar/unicode). Bit, they say, is 1/8 byte. That’s the reasoning.

But, in SQL Server, if you have 8 or less bit column in the table, it is stored as 1 byte. So if Is Current Flag is the only bit column in your dimension table, you don’t save space by going to bit data type.

Contrary to popular believe that 1/0 (numbers) takes less space that Y/N (letters), it is not. By default people think that the data type for integer is int, they don’t think about small int or tiny int. In that case (int) the storage is 4 bytes. If the data modeller is platform aware and they know about small and tiny int, then it’s 2 or 1 byte.

Char(1) is 1 byte. But varchar(1) is 3 bytes, not 1 byte. The additional 2 bytes in varchar column is to specify the length of the column.

So the question then become:

which data type is faster for “where is_current_flag = 1” query, is it char(1) Y/N or bit 1/0?  We know both take 1 byte storage, but which one is gives better query performance?

It’s subject to testing but I think comparing bit is faster than char(1). That’s because bit can only have 1,0 and null values where as char(1) can have many possible values. So for Is Current Flag column I’d go for bit data type.

Now let’s test it.

Dim1 is a dimension table with 2 attributes. It has 1000 rows. Fact1 is a fact table with 2 measures. It has 1 million rows.

— Add Is_Current column as bit
alter table dim1 add is_current bit
update dim1 set is_current = 0 where dimkey1%4 = 0
update dim1 set is_current = 1 where dimkey1%4 <> 0
select * from dim1

— Measure the query performance
set statistics time on
declare @i int
set @i = 1
while @i <= 3
begin
select d.attr2, sum(measure1) as m1, sum(measure2) as m2
from fact1 f join dim1 d on d.dimkey1 = f.dimkey1
where d.is_current = 1 group by d.attr2
print @i
set @i += 1
end
–8885 ms (avg of 8201,9871,9159,8875,8319)

— Change Is_Current column as char(1)
set statistics time off
alter table dim1 alter column is_current char(1)
update dim1 set is_current = ‘N’ where dimkey1%4 = 0
update dim1 set is_current = ‘Y’ where dimkey1%4 <> 0

— Measure the query performance
set statistics time on
declare @i int
set @i = 1
while @i <= 3
begin
select d.attr2, sum(measure1) as m1, sum(measure2) as m2
from fact1 f join dim1 d on d.dimkey1 = f.dimkey1
where d.is_current = ‘Y’ group by d.attr2
print @i
set @i += 1
end

— 5353.5 ms (avg of 5248,5299,5053,5254,5764,5503)

So, quite surprisingly, char(1) is faster than bit. 40% faster. Hmm, in that case I’ll set the Is_Current column as char(1) Y/N. And Y/N more descriptive than 1/0.

12 March 2010

Change Tracking in SQL Server Data Warehousing

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 9:43 pm
Tags: ,

Change Tracking is a mechanism to detect that a row has changed. It does not provide information about which column has changed. In SQL Server data warehousing, Change Tracking is typically used for:

  • Propagating transactions from the source system in real time, to ODS or DW
  • Synchronising DMs with DW

Why use Change Tracking instead of CDC or triggers?

  • Real time (synchronous)
  • Lightweight
  • Two way
  • No change to source table
  • Saving months (£) of development time

As I explained here, for incremental extraction we need to rely on last updated timestamp or identity columns. If we don’t have them, we could create CRC column to detect changes in rows. The extract of 2 columns (PK & CRC) is compared with previous day extract to detect: deletion, new rows and changed rows. We then use the PK to bring those rows to DW.

Now with Change Tracking the mechanism is provided out of the box. Yes, true that the source needs to be in 2008. But it’s only a matter of time before those 2005 boxes are upgraded to 2008. So in 2-3 years’ time I expect many SQL Server data warehousing projects will be utilising Change Tracking to provide real time data warehousing solution.

Disadvantages:

  • Does not provide the data that is changed

Demo/example:

— create an populate source table
drop table master3
create table master3
( attr1 int not null primary key clustered,
attr2 varchar(10),
attr3 varchar(10)
)

declare @i int
set @i = 1
while @i <= 1000
begin
insert into master3 (attr1, attr2, attr3)
values(@i, ‘Value’+CONVERT(varchar,@i), ‘Value’+CONVERT(varchar,@i))
set @i += 1
end

select * from master3

— enable change tracking and verify
alter database test set change_tracking = on
( change_retention = 3 days, auto_cleanup = on)

alter table master3 enable change_tracking
with (track_columns_updated = on)

select t.name, c.*
from sys.change_tracking_tables c
inner join sys.tables t on c.object_id = t.object_id

— obtain current version
declare @CT_Ver bigint
set @CT_Ver = change_tracking_current_version()

— make changes to the table
update master3 set attr2 = ‘Value2a’ where attr1 = 2
update master3 set attr3 = ‘Value1a’ where attr1 = 1
delete from master3 where attr1 = 3
insert into master3 (attr1, attr2) values (3000, ‘Value3a’)

— query changes
select CT.attr1, CT.sys_change_operation,
CT.sys_change_columns, CT.sys_change_context
from changetable(changes master3, @CT_Ver) as CT

— check column ID
select c.* from sys.columns c
join sys.tables t on t.object_id = c.object_id
where t.name = ‘master3’

11 March 2010

SSUG BI Evening and SQLBits 6

Filed under: Event — Vincent Rainardi @ 11:21 pm

On 17th March 2010 I’ll be speaking at SSUG: SQL Server User Group BI Evening, Microsoft Victoria, London, 18:00-21:00, SQL Server Data Warehousing features. Details below.

16th April 2010: SQLBits VI, Church House conference centre, Westmister, London, my proposed session is Query and Loading Performance of a SQL Server Data Warehouse. Details below.

See you there.

SQL Server User Group Evening Meeting on Wed Mar 17, Microsoft Victoria, London

SQL 2008 Data Warehousing Features (Vincent Rainardi); Index Views and Computed Columns (Neil Hambly); Business Intelligence in Office 2010 – Excel, PowerPivot, SharePoint (Duncan Sutcliffe); finishing with a Q & A on BI with the panel

LiveMeeting Attendee URL:  Click here to join Meeting
Time: Starts (UK time) at 18:00, Finishes 21:00
Cost: Free
Organiser: UK SQL Server User Group
Address: Microsoft London, Cardinal Place, 100 Victoria Street, London, SW1E 5JL
Directions to Event
Tags: Internals for Beginners; Indexing; Database Structure; File Groups
Note: This event has a max capacity of 102 – this event is now full; we are operating a reserve list – feel free to register as I will be sending reminders out a few days before the event to make sure if you have registered you are still coming.

Come and socialise and learn from your peers; these physical meetings are great places to expand your network, get answers and find out how other people are using SQL Server and what is going on. This is the last meeting this year for London and for those who can we should do drinks after the meeting to continue the SQL chat in an even more informal environment.

If you want to twitter please make sure you use the tag #uksqlug so it is shown on the site.

Your host for the evening will be Ashwani Roy, SQL Server MVP.

Agenda

18:00 – 18:15 Meet & Greet
Meet up and socialise with your friends, meet new people, find out what other people are doing with SQL Server.

18:15 – 19:00 SQL Server 2008 Data Warehousing Features
Vincent Rainardi; http://www.datawarehouse.org.uk

In this session I will explain about 2008 Data Warehousing features such as merge, CDC, data compression, backup compression, star join query, minimally logged insert, parallel query on partition tables, partitioned aligned indexed views, resource governor, grouping sets. I will pick some of them and show implementation example in the form of a demo.

So this session will be particularly useful for people/companies who already have a data warehouse on SQL Server 2005, as it explains to them the benefits of upgrading their warehouse to 2008. It will also be useful for companies who don’t currently have a data warehouse on SQL Server (perhaps they have it on other platform), and are currently deciding whether to use SQL Server 2008 for their warehousing platform. And of course this will benefit Microsoft, as it draws people’s attention to using SQL Server 2008 for their data warehousing platform.

19:00 – 19:15 Using Indexed Views & Computed Columns for Performance
Neil Hambly, http://sqlblogcasts.com/blogs/NeilHambly

One of the key elements of any DW / OLTP system is performance, alas no matter how good it’s design or how powerful it’s hardware there are times when the performance of a Query operation / function is JUST NOT adequate, even if properly tuned these times generally we are likely to be dealing with lots of data or complex calculations SQL Server has ways to help deal with those scenarios, some of these include the use of Indexes Views or Computed Columns I will spend a few minutes explaining their usage, Pro’s & Con’s and demo some examples.

19:15 – 19:35 Break for Pizza sponsored by Microsoft

19:35 – 20:30 Business Intelligence in Office 2010
Duncan Sutcliffe, Hitachi Consulting (http://www.hitachiconsulting.co.uk)

This session will look at the improvements to self-service BI made in Excel 2010 and SQL Server 2008 R2. It will examine native Excel functionality, the PowerPivot for Excel tool, and deployment of PowerPivot in SharePoint 2010. The session will be in the form of a live demonstration using the most recent CTP and beta software.

20:30 – 21:00 Panel – Q & A session on Business Intelligence
Ashwani Roy and Chris Testa-O’Neill

SQLBits VI, 16th April 2010, Church House conference Centre, Westminster, London
Query and Loading Performance of a SQL Server Data Warehouse
This session discusses the impact of various design/modelling aspects and SQL Server features to the query and loading performance of a SQL Server data warehouse.
  • How various data warehouse design/dimensional modelling aspects like fact table primary key, foreign keys, snowflaking, fact grain, transaction dimension, slowly changing dimension type 2, periodic snapshot, incremental extract, accumulated snapshot and fact normalisation affects the data warehouse query performance and loading performance.
  • How SQL Server specific features affect (can be used to improve) the query and loading performance of a SQL Server data warehouse, for example: partitioning, clustered index, indexed views, filtered indexes, bitmap indexes, change data capture, data compression, merge statement, incremental delete, query/table hints, joins, data types, minimally logged insert, data/log files, filegroups, autogrow, tempdb, recovery model, and scalable shared database.

For a complete list of sessions click here. SQLBits home page is here.
Update: I won’t be speaking at SQLBits VI. My proposed session above was not accepted.

Change Data Capture in SQL Server Data Warehousing

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 10:12 pm
Tags: ,

CDC is a mechanism to store data changes (update, insert and delete) happening in a source table by reading transaction log files. These changes are stored in a change table, enabling us to apply those changes to the a target table incrementally.

In SQL Server data warehousing, CDC is typically used for:

  • Propagating changes in the source system into DW
  • Drip feed ODS changes into DW
  • Updating dimensions in data marts
  • Propagating standing data changes into DW

Why use CDC instead of SSIS* or replication:

  • Near real time ETL
  • Small volume of data
  • More efficient than replication
  • Auditable
  • Configurable clean up

*We still use CDC in SSIS, not outside SSIS. The question should be why use CDC in SSIS instead of the normal daily load dataflow package.

Example / demo:
— enable DB for CDC and verify
use Test
exec sys.sp_cdc_enable_db
select is_cdc_enabled from sys.databases where name = ‘Test’
–output:1

— create and populate the source table (must have PK):
create table master1
( attr1 int not null primary key clustered,
attr2 varchar(20) not null
)
insert into master1 values (1,’A’)
insert into master1 values (2,’B’)
insert into master1 values (3,’C’)
select * from master1

— enable the table for CDC and verify:
exec sys.sp_cdc_enable_table @source_schema = ‘dbo’,
@source_name = ‘master1’, @role_name = ‘cdc_admin’,
@supports_net_changes = 1
select is_tracked_by_cdc from sys.tables where name = ‘master1’
–output: 1

— update the source table
update master1 set attr2 = ‘A2’ where attr1 = ‘1’
delete from master1 where attr1 = 2
insert into master1 (attr1, attr2) values (5, ‘E’)
select * from master1

1     A2
3     C
5     E

— retrieve the changes:
select * from cdc.fn_cdc_get_all_changes_dbo_master1
(sys.fn_cdc_get_min_lsn(‘dbo_master1’), sys.fn_cdc_get_max_lsn(), ‘all’)

select * from cdc.fn_cdc_get_net_changes_dbo_master1
(sys.fn_cdc_get_min_lsn(‘dbo_master1’), sys.fn_cdc_get_max_lsn(), ‘all’)

CDC functions:

— check CDC metadata tables

select * from cdc.captured_columns

select * from cdc.change_tables

select * from cdc.dbo_master1_CT

select * from cdc.lsn_time_mapping

exec sys.sp_cdc_help_change_data_capture

Disadvantages:

  • Lots of change tables and functions
  • Bad for big changes e.g. truncate & reload

Mechanism:

  • Internally implemented as sp_replcmds so reduces log contention
  • 2 agent jobs: capture and clean up
  • Column data type changed: still works
  • Column dropped: null
  • New column: ignored
  • Table dropped: change table dropped

Optimisation:

  • Stop the capture job during load
  • When applying changes to target, it is ideal to use merge

CDC Templates:

Next Page »

Blog at WordPress.com.