CDC is a mechanism to store data changes (update, insert and delete) happening in a source table by reading transaction log files. These changes are stored in a change table, enabling us to apply those changes to the a target table incrementally.
In SQL Server data warehousing, CDC is typically used for:
- Propagating changes in the source system into DW
- Drip feed ODS changes into DW
- Updating dimensions in data marts
- Propagating standing data changes into DW
Why use CDC instead of SSIS* or replication:
- Near real time ETL
- Small volume of data
- More efficient than replication
- Auditable
- Configurable clean up
*We still use CDC in SSIS, not outside SSIS. The question should be why use CDC in SSIS instead of the normal daily load dataflow package.
Example / demo:
— enable DB for CDC and verify
use Test
exec sys.sp_cdc_enable_db
select is_cdc_enabled from sys.databases where name = ‘Test’ –output:1
— create and populate the source table (must have PK):
create table master1
( attr1 int not null primary key clustered,
attr2 varchar(20) not null
)
insert into master1 values (1,’A’)
insert into master1 values (2,’B’)
insert into master1 values (3,’C’)
select * from master1
— enable the table for CDC and verify:
exec sys.sp_cdc_enable_table @source_schema = ‘dbo’,
@source_name = ‘master1’, @role_name = ‘cdc_admin’,
@supports_net_changes = 1
select is_tracked_by_cdc from sys.tables where name = ‘master1’ –output: 1
— update the source table
update master1 set attr2 = ‘A2’ where attr1 = ‘1’
delete from master1 where attr1 = 2
insert into master1 (attr1, attr2) values (5, ‘E’)
select * from master1
1 A2
3 C
5 E
— retrieve the changes:
select * from cdc.fn_cdc_get_all_changes_dbo_master1
(sys.fn_cdc_get_min_lsn(‘dbo_master1’), sys.fn_cdc_get_max_lsn(), ‘all’)
select * from cdc.fn_cdc_get_net_changes_dbo_master1
(sys.fn_cdc_get_min_lsn(‘dbo_master1’), sys.fn_cdc_get_max_lsn(), ‘all’)
CDC functions:
— check CDC metadata tables
select * from cdc.captured_columns
select * from cdc.change_tables
select * from cdc.dbo_master1_CT
select * from cdc.lsn_time_mapping
exec sys.sp_cdc_help_change_data_capture
Disadvantages:
- Lots of change tables and functions
- Bad for big changes e.g. truncate & reload
Mechanism:
- Internally implemented as sp_replcmds so reduces log contention
- 2 agent jobs: capture and clean up
- Column data type changed: still works
- Column dropped: null
- New column: ignored
- Table dropped: change table dropped
Optimisation:
- Stop the capture job during load
- When applying changes to target, it is ideal to use merge
CDC Templates:
[…] Changed data capture: enables us to drip changes from DW to DMs, or from ODS/OS to DW […]
Pingback by SQL 2008 Data Warehousing Features « Data Warehousing and Business Intelligence — 11 March 2010 @ 10:17 pm |
By nature DM is batch process that involve level of agregation. CDC is ideal process for DW tables population by capturing events on operational sources. Properly designed DW (on the event level) with CDC enabled on operational sources will help you to avoid data quality issues that exists within oparational databases.
A tarnsaction in operational database can have multiply events or one event only. The point is that you have to design DW to capture all events around the transaction. And thats where you use CDC. If you fall in trap and your DW design is transaction centric, than practically you are migrating business rules and all data issues from transactional databases to DW.
Comment by Milovan Banicevic — 21 April 2011 @ 4:24 am |