Data Warehousing and Data Science

29 July 2010

The database does not have a valid owner

Filed under: SQL Server — Vincent Rainardi @ 11:20 pm
Tags:

When I tried to created a database diagram (expand the DB name on Object Explorer, right click on Database Diagrams and select New Database Diagram), I got this error message:

I executed sp_helpdb DBName and got the owner as domain\vrainardi. Then I realise that I actually working offline. My laptop is not connected to the network. Hence no domain, and my user ID becomes invalid, or unauthenticated to be more precise. This is regardless that I’m working on local SQL Server instance, i.e. localhost!

Following the hint on the error message I opened the Files page of the Database Property and got:

So I changed the owner to sa and now when I try to create a database diagram I get:

which is normal. This is because the diagram support objects has not been installed. Instead of going straight to New Database Diagram I should have selected Install Diagram Support first, or so I thought:

But when I clicked on Install Diagram Support as above, I got the same error message again:

So I went back to click New Database Diagram and after clicking yes to the “install diagram support” dialog box, it came up with the Add Table screen*:

*20 years ago a modal box like that used to be called a “dialog box”, but these days people call them “screen” and “page”. I guess if it’s web app then “page” is appropriate but if it’s a Windows app? Well it’s probably right, because Windows app used to be a GDI (form based), but these days it’s WPF.

Any way, now I can create my database diagram/ERD.

SSIS: Looping with ADO Enumerator

Filed under: SSIS — Vincent Rainardi @ 11:43 am
Tags:

In SSIS, using the Foreach Loop Container we can loop around 7 different things: file, item, ADO, ADO.NET Schema Rowset, Variable, Nodelist and SMO.

  1. File: we loop around a collection of files in a folder
  2. Item: we loop around rows of data that we define ourselves (can be more than 1 column)
  3. ADO: we loop around rows of a existing table
  4. ADO.NET schema rowset: we loop around a collection of any of these: Catalogs, Character Sets, Check Constrains, Collations, Column Privileges, Columns, Columns Domain Usage, Constraint Column Usage, Constraint Table Usage, Foreign Keys, Indexes, Key Column Usage, Indexes, Primary Keys, Procedure Columns, Procedure Parameters, Procedures, Provider Types, Referential Constraints, Schemata, Statistics, Table Constraints, Table Privileges, Tables, Translations, Usage Privileges, View Column Usage, Views, View Table Usage, SQL Languages, DB Info Keywords, DB Info Literals, Check Constraints by Table, Table Statistics, Tables Info, and Trustees.
  5. Variable: we loop around the values of a variable, for example a DataTable where we loop around the rows.
  6. Nodelist: we loop around a set of XML nodes, which are the results of applying XPath expression to an XML file.
  7. SMO: we loop around SQL Server Management Objects, such as tables, users, statistics, jobs, traces, brokers.

In this article we will use ADO enumerator. We’ll create a package that:

  1. Create a DataSet
  2. Create a DataTable in the DataSet
  3. Populate the DataTable with 2 rows, row 1: file1.txt, row 2: file2.txt
  4. Assign that DataSet to variable Var1
  5. Have a FELC ADO Enumerator with source = Var1 to loop around the rows of the DataTable
  6. Set the Variable Mapping of this FELC to Var2
  7. Inside this FELC loop create a Script Task that checks if a file with file name = the value of Var2 exists or not.

The overall package looks like this:

True, we don’t need to use FELC to loop around the DataTable rows. We could do it in the first C# Script task, right after step 3 (populate the DataTable). The reasons of we might want to use FELC are a) to use SSIS functionality such as transaction and checkpoint, b) to reuse the variable that was transferred to the FELC in downstream task(s).

1-4 Create & Populate DataTable, assign to Var1

OK let’s start. First create a variable called Var1, of type object, scope = package:

Create the Script Task, and set Var1 as ReadWrite variable:

Edit Script, and create the DataSet and DataTable:

// Create the table
DataSet DS1 = new DataSet();
DataTable T1 = new DataTable();
DataColumn C1 = new DataColumn();
C1.DataType = Type.GetType(“System.String”);
C1.ColumnName = “Col1”;
T1.Columns.Add(C1);
DS1.Tables.Add(T1);

Usually we populate the datatable from a database table, but in this instance I’m going to populate it from a file:

// Populate the table from the file
StreamReader sr1;
sr1 = new StreamReader(“C:\\Data\\FELC per row\\list.txt”);
String s1;
DataRow DR1;
while ((s1 = sr1.ReadLine()) != null)
{
DR1 = DS1.Tables[0].NewRow();
DR1[“Col1”] = s1;
DS1.Tables[0].Rows.Add(DR1);
}
sr1.Close();

The List.txt file only contains 2 lines:

After populating the DataSet, we assign it to Var1:

// Set Variable
Dts.Variables[“Var1”].Value = DS1;
Dts.TaskResult = (int)ScriptResults.Success;

5. Create the Foreach Loop Container

Have a FELC ADO Enumerator with source = Var1 to loop around the rows of the DataTable

Set the collection of the FELC to ADO enumerator, rows in the first table:

6. Set the Variable Mapping

Set the Variable Mapping of this FELC to Var2 (create Var2 first, scope = package, type = string):

This means that every row on the DataTable will be assigned to Var2. It’s the first column only that will be assigned (Index 0 = first column).

7. Add a Script Task inside the loop

Add script task inside the loop to process each row. Set Var 2 as ReadOnlyVariables because we want to use within the script.

Display the value of Var2 and check the existance of the files mentioned in Var2:

string Var2 = Dts.Variables[“Var2”].Value.ToString();
System.Windows.Forms.MessageBox.Show(Var2);
if (File.Exists(“C:\\Data\\FELC per row\\”+Var2))
{
System.Windows.Forms.MessageBox.Show(Var2 + ” exists”);
}
else
{
System.Windows.Forms.MessageBox.Show(Var2 + ” does not exist”);
}
Dts.TaskResult = (int)ScriptResults.Success;

Apologies about the indentation. Within {} the code should be indented, but WordPress wouldn’t let it. Unless I format it as “preformatted”. But using preformatted the font would appear very small.

Now if we run it, it will display the value of Var2 and whether that file exist or not:

Rainer Stropek, founder of software architects“, wrote a good article about Foreach ADO Enumerator in The Code Project.

27 July 2010

MDX: Except

Filed under: Analysis Services,MDX — Vincent Rainardi @ 1:52 pm
Tags: ,

Background

It’s a stock brokerage firm. They do some promotion campaigns and would like to know how many customer accounts in each campaign never bought or sold a certain stock. They would like to have this as a calculated measure in the cube.

Let’s do it in SQL first, then in MDX. This way we know what the result should be, so we can verify the MDX output.

So the article will flow like this:

  1. Create SQL tables. Do this as simple as possible. Only create the neccesary columns.
  2. We populate these tables with data.
  3. Write a SQL query which gives us the count of customer accounts in each campaign that never bought or sold a certain stock (say ‘stock1’)
  4. Create the SSAS cube from those tables.
  5. Write an MDX query which gives us the count of customer accounts in each campaign that never bought or sold stock1.
  6. Explain the MDX

If you are looking only for the MDX just scroll down to the end of this article.

1. Create the SQL tables

Let’s create the table:

— Create tables
create table dim_campaign
( campaign nvarchar(10) )
— simplified (no media, target, design)

create table dim_account
( account nvarchar(10) )
— simplified version (no name, number, customer details)

create table dim_stock
( stock nvarchar(10) )
–simplified (no symbol, company name, sector, segment, cap)

create table fact_campaign
( campaign nvarchar(10),
account nvarchar(10),
hit int
)
— simplified version (no run date, duration & budget)

create table fact_transaction
( account nvarchar(10),
stock nvarchar(10),
qty int
)
— simplified (no dates, price, amount)

I simplified every table. Only creating columns which were absolutely necessary.

2. Populate the tables

— Dim Campaign
insert into dim_campaign values (‘campaign1’)
insert into dim_campaign values (‘campaign2’)

— Dim Account
insert into dim_account values (‘account1’)
insert into dim_account values (‘account2’)
insert into dim_account values (‘account3’)
insert into dim_account values (‘account4’)

— Dim Stock
insert into dim_stock values (‘stock1’)
insert into dim_stock values (‘stock2’)

— Fact Campaign
insert into fact_campaign values (‘campaign1’, ‘account1’)
insert into fact_campaign values (‘campaign1’, ‘account2’)
insert into fact_campaign values (‘campaign2’, ‘account3’)
insert into fact_campaign values (‘campaign2’, ‘account4’)

— Fact Transaction
insert into fact_transaction values (‘account1’, ‘stock1’,100)
insert into fact_transaction values (‘account2’, ‘stock1’,100)
insert into fact_transaction values (‘account3’, ‘stock1’,100)
insert into fact_transaction values (‘account2’, ‘stock2’,100)
insert into fact_transaction values (‘account3’, ‘stock2’,100)
insert into fact_transaction values (‘account4’, ‘stock2’,100)

Now let’s query them:

select * from dim_campaign
select * from dim_account
select * from dim_stock
select * from fact_campaign
select * from fact_transaction

3. Write the T-SQL query:

So the tables are all populated now. We need to find out how many accounts in each campaign that never bought or sold stock1.

select distinct account from fact_transaction where stock <> ‘stock1’

Output: account2, account3, account4

This is not what we want. Account2 and account3 have purchased stock1.
What we want is account4 only.

So looking at the fact_campaign table, campaign1 is for account1 and 2, and campaign2 is for account3 & 4:

So how many accounts in each campaign that never bought or sold stock1?
Campaign1   0
Campaign2   1   (account4)

That’s the answer. Now let’s write the T-SQL query, then the MDX query.

select distinct account from fact_transaction
where account not in (select distinct account from fact_transaction where stock = ‘stock1’)
Output: account4

Good, that’s heading to the right direction. Now bring-in the campaign:

select distinct FT.account, FC.campaign
from fact_transaction FT
join fact_campaign FC on FC.account = FT.account
where FT.account not in (select distinct account from fact_transaction where stock = ‘stock1’)

Output:

So far so good. Now let’s list down all campaign, then join to the above output.

select DC.campaign, count(x.account) as count_of_accounts
from dim_campaign DC
left join
( select distinct FT.account, FC.campaign
from fact_transaction FT
join fact_campaign FC on FC.account = FT.account
where FT.account not in (select distinct account from fact_transaction where stock = ‘stock1’)
) x on x.campaign = DC.campaign
group by DC.campaign

Output:

In the above SQL, we do “group by and count”, because the requirement is to find out how many accounts in each campaign that never bought or sold stock1.

So that’s the SQL query. Now let’s do the MDX. But before that we need to create the cube first.

4. Create the cube

Data source view:

Cube structure:
(as the dimensions are straight forward, I used the cube wizard to create the dimensions)

Define many-to-many relationship:

Browse the cube:

5. Write the MDX

Now we write the MDX:

with member [Measures].[X] as
distinctcount
( nonemptycrossjoin
  ( { [Measures].[Qty] },
    { except
      ( [Dim Stock].[Stock].[Stock].Members,
        [Dim Stock].[Stock].CurrentMember
      )
    },
    { [Dim Account].[Account].[Account].Members }
  )
)
select X on columns,
non empty [Dim Campaign].[Campaign].[Campaign].members on rows
from [Campaign]
where [Dim Stock].[Stock].&[stock1];

Output:

This is not what we are looking for, because account1, account2 and account3 did bought stock1. The question was: how many accounts in each campaign that never bought or sold stock1? So for campaign1 (account 1 & 2) the answer is 0, and for campaign2 (account 3 & 4) the answer is 1 (only account4).

Here’s the correct MDX:

with
set set1 as
nonempty
( [Dim Account].[Account].[Account].Members,
  [Dim Stock].[Stock].&[stock1]
)
member x as
count
( except
  ( nonempty
    ( [Dim Account].[Account].[Account].Members,
      ( [Dim Campaign].[Campaign].CurrentMember,
        [Measures].[Qty]
      )
    ), set1
  )
)
select {x} on columns,
non empty [Dim Campaign].[Campaign].[Campaign].members on rows
from [Campaign]

Output:

6. Explanation

set1 is the accounts which have bought/sold stock1, i.e. {account1, account2, account3} as shown below:

with set set1 as
nonempty
( [Dim Account].[Account].[Account].Members,
  [Dim Stock].[Stock].&[stock1]
)
select set1 on columns
from [Campaign]

The nonempty inside the except is to get the accounts for each campaign:

with set x as
nonempty
( [Dim Account].[Account].[Account].Members,
  ( [Dim Campaign].[Campaign].CurrentMember,
    [Measures].[Qty]
  )
)
select {x} on columns,
non empty [Dim Campaign].[Campaign].[Campaign].members on rows
from [Campaign]

Output:

Except takes a way member from a set. For example, campaign2 is for account 3 & 4. So if we take away account1,2&3 from account3&4 what’s left is account4:

with set x as
except
( { [Dim Account].[Account].[Account].[Account3],
    [Dim Account].[Account].[Account].[Account4]
  },
  { [Dim Account].[Account].[Account].[Account1],
    [Dim Account].[Account].[Account].[Account2],
    [Dim Account].[Account].[Account].[Account3]
  }
)
select {x} on columns
from [Campaign]

Output:

The complete MDX is basically saying:

  1. Create set1 = all account which have bought or sold stock1
  2. Find out accounts for each campaign using nonempty
  3. Take away 1 from 2 using except
  4. Count the members in that set
with
set set1 as
nonempty
( [Dim Account].[Account].[Account].Members,
  [Dim Stock].[Stock].&[stock1]
)
member x as
count
( except
  ( nonempty
    ( [Dim Account].[Account].[Account].Members,
      ( [Dim Campaign].[Campaign].CurrentMember,
        [Measures].[Qty]
      )
    ), set1
  )
)
select {x} on columns,
non empty [Dim Campaign].[Campaign].[Campaign].members on rows
from [Campaign]

23 July 2010

Parent Child Dimension

Filed under: Analysis Services — Vincent Rainardi @ 9:22 pm
Tags:

In this post I’d like to point out 3 things:

  1. The column contains ParentID, not the ChildID
  2. MembersWithData property
  3. What happen to node without parent

1. The column contains the ParentID, not the ChildID

When we crate a Parent Child Dimension, we have ID and name column, then we have a third column that points to the ID column. This third column should contains the parent ID, not the child ID:

On the table we specify the parent node rather than the child node because the hierachy is in pyramid format, where by a node can only have 1 parent, but a node can have several children. Hence we can’t set the third column to be the ChildID, because a node can have several children.

Note that in the table:

a)    The top node has no parent, and

b)    The ChildID (NodeID) is the primary key of the table. Meaning that we can’t have duplicate node numbers, because it means that a node can have more than one parent.

When we create the SSAS dimension, we make the ChildID (NodeID) as the key of the dimension. The “self referencing column”* contains ParentID, not ChildID. Like this:

*a “self referencing column” is a column that points to another column on the same own table.

Let’s create the table first:

create table PC
( ID nvarchar(10) null,
Name nvarchar(10) null,
ParentID nvarchar(10) null
)

insert into PC values (1, ‘Node1’, NULL)
insert into PC values (2, ‘Node2’, 1)
insert into PC values (3, ‘Node3’, 2)
insert into PC values (4, ‘Node3’, 2)
select * from PC

Then create the DSV table: (click the ParentID column and drop it onto the ID column)

On the DSV, the relationship is: Source = ParentID, Destination = ID:

Then create the dim (from existing table):

No need to define the “Usage” property of ParentID attribute as “parent”. BIDS automatically does this for us.

Process the dim, then browse:

2. MembersWithData property

In the above picture, the nodes are repeating because of a property called “MembersWithData”. This property is currently set to NonLeafDataVisible:

PS. This is the property of the ParentID column, not the property of the dimension.

If we set it to NonLeafDataHidden, process the dim, browse the dim, and reconnect, the repeating nodes disappear:

3. What happen to node without parent (orphan nodes)

For this let’s add a row with ParentID = null to the table:

insert into PC values (5, ‘Node5’, NULL)
select * from PC

Process the dimension, browse:

We see that a node without parent will become a top level node.

4. Unknown

I know I said 3 points, but oh well, 1 more:

In the above picture the Unknown member is there because the UnknownMember property is set to Visible. Set to Hidden and the Unknown member disappears:

(This is the property of the dimension, not the property of the attribute)

Note that making the unknown member hidden is not recommended because fact rows mapped to the unknown member of this dim will “disappear from the cube”, causing the total of the measure to be incorrect (less than what it should be).

Yes, the usual counter argument is: “but we have set the dim key to 0 on the ETL, so there would be no rows on the fact table with dim key = null”.

Well in Data Warehousing there’s an old saying: “If it is not enforced, it will happen” 🙂 Meaning: things like RI, dup, and nullability, if they are not physically enforced they will happen. May be not on the first month of going live, but eventually they will happen.

PS. RI is enforced using FK, duplication is prevented using PK and nullability is enforced using “Allow Nulls” column property.

“But the FKs would slow down the ETL”. I know, I know, … classic.

7 July 2010

Top 10 Tips: Cube Testing

Filed under: Analysis Services — Vincent Rainardi @ 10:21 pm
Tags:
  1. Make sure that the high level totals are correct. Test double counting. Randomly sample the “leaf level data”, comparing it with the operational system. Accuracy is the most important thing in cube testing.
  2. Give your cube to 3 key users for functionality testing. They understand the business more than you, and it’s them who will be using the cube, not you. Choose users who are familiar with both the BI tool and the data.
  3. To test the query performance, submit complex queries with sub totals, with many attributes in rows/columns/background. Warm cache. Obtain execution duration using Profiler. Consider a) tailoring the aggregations to improve slow queries, b) having snapshot fact tables in DW to support cube performance, c) moving the large attributes into the same dimension.
  4. Stress test using VS Team Suite. It simulates many users querying the cubes simultaneously.  Lower cost alternative is to create an ADOMD app that submits MDX queries.
  5. For BAU change requests, e.g. additional attributes, compare the changed cube in UAT with the Prod cube. Verify that they produce the same high level numbers.
  6. Test processing the cube and time it. If you use a scheduler software use it in testing. If the cube processing is monitored, check that the alert is working. If you do dimension process update + cube process default, check that a) you don’t miss any dimension, b) no issue with collation (upper/lower case) when the data is updated, and c) query performance is good. Using incremental processing could half the processing time.
  7. If you have proactive cache, test the response lead time by changing the underlying source data that is being monitored. If it’s schedule polling the lead time should match the polling interval.
  8. Test the calculated measures, named sets and scoping assignments by random sampling of a few cases with different data nature. Compare with manual calculation.
  9. Test the actions (URLs, reports, drillthrough). Make sure they open the correct URL/reports and provide the correct drillthrough numbers. Also test KPIs.
  10. Test the security. Use white box method. On SSMS, access the cube as different users, check if they can only see the dimensions and members they are allowed to see. Also test from the BI tool, using a test domain account, included as a member of the cube role.

There is very little public information about cube testing. I told my colleague he could write a book about cube testing as the gap is too big. Gap between need and supply I mean. The only ref I found is Thomas Ivarsson’s. Hope this top 10 tips help fills the gap. I’m aware that cube testing is a subject where different people has different opinions.

As usual I welcome any comments and questions at vrainardi@gmail.com

Top 10 Tips: Building Cubes

Filed under: Analysis Services — Vincent Rainardi @ 7:11 pm
Tags:
  1. Understand what the cube will be used for. This is a super critical success factor.
  2. Be careful when materialising a referenced dimension. They could produce incorrect results. See Alberto’s post. I have experienced this 3 times.
  3. Define aggregations, from actual cube usage if possible. They will speed up queries.
  4. Partition your cube. Align the cube partitions with the physical table partitions. This will speed up processing time.
  5. Define your SQL queries in the database views. Not in the named queries. Materialised them if possible, using indexed views. Only select columns that you need.
  6. Keep it simple. Only create required attributes and measures. You can add more attributes and measures later (and you will).
  7. Be careful when defining attribute relationships. They could produce incorrect results. Product1 is currently under category1, but next year it could be under category2. See this post.
  8. Arrange your DSV diagram and cube structure diagram by hand. Don’t use Arrange Tables. This is more of a personal taste; I like to know where my tables are in the diagram.
  9. Test your cube for accuracy, functionality and performance*. Accuracy is paramount in cube.
  10. Get early feedback. Give your key user early version of the cube. It’s OK if it’s not finished. Incorporating user feedback is a super critical success factor.

*Hmm, felt like another top 10 tips coming up soon: cube testing!

I came across Hilmar Buchta’s Self Service BI article last week. And when I came back to his blog this week, I saw his top 10 tips. So, yes, I was inspired by it. But above are encapsulation of my own experience, honest 🙂

6 July 2010

AMO: Enumerating DSV Tables and Columns

Filed under: Analysis Services — Vincent Rainardi @ 11:44 pm
Tags:

To list down the tables in the DSV (and views, and named queries) and their columns using AMO, we use DataSourceView class. First we get the DSV directly from the cube, like this:
DataSourceView DSV1 = Cube1.DataSourceView;

Then we loop around each table on the DSV, like this:
for (int i = 0; i <= DSV1.Schema.Tables.Count – 1; i++)
{
}

And inside that loop we retrieve the table names (or named query names) using “DbTableName” Extended Property, like this:
F.WriteLine(DSV.Schema.Tables[i].ExtendedProperties[“DbTableName”]);

Notes:

  1. Instead of writing to file or screen, we can assign to arrays for later processing.
  2. Extended property: If you browse it on Local variables, the ExtendedProperties are here: DSV > Schema > Table > Non Public members > List > n > ExtendedProperties
  3. DSV.Schema.Tables[i].TableName contains the element name, not the named query name.
  4. FriendlyName extended property is changable in SSAS DSV (select the view/table, right click, properties – see below). Some cube designer use this FriendlyName property to rename a view/table (not named query).

To get the SQL for that named query, we use “QueryDefinition” extended property, like this:
F.WriteLine(DSV.Schema.Tables[i].ExtendedProperties[“QueryDefinition”])

To get the column names, we loop around the columns, like this:
for (int j = 0; j <= DSV1.Schema.Tables[i].Columns.Count – 1; j++)
{ F.WriteLine(DSV1.Schema.Tables[i].Columns[j].ColumnName }

We do have DbColumnName and FriendlyName extended properties (see below left). The values of DbColumnName are the same as ColumnName, but FriendlyName is changable (see column property in SSAS DSV screen, below right). The “design-time-name” ExtendedProperties contains a GUID.

We can also get the data type for each column, as well as max length and ordinal, like this:

DSV1.Schema.Tables[i].Columns[j].DataType
DSV1.Schema.Tables[i].Columns[j].MaxLength
DSV1.Schema.Tables[i].Columns[j].Ordinal

To add a column we use Add method, giving 2 parameters: column name and data type, like this:
DSV.Schema.Tables[i].Columns.Add(“Column9”, System.Type.GetType(“System.Int32”));

To add a table/named query we use Add method, like this:
DSV.Schema.Tables.Add(“Table9”);

Then use the Update method to update the DSV:
DSV.Update();

Blog at WordPress.com.