Data Warehousing, BI and Data Science

20 May 2011

LINQ2SQL and TSQL and CLR

Filed under: SQL Server — Vincent Rainardi @ 10:33 pm

Somebody just asked me today about the differences between LINQ 2 SQL, TSQL and CLR. I thought I’d post my short answer as it might be useful for other people.

Language Integrated Query, or LINQ is an application-oriented database query language, for example:

var q =
from c in db.Customers
where c.City == “London”
select c;
(source: http://msdn.microsoft.com/en-us/library/bb425822.aspx)

SQL is a row-oriented database query language, for example:
select * from Customers where City = ‘London’

TSQL is the Microsoft version of SQL, PL/SQL is the Oracle version, NZSQL is the Netezza version.

LINQ is good for application-database integration, but the problem is: databases accept SQL language, they don’t accept LINQ.
LINQ 2 SQL is a tool where developer can write LINQ in their application, and this tool will translate it into SQL automatically so that the database can accept it. The output from the database (rows) is then translated back to LINQ Objects.

A Stored Procedure is a collection of SQL queries e.g. TSQL, PL/SQL, NZSQL.
CLR is a way of writing Stored Procedure using application-oriented languages such as C#, C++ and VB.

22 April 2011

T-SQL: Exists versus In – Performance Comparison

Filed under: SQL Server — Vincent Rainardi @ 4:25 pm
Tags:

In this test I’m going to query which numbers exist in Table1 but not Table2, using “Exist” and “In”.

Exist Query:

select * from Table1 A where not exists
( select * from Table2 B where B.col1 = A.col1 )

In Query:

select * from Table1 where col1 not in
( select col1 from Table2 )

There will be 2 tests:

  • Test 1: Table2 is large
  • Test 2: Table2 is small

I believe that in Test 1 “exist” is a lot faster, but in Test 2 “in” is slightly faster than “exist”

Test 1: Table 2 is large

Data Setup:
Table1 contains numbers from 1 to 100,000.
Table2 contains numbers from 10 to 100,000.

Expected results: Exist is faster than In.

Script to populate both tables with data:

truncate table table1;

declare @i int
set @i = 1
while @i <= 100000
begin
insert into Table1 values (@i, @i)
set @i = @i + 1
end;

truncate table table2;

declare @i int
set @i = 10
while @i <= 100000
begin
insert into Table2 values (@i, @i)
set @i = @i + 1
end;

Results:

  • Exist Query: average of 167 ms.
  • In Query: average of 1892 ms.

This is as I predicted.

Test 2: Table 2 is small

Data Setup:
Table1 contains 100,000 rows. Col 1 contains numbers from 0 to 50.
Table2 contains 49 rows. Col 1 contains numbers from 2 to 50.

Expected results: In is slightly faster than Exist.

Script to populate both tables with data:

truncate table table1;

declare @i int
set @i = 1
while @i <= 100000
begin
insert into Table1 values (@i%50, @i)
set @i = @i + 1
end;

select * from Table1

truncate table table2;

declare @i int
set @i = 2
while @i <= 50
begin
insert into Table2 values (@i, @i)
set @i = @i + 1
end;

select * from Table2

Results:

  • Exist Query: average of 219 ms.
  • In Query: average of 253 ms.

So I was wrong. Even if Table 2 is small, “Exist” is still faster than “in” (slightly). But if Table 2 is large, “Exist” is a lot faster than “in”. So it is better to use Exist. Always.

Null in Date Column

Filed under: SQL Server — Vincent Rainardi @ 6:14 am
Tags:

I was having one of those blunder moment when a few days ago I said that date column is not nullable. I was in SSIS, in a Derived Column transform and was writing something one of those conditional statements: “a ? b : c”. I put “null” in c and got an error hence my silly conclusion that date column is not nullable. But of course it is. Any column is nullable. And in SSIS it’s not “null”, but we write it using NULL functions, e.g. “NULL(DT_DBDATETIME)”.

20 April 2011

T-SQL: Exists versus In

Filed under: SQL Server — Vincent Rainardi @ 6:31 am
Tags:

I keep forgetting the syntax for “Not Exists” so I thought I’d write a post for it.

select * from Table1 A where not exists
( select * from Table2 B where A.Col1 = B.Col1 ) 

The above query is known as “correlation” query.

You need to have the “A” and “B” aliases in the code above. Otherwise you will have to type in the whole table name:

select * from Table1 where not exists
( select * from Table2 where Table2.col1 = Table1.col1 ) 

Well, not really. You can get away with not specifying the Table2 alias. But you have to specify Table1’s:

select * from Table1 where not exists
( select * from Table2 where col1 = Table1.col1 ) 

If you take away the Table1, it will return nothing, because there is no correlation between the two tables:

select * from Table1 where not exists
( select * from Table2 where col1 = col1 ) 

Exists versus In

And while I’m at it I might as well explain the benefit of “Exists” compared to “In”. And the disadvantage.

Here’s “In”: (it’s always easier to explain by example)

select * from Table1 where col1 not in
(select col1 from Table2) 

If we want to compare 2 columns:

select * from Table1
where col1 not in (select col1 from Table2)
  and col2 not in (select col2 from table2) 

Whereas using “Exists” to compare 2 columns:

select * from Table1 A where not exists
( select * from Table2 B where A.Col1 = B.Col1 and A.Col2 = B.Col2 ) 

So comparing 2 columns using “In” is not the same as using “Exists”. The “In” is not really comparing 2 columns.

To make the “In” really comparing 2 columns we need to do this:

select * from Table1
where convert(varchar,col1) + '|' + convert(varchar,col2) not in
(select convert(varchar,col1) + '|' + convert(varchar,col2) from Table2)

Above, we concatenate the two columns and compare the concatenated string.

Query Performance

Due to time limitation I thought I’d write the performance in another post. But considering that performance is the most important factor for some people, I will write the principles here, and leave the proofing of it for another time.

It is my believe that if the result set to compare to is small, then In is faster. But if the result set to compare to is large, then join is faster.

For example, compare these 2 SQL Statements, one for Exists, one for In:

 select * from Table1 A where not exists
( select * from Table2 B where A.Col1 = B.Col1 ) -- Query1

select * from Table1 where col1 not in
( select col1 from Table2 ) -- Query2 

If Table2 is small (say 10 rows), then the subquery in Query 2 will run fast, SQL will pin the result in memory and the lookup happens in memory, hence fast. But if Table2 is large (say 4 million rows) then SQL will put the result on disk, and the lookup will be searching the disk, hence slow.

On the other hand, the correlation is implemented as join and is very efficient dealing with 2 large set. Say both Table1 and Table2 contains 4 million rows. If (and that’s a big if) Col1 is indexed on both tables, Query 1 will run fast because Col1 is the join criteria.

The question I had for years is the efficiency of Query 2 to deal with small data set. Say Table 1 is 4 million rows and Table 2 is 10 rows. Can Query 2 beat Query 1? In SQL 7 and 2000 I believe this was the case. But in 2008 I don’t believe that is the case. It depends on the index. And the logic for hash join in SQL Server 2008 has improved a lot. We have 3 types of hash join in SQL Server: In-Memory hash join, Grace hash join and Recursive hash join (see here). Quote: “SQL Server starts by using an in-memory hash join and gradually transitions to grace hash join, and recursive hash join, depending on the size of the build input.” Note that SQL Server doesn’t understand the size of the table if the statistics are out of date, hence it is important to keep the statistics up-to-date.

2 April 2011

Collation Conflict When Querying a Linked Server

Filed under: SQL Server — Vincent Rainardi @ 7:45 am
Tags:

Sometimes we want to query a table on a linked server, to compare with data from the local SQL Server like this:

select * from LinkServer1.Database1.dbo.Table1
where Column1 in (select Column2 from Schema1.Table2)

When we do that we may get this error:

Msg 468, Level 16, State 9, Line 1
Cannot resolve the collation conflict between “SQL_Latin1_General_CP1_CI_AS” and “Latin1_General_CI_AS” in the equal to operation.

This is because the collation is different between the two SQL Servers. To solve this convert the collation using “collate database default” like this:

select * from LinkServer1.Database1.dbo.Table1
where Column1 collate database_default in
(select Column2 from Schema1.Table2)

30 March 2011

Update A Table Based on Itself

Filed under: SQL Server — Vincent Rainardi @ 5:34 pm
Tags:

In the example below we want to update 29th March data based on 30th March.

Before Update:

After Update:

SQL:

update Table1
set Col1 = B.Col1, Col2 = B.Col2, Col3 = B.Col3
from Table1 A
join
( select ID, Col1, Col2, Col3
  from Table1 where Date = 20110330
) B on A.ID = B.ID and A.Date = 20110329
where A.Date = 20110329

5 March 2011

SSIS: Importing a File with Dynamic Columns

Filed under: SQL Server,SSIS — Vincent Rainardi @ 9:25 am
Tags: ,

One of the most difficult things to deal with in the ETL is importing a file with dynamic columns. The file does not have a set number of columns. The number of columns changes from time to time. For example, the file can have 5 columns, or 20 columns, or anything in between.

Below are some examples of such a file:

File1:
Col1,Col2,Col3t1,Col3t2,Col3t3
a,a,1,1,1
b,b,2,2,2
c,c,3,3,3

In the above example the number of columns for Column3 is 3. But it could be 4, 5, 6 or more, we don’t know.

Before we look at the ETL side, we need to look at the data modelling side first. What kind of target table should we create to host these files.

Data Modelling

Essentially there are 2 choices.

1. We could make it normalised like this:

Table1:
Col1,Col2,Col3Type,Col3Value
a,a,t1,1
b,b,t2,1
c,c,t3,1
a,a,t1,2
b,b,t2,2
c,c,t3,2
a,a,t1,3
b,b,t2,3
c,c,t3,3

In principle we design the structure so that we have a fixed number of columns. It doesn’t matter how many columns in the source file, we will always have 4 columns in the table.

The drawback of this design is when we query it. If we need to produce File1 from Table1, we would need to pivot the table like this:

select Col1, Col2, t1 as Col3t1, t2 as Col3t2, t3 as Col3t3
from
( select Col1, Col2, Col3Type, Col3Value
from Table3
) P1
pivot
( sum(Col3Value) for Col3Type in (t1, t2, t3)
) as P2

Output:
Col1,Col2,Col3t1,Col3t2,Col3t3
b,a,1,1,1
a,b,2,2,2
c,c,3,3,3

If we have 2 tables like that we need to join them together on the common column(s) and it becomes even trickier. If we are dealing with 1 million rows column, we risk a serious performance degradation. This might not be a problem for a data mart, but if it is for a transaction system where millisecond response time is expected, this design does not serve the purpose.

Alternative Design

The alternative is to find out how many columns Col3 is. We know it could be 3 to 5, but could it be 100 columns. Spending 5 minutes with the subject matter expert to understand the characteristic of Col3, is well worth it. I have seen this dynamic columns case 3 times (the first time was about 12 years ago) and on every case they followed the book/theory (to normalise) and everytime they regretted it. On the second case (about 8 years ago) when I was involved in troubleshooting the performance issue, in the end we changed the table structure to denormalised the table structure.

So, I would recommend to find out how many columns Col3 is. 90% of the chance is: it is not that many. The response I usually get from the SME is something like this: “Most likely Col3 varies between 3 and 5 columns, it certainly won’t exceed 8”. There is a natural meaning of this data, that’s why the range is limited. For example: Col3 could be the number of teams, or the number of projects, or the number of web sites, or the number of product types. In the business world most of the data is within a certain range. You can’t have 100 teams for example. It’s physically impossible for a company as it is limited by budget constraints.

So once we have that information (the maximum number of columns), then we create the table according to this limitation. Let’s assume that the SME said that the max is 8 columns. So we design Table1 with 10 columns (just in case) as follows:

create table table1
( Col1 varchar(10),
  Col2 varchar(10),
  Col3t1 int,
  Col3t2 int,
  Col3t3 int,
  Col3t4 int,
  Col3t5 int,
  Col3t6 int,
  Col3t7 int,
  Col3t8 int,
  Col3t9 int,
  Col3t10 int
)

Unless it is a staging table, don’t forget to add a PK column. It is a good design to be able to uniquely identify a row at all times. In this example I don’t add a PK column because the target is a staging table.

A table structure with a fixed number of columns like this makes our lives a lot easier. It directly mimic the structure of the source file, and able to accommodate the dynamic nature of Col3.

The Data Load

I’m going to explain the ETL for the denormalised table, i.e. the one with a fixed 10 columns for Col3. The ETL for the normalised table, i.e. the table with 4 columns will have to wait until next time I’m afraid, as I have a few other posts to write next week, plus a presentation for SQLBits 8, Advanced Dimensional Modelling. If you need the ETL for the normalised table with 4 columns please contact me.

Because of the dynamic nature of Col3 in the source file, the best way is to use the Script Component as a Source. There are a lot of sources on the internet Script Component. This post shows you the basics. This post (by Lee Everest) shows you how to use a Script Component to split the rows in the source file. Similarly, this MSDN post shows how to import a delimited source file using a Script Component.

The main concept is as follows: First create a Connection Manager for a File Connection (no need to use Flat File connection). Name this connection File1, like this:

Then create a workflow, and drag the Script Component into the workflow. When asked, specify as source.

Then:

  • On the Connection Managers tab, set the connection to File1 (that’s the middle column). On the first column, name it as “Conn1”.

  • On the Inputs and Outputs tab, rename the “Output 0” as File1Output. Then create the Output Columns one by one, as per the structure of Table2:

    Col1, Col2, Col3t1, Col3t2, … Col3t10.


  • In this example we will convert the data type for col3 from string to int not in the Script Component, but later on using Derived Column / Data Conversion transformation. So for all columns, set the data type to String, length 10.

  • On the Script tab, click on the Edit Script and edit it as follows. Don’t forget to add “Using System.IO;” and if you want to use MessageBox for debugging you need to add “using System.Windows.Forms;”. Remember that you can’t debug an SSIS Script Component (see my post here), but you can debug SSIS Script Task.
private StreamReader SR;
private string File1;

public override void AcquireConnections(object Transaction)
{
  // Get the connection for File1
  IDTSConnectionManager100 CM = this.Connections.File1Conn;
  File1 = (string)CM.AcquireConnection(null);
}

public override void PreExecute()
{
  // Create a reader for File1
  base.PreExecute();
  SR = new StreamReader(File1);
}

public override void CreateNewOutputRows()
{
  // Declare variables
  string nextLine;
  string[] columns;
  char[] delimiters;
  int Col3Count;
  String[] Col3Value = new string[10];

  // Set the delimiter
  delimiters = ",".ToCharArray();

  // Read the first line (header)
  nextLine = SR.ReadLine();

  // Split the line into columns
  columns = nextLine.Split(delimiters);

  // Find out how many Col3 there are in the file
  Col3Count = columns.Length - 2;
  // MessageBox.Show(Col3Count.ToString());

  // Read the second line and loop until the end of the file
  nextLine = SR.ReadLine();
  while (nextLine != null)
  {
    // Split the line into columns
    columns = nextLine.Split(delimiters);
    {
      // Add a row
      File1OutputBuffer.AddRow();

      // Set the Col3Value variable to the values in the file
      for (int i = 0; i <= Col3Count - 1; i++)
      {
        Col3Value[i] = columns[i + 2];
      }

      // Set the values of the Script Component output according to the file content
      File1OutputBuffer.Col1 = columns[0];
      File1OutputBuffer.Col2 = columns[1];
      File1OutputBuffer.Col3t1 = Col3Value[0];
      File1OutputBuffer.Col3t2 = Col3Value[1];
      File1OutputBuffer.Col3t3 = Col3Value[2];
      File1OutputBuffer.Col3t4 = Col3Value[3];
      File1OutputBuffer.Col3t5 = Col3Value[4];
      File1OutputBuffer.Col3t6 = Col3Value[5];
      File1OutputBuffer.Col3t7 = Col3Value[6];
      File1OutputBuffer.Col3t8 = Col3Value[7];
      File1OutputBuffer.Col3t9 = Col3Value[8];
      File1OutputBuffer.Col3t10 = Col3Value[9];

    }

    // Read the next line
    nextLine = SR.ReadLine();
  }
}

public override void PostExecute()
{
  // Close the reader
  base.PostExecute();
  SR.Close();
}
  • Save the script, close SSIS script editor, click OK to get out of the Script Transformation Editor, back to the Data Flow.
  • I would suggest not to put the output into a table in the database directly, but put the output to a file first to check the result. This way you don’t need to worry about data conversion between string and int etc. Create a flat file connection with 12 columns. All the columns are string, with length = 10, as follows:

  • Run it. If it fails, check the source file and make it like this:

    Also check the script, make sure it is like above.

  • Create an OLE DB Destination or SQL Server Destination (see my post here re why practically speaking we can’t use SQL Server Destination). But create the table first by executing the DDL I wrote above. Note that if you are using SQL Server Destination (because you are doing it to your local instance) then notice that on Windows 7 or Vista you need to run BIDS as Administrator, otherwise you’ll get this error: Unable to prepare SSIS bulk insert, so something like that.
  • Create a data conversion to convert Col3 columns from string to int, like this:

  • Connect the output of the data conversion to the DB and map the DC output columns to the DB columns like this:

  • Now execute it again, and check the target table:

  • Overall the Data Flow looks like this:

As usual I welcome any question and discussion at vrainardi@gmail.com. Vincent, 5/3/11.


27 February 2011

Creating Test Data on SQL Server 2008

Filed under: SQL Server — Vincent Rainardi @ 11:31 pm
Tags:

The key thing here is that you need Visual Studio 2010 to do this. You can not generate test data using SQL Server 2008. You need Visual Studio 2010 Premium or Ultimate. The normal Visual Studio (Professional edition) can not do it. Alternatively you can do it using Visual Studio 2008 Database edition. In VS 2010 there is nothing called Database edition, it only exists in VS 2008.

Unfortunately VS 2010 Premium is £3,695, very different to the normal VS 2010 (Professional edition) which is £694).

As per the how to do it, Arshad Ali wrote a very good post here. And the formal MSDN guide is here. One of the most important things to notice is that the test data is rubbish, e.g. contains a series of random, unreadable characters. For numeric columns it’s OK, they contains random numbers, which is what we want. But again the numbers are rubbish. Most users are used to meaningful data, so for them to view reports containing the test data, their usual reactions is that they can’t stand looking at the reports. In their view the reports are rubbish because they contain rubbish data. You won’t understand it unless you see it for youself, so I’d recommend doing it (showing the user a report with dummy data). I’m of the opinion that reports with dummy data are more appropriate for developers and testers within IT, not the end users.

If rubbish data is not what you have in mind, do not set 3 days in your project plan to generate test data. You will need to do it the old way, using stored procedures. Mitchel Sellers wrote a very good article on this topic: link.

Alternatively, you can use Red Gate data generator tool (link), which is £175/user. Cheaper than buying the VS 2010 Premium which is £3,695. But, again, as per any test data generator tools, the data is rubbish. It’s just a series of random, unreadable characters.

3 September 2010

Q&A on Resizing TempDB

Filed under: SQL Server — Vincent Rainardi @ 5:43 pm
Tags: ,

Q: Can we change the growth increment of tempdb from 10% to 1 GB?

A: Yes. To do this, stop SQL Server, on command prompt: sqlservr -c -f. Doing this reset the TempDB data file to 2,304 KB (not 1 MB as in KB 307487) and the log file to 504 KB. The growth is reset to 10% MB for data and 1 MB for log. SQL is in single user mode, so no risk of TempDB corruption. No, from Config Mgr you can’t see that SQL Server is running. Then do this from SSMS:

ALTER DATABASE tempdb MODIFY FILE (NAME = ‘tempdev’, FILEGROWTH = 1GB)

Verify the growth increment before and after by issuing “sp_helpdb tempdb”. The growth column for the log file should be 10% before and 1024KB after.

Control-C on the command prompt to stop SQL Server. Restart SQL Server from Config Mgr.

Q: Can we shrink the TempDB?

A: Yes, we can. Same as above. Stop SQL Server and restart in single user mode on command prompt. But we specify the size this time, like this:

ALTER DATABASE tempdb MODIFY FILE (NAME = ‘tempdev’, SIZE = 3GB)
ALTER DATABASE tempdb MODIFY FILE (NAME = ‘templog’, SIZE = 1GB)

Note: we can only go higher but can’t go lower. For example, after setting it to 5 GB we can set it to 6 GB but we can’t set it to 4 GB. If we try to set it to 4GB it will remain 5GB.

We can specify the growth and the size at the same time.

Q: My TempDB is 200GB. If I shrink it to 1 GB will it come back to 200 GB?

A: If you execute the same queries against the same data, yes. If the big query that caused the TempDB to blow to 200GB is only a one off, no.

Q: Is it worth shrinking TempDB?

A: Depends on your disk space. If you have ample disk space then you shouldn’t worry. But question like this usually comes up because the disk space is limited. One point I’d like to make is, if you are working with a 1TB warehouse and running queries against a 1-billion-rows fact table, with joins to 5 dims and “sum group by” etc, don’t be surprise if the TempDB is 200GB. True that most SQL DBAs who never administered a SQL warehouse would be surprised, as in their books a TempDB “should be” around 5-10GB. Yes, the TempDB of a SQL Server used for warehousing is different to if it’s used for “data entry” applications (known as “transaction systems” in the BI world).

If you are working with a SQL warehouse, the question should be more of: how do I optimise the TempDB performance rather than disk space, e.g. 1 TempDB file per processor core, placing TempDB files on different spindles, putting them on RAID 1+0 rather than RAID 5, etc. If your company can’t buy a 1TB disk space then they probably shouldn’t do data warehousing. Almost by definition, data warehousing is about disk space.

Q: Can I shrink TempDB without stopping SQL Server?

A: Yes. Use dbcc alter file/db. Consider the risk of data corruption. This risk does look scary for prod DBs but may be acceptable for dev.

11 August 2010

Using Pivot for 2 Value Columns

Filed under: SQL Server — Vincent Rainardi @ 3:08 pm
Tags:

Say you need to pivot this table:

into this:

I usually use the old “case when” for this but being 2010, I’d prefer to use pivot statement. We can use pivot for 1 value column, but unfortunately we can’t use it for 2 value columns. Meaning, in the above source table, if we have sales amount only, or profit amount only, we can use pivot.

Using “case when” it’s like this:

select product,
sum(case when trans_type = ‘type1’ then sales_amount end) as sales_type1,
sum(case when trans_type = ‘type2’ then sales_amount end) as sales_type2,
sum(case when trans_type = ‘type1’ then profit_amount end) as profit_type1,
sum(case when trans_type = ‘type2’ then profit_amount end) as profit_type2
from source_table
group by product

Using pivot it’s like this:

select * from source_table
pivot
( sum(sales_amount)
for trans_type in (type1, type2)
) as p

But how about the profit_amount column? It needs to be pivoted for type1 and type2 too. As I said, you can’t put 2 measures in pivot, like this:

select * from source_table
pivot
( sum(sales_amount)
for trans_type in (type1, type2)
) as p1
pivot
( sum(profit_amount)
for trans_type in (type1, type2)
) as p2

We can’t specify type1 twice like above, because it means we will have 2 columns with the name = [type1]. When we run the above pivot statement SQL Server gives these 4 errors:

Msg 207, Level 16, State 1, Line 11
Invalid column name ‘trans_type’.
Msg 265, Level 16, State 1, Line 11
The column name “type1” specified in the PIVOT operator conflicts with the existing column name in the PIVOT argument.
Msg 265, Level 16, State 1, Line 11
The column name “type2” specified in the PIVOT operator conflicts with the existing column name in the PIVOT argument.
Msg 8156, Level 16, State 1, Line 12
The column ‘type1’ was specified multiple times for ‘p2’.

SQL Server 2008 Books Online gives this syntax for “FROM” clause:

[ FROM { <table_source> } [ ,n ] ]
<table_source> ::=
{ …
| <pivoted_table>

}

<pivoted_table> ::=
table_source
PIVOT <pivot_clause>
table_alias

<pivot_clause> ::=
( aggregate_function ( value_column [ [ , ]…n ])
FOR pivot_column
IN ( <column_list> )
)

<column_list> ::=
column_name [ ,…n]

Notice that on the value_column part there is “, n” meaning that we can specify more than 1 value column. But if we try it:

select * from source_table
pivot
( sum(sales_amount, profit_amount)
for trans_type in (type1, type2)
) as p

But SQL Server gives this error message:

Msg 8144, Level 16, State 2, Line 1
Procedure or function sum has too many arguments specified.

Aaarrggghhh…

So what’s the solution?

We need to provide 2 trans type columns with different value, like this:

That is done by selecting the trans type column twice,
adding “p” prefix on the second time:

select product, trans_type as trans_type1, ‘p’ + trans_type as trans_type2, sales_amount, profit_amount
from source_table

Using the above as a derived table, we pivot it twice, like
this:

select * from
( select product, trans_type as trans_type1, ‘p’ + trans_type as trans_type2,
sales_amount, profit_amount
from source_table
) s
pivot
( sum(sales_amount)
f
or trans_type1 in (type1, type2)
) as p1
pivot
( sum(profit_amount)
for trans_type2 in (ptype1, ptype2)
) as p2

Then we group them on product, like this:

select product, sum(type1) as sales_type1, sum(type2) as sales_type2,
sum(ptype1) as profit_type1, sum(ptype2) as profit_type2
from
( select product, trans_type as trans_type1, ‘p’ + trans_type as trans_type2,
sales_amount, profit_amount
from source_table
) s
pivot
( sum(sales_amount)
for trans_type1 in (type1, type2)
) as p1
pivot
( sum(profit_amount)
for trans_type2 in (ptype1, ptype2)
) as p2
group by product

So can have 2 value columns. We can pivot the table twice.
The key thing is
a) the column for second pivot must be different from the first pivot (e.g. trans_type1 and trans_type2) otherwise SQL Server will complain about duplicate columns, and
b) the value on the 2nd column must be different from the 1st column e.g. ‘type1’ and ‘ptype1’.

If we have a lot of value columns to be pivoted then it’s worth creating a bit of dynamic SQL.

Hope this helps. As usual I welcome any discussion, correction and questions on vrainardi@gmail.com

« Previous PageNext Page »

Blog at WordPress.com.