Data Warehousing and Data Science

28 December 2012

SSIS 2012: Converting Date from Excel for Lookup

Filed under: SSIS — Vincent Rainardi @ 12:12 am

We have an Excel file like this (date column in DD-MM-YYYY):

Excel File

We have a Date Dimension table like this (date column in YYYY-MM-DD):

Date Dimension

Using SSIS 2012, we want to convert the date column in Excel from DD-MM-YYYY to YYYY-MM-DD so we can lookup the Date_Key value in the Date Dimension.

We want to output it in text file (error rows go to Error File):

Output FileError File

Overall

We need to build an SSIS 2012 package like this:

SSIS 2012 Package

The package reads the Excel file. It convert the date column in the Excel file to YYYY-MM-DD using Derived Column. It reads the Date Dimension table using Lookup transform, then lookup the Date_Key column. It then output the file to an Output file, and the error rows (failed lookup) to the Error File.

Before We Begin

SSIS Excel adapter can only run in 32 bit mode. So right click your project name, properties, and set Run64BitRunTime to False:

Run64bit

Step by Step

Step 1. Read the Excel File

On SSIS Toolbox, drag Data Flow Task to Control Flow canvas:

Data Flow Task

In SSIS Toolbox, double click Source Assistant, choose Excel:

Source Assistant

Click New, choose the Excel file to import. Data access mode = Table or View, Name of Excel sheet = Sheet1$. Click Preview:

Preview Excel File

Click Column on the left pane to verify that the column names are OK, then click OK to close the Excel Source Editor.

Step 2. Convert to YYYY-MM-DD Using Derived Column

From the SSIS Toolkit on the left hand side, drag Derived Column onto the Data Flow canvas. Configure it as follows:

Derived Column

Expression: (DT_WSTR,4)YEAR(Date) + “-” + RIGHT(“0” + (DT_WSTR,2)MONTH(Date),2) + “-” + RIGHT(“0” + (DT_WSTR,2)DAY(Date),2)

Date being the column from the Excel file.

At this step I would recommend setting up a Flat File destination and output the Derived Column result into it. This is to check that the date column in Excel file has been converted correctly.

If not, at least setup a Data Viewer so you can see what’s flowing out of the Derived Column.

Step 3. Read the Date Dimension using Lookup and get the Date Key

Drag Lookup transform from the Toolkit to the Data Flow canvas. Setup a connection to your SQL Server. Go to Connection pane, and configure as follows:

Lookup Transform

In the above SQL query box, you need to convert from varchar to nvarchar, otherwise it won’t match the output of the Derived Column which is WSTR (wide string / unicode). STR in SSIS data type = varchar, WSTR = nvarchar.

Click on Preview to see the data from the Date Dim:

Lookup - Preview

Click on Columns pane on the left and configure as follows:

Lookup - Columns

On the General tab, configure Redirect rows to Error output for No Matching Entries:

Lookup - General

4. Configure Output File and Error File

Drag Flat File Destination from the Toolkit panel onto the canvas and configure as follows:

Output File - General

Output File - Columns

Output File - Advanced

On the Output Flat File Destination Editor, configure the Mapping pane as follows:

Output Flat File Destination Mapping

Do the same with the Error Flat File Destination Editor:

Error Flat File Destination Mapping

That’s it you can now run the package:

Overall

26 December 2012

Formatting Dates in SQL Server 2012

Filed under: SQL Server — Vincent Rainardi @ 1:46 pm

Finally, after a long wait, we have “Format” in SQL Server which enable us to format dates. In Oracle we have to_char to format a date to any format we want e.g. to_char(date1, ‘YYYY-MM-DD’). And this is since Oracle 9i (2001). Now in SQL Server 2012 we have the Format function which we can use to format date to any format we want.

It’s quite straight forward to use especially if you know .Net formatting, but a few notes would probably help:

  1. Month is M (m is minute). Year is yyyy not YYYY.
  2. M is single digit month, MM is double digit.
  3. MMM is month short name (e.g. Dec), MMMM is long name e.g. December
  4. Day of the week is ddd (Wed) and dddd (Wednesday)
  5. hh is 12-hour, HH is 24 hour (h or H for single digit)
  6. Milliseconds in fff, microsecond is ff (F is 1/10, FFFF is 1/10000)
    f gives zero, F gives blank
  7. Escape character is \

Examples:

select format(getdate(), 'yyyy-MM-dd')
2012-12-26

select format(getdate(), 'MM/dd/yyyy')
12/26/2012

declare @a datetime2 = '2012-12-26 13:30:27.1234'
select format(@a, 'yyyy-MM-dd hh:mm:ss.fff')
2012-12-26 01:30:27.123

declare @a datetime2 = '2012-12-26 13:30:27.0004'
select format(@a, 'yyyy-MM-dd hh:mm:ss.FFF')
2012-12-26 01:30:27

declare @a datetime2 = '2012-12-26 13:30:27.0004'
select format(@a, 'yyyy-MM-dd hh:mm:ss.fff')
2012-12-26 01:30:27.000

select format(getdate(), 'ddd')
Wed

select format(getdate(), 'dddd')
Wednesday

declare @a datetime2 = '2012-12-26 11:00:00'
select format(@a, 'HH:mm\h')
11:00h

Complete formatting pattern is here:
http://msdn.microsoft.com/en-US/library/8kb3ddd4.aspx

Note that we can use standard .Net formatting too, e.g. d, D, g, G:
http://msdn.microsoft.com/en-us/library/az4se3k1.aspx

declare @a datetime2 = '2012-12-26 13:30:27.0004'
select format(@a, 'd')
select format(@a, 'D')
select format(@a, 'g')
select format(@a, 'G')
12/26/2012
Wednesday, December 26, 2012
12/26/2012 1:30 PM
12/26/2012 1:30:27 PM

23 December 2012

Bit and Pieces on Oracle

Filed under: Oracle — Vincent Rainardi @ 6:04 pm

Some things on Oracle proved to be very useful for me. I’ll share them below because I’m sure they are useful for some of you too.

1. Searching for a column name

select * from all_tab_columns where table_name like ‘C%’ and column_name like ‘%ST%’

2. Add/delete a column

alter table account add account_number varchar2(30)

alter table account add (account_number varchar2(30), balance decimal(20,4))

alter table account drop column account_number

3. Change column name / data type

alter table account rename column account_no to account_number

alter table account modify account varchar2(35)

4. Change table name

alter table account rename to customer_account

5. Update a column based on another table

update account set account_number =
( select account_number
from account_mapping
where account_mapping.customer_id = account.customer_id
) where account_number is null

6. Create a table from another table
(this is very useful for backing up before updating)

create table account_bak as
select * from account where 1=2

7. Escaping & (ampersand)

update table1 set column1 = ‘M’ || chr(38) || ‘S’

17 December 2012

Composite

Filed under: Business Intelligence — Vincent Rainardi @ 8:19 pm

VR, Update 11/10/2015: Composite is now formally called Cisco Data Virtualisation Suite

Composite enables us to integrate data from various sources to provide BI. It is called data virtualisation (DV). It’s the exact opposite of a data warehouse (DW), where we collect data from various sources and store it in a new data store.

The biggest advantage of using Composite is its speed to market. Without building a data warehouse, we can provide the BI in 2-3 months, as opposed to 12-18 months. We don’t need to design the DW (3-6 months), we don’t need to design the ETL (6-9 months), and we don’t need to spend a lot of time testing the ETL (2-3 months).

As a DV tool, Composite is very good. It has it’s own SQL-like language. It has many functions. It has a adapter to most RDBMS as well as Hadoop and Teradata. Also flat files and Excel. We can create views and procedures which integrate many sources. We can cache them, so that the BI tool can experience fast query performance. The cache can be a database table, or we can also cache it in Hadoop, so that the query is super fast.

Rather caching the data, we can also build loading procs to retrieve data from the source system incrementally. This way, we lighten the burden to the source systems. We can also setup various triggers to invoke those loading procedures either on timer basis or event-based. And we can create a “batch” procedure which calls the loading procedures one-by-one so there’s no “waiting and delay” which happens if we arrange them on timer basis.

On the publication side, there is a facility to publish the views and procedures that we created as web services. These can then be consumed by BI tools, for example by Spotfire Information Designer.

That’s the strong points. What’s the weakness? The scripting language is primitive. Compared to Java, C# or even PL/SQL there are a lot of gap to close. Data manipulation functions are poor. In this area, Informatica functionality is far more superior. Programming environment is poor. It’s like going back to 1992, no debugging, no break point, no watch.

The database function is also weak. For example, we can’t truncate a partition. Something that is very basic in data loading. Data inspection (and DQ in general) is also weak. We can’t easily generate the statistics of the source data, e.g. distribution of values.

Would I recommend Composite for BI projects? Yes of course. Because it provides an excellent easy-to-use tool to integrate data from many places, to be consumed by the reporting/BI tool. It can cut development time significantly, as in 3 months instead of 12 months.

Disclosure: I receive no financial benefit from any company saying/writing this. The opinion is entirely mine, and it doesn’t necessarily reflect opinion of my employer or my clients. I have used both Informatica and Composite, hands on, as well as SSIS, BO, Oracle, Spotfire. As always I may be wrong and I’m open to suggestion. If I said something incorrect above, I would be happy to be corrected.

Vincent Rainardi, 17/12/2012

3 December 2012

Why Do We Need a Data Warehouse?

Filed under: Data Warehousing — Vincent Rainardi @ 6:52 pm

A lot of times people question the value of data warehousing. Why do we spend 1 year building a data warehouse? We can’t wait that long. Let’s just install QlikView/Spotfire and feed the transaction system direct to it and we have a BI!

Absolutely! You can. You can buy BO, MicroStrategy, QlikView, Spotfire or any BI tool you like, then report straight from the transaction system. Or, if you fancy, you can create a cube first (SSAS, Cognos or Hyperion), then install appropriate client tool (Tableau, Strategy Companion, etc). Of course you can. And this is the best way to learn about Data Warehousing: by not doing it.

What you will experience is:

a)    Data quality issues
b)    Low confident level from users
c)    Quick turn around of report but data is unstable
d)    Issues with data consistency
e)    Issues with performance

The whole year spent on building a data warehouse essentially for providing a quality data source. A data warehouse has the following characteristics:

a)    Integrated
b)    Consistent
c)    Contains historical data
d)    Tested and verified
e)    Performant

A data warehouse integrates data from multiple sources correctly. This integration doesn’t happen overnight. A Business Analyst spent weeks analysing the sources and wrote down a specification of how the data should be integrated. A Data Architect looked at that spec and designed a performant star schema to host the data. An ETL Architect looked at the star schema design and wrote an ETL population spec. An ETL developer studied the ETL spec and built the workflows. And finally, a tester verified the data.

That takes months, but as a result, we have integrated, consistent, clean data source containing the correct and valid data. And it is performant. Your query doesn’t need to join 15 tables in a horrible way. All the data is in a centralised place, ready for you to query.

But, as I said, experience is the best teacher. Once you experience how difficult it is to build business intelligence without a solid data warehouse, then you will truly appreciate the value of a data warehouse.

PS. I have just posted another article with the same title (but different): link

2 December 2012

Tibco Spotfire

Filed under: Business Intelligence — Vincent Rainardi @ 12:49 am

Like QlikView, Spotfire provides capability to serve BI users with analytical reports, with which users can slice and dice the data, export to Excel/PDF, in various different types of visualisations including pivot tables and charts.

Unlike SSAS, Spotfire don’t require an OLAP cubes to be built first. We define a set of data tables which are then used by the visualisations.

Spotfire has a good back end capability. It’s called Information Designer. In there we can create columns and procedures, from which we then create Information Links which feed the data to the reports. Info Links accept parameters and filters, which make them flexible enough for retrieving only the data we need.

Most of Spotfire developers time is spent on 3 things:

  1. Create information links (20%)
  2. Create data tables (30%)
  3. Create visualisations (30%)

Once Info Links are setup, they usually repeat 2 & 3 with the users in a loop / iterative way. Occassionally, they also do some automation (using Automation Services) and administration (arranging library content/folders and administration users/access)

Spotfire also have a development kit, which is in to Visual Studio (using .NET language such as C#), where we can do programming to add customisation to the visualisation and data tables. Alternatively, we can also do the programming inside the Spotfire itself, i.e. as a code behind on a button in a text area, using Phyton as the scripting language. The scripting capability has variables, branching, looping as usual, and API access to all visualisation and data table objects within Spotfire.

As most BI tools, Spotfire is a rapid development tool. An analytic report is typically developed within a week. That’s from the point where we sit down with the business (or BA) detailing the requirements, until we present the first draft version back to the business. From then of course there are many iterations to fine tune it until we get the finish products. This fine tuning typically takes 2-3 weeks, depending on how wide the business community involved and how fast/slow their responses are. And of course it depends on how complex the report is. A 1 pager with 2 visualisation and 3 data tables is about 2 days development time. Four pages with 7 different visualisations and 15 data tables takes about a week.

Most of the issues in Spotfire development lies in the data tables. This is not a thing that is specific to Spotfire. It is a general phenomenon in BI. The data is a very difficult thing to get it right. The reports on the other hand, are easy. They just are reflecting the numbers in the data. If somebody can guarantee that the data in the database/data warehouse is 100% correct and accurate, then to make the report would be accurate. I would say that 90% of the BI effort goes to prepare a database that is accurate, and only 10% goes to the reporting.

Overall Spotfire is a good BI tools for presenting data to the users and let them analyse it. It has Web Player and it has Enterprise Player. Web Player (WP) runs on web browsers / thin client and EP is a thick client. Like other BI tools, the thick client has more features than the Web Client. Your slice and dice capability is more limited if you use Web Player, for example, you can’t change the axes.

Strengh of Spotfire? I would say it’s the backend, i.e. the Information Designer and Data Tables. Also in the calculated column, there are so many things we can do there, very rich functions. Also strengh is the API. Weakness? It’s the visualisations in my opinion. For example, when arranging visualisations on a page, it’s unbelievably difficult. Bar chart: we can’t put the numbers on each bar. Using filters to slice data is difficult. Tableau and Strategy Companion is a lot easier to use when it comes to slicing and dicing the data.

Would I recommend it as a BI tool? Absolutely. Along with QlikView, I think Spotfire is one of the best visualisation/reporting tool in the BI world.

Blog at WordPress.com.