Data Warehousing and Data Science

29 September 2010

Updating Partition Source SQL

Filed under: Analysis Services — Vincent Rainardi @ 7:44 am

Say you have a Measure Group with 100 partitions with source SQL = “select [some columns] from table1 where col1 = X” where X = 1 to 100. Table1 has been renamed to Table2 and you need to change all the partition source SQLs from “Table1” to “Table2”

If you have only 10 partitions then you could update them one by one in BIDS. Takes 5 minutes job done. But if you have 100 partitions? Hmm…. Do you want to spend an hour? Not really.

So how do we update all partitions in 1 go?

1. Replace All in XMLA

Export to XMLA, do a “replace all” in SSMS, execute the XMLA, recreate the BIDS project from the resulting cube. In the Find & Replace dialog box you can use Regex if necessary (in the “Use” drop down list, select “Regular Expression”, 2008 and R2).

2. Using XMLA alter command

In SSMS Object Explorer, navigate to the measure group containing the partitions then script as Alter. The result is below.

<Alter …>
   …
   <Partitions>
      <Partition>
        <ID>1</ID>
        …
        <QueryDefinition>SELECT … FROM Table1 = 1</QueryDefinition>
        …
      </Partition>
      <Partition>
         <ID>2</ID>
         …
         <QueryDefinition>SELECT … FROM Table1 = 2</QueryDefinition>
         …
      </Partition>
      … etc (all the partitions are scripted here)
   <Partitions>
</Alter>

Do a “Replace all” to replace “Table1” with “Table2” and execute the XMLA. Then recreate the BIDS project from the resulting cube.

3. Update the partition file

In Windows explorer, navigate to the folder where your BIDS project is located. Find the *.partitions file. Edit it using Mgt Studio (you can drag and drop from Win Explorer to SSMS, no need to do File, Open file).

First part of this file is aggregation. Keep scrolling down until you find <Partitions> tag. Like this: (yes it’s the same as the XMLA snippet on point 2 above)

<Partitions>
  <Partition … >
    <ID>1</ID>
    <Name>1</Name>
    …
    <Source … >
      <DataSourceID>DataSource1</DataSourceID>
      <QueryDefinition>SELECT … </QueryDefinition>
    </Source>
    …
  </Partition>
  <Partition … >
    <ID>1</ID>
    <Name>1</Name>
    …
    <Source … >
      <DataSourceID>DataSource1</DataSourceID>
      <QueryDefinition>SELECT … </QueryDefinition>
    </Source>
    …
  </Partition>
  …
 </Partitions>

Do a “replace all” to change “Table1” to “Table2”. The word “Table1” could exist somewhere else (not only in partition). So it is safer to block the text between  <Partitions> and  </Partitions> then do a replace all “within selection only” like this:

Save it. Open the solution using BIDS as normal. Look at the partition tab of the cube. Your partition SQLs are all changed.

4. Change the Partition SQL using AMO

Hmm everything is using AMO these days 🙂

Here we go:

using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.AnalysisServices;
using System.IO;

namespace AMO
{
  class Program
  {
    static void Main(string[] args)
    {
      // Output to text file
      StreamWriter F = new StreamWriter("C:\\Data\\Apps\\DotNet Projects\\AMO Partition Source SQL\\File.txt");

      // Connect to server
      Server SSAS_Svr = new Server();
      SSAS_Svr.Connect("Data Source=localhost;Initial Catalog=Adventure Works DW 2008;Integrated Security=SSPI;");
      if ((SSAS_Svr != null) && (SSAS_Svr.Connected))
      {
         // Find the database
         Database UW_DB = SSAS_Svr.Databases.FindByName("Adventure Works DW 2008");
         if (UW_DB != null)
         {
           Cube UW_Cube = UW_DB.Cubes.FindByName("Adventure Works");
           if (UW_Cube != null)
           {
             // Find the Measure Group
             MeasureGroup MG = UW_Cube.MeasureGroups.FindByName("Internet Sales");
             F.WriteLine("Measure Group: " + MG.Name);

             // Get partition source SQL for each measure group,
             // and replace the source table.
             foreach (Partition P in MG.Partitions)
             {
               F.WriteLine("Partition: " + P.Name);

               // Get the Partition Source SQL
               F.WriteLine("Before:");
               F.WriteLine(((QueryBinding)P.Source).QueryDefinition.ToString());
               // Partition Source can be DsvTableBinding or QueryBinding.
               // Only QueryBinding Source has QueryDefinition. Hence the cast.

               // Replace the source table
               StringBuilder NewSQL = new StringBuilder(((QueryBinding)P.Source).QueryDefinition.ToString());
               NewSQL.Replace("[dbo].[FactInternetSales]", "[dbo].[FactInternetSales2]");
               ((QueryBinding)P.Source).QueryDefinition = NewSQL.ToString();

               // Write the new Source SQL
               F.WriteLine("After:");
               F.WriteLine(((QueryBinding)P.Source).QueryDefinition.ToString());
             }
           }
         }

         // Disconnect from AS Server
         SSAS_Svr.Disconnect();
       }

       //Close the file writer
       F.Close();
     }
  }
}

Hope it’s useful. I welcome comments and discussion at vrainardi@gmail.com.
To copy and paste from SSMS to WordPress with colour and tabs retained, you just need to wrap it in
“sourcecode” tag. See here for details

Vincent Rainardi, London, 29/9/2010.

28 September 2010

Why write a blog?

Filed under: Other — Vincent Rainardi @ 8:30 am
Tags:

Some people asked why I write a blog. Some people asked why I wrote a book. Why I spent a lot of hours for it. Why don’t I do other things instead, like relaxing in the garden? They wonder what’s in it for me.

I write blog posts usually at night, or on the train like now. I spend on average probably 3 hours a week writing blog. In 2007 I spent the whole year, from January to December, writing a book. Almost every night and every week end. And my friend asked why I did that.

I also spend a lot of hours preparing presentations. For SQLBits, for SSUG and for other events. The Stock Broker case study I presented at SQLBits 6 in Wales Nov last year, I spent a lot of hours creating the data. My friends wondered why I gave it away. I spent a lot of hours preparing the demos for SSUG at Microsoft (and the coming SQLBits 7). My colleagues wondered why.

I write and present because I want to share my experience and thoughts with others. I believe that the experience, thoughts and ideas that I had, were given by God and in return I have to share it. I understand that not everybody believe in God. Many people don’t. So I’ll share this next reason:

By writing we increase our knowledge. If, say, before writing a book, your level is 5. After finish writing that book, your level would be 8. The same goes with writing a blog. This is because you have to do research. You have to read and you have to test it. You need to be sure that what you write is correct. And in doing so you grow your knowledge.

By writing we learn. And we learn more by writing than by reading.

Ratio in SSAS

Filed under: Analysis Services — Vincent Rainardi @ 7:34 am
Tags:

A ratio is a number divided by another number, e.g. A/B. An example of a ratio in retail industry is profitability, which is profit divided by sales. Another example is capacity utilisation (in telco and manufacturing), which is capacity used divided by total capacity.

The point of this post is perhaps obvious for some people, but for others it may not: in SSAS to create a ratio we need to present both measures (A and B) to the cube, and then create either a measure expression or a calculated measure. We can not calculate the ratio in the data warehouse, and then present the calculation to the cube.

The reason for this is because ratios are not additive. If product1 profitability is 20% and product2 profitability is 10%, the profitability of both products is not 30%. It is a figure between 10% and 20% depending on the volume/sales.

Because ratios are not additive, in the fact table we need to create two measures: the nominator (the top number), and the denominator (the bottom number).

Ratio is not an issue in reports, but it is an issue in cubes. In report it is only 1 level, so if we want to (to boost performance) we can calculate and store the ratio in the fact table. But cubes have many levels. That is the essence of having a cube: so we can aggregate up and drill down. Hence the ratio needs to be correct at all levels. Hence we can’t calculate in advance and store it in the fact table.

In order for a ratio to be correct at all levels, we need to we need to store two measures in the fact table, then do the calculation in the cube. As I said this might be obvious for some people, but not for others. As always I welcome discussion and questions at vrainardi@gmail.com. Hope it’s useful, Vincent Rainardi 28/9/2010.

21 September 2010

Aggregate Not Shown

Filed under: Analysis Services — Vincent Rainardi @ 6:10 pm
Tags:

Perhaps this post would come handy if one day you encountered this issue: The cube displays Measure1 at the day level but not at month or year level. I was guessing:

  1. The dimension relationship between Measure1 and other dimensions (other than date dim)
  2. The attribute relationship from day to month in the date dimension
  3. Aggregate design assigned to the partition where Measure1 is located
  4. Scope on MDX calc that overwrites the month & year level with null

After scratching my head investigating all the above systematically, the answer was: none of the above. The cause was there was a default member set on one of the dims.

A tip to troubleshoot “measure is empty” issue:

  1. Copy the BIDS project to project2.
  2. Open project2 and in the dimension relationship remove all of the MGs except 1 (the one that you are having problem with)
  3. In the dimension relationship remove half of the dimensions (the one that you think is unlikely to cause the issue)
  4. Deploy & process the cube as different name, e.g. cube2.
  5. If the measure is displayed, it means that one of the dims that you removed caused the problem. If the measure is not displayed, it means that one of the dims that are still there caused the problem.
  6. Keep doing 3 & 4 until you found the dim in question.

Note: when doing #3, instead of removing the dim you can also set the relationship to none (right click the intersection cell and choose delete, no need to click ellipsis and go into the dialog box).

As usual I welcome any discussion and questions at vrainardi@gmail.com. Not just about this post, but also about data warehousing and BI in general.

Renaming Attribute and Dimension

Filed under: Analysis Services — Vincent Rainardi @ 12:35 pm
Tags:

Say the attributes and dimensions in the enterprise data warehouse are renamed and your cubes are impacted. Apart from renaming the attribute & dimension themselves, when estimated the effort we also need to consider:

  1. Attribute relationships
    After changing the attribute names, the attribute relationship will show a yellow circle with exclamation mark. You will need to delete and recreate them. No, editing them won’t work.
  2. Hierarchies
    Many of the renamed attributes are used in hierarchies. The levels on these hierarchies should also be renamed, to match the attribute names. The hierarchy name itself should also be renamed.
  3. MDX Scripts/Calculations
    This looks like a trivial “replace all”, but be careful when doing replace all. Sometimes it makes a few things stop working, for example:
    a) Same dim name, attribute & level, i.e. [Branch].[Branch].[Branch]. You meant to replace the Dim Name only, but the “replace all” replaced all three.
    b) That word is part of Hierarchy Name, i.e. [Electronic Product Hierarchy]. You meant to replace [Product] dim name to [Service] but the above Hierarchy Name in the calculation would also be replaced.
  4. Default Values
    If an attribute has default value, it needs to be changed too. For example: dim = account, attribute = account type, current default value: [Account].[Account Type].[&Bond]. If Account dim is renamed to Product and Account Type attribute is renamed to Product Type, then the default value becomes: [Product].[Product Type].[&Bond].
  5. Dimension Usage
    If you change a name of a dimension, the dimension name on the Dimension Usage tab should also be change. Dimension Usage is the tab that contains the relationship between dimensions and MGs.
  6. Translations
    Both on the Cube Translations (the dim names are here) and the Dimension Translation (the attribute names are here). For example, if Account Type is changed to Product Type, then you don’t want the French translation to stay as Type De Compte. You would want it changed to Type De Produit. And the Spanish: from Tipo De Cuenta to Tipo De Producto.
  7. Analytics Reports
    Remember that the attribute names, dimension names, and measure names on all analytics reports Strategy Companion, Tableau, ProClarity, SSRS, Excel/Vivid would need to be changed too. This is why once a lot of reports are built, practically speaking it’s impossible to change the attribute/dimension name.

I would definitely discourage the “shortcut” practice of: “export to XMLA, search & replace all, deploy to AS server, recreate BIDS project from the server”. True that this is quick, but it won’t work, apart for the simplest cubes. But if it’s a very simple cube, then renaming it on BIDS is quick & simple too. So definitely do it in BIDS.

If you are changing measure names as well, things that are impacted are:

  1. Calculations / MDX Scripts
  2. Default Measure of the cube

Hope this helps your impact analysis/estimation. As usual I welcome discussion and questions at vrainardi@gmail.com

10 September 2010

Q&A on: Browsable Folders When Restoring Cubes

Filed under: Analysis Services — Vincent Rainardi @ 10:54 am

Q: When restoring cubes, can I “browse” rather than “type” to select the backup file?
A: Yes, enter the folder path on the “AllowedBrowsingFolders” server property.

Q: Can I have more than 1 folders on that restore cube dialog box?
A: Yes, when entering the folder paths on the “AllowedBrowsingFolders” server property, separate them using a pipe (|).

Q: I entered 2 folders separated by |, but why only 1 appears on the restore cube dialog box?
A: That folder is invalid. Either a) typo on folder path in the server property, or b) that folder doesn’t exist. For example: “Log|Backup”, Backup gets deleted and now only Log appears on the restore cube dialog box.

Q: Can the backup file located on other server / file share?
A: Yes, for example \\server1\folder1\file1.abf. You need to have read permission on this remote folder. But no, you can’t add remote path like that on the  “AllowedBrowsingFolders” server property. If you do that it won’t appear on the restore cube dialog box.

3 September 2010

Q&A on Resizing TempDB

Filed under: SQL Server — Vincent Rainardi @ 5:43 pm
Tags: ,

Q: Can we change the growth increment of tempdb from 10% to 1 GB?

A: Yes. To do this, stop SQL Server, on command prompt: sqlservr -c -f. Doing this reset the TempDB data file to 2,304 KB (not 1 MB as in KB 307487) and the log file to 504 KB. The growth is reset to 10% MB for data and 1 MB for log. SQL is in single user mode, so no risk of TempDB corruption. No, from Config Mgr you can’t see that SQL Server is running. Then do this from SSMS:

ALTER DATABASE tempdb MODIFY FILE (NAME = ‘tempdev’, FILEGROWTH = 1GB)

Verify the growth increment before and after by issuing “sp_helpdb tempdb”. The growth column for the log file should be 10% before and 1024KB after.

Control-C on the command prompt to stop SQL Server. Restart SQL Server from Config Mgr.

Q: Can we shrink the TempDB?

A: Yes, we can. Same as above. Stop SQL Server and restart in single user mode on command prompt. But we specify the size this time, like this:

ALTER DATABASE tempdb MODIFY FILE (NAME = ‘tempdev’, SIZE = 3GB)
ALTER DATABASE tempdb MODIFY FILE (NAME = ‘templog’, SIZE = 1GB)

Note: we can only go higher but can’t go lower. For example, after setting it to 5 GB we can set it to 6 GB but we can’t set it to 4 GB. If we try to set it to 4GB it will remain 5GB.

We can specify the growth and the size at the same time.

Q: My TempDB is 200GB. If I shrink it to 1 GB will it come back to 200 GB?

A: If you execute the same queries against the same data, yes. If the big query that caused the TempDB to blow to 200GB is only a one off, no.

Q: Is it worth shrinking TempDB?

A: Depends on your disk space. If you have ample disk space then you shouldn’t worry. But question like this usually comes up because the disk space is limited. One point I’d like to make is, if you are working with a 1TB warehouse and running queries against a 1-billion-rows fact table, with joins to 5 dims and “sum group by” etc, don’t be surprise if the TempDB is 200GB. True that most SQL DBAs who never administered a SQL warehouse would be surprised, as in their books a TempDB “should be” around 5-10GB. Yes, the TempDB of a SQL Server used for warehousing is different to if it’s used for “data entry” applications (known as “transaction systems” in the BI world).

If you are working with a SQL warehouse, the question should be more of: how do I optimise the TempDB performance rather than disk space, e.g. 1 TempDB file per processor core, placing TempDB files on different spindles, putting them on RAID 1+0 rather than RAID 5, etc. If your company can’t buy a 1TB disk space then they probably shouldn’t do data warehousing. Almost by definition, data warehousing is about disk space.

Q: Can I shrink TempDB without stopping SQL Server?

A: Yes. Use dbcc alter file/db. Consider the risk of data corruption. This risk does look scary for prod DBs but may be acceptable for dev.

Standardising Entity and Column Names

Filed under: Analysis Services — Vincent Rainardi @ 12:48 pm

I was in a meeting this week about standardisation of attribute names throughout the enterprise. I realised that one of the most difficult duties of a data architect (or data architecture team) is to maintain a standard for the entity and column names. This includes the data warehouses and data marts (but not limited to just them).

This is a very difficult task that no body is eager to do. Because it is laborious and tedious. And because it involves a lot of business discussions, over and over. And because it is impossible to implement: after you have standardised 500 column names, changing many different systems throughout the enterprise is a multi year project, costing a few million, with no ROI. It would probably take only 2 minutes for the board to veto projects like that, mainly because of the ROI.

Because nobody wants to do it, this task is usually fall onto the shoulder of a newly formed body, called Data Governance team (or “council”). Some companies call it data quality team, which is not the same really. It is not as difficult to pitch the importance of having data governance team. Especially in industries that heavily dependant on data, such as financial services. Almost by definition, the members of this team should be quite senior. Amongst other things its main duty is to define the formal terminologies and business rules used by the business, and therefore used in IT. This team has the authority on everything related to data. They understand the business meaning, create data standards, and enforce the compliance to these standards.

Because of this “enforcing the compliance” business, you need very senior people to be on it. Also, you need people who are very sound in terms of business knowledge, aka the “Business Architect”. You also need senior people in DG team because they will need to sell the ideas of “projects without ROI” I described earlier. They have to convince other members of the board that there is a huge benefit of enforcing naming and rule conformance to all systems.

For large enterprises, data governance is a must. Not a luxury. It’s a necessity. International banking groups such as Citi, BOA and RBS, for example, definitely need data governance. If they don’t they are in trouble. For small companies, this reliability is on the shoulder of data architecture team. Their job is not just designing databases, creating ERDs, etc, but also maintaining naming standards.

Changing column names will certainly break all the applications accessing the database. These applications need to be changed. Impact analysis, however detailed, will not be 100% accurate. There will always be columns that we missed in impact analysis. E.g. we think app1 accesses 100 columns, but actually it accesses 120 columns. That’s why the testing must be thorough.

Yesterday was the fifth time I came across “naming standardisation” initiative. And based on the past experience, I’m a bit sceptical. If it’s only changing 1 or 2 apps, fine. If it’s only renaming the “front app” e.g. BO / Cognos / RS reports, that’s fine. But if it’s 500 columns and 2 warehouses, 8 marts, 5 source systems, used by 22 apps, hmmm… is it worth it?

As usual I welcome any comments and discussion at vrainardi@gmail.com

Preserving the History of the Facts

Filed under: Data Warehousing — Vincent Rainardi @ 11:41 am
Tags:

One of the questions frequently asked by somebody who is learning dimensional modelling is “How do we preserve the history of the facts?” This question usually comes up after they learned about SCD (Slowly Changing Dimension), which made them realise that we could preserve the history of dimensional attributes.

What they meant by the word “fact” was measure, i.e. the numerical column in the fact table. For example, in an insurance warehouse, if in the source system a premium of £100 later on changed to £200, what will happen to the fact table? Earlier today somebody came to me with this question.

I explained to him the concept of “Transaction Fact Table” versus “(Periodic) Snapshot Fact Table”. Then explained that in the case of Transaction Fact Table, we overwrite the £100 with £200. But in the case of Snapshot Fact Table (say daily snapshot), we store £100 for snapshot date1 and £200 for snapshot date2. E.g. if the premium was £100 on 1st July, then on 10th July it was changed to £200, in the snapshot fact table for snapshot date = 1st to 9th July the premium is £100, but from 10th July onwards the premium is £200.

So a simple answer to the question of “how do we preserve the history of the fact” is: using snapshot fact table.

Then he went on and say, “But what if it’s a retail case and the price of the product is updated from £100 to £200, and it’s not overwritten, but inserted as a different date.” I explained the concept of sales version price, and that price is encapsulated in the sales transaction. E.g. sales1 on 1st July for product1 has price = £100. If on 10th July the price of product1 is changed to £200, then sales2 which happened on 10th July will have the price = £200. So in the transaction fact table there will be 2 rows, one for each sales transaction. Sales1 has price = £100 and sales2 has price = £200. So it is the sales transaction that is captured, and price is captured as “part of the sales”.

So this case it is not about preserving the history of the fact. It’s a different matter. The price is “captured” every time there is a sales transaction.

Would you store the price as dimension? Hmmm it’s for another blog post. But the short answer is (as always in dimensional modelling): It depends on how it would be used. Generally it is better to store price as a fact (with date/time key), rather than as a dimension. In retail it is price, but in the case of Forex, banking & insurance, it’s “rate” rather than “price”.

“But what if”, he asked again, “in the first case above (premium), the snapshot fact table is monthly, and the premium is changed from £100 to £200, then back to £100 again, within the same month? Would we loose the history?” What he meant was: say on 1st July the premium was created as £100, then on 10th July it was updated to £200. Then on 20th July it was changed back to £100. In this case the July version of that premium in the snapshot fact table would be £100. Yes we will loose the “history”, i.e. the warehouse never captures the fact that the premium was £200 for 10 days. But if the snapshot fact table was daily, the warehouse would capture it.

Hope this is useful. As usual I welcome any comments and questions on vrainardi@gmail.com.

Blog at WordPress.com.