Parallel Data Warehouse is the formal name given by Microsoft for Project Madison. I have been waiting for this for a long time. After doing SMP warehousing in SQL Server, I was doing MPP warehousing on Teradata for a while, and it wow-ed me. The performance I mean. Both loading performance and query performance. It’s incomparable to SQL Server. I also admired the way it worked. I got infected by two colleagues who worked for Travelocity at that time. They were so enthusiastic about the MPP technology and about data warehousing. At that time I was thinking that Oracle and Microsoft must be planning to get into this MPP DW market. Then when I was doing SQL Server warehousing again I came across Netezza. It didn’t really wow-ed me this time, as I already knew how MPP worked. Basically Netezza works the same way, just a different variant. At a lower price.
Then about 2 years ago or so I heard that Microsoft acquired Datallegro. And then I read about Professor David Dewitt leading the work on the parallel query optimisation, under the name of “Project Madison”, as the main work is at the Jim Gray lab in Madison, Wisconsin. In the database world, David Dewitt is well known, see this work with Jim Gray, this and this. He has implemented three parallel database systems: DIRECT, Gamma, Paradise. At that time I read that the project was essentially replacing the Ingres with SQL Server, and Linux with Windows. But, after knowing that Prof Dewitt was involved, I had a high hope in Project Madison. I was hoping that
- The build is solid, especially the internal network between nodes and the parsing engine (data distribution mechanism/hashing, query/workload distribution, distinct count handling, etc). By the word “solid” I mean the fault tolerance capability e.g. if the primary node fails then the data is read from or written to the secondary node, etc.
- The performance is linear, and in the same league as Teradata. The query & loading performance of an MPP DW depends on the performance of its nodes. But to a large degree, the way the nodes collaborate (shredding and distributing queries and combining the results) affects the linearity of the overall system.
Last month at SQLBits I attended Thomas Kejser session on Project Madison/R2’s Parallel Data Warehouse and I was disappointed. I thought I would be able to see the details of how it works. But no, sadly it was not revealed. I was looking for a) DDL syntax, the ‘create table’ statement, how and where we place data and index, and b) how it works technically. I know how MPP works in general so I’m looking for how Microsoft version works. I admired Thomas for his technical knowledge, and his works at SQL CAT. I learned many things from him. DW loading being one of them, SSAS Processing Phases being another.
After Thomas’ session I met Anthony Howcroft, who spoke at the next session (the Fast Track). He was with Datallegro prior to the acquisition. I wish him the very best of luck for marketing PDW in Europe MPP market. I believe at Microsoft price point (which I expect to be lower than Netezza) it may be able to take off. I believe that price barrier is one determinant factor why Teradata customers in UK is only 30-35 companies (that was what I heard 3 years ago, not sure how many now). I believe that many companies (banks, insurances, healthcare, retail) need MPP/PDW and they can be hundreds of them. Definitely not 35.
I must admit PDW is a good name. Outside computer science world, the term “MPP” is not that popular. Say it slowly and hear it for yourself: Parallel Data Warehouse. Massively Parallel Processing. The former sounds familiar and easily understood. The latter sounds laboratory and academical. By naming a product with what the product does, I must admit it is a very clever idea. “Parallel Data Warehouse”, what a clevel name.
If PDW market is hundreds in UK (say 500) and thousands in Europe (say 4000), then we have a hope that it is an attractive market. Skills market I mean, job market. Worldwide we could be looking at 20,000 companies using PDW. Not SQL Server PDW specifically, I meant MPP DW. I’m going to not use the term ‘MPP DW’ anymore, I’m going to refer to MPP DW as PDW. And I will refer Microsoft version as SQL Server PDW. See it is a very clever marketing isn’t it, calling that new product Parallel Data Warehouse? So, yes, 20,000 worldwide. Good job market? Not sure if it’s good, but it’s certainly not a narrow one. Teradata was marketed as ‘no DBA required’, ‘save money on human resources’. I wonder what Microsoft strategy is. Is it the same as Teradata: SQL Server PDW is easy, you don’t need a DBA? That is inherently a dangerous statement isn’t? To SQL DBA market I mean. And not getting a support from SQL DBA market worldwide, is a suicide for SQL Server PDW. So no, that’s wouldn’t be a good strategy.
I am pretty much hoping (and predicting) that Microsoft will take the other way. Market SQL Server PDW as ‘a step up from SQL DBA’ kind of thing. Make PDW as the best friend of SQL DBAs. Create the exam. Be certified. This the daddy of all databases. Come this way and get better pay, etc, etc. Data warehouse designer and developer will too, find it attractive. With the staggered approach into big DW market (the Fast Track on level 1, SQL Server PDW on level 2), supported by the concept of hub-and-spoke, SQL Server R2 Parallel Data Warehouse is in a very good position to get significant market share in parallel data warehouse market, competing against Exadata, Netezza, Teradata, Greenplum and Neoview.