Data Platform and Data Science

30 July 2023

Why they are called FLAT files

Filed under: Data Warehousing — Vincent Rainardi @ 4:00 pm

Do you know why they are called “FLAT” files? What is flat about it? Which part of the file is flat?

There are two reasons why they are called FLAT files:

1. They are called FLAT files because the content is flat, meaning that it does not have levels (hierarchy). For example, this JSON file contains levels/hierarchy. The patient_id key is on first level, whereas the doctors keys like medical_field are on the second level. (in JSON the attributes are called key, and the data is called values).

Flat file, in contrast, does not have hierarchy or levels. Everything is on the same level. In a flat file there is only 1 level. Every line represents a row.

2. The second reason it is called FLAT because the content is FLATTENED, meaning it is not relational (table related to other table).

This is an example of relational database:

The content of table A and B can be “FLATTENED” into a single table (called denormalisation) and that single table when exported into a file is called a FLAT file. It contains rows and columns (because it is a single table).

That is why they are called FLAT files. Flat means no levels/hierarchy. And it can also mean related tables flattened/denormalised into a single table.

Flat files consist of lines and fields. The lines represent rows and the fields represent columns (in a table). Flat files can be in one of these 2 shapes:

  1. Delimited: in a line, fields are separated by a certain character.
  2. Fixed length: in a line, fields have fixed length.

24 July 2023

Building Data Lake on Snowflake

Filed under: Data Warehousing — Vincent Rainardi @ 3:17 pm

Building a data warehouse/mart on Snowflake is quite obvious, but how do you build a data lake on Snowflake?

It turns out that finding the answer is quite tricky. Everyone including Snowflake is not clear on the technical architecture (see Ref #1 below). That page explains about the benefits without explaining how to build it. The book on that page (Ref #2) goes in such a length to avoid answering the question. It explains what data lakes are, the benefits, use cases, but does not explain how to build a data lake on Snowflake. This made me think that the true answer is probably: you can’t.

The best clue is given by Pratamesh Nimkar (Ref #3) who explains that it is using external tables:
(image credit: Pratamesh Nimkar)

The above diagram is inline with what I wrote in January 2022 (Ref #5), and to what Saurin Shah says (see Ref #6). Notice that in the above diagram the data lake is not on Snowflake, it is on S3 or ADLS. So yes, you can’t create a data lake on Snowflake. You create it on S3 then create the File Format on Snowflake, so you can query it using SnowSQL. Then you can do the medallion model (the silver and gold layer) on Snowflake, like this: (I’ll simply to just JSON and CSV files on S3)

As shown above, you can create materialised views on the external tables. The reporting and analytics mostly read from the Gold layer but the ML mostly read from the Bronze layer. I said “mostly”, meaning that some ML reads from Gold layer and some reports read from Bronze.

There you go, that’s how you build a data lake on Snowflake.

References:

  1. Snowflake for Data Lakes, by Snowflake: link.
  2. Cloud Data Lakes for Dummies, by Snowflake (David Baum): link.
  3. Augment your Data Lake Analytics with Snowflake, by Prathamesh Nimkar: link.
  4. Introduction to Data Lakes, by Databricks: link.
  5. Data Lake Architecture, by Vincent Rainardi: link.
  6. External Tables are Now Generally Available on Snowflake, by Saurin Shah: link.
  7. Emerging Architecture for Modern Data Infrastructure, by Matt Bornstein, Jennifer Li, and Martin Casado: link.

23 July 2023

Data Architect jobs in UK

Filed under: Data Architecture — Vincent Rainardi @ 3:35 pm

In JobServe in the last 1 week there are 29 data architect job ads in the UK (both perm and contract). The terms within those ads are as follows:

We can see here that Governance, Design and Analytics are the most favourite terms, whereas Warehouse, Lake, Snowflake, TOGAF and NoSQL are the least favourite. This is quite surprising to me because I thought Warehouse and Lake should be very much higher.

The term “SQL” is another surprise: there are only 7% of ads containing the word SQL. This indicates that employers do not expect data architects to be fluent in SQL queries.

The other thing we can learn from the above number is the Cloud platform. 28% ads mentioned Cloud and 24% mentioned Azure. But AWS and GCP are only 10%. This shows that Azure is more in demand than AWS or GCP, at least for this week (17th to 23rd July 2023).

Looking at the 31% there are 3 skills there. I get it why Model and Quality are there, because data modelling and data quality are part and parcel of data architecture. But the the word “Security” is at the same level of demand as the word Model and Quality. This shows that data security is very important part of a data architect job.

If we look at the 7%, all three of them surprise me: SQL, DevOps, Standard. 7% is way too low for them. SQL, DevOps and Data Standards are very important in data architecture. In my opinion they are part of the job. But apparently employers don’t really need them.

TOGAF is another declining demand. Cloud skills are 9 times more in demand than TOGAF, i.e. 28% vs 3%.

That’s it. That’s what the hiring managers are looking for when they hire a data architect in the UK this week.

18 July 2023

Learning AI – Where to start?

Filed under: Machine Learning — Vincent Rainardi @ 5:29 pm

Yesterday someone commented on one of my recent articles. In that article I said something like “What if you don’t do ML? Well then you really need to see yourself in the mirror.” I apologise if it’s a bit harsh. I think it was the “Not a data warehouse, but a data platform” article. That someone said that he’s built data warehouses for many years and now it’s time to learn AI but there is a lot of material. So the question is: where to start? Is there a good introductory material? Like a book or a course?

Yes I appreciate it could be quite daunting, because there are many faces of AI. On the one hand we have prediction and forecasting, like predicting sales, or stock forecasting, may be using Linear Regression, Random Forest or Neural Network technique like LSTM. Then we have something completely different NLP (Natural Language Processing), the like of Chat Bots, Alexa and Chat GPT.  And then we also have self driving car, which employs image classification, video recognition (on moving objects), reading texts and understanding its meaning, interpreting data feed from Radars and Lidars, and controlling the steering wheels, breaks and accelerators.

Then we have robotics, which also uses AI, i.e. sensors and actuators/motors. And we have generators, like AI which can make music/songs, making paintings, films and poems, etc. It seems that everything is so different, and yet they all use AI. So if you want to learn AI, where to start? Oh, and then there are things like Alfa Go and Chess, and other games, which uses Reinforcement Learning. That’s another completely different AI algorithm.

And yet all those introductory materials talk about supervised and unsupervised algorithms, which make it like there are 2 broad types of AI. To many people it doesn’t feel like there are only 2 types of AI, it feels like there are 20 different types of AI. How can we comprehend it all?

So let me paste my last night answer here first, then I’ll add a few comment at the end.

Hi, it’s good to start with fundamentals of ML (Ref #1 and #2 below), then go through ML algorithms in the order below:

1. Linear Regression and Logistic Regression
2. Naive Bayes
3. Support Vector Machine
4. Decision Tree and Random Forest
5. Gradient Boosting
6. Clustering
7. Natural Language Processing
8. Large Language Model
9. Convolution Neural Network
10. Recurrent Neural Network
11. Reinforcement Learning

When learning ML algorithms, I would recommend not to learn too deep to the nitty gritty, how to code it, etc, but only deep enough to answer these 3 questions:
1. What’s that algorithm for?
2. How does that algorithm work? (at high level)
3. What real world cases can be solved by that algorithm?

Then learn how to select the right algorithm to use, and also about feature selection. Again, only at high level (how it works), don’t get into code level.

By this point you will have an understanding of what AI/ML is in great detail, even though you can’t code yet. You know how various real world problems are solved using ML, and which algorithm to use. So now it’s time for the second part. This part consists of 3 things:
1. Statistics
2. ML Algorithms
3. DevOps

For the Statistics part, learn about Exploratory Data Analysis, probability, inferential statistics and hypothesis testing. All with coding.
For the ML Algorithms part, relearn the 11 ML algorithms above, but now it’s about how to build them (yes, with coding).
The last part (DevOps) is about understanding the software/platform that you need to use to develop ML models, and to deploy them into production, and to monitor them. It is best to learn from the vendors, see list below.

At this point you know how to develop ML models for various cases. You can work in AI as an ML engineer. An AI developer. But at this point you’ll realise (as I did) that you are only at the beginning of a very long journey. When Jeffrey Hinton talks about Capsule, you are able to understand what he’s talking about (Ref #3 below). When you read in newspaper about ChatGPT, you know what GPT is and how it works. It is because you have enough knowledge now that you feel small in a big world. At this point I say, welcome to AI. You are now inside the bubble. You are now an insider. Inside the world of AI. And you realise that it is a very large world. But the feeling is amazing, compared to being an outsider which you were once. It’s amazing because now you understand the technicalities and the nitty-gritty. How it all works.

References:
1. The Fundamentals of Machine Learning by Interactions: link
2. Machine Learning Fundamentals by Jean de Dieu Nyandwi: link
3. Capsule Neural Network: link

List of ML DevOps, by vendors:
1. Microsoft Azure: link
2. Databricks: link
3. AWS: link
4. GCP: link
5. Dataiku: link

———————————————-

Right, that was my answer. Now my comments:

I think you should go through it in two stages like above (keep it shallow the first time around, then dive deep on the second round) because of 2 reasons:

  1. AI is large, get big pic first so you understand the various flavours of AI. If you dive too deep the first time, you might not have time to go through them all and therefore doesn’t get the complete picture. It’s like in an ice cream shop where there are 20 different flavours of ice creams. If you have a big scoop then you’ll only know the taste of one or two flavours. If you have a tiny spoonful on each flavour, then you’ll get to taste all flavours.
  2. You may NOT WANT to be able to code on all algorithms. Maybe you don’t like LLM (Large Language Model), may be your don’t like RL (Reinforcement Learning), but if you don’t know what they are, and what they can be used for, then how could you decide which one to not like? You can’t right? You need to know them a little bit to be able to decide.

Second comment: it is important to learn about ML Ops because of 2 reasons:

  1. It is different to the usual Dev Ops. Usually it is gather user requirements, create BRD (business requirement document), write functional spec, technical spec, data model, report design, then testing (unit test, system test, UAT) and then CI/CD, release pipelines and monitoring/operation. But in ML Ops it is different: EDA, then, feature engineering, then model building, model training, model evaluation/analysis, loop back to model building, before the release-deploy-operate-monitor. But, before EDA, in ML Ops we have a full cycle called Data Cycle: design, collect, curate, data transformation / augmentation, data validation, data cleansing/quality, data prep and data processing. Anyway, there’s no point going into the nitty gritty here, suffice to say that in ML the Dev Ops is different to the usual data or software development process.
  2. You may be able to create a good model, but if you don’t know how to deploy it to production then it won’t get used. Plus, model deployment (and monitoring) is a completely different ball game to model development. You might be good in Python, Numpy, Pandas, Colab, Keras, etc., you may be very good in creating very good models, you know how to combine different algorithms into one models, etc. but if you don’t know how to deploy it, then that very good model would stay in development.

That’s it. So, learning AI – where to start? Start by getting a good intro by reading those 2 links above (Ref #1 and #2), then go through all algorithms just deep enough to answer 3 questions: what is it for, how does it work, and what real world cases can it solve.

Happy learning!

16 July 2023

Not a Data Warehouse, but a Data Platform

Filed under: Data Platform — Vincent Rainardi @ 8:03 am

I’ve been building data warehouses since 2005. First it was for a utility company (waste and water). Then a travel company, healthcare, insurance, banks and asset management companies.

A data warehouse is very useful for BI. And for reporting. And for “data mining”. And for answering questions. For decision making too. We designed the schema, and then load the data. Then we build the BI data model around that data and use a BI tool to visualise that model (Qlik, Tableau, Power BI, take your pick). Or in the old years, SSAS, SSRS, TM1, Hyperion, Cognos, Business Object, ProClarity. Yup I’ve been in this BI business since 2005. 18 years!

Building a DWH took time. About one year or two. You can build it on any database. SQL Server, Oracle, Netezza, Teradata, Exadata, you can take your pick. Yup I’ve build data warehouses on all those databases. And then build BI/reports on top.

But in the last in the last 3 years I built a data platform. We built a data platform (there were 10 of us). Initially we built a data warehouse. But then migrated to a data platform.

What exactly is a data platform? It’s a data lake + data marts. Plus data sharing, data API, data quality, data governance, data dictionary, data catalogue, data ingestions. Yup, sometimes with data streaming too. The whole shebang.

Why data platform, not a data warehouse? Because of its ability to share raw data. While some users need BI reports, some users need the raw data. Picture this, you buy data from external companies (in investment banking we have Markit CDS prices, Bloomberg securities and issuer data, FX rate data from Reuters, just as an example) and your users need to see this raw data. How do you do it? Are you going to share the staging tables in your warehouse? No, they are emptied and reloaded every time (kill and fill, as they say. Or truncate-reload). They need to see yesterday’s data too, not just today. Eventually you’ll end up sharing a folder where all those daily files are placed.

The second reason is your data scientists. Many companies now a days have data scientists. Instead of BI, they are producing predictive analytics (I wrote about this a few days ago, see my article “It’s not BI, it’s Analytics!”). They need to write Python notebooks, use libraries like Scikit-learn, PyTorch and Keras, data manipulation and analysis libraries such as Pandas and Numpy, visualisation library such as MatPlotLib and Seaborn, and place the output back into the lake. Initially into a “Sandbox” area, then into a persistent area when the process has been productionalised. A data lake provides this environment. Those notebooks (they are called Jupyter) need to read/consume to those external and internal data sources, all of which are available in the lake.

A data lake has 3 layers, known as the “medallion” architecture, because it’s bronze, silver & gold (see Ref #1 below). The Bronze layer contains raw data, the silver layer contains data which has been filtered, cleaned and augmented. The gold layer contains business level aggregates/summary data. Everybody’s interpretation is different, but that’s most popular view.

With data lake, you can provide business level data, without building a data warehouse. But in most cases you will need to build some kind of data warehouse. Not a big data warehouse which covers all the data in the lake. But a small data mart which only covers one business area. For example: just sales, or just finance. Or just client reporting. Or just pricing. This way, it is quick to build and benefit the users straight away (a quick win).

Architecturally a Data Platform looks like this:

Why do you still need to build data marts (star schema) when you have a data lake? Because most BI tools works best with star schema. Many reporting tool too. Can you store the star schema in the data lake? Yes you can, especially if your lake has ACID properties like Databricks Lakehouse (see Ref #2 below).

So there you have it. You should not build a data warehouse, you should build a data platform. Because analytics are not just BI, they cover ML output too. What if you don’t do ML? Well then you need to look at yourself in the mirror. Really, in this day and age you don’t want to do AI? You don’t find any use cases for AI in your company? It’s probably time to move company then! Even agriculture business has AI. From fishery, pharmaceutical, finance, mining, retail to telecom everybody uses AI. Businesses without AI might not survive the next decade.

Can machine learning read data from the warehouse? (so you don’t have to build a data platform, just need to build a data warehouse) Yes you can but it is not versatile/flexible. If there is a schema change from the data source, the data warehouse is not as flexible as data lake (because of “schema on write”, not “schema on read”, see Ref #3 below). And it’s not the right approach (building a warehouse to feed ML) because it would take a year or two (cost a lot). Feeding ML from the lake is the right approach.

The other reason of having a data lake is because of semi-structured data like JSON, or unstructured data (weather, geo-spatial, images, etc – which is often required by machine learning). If you want to read the detailed arguments please read Danil and Lynda’s book on Ref #4 below.

So there you go. You should not build a data warehouse, you should build a data platform.

References:

  1. Medallion Architecture by Databricks: link.
  2. Databricks Lakehouse: link.
  3. Schema on read vs on write by Deepa Vasanthkumar: link.
  4. Designing Cloud Data Platform by Danil Zburivsky and Lynda Partner: link.

13 July 2023

It’s not BI any more, it’s Analytics!

Filed under: Business Intelligence — Vincent Rainardi @ 8:16 am

5 years ago everybody called it BI. Business Intelligence. It sounds great (intelligent, smart) and it has a great meaning too: it means that the company has business insight. They understand the trends happening in their business. Which customer segments need attention, which product lines are in demand. It means that the management knows how the business performs. Keep their finger on the pulse, as they say. And more importantly, that company is perceived by the market as a company that makes decisions based on data, not based on instinct.

In the job market, we say “BI Developer”. In the software market, we say “BI Tools”. It is called Power BI*. Tableau is a BI tool. In IT department we call it “BI reports”. BI Developers are people who develop BI reports, using BI tools.

Note: Microsoft is well known to name their products using English words with the same meaning. For example: Microsoft Word, SQL Server, SQL Data Warehouse and Power BI.

But today that word is obsolete. The right word is Analytics.

With the advance of AI, we now have 4 types of analytics (2D2P). They are: descriptive analytics, diagnostic analytics, predictive analytics and prescriptive analytics. The 2 Ds are the old BI (descriptive and diagnostic analytics). The 2 Ps are AI-based (predictive and prescriptive analytics).

What are those 4 types of analytics? Please read Ref #2 below which provides a good explanation. Basically:

  • Descriptive analytics: describes the past
  • Diagnostic analytics: explains why an event happened
  • Predictive analytics: what will happen in the future
  • Prescriptive analytics: actions we should take

The 3rd one (predictive analytics) is extremely valuable. It predicts the future. Which customers are likely to leave your business this year? Which products are likely to decrease in sales in the next few months? Every business wants to know the future.

One of the jobs in JobServe I’m looking at this morning is from a automotive retailer. They are hiring a data scientist to do machine learning forecasting based on the weather! It is important for retailers (shops) to forecast sales based on the weather, economic indicator (like interest rates), and other factors.

My dissertation is about stock forecasting. For investment companies it is important to forecast the value of their holdings based on various factors, including the past prices (it’s called time series forecasting, see Ref #3).

Another predictive analytics that banks and investment companies do is credit rating. Which companies will move from credit rating AA plus to AA minus in the next 12 months? Which industry sector will experience a general decline next year? You get the picture.

Descriptive and diagnostic analytics are BI. But predictive and prescriptive analytics are AI-based. BI explains the past, whereas the 2P analytics explains the future (using machine learning or AI). So it’s up to you, if you just want to explain the past, use BI. But if you want to predict the future, use the 2P. So when you say BI, it means just the 2D. When you say Analytics, it means all four. And in many cases what we mean is all four. That’s why we should say Analytics, not BI.

So there you have it. It is not BI any more, it is Analytics!

Reference:

  1. BI vs Business Analytics by Tableau: link
  2. Four Types of Analytics by Analytics8: link
  3. Complete Guide to Time Series Analysis and Forecasting by Marco Peixeiro: link

12 July 2023

Good Bye Data Centre – It must be in the Cloud

Filed under: Data Warehousing — Vincent Rainardi @ 10:41 am

I remember in 1998 I was working for a manufacturing company in Essex and we had a server room at the back of the building. Database servers, email servers, web servers, application servers, storage (SAN, DAS), switches (Cisco) were all stored there in the cabinets. Behind the cabinets there were lots of yellow Cat 5 cables connecting everything. In 2005 I was working for a utility company near London and we had the same thing: a server room. But we also rent a few cabinets in a data centre. We put some servers in our building, and some in the data centre.

In the following 10 years I worked for various companies in London (banks, insurance, asset management, healthcare, travel) and they always had a data centre. They managed the servers in the data centre. They did the installation, the upgrades, the patching, antivirus, the whole shebang. We had network engineers, server administrators, DBAs, infrastructure people working on those servers.

For us who worked in data warehousing, the database servers like Oracle and SQL Server were in the data centre. Getting a new database server took like 2 months. We had to spec it, raise a purchase order, get it approved, wait a few weeks for it to arrive, installed the server on the cabinet, install the data storage (SAN), install the O/S, carved the storage, install the database engine and configure it, and give the developers access to it. That’s the Dev SQL server done. A few months down the line when we are about the go live, do the same thing for the Prod server (did the testing in the Prod server as it’s still empty). A few months later we repeat the whole thing when we purchased a Test server (because testing in production is a big NO).

You see, at this rate, nearly half of development team’s time was spent on getting the servers ready (I was a development manager at that time, a small team of 5). We had web servers, SQL servers, app servers. We had 3 environments: Dev, Test, Prod. And later on the 4th one: Staging.

I don’t know about US and Australia, but until 5 years ago that was what happening with IT teams in the UK. Data centre was the name, data centre was the game. Everybody had a data centre, with a DR plan to flip over to a second data centre. O yes, we had servers marked as Tier 1, Tier 2, etc. Tier 1 servers were replicated (or backed up and restored) every few hours into the DR site, Tier 2 servers once a day, and Tier 3 once a week. And just like a fire drill, we had “DR drill” every 6 months. It was a nightmare. It took many goes before we got it right. We had to repeat the DR run book a few times before everything flipped over correctly. Took huge amount of time from many different teams. Infrastructure, operation, development (yes, development team!), helpdesk and the management. Basically everybody was involved in managing the servers!

Note: DR = Disaster Recovery. Basically it’s a plan for when we lost the whole building what are we going to do (bomb, fire, flood, earthquake, you name it). Now a days it is called BC, stands for Business Continuity.

During those times I often thought: “How nice it is to become a business user!” They didn’t have to deal with these things, they just used the data. They enjoyed the BI, while we dealt with all that malarkey.

If installing SQL Server, Oracle or Informix server sounds difficult, try installing Teradata, Exadata and Netezza! I’ve been there and they were nightmare, believe me.

Oh well, those were the times. Fast forward to today, now we have Azure, AWS and Google Cloud. This is the era when we in data business say “DR? What DR?” with a big smile on our faces. It’s a huge relief for everybody not having to do DR. Let Amazon takes care of it (or Microsoft, or Google). They have redundancies all over the place (servers, not people!) Database servers are replicated to another set of servers in another building, in another city, or in another country, take your pick.

When we need a new database server, it takes just 5 minutes to “spin it up”. I tried it with RDS in AWS, and Cosmos DB in Azure. Amazing! When you want to increase capacity, like more CPUs or more storage, it takes 5 minutes too! I tried it with “managed SQL instance” in Azure. For us who worked in data warehousing, this is unbelievable! It’s like a dream comes true.

And these days it’s not a “server”. It is not a SQL server. There is no server! There is a database, but there is no server. What is it then? It’s a service. It’s a SQL Server database service. We don’t get access to the servers, let Microsoft worry about it. We just use the database. We don’t need to worry about the server. That’s why it’s called PAAS, Platform as a Service. In this case it is a database platform (SQL Server or RDS, delivered as a service).

That’s it, we don’t need to worry about patching, upgrading or installing. Just use the SQL DB. Easy life!

Note: instead of PAAS, you can go one level below (IAAS, Infrastructure as a Service) where you create a VM (Virtual Machine) and install the SQL server yourself. Or you can go one level up (SAAS, Software as a Service) like Snowflake and Salesforce, where you don’t even have to manage the database. Just use the software!

That’s the beauty of cloud database service. We just have to use it, without worrying about the server, the patching, upgrading or installing. Just use the database!

And it’s Pay As You Go. We only pay what we use, when we use it. This is the main reason why no one is buying servers any more. Because it cost thousands of pounds/dollars upfront. Ten of thousands. Whereas in Azure we don’t pay a penny upfront. Ditto AWS and GCP. Financially this is a dream comes true! Who wants to pay thousands of dollars upfront? No one. Pay As You Go is the name, Pay As You Go is the game.

So there you have it folks. Good bye data centre. All BI must be in the cloud. All data marts and warehouse must be in the cloud. All data lakes, data platform, data anything – they all must be in the cloud. And ML / AI too, it must be in the cloud.

Reference:

  1. Azure Managed SQL Instance: link.
  2. Amazon Relational Database service (RDS): link.
  3. Designing Cloud Data Platforms, by Danil Zburivsky and Lynda Partner: link.

Note: Danil and Lynda’s book (#3 above) is truly amazing. I can’t recommend it highly enough. Their book has opened my eyes about what Cloud Data Platforms are, and how they should be designed.

11 July 2023

What about Hadoop?

Filed under: Data Warehousing — Vincent Rainardi @ 4:32 pm

When I wrote the “Databricks vs Snowflake” article yesterday I couldn’t stop wondering about 1 thing: Should a data platform look like Databricks, or like Snowflake?

A data platform must have a data lake, data warehouse/marts, can do streaming (real time), data sharing, data engineering and for machine learning too. Therefore in my opinion a data platform should look like Databricks, not like Snowflake which can only do data warehousing and data sharing.

But what about Hadoop? 5 years ago everybody was on Hadoop. The power of Big Data, as they say. We know that Databricks is Spark-based. Is Spark better than Hadoop? Or should an ideal data platform be built on Hadoop?

A data platform must be in the cloud, that’s a given (not in your own data centre). And by that I mean public cloud i.e. Azure, AWS or GCP. In Azure, AWS, GCP we can choose between Hadoop and Spark. Now it seems that no one is nominating Hadoop any more, not like 5 years ago. Why?

  1. Because Hadoop is not suitable for small files. Hadoop stores files in blocks with minimum size of 128 MB.
  2. Hadoop is slow when processing data. Hadoop is basically Map plus Reduce operations which are slow because Map-Reduce read and write the data to & from disks (called HDFS). Spark is a hundred times faster than Map-Reduce because it processes everything in memory.
  3. Hadoop can’t process stream / real time data. Hadoop architecture is designed for processing a large amount of input data in batches, because the Map-Reduce works in 2 steps: Map converts the input data into key-value pairs, and then Reduce processes those pairs together.
  4. In Hadoop the compute and storage are bundled. A Hadoop cluster contains of a name nodes and data nodes (aka master and worker nodes), which in turns consists of CPU, memory and disks. A small cluster typically contains 1 name nodes and a few data nodes. It is possible to add storage and compute, but not as flexible as S3 in AWS or ADLS in Azure or GCP.

The same with Hadoop-based systems like Cloudera, HBase, Hive and HDInsight, they have the same weakness. Amazon EMR as the name implies (Elastic Map Reduce) is more flexible in terms of increasing storage and compute, but not as flexible as “Data Lake on AWS” (yes that’s the name: link) or ADLS.

So the era when big companies build their own Hadoop clusters are gone. People now use those public clouds: AWS, Azure, GCP. And for data lake Spark is better than Hadoop, because of the above limitations. Spark cluster is memory-based, and that resolves all the above issues that we have with Hadoop.

References:

  1. Hadoop documentation: link.
  2. Claudera documentation: link.
  3. Limitation of Hadoop by Data Flair, link.
  4. Amazon EMR documentation: link.
  5. Data Lake on AWS documentation: link.
  6. Azure Data Lake Storage (ADLS) documentation: link.
  7. HDInsight documentation: link.
  8. Google Cloud Data Lake documentation: link
  9. Databricks documentation: link

7 July 2023

Databricks vs Snowflake

Filed under: Data Warehousing — Vincent Rainardi @ 8:38 pm

1. Databricks provides a platform for data warehousing, BI, data lake, data engineering, data science, machine learning (AI), data sharing, real time streaming and data governance. Whereas Snowflake provides a platform for data warehousing, BI and data sharing.

Note: Snowflake tools for machine learning (Snowpark ML and MLOps) are not yet available (it’s coming). Snowflake is designed for bulk loading. For streaming Snowflake has Kafka connector, but the Snowpipe writes data into temporary stage files (so even though the input is streaming, the loading into data warehouse is in batches, not real time streaming). To be truly streaming you need to replace Snowpipe with Snowpipe Streaming, but this feature is still in preview (being tested).

2. Databricks is a PAAS (Platform as a Service), whereas Snowflake is a SAAS (Software as a Service). Meaning that you need to employ engineers to use Databricks, whereas for Snowflake you don’t need engineers. Databricks is more complex to setup, but you have more control on the compute nodes, etc. This PAAS vs SAAS means that your Databricks lives in your tenant of Azure/AWS/GCP, whereas your Snowflake lives not on your tenant/cloud but in Snowflake’s tenant.

Note: Both Databricks and Snowflake are available on Azure, AWS and GCP.

3. Databricks can process structured, semi-structured and unstructured data, whereas Snowflake can process structured and semi-structured data. Databricks uses SQL, Python, R, Scala, Java and Spark, whereas Snowflake only uses SQL natively (it is possible to use Python in Snowflake, but you will need to install a Python connector).

Note: It is possible to read unstructured data files in Snowflake but you will need to use Python UDF handler (User Defined Function). Meaning that you need to write a custom function using Python. Alternatively you can also write the UDF in Java. In other words, Snowflake does not handle unstructured data natively like Databricks.

So it depends on what you will be using it for in your company. If you use it for data warehousing or data sharing, then use Snowflake because it is easier to use. But if you use it for data lake, data engineering, streaming and data science / ML in addition to data warehousing and data sharing, then use Databricks.

The references are below if you need more detail information. I’d recommend starting with #7 first (Fujitsu), followed by #10 (Macrometa).

Date of writing: 7th July 2023. Yes I have used both Databricks and Snowflake.

References:

  1. Databricks documentation, 19 May 2023: link.
  2. Snowflake documentation, 27th June 2023: link.
  3. Databricks vs Snowflake by Venkatakrishnan on Medium, 24th April 2023: link.
  4. Snowflake vs Databricks 2023: Key Differences by Project Pro, 4th July 2023: link.
  5. Databricks vs Snowflake – The Better Choice in 2023? by Anesu Kafesu, 15th February 2023: link.
  6. Databricks vs Snowflake: A Comprehensive Comparison for Data Analysts and Data Scientists by Antonio Di Nicola, 31st March 2023: link.
  7. A Practitioner’s Guide to Databrick vs Snowflake by Fujitsu Data & AI, 5th March 2023, link.
  8. Snowflake vs Databricks – The Ultimate Comparison, Which One is Right for You? by DecisionForest, 23rd April 2023, link.
  9. Databricks vs Snowflake (2023) – A detailed comparison, 6th July 2023, by Firebolt, link.
  10. Databrick vs Snowflake – A Side by Side Comparison, 22nd June 2023, by Macrometa and Inbound Square, link.

2 July 2023

Machine Learning for Asset Management

Filed under: Machine Learning — Vincent Rainardi @ 6:29 pm

BlackRock uses ML for trade execution and liquidity risk (link), whereas Abrdn uses ML for portfolio allocation strategy (link). Schroders uses ML to enhance investor’s views (link) and to find companies to invest in (link). JP Morgan uses NLP/LLM, time series analysis and RL for managing risk, marketing, customer experience and fraud prevention (link). They have with 600 ML engineers working at over 300 AI use cases in production.

Fidelity uses NLP for social listening to uncover potential risks, flagging non-compliant language and images in posts and videos (link). Man Group uses ML for predicting potential market movements (alpha), for trade execution to get lowest cost market access with minimum of market impact, and for execution in futures and FX markets (link).

Acadian asset management uses NLP for constructing transition-aligned thematic portfolios (link). They use a predictive signal called GEO (Green Economy Opportunities) to identify companies associated with the energy transition to lower carbon economy, and to gauge their commitment to sustainability and the environment.

The best paper that summarises how ML are used in asset management is by Derek Snow, published in Sov.ai and Medium (link). Broadly speaking in the world of investing ML are used for 4 purposes:

  1. To forecast the future prices of securities
  2. To predict financial events like corporate defaults, earnings surprises and acquisitions.
  3. To estimate values such as credit ratings, volatilities and company revenues.
  4. To optimise portfolios and trade execution, including managing risks and capital.

In that paper Snow demonstrated how to use RL (Reinforcement Learning) on VIX CMF (Constant Maturity Future), finding out industry return predictability, the effect of oil price to Norwegian currency, and forecasting stock prices.

Karatas and Malhotra described the application of ML on asset management (link) including for answering these questions:

  • How does the price / volume move?
  • Which sector / asset is going to outperform?
  • Which manager is better?
  • What is the product’s success rate?
  • Is there an up or down trend?
  • What is the real time impact of news?
  • What should the allocation be?

CFA Institute explains (link) that ML in asset management are used for the following:

  • Estimating expected returns
  • Portfolio Optimization
  • Generate a provisional trading list
  • Determine optimal execution strategies e.g. to minimise transaction costs
  • Algorithmic trading
  • Risk management (both market risk and credit risk)
  • Robo-advisors

Blog at WordPress.com.