Data Warehousing and Data Science

25 January 2022

Data Lake on Oracle

Filed under: Data Warehousing — Vincent Rainardi @ 8:33 am

A modern data warehouse consists of a data lake and a dimensional data warehouse (link). The data lake is just a collection of files, but there is a SQL layer on top, enabling us to query the data lake using SQL which is called Data Lakehouse (link). I know that Databricks and Snowflake have this feature (link) but it turned out that Oracle has this feature too (link). It is based on Oracle external table which creates a SQL layer on top of data lake files (link). It works not on Oracle Cloud files, but also on S3, Azure and GCP files.

Oracle Autonomous Database is a cloud database that uses machine learning to automate database tuning, security, backups, updates and other routine database management tasks which usually performed by a DBA (link). Data Catalog is a metadata management service that provides an inventory of assets, a business glossary and a metadata store for data lakes. The Autonomous Database can automatically create external tables for thousand of files automatically harvested by the Data Catalog (link). This way, users can immediately query all the data in the data lake using SQL. For each file in the data lake the Data Catalog creates an entity automatically, without us having to manually define the columns and data types. And this entity metadata is automatically synchronised to the files in the lake (link), so the Autonomous Database always reflects what’s in the data lake.

This way, the data lake objects/files are harvested into the Data Catalog, and the Data Catalog is automatically sync into the Autonomous Database, enabling users to query the data lake using SQL, as Martin Gubar describes in the Oracle blog (link):

This makes Oracle a good platform for creating a modern data warehouse, i.e. a data lake + a dimensional data warehouse. Oracle’s relational database engine is legendary, making it a very good platform for hosting the dimensional data warehouse (star schema). And the Autonomous Database with Data Catalog describe above can provide a SQL queryable layer on top of data lakes in AWS S3, Azure, GCP, or Oracle Cloud Infrastructure (OCI, link).

For data movement, on OCI we can use Data Integration to ingest data into the data lake, and into the warehouse. It is a cloud based, managed service (link) with rich transformation, data explorer and data profiler features. We can monitor the running pipelines using various metrics, notifications or alarms. We can use Data Flow to create and run Spark application (link). It’s a fully managed service so we don’t need to create Spark clusters or worry about the underlying infrastructure. We can use Data Transfer to migrate large volume of data into the data lake (link).

Like Azure, AWS and GCP, OCI also provides machine learning platform. It is called Data Science. We can use it to build, train and deploy ML models (link). It is fully managed and serverless so we don’t need to worry about the underlying infrastructure such as updating Python libraries dependencies, Conda or JupyterLab. Anomaly Detection enables us to identify undesireable events in real time (link), such as IOT. It automatically analyses the dataset to build ML models for multivariate signals by considering their correlations.

For BI & reporting, we can use Analytics Cloud to explore, load, model and analyse data by creating reports and dashboards (link). We can do development operations such as release pipelines, build, deployment approval, code repository and continuous integration using a tool called DevOps (link).

So all in all, OCI provides an end-to-end platform to create a modern data warehouse, i.e. a data lake and a data warehouse. I’m glad to learn this today, because all this time I was wondering what Oracle has been doing in this era of cloud data warehousing. Azure, Amazon and Google all are offering cloud data platform for data lakes and data warehousing – what about Oracle, who is the leader in relational databases? It turns out that Oracle also has good cloud data platform offering.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: