Data Warehousing and Machine Learning

8 November 2021

What is a Data Lake?

Filed under: Data Warehousing — Vincent Rainardi @ 9:05 am

When asked what a data lake is, most people describe the function, rather than describing what it is physically. That is like answering “What is a car” question with “A vehicle that we can use to move from place A to place B”. This is because not many people know what a data lake is, physically. So in this article I would like to answer that seemly simple question.

Collection of Files

A data lake, physically, is a collection of files. Files can be structured (tabular, hierarchical, graph, etc.) or unstructured (images, audio, videos, documents, etc.) Typically, structured data files are stored as columnar text files such as CSV or pipe delimited. Other commonly found format for data files are JSON, XML and Excel files.

Unstructured data files such as multimedia files are stored in their native formats, such as:

  • Images: JPEG, PNG, GIF, BMP (link)
  • Audio: WAV, MP3, M4A (link)
  • Video: MPEG, AVI, WMV (link)
  • Document: PDF, HTML, DOCX (link)

The other type of files stored in the data lake are:

  • Database backup files (typically .BAK but can be other extensions)
  • Log files (typically .LOG but can be other extensions)
  • Email files (typically MSG, PST, EDB, OST)
  • Social media data such as Facebook and Twitter

BLOB (Binary Large Object)

Let me clarify a terminology which is used often but not usually clear what it is physically: BLOB. A blob, or binary large object, is a file containing binary data, such as multimedia files or executable files. So originally blob does not include human readable files such as text files. But in the data lake world, blob generally means all files, including human readable files.

But if we want to be precise and get physical, blobs are not files. Blobs are a collection of files, which can be stored in different ways.

In Azure, there are 3 types of blobs (or blob storage): block blob, append blob, page blob. When people say “file”, it generally means “page blob”, which is random access file storage. Block blobs are optimised for uploading large amount of data/files, whereas append blob is optimised for appending data at the end of the blob (not efficient for updating or deleting existing blocks)

In AWS, files are called objects. What we call “blob storage” in Azure is called object store in AWS. An object store in AWS uses unique key-values to store objects. An object in AWS consists of a file and a metadata describing the object.

Containers

The files in the data lake are organised in folders. These folders are called “containers” or “buckets”. In Azure they call it a container and in AWS and Google Cloud they call it a bucket. The container can be multi levels (folders within folders).

Each container belong to an account. These accounts are called “storage account”. Users are permissioned to access these storage accounts. But users can also be permissioned to access a container or a file.

HTTP Access

Files in the data lake is accessable via HTTP. Data lake is a RESTful architecture so each file has a URI (means URL).

For example:

Database-like Access

Structured data files in the data lake can be accessed as if they are tables in a database. We can do this in both Azure, AWS and Google Cloud. For that we use technology like Hive and Databricks. In Databricks we create a Hive table which is linked to a data file. In Databricks, a collection of tables is called a “database”.  

Afterwards, we can query that Hive table using an SQL SELECT statement (Spark SQL). To run those Spark SQL queries we need to create a Spark cluster.

Database-like Access is a very important feature of a data lake. It means that the data lake is queryable as if the data is stored in SQL tables. Many BI tools (such as Power BI) can access Databricks tables, be it in Azure, AWS or Google Cloud. So we don’t need to put the data into a SQL Server or Oracle database. The BI tools can directly query the data lake as if they are database tables.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: