Data Warehousing and Data Science

22 May 2022

Watermark in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 7:54 am

When we load data into a target table, we write down the last row of data we loaded from the source table. That’s watermark. Say we have 5 rows in the source table, and we loaded all of them to the target table. So we write down “5” (meaning that the last row we loaded was row 5), see the image below:

So now in the target table we have 5 rows.

Note: The above image is slightly in correct. In data warehousing, the pump doesn’t “withdraw” the water from the source (pumping it out). Instead it “copies” the water (water in the source container is not gone but is duplicated).

Let’s say there are 3 more rows in the source table. So now there are 8 rows in the source. We load from row 6 of course, because the last row we loaded was row 5. And we need to load the 3 new rows from source into the target, see the image below:

After we loaded the 3 new rows, we update the watermark to 8, see the image below:

This mechanism is well documented in the Microsoft Azure web page for ADF pipeline (link):

Note here that step 2 is done before step 3. This ensure that the data loading is water tight. Meaning, any row created in the source after the watermark in step 2 will be loaded next time.

Also note above that Microsoft uses the word “watermark” for ADF pipeline (as opposed to “delta mark”).

But where did the word “watermark” come from? Well if you see on the Wikipedia (link), the word watermark means an obscured image on paper (almost hidden). If you look at a £20 note, you’ll see a £20 watermark, see the red arrow below:

But that is not the word watermark that we use in data loading 🙂

The word watermark we use in data loading is from this:

It is a water level stick in a dam or a river, and there is a marking on the stick to mark high water level in the past event such as a flood (source: link, link).

Sometime the water mark line is drawn on the wall of a house, to mark the level of flooding that hit that house. We can see below left that in 1893 the water level was 41 inches, whereas on the right there were various water marks going back centuries, the most recent one being 1970 and 1982 (source: link, link). Those are known as “high water mark”.

Below left is the high water mark in the Old Town Hall in the city of Passau, Germany (link, link). And below right is the record flood levels at the Boat Inn in Coalport, near Telford, England (link):

That was where the word “watermark” in data loading from! Not the hidden writing on money, but the “high-water marks” due to flooding.

Now you know what a “watermark” in data warehousing is, and where it is from 🙂

Blog at WordPress.com.