Data Warehousing, BI and Data Science

21 February 2019

Data Files – Delimiter and Qualifier

Filed under: Data Warehousing — Vincent Rainardi @ 8:06 am

Suppose you got to specify a file spec for your data supplier to send their data via FTP. Would you request them to send you Excel files or CSV files? Do you prefer a tab delimited text file or pipe delimited? Do you ask them to qualify their string with double quotes?

Excel or CSV

I would prefer CSV files than Excel files, because the number of columns in Excel which can be read by SSIS is limited to 255 columns (see link). Whereas in CSV files there are no limitation regarding the number of columns.

To overcome this limitation we need to import it twice (or three times) using two data source components and then join them. But in order to join them we will need to have an identity column, which will become the join criteria (see link). We also need to be careful with the performance when joining because merge join can be slow.

The second issue with Excel file is the OLE DB Provider installed in the server where SSIS is running, otherwise we could get an error message saying that “OLE DB provider Microsoft.ACE.OLEDB.12.0 is not registered” (see link). The “Use 32 bit runtime” in the SQL Agent job step and Run64BitRunTime in the project property are also affecting this.

The other disadvantage of importing an Excel file is dealing with zero in the front of a string, such as “007”, which automatically becomes number 7.

Also numbers which are in scientific notation, such as 1.23E21, which will then be imported as a string rather a number, causing a failure. If it is in CSV it written as 1230000000000000000000 in the file and imported as a number.

The other limitation of Excel is about cells containing long strings, such as 2000 characters, being cut to 256 characters. This issue only happens before 2007 edition.

Delimiter

What delimiter do you prefer: comma, pipe or tab?

The problem with CSV or a comma delimited file is that we have comma in the data. This causes misalignment when the file is imported. For example, if there are 10 columns, and one of them has a comma, this row will become 11 columns. This problem is known in the IT world as “delimiter collision”, see here: link.

Comma in the data is very common when dealing with numeric fields, such as “123,000”. Or, in countries like Indonesia which uses comma as a decimal point, it is like this: “100,23”.

We can enclose it with double quotes, but why not eliminate the problem in the first place? That’s why for text files people prefer pipe delimiter or tab delimiter. Text file with tab delimiter is known as tab-separated values, or TSV (link), but pipe delimited files are not known as PSV. We do have DSV though, which stands for Delimiter Separated Values (link).

Pipe is generally preferable because of the perception that it is rarer than tab. Some strings may contains tab, for example in a commentary field.

Qualifier

Qualifier means enclosing the data with something, such as double quote. Double quote is the most common delimiter. Other delimiters (but far less common) are single quote and brackets. There are 3 different types of bracket, i.e. [ ], < > and { }.

The issue with double quote delimiter is that the data may contain double quote, such as in commentary fields. This applies to other delimiters too. That’s why it is ideal to use pipe delimiter, without qualifier.

So that’s my preferred choice for the format of a data file, if I can choose it: pipe delimited text file without qualifiers. But unfortunately, in reality, we very rarely get the opportunity to choose the format. Usually the vendor dictates the format because the same file goes to many customers.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: