U-SQL is a new language used in Azure Data Lake. It is a mix between T-SQL and C#. It is used to load files into Data Lake.
To use U-SQL, we need to add a New Data Lake Store in our Azure Portal, click Sign Up for Preview (we need to use IE; in Chrome we can’t click Sign Up). We will get an email when it is approved (can be a few hours, can be a few weeks).
Here is an example for calculating the duration for each currency in the portfolio:
@rawData = EXTRACT SecurityName string?, Weight decimal?, ModifiedDuration decimal, MaturityDate DateTime, Currency string
FROM “/FixedIncome/EMD.csv” USING Extractors.Csv(silent: true);
@aggData = SELECT Currency, SUM(Weight * ModifiedDuration) as Duration FROM @rawData GROUP BY Currency;
OUTPUT @aggData TO “/Report/Duration.csv” ORDER BY Currency USING Outputters.Csv();
The above U-SQL code imports a comma delimited file called Portfolio.csv, calculates the duration for each currency and create a file called Duration.csv. I’ll explain the 3 lines above (EXTRACT, SELECT, OUTPUT) one by one below.
EXTRACT
Case sensitive: U-SQL is case sensitive and strong-typed. It is case sensitive because it is using C# syntax. So “EXTRACT” must be in upper case, “string” must be in lower case and “DateTime” must be in mixed case.
EXTRACT is used to import a file into a row set. It can be used to import both a structured file (all rows have the same number of columns) or unstructured files (rows have different number of columns). Documentation: here.
The data type is C# data type, not T-SQL. Full list of C# built-in data types is here. DateTime is a structure, not a data type. It has Date, Time, Year, Month, Day, Hour, Minute, Second, Today, Now, TimeOfDay and DayOfWeek properties. See here for documentation.
The ? at the end of “string?” and “decimal?” means NOT NULL.
One thing we need to remember is that in U-SQL it is “schema on read”. Which means that we define the data types of the file we import as we import them.
When importing a file we need to use an extractor. It is a data interface program which reads byte stream in parallel into a row set. U-SQL has 3 built-in Extractors: one reads text files (Extractors.Text), one reads tab delimited files (Extractors.Tsv) and one reads Comma Separated Values files (Extractors.Csv). The last two are derived from the first one. By default U-SQL split the imported file into several pieces and import the pieces in parallel. U-SQL imports unstructured files using user-defined extractor. See here for documentation on extractors.
Silent is a parameter. It means if the row in the CSV file doesn’t have 5 columns, ignore the row (we expect the CSV file to contains 5 columns: SecurityName, Weight, ModifiedDuration, MaturityDate, and Currency). There are other parameters such as: (documentation here)
- rowDelimiter: the character at the end of the row, e.g. carriage return, linefeed
- quoting: whether the values are wrapped in double quotes or not
- nullEscape: what string represents a null value
@rawData is the name of the row set containing data we import from the CSV file. We can then refer to this row set with this name.
SELECT
Once we have this @rawData row set, we can do a normal SQL statement to it, such SELECT, UPDATE, INSERT and DELETE. But we need to use C# syntax. So the where clause is not “where AssetClass = “Options” ” but “where AssetClass == “Options” ”. And SELECT, UPDATE, INSERT, DELETE all needs to be in upper case. Other supported SQL statement in U-SQL are: JOIN, UNION, APPLY, FETCH, EXCEPT, and INTERSECT.
- SELECT: U-SQL supports WHERE, FROM, GROUP BY, ORDER BY and FETCH. Documentation here. FETCH means get the first N rows, link. For ALIAS see here.
- JOIN: U-SQL supports INNER JOIN, LEFT (OUTER) JOIN, RIGHT (OUTER) JOIN, FULL (OUTER) JOIN, CROSS JOIN, SEMIJOIN, and ANTISEMIJOIN. Documentation here. Both the “ON T1.Col1 == T2.Col2” and the old style “WHERE T1.Col1 == T2.Col2” are supported (link).
- UNION: U-SQL supports both UNION and UNION ALL. Doc: here.
- APPLY: U-SQL supports both CROSS APPLY and OUTER APPLY. Doc: here.
- Interaction between row sets (these are hyperlinked): EXCEPT, INTERSECT, REDUCE, COMBINE, PROCESS.
- Aggregate functions: SUM, COUNT, FIRST, LAST, MAX, MIN, AVG, VAR, STDEV, ARRAY_AGG, MAP_AGG. Doc: here. ARRAY_AGG combines values into an array. Split breaks an array into values (link).
- Operators: IN, LIKE, BETWEEN. Doc: here.
- Logical operators: Remember that U-SQL is using C# syntax so AND is &&, OR is ||, = is ==, <> is != and not is !
- “condition ? if_true : if_false” is also supported, see here.
OUTPUT
The output of a SQL statement (called Expression in U-SQL) can be assigned to a row set like in the example above i.e. “@aggData = “, or we can do these 3 things:
- Output it to a new file using OUTPUT
- Output it to a new table using CREATE TABLE AS
- Output it to an existing table using INSERT INTO
The OUTPUT expression saves the row set into a file. There are 3 built-in outputters in U-SQL: Outputters.Tcv for tab delimited files, Outputters.Csv for comma delimited files, and Outputters.Text for any delimiter. They accept these parameters: quoting, nullEscape, rowDelimiter, escapeCharacter, and encoding. Documentation: here.
Documentation
The U-SQL language reference is here (still work in progress), and the U-SQL tutorial is here.
Data Lake documentation is here and the learning guide is here.
Articles on U-SQL from Microsoft: