U-SQL

7 May 2016

U-SQL

Filed under: Analysis Services — Vincent Rainardi @ 7:26 pm

U-SQL is a new language used in Azure Data Lake. It is a mix between T-SQL and C#. It is used to load files into Data Lake.

To use U-SQL, we need to add a New Data Lake Store in our Azure Portal, click Sign Up for Preview (we need to use IE; in Chrome we can’t click Sign Up). We will get an email when it is approved (can be a few hours, can be a few weeks).

Here is an example for calculating the duration for each currency in the portfolio:

@rawData = EXTRACT SecurityName string?, Weight decimal?, ModifiedDuration decimal, MaturityDate DateTime, Currency string
FROM “/FixedIncome/EMD.csv” USING Extractors.Csv(silent: true);

@aggData = SELECT Currency, SUM(Weight * ModifiedDuration) as Duration FROM @rawData GROUP BY Currency;

OUTPUT @aggData TO “/Report/Duration.csv” ORDER BY Currency USING Outputters.Csv();

The above U-SQL code imports a comma delimited file called Portfolio.csv, calculates the duration for each currency and create a file called Duration.csv. I’ll explain the 3 lines above (EXTRACT, SELECT, OUTPUT) one by one below.

EXTRACT

Case sensitive: U-SQL is case sensitive and strong-typed. It is case sensitive because it is using C# syntax. So “EXTRACT” must be in upper case, “string” must be in lower case and “DateTime” must be in mixed case.

EXTRACT is used to import a file into a row set. It can be used to import both a structured file (all rows have the same number of columns) or unstructured files (rows have different number of columns). Documentation: here.

The data type is C# data type, not T-SQL. Full list of C# built-in data types is here. DateTime is a structure, not a data type. It has Date, Time, Year, Month, Day, Hour, Minute, Second, Today, Now, TimeOfDay and DayOfWeek properties. See here for documentation.

The ? at the end of “string?” and “decimal?” means NOT NULL.

One thing we need to remember is that in U-SQL it is “schema on read”. Which means that we define the data types of the file we import as we import them.

When importing a file we need to use an extractor. It is a data interface program which reads byte stream in parallel into a row set. U-SQL has 3 built-in Extractors: one reads text files (Extractors.Text), one reads tab delimited files (Extractors.Tsv) and one reads Comma Separated Values files (Extractors.Csv). The last two are derived from the first one. By default U-SQL split the imported file into several pieces and import the pieces in parallel. U-SQL imports unstructured files using user-defined extractor. See here for documentation on extractors.

Silent is a parameter. It means if the row in the CSV file doesn’t have 5 columns, ignore the row (we expect the CSV file to contains 5 columns: SecurityName, Weight, ModifiedDuration, MaturityDate, and Currency). There are other parameters such as: (documentation here)

rowDelimiter: the character at the end of the row, e.g. carriage return, linefeed
quoting: whether the values are wrapped in double quotes or not
nullEscape: what string represents a null value

@rawData is the name of the row set containing data we import from the CSV file. We can then refer to this row set with this name.

SELECT

Once we have this @rawData row set, we can do a normal SQL statement to it, such SELECT, UPDATE, INSERT and DELETE. But we need to use C# syntax. So the where clause is not “where AssetClass = “Options” ” but “where AssetClass == “Options” ”. And SELECT, UPDATE, INSERT, DELETE all needs to be in upper case. Other supported SQL statement in U-SQL are: JOIN, UNION, APPLY, FETCH, EXCEPT, and INTERSECT.

SELECT: U-SQL supports WHERE, FROM, GROUP BY, ORDER BY and FETCH. Documentation here. FETCH means get the first N rows, link. For ALIAS see here.
JOIN: U-SQL supports INNER JOIN, LEFT (OUTER) JOIN, RIGHT (OUTER) JOIN, FULL (OUTER) JOIN, CROSS JOIN, SEMIJOIN, and ANTISEMIJOIN. Documentation here. Both the “ON T1.Col1 == T2.Col2” and the old style “WHERE T1.Col1 == T2.Col2” are supported (link).
UNION: U-SQL supports both UNION and UNION ALL. Doc: here.
APPLY: U-SQL supports both CROSS APPLY and OUTER APPLY. Doc: here.
Interaction between row sets (these are hyperlinked): EXCEPT, INTERSECT, REDUCE, COMBINE, PROCESS.
Aggregate functions: SUM, COUNT, FIRST, LAST, MAX, MIN, AVG, VAR, STDEV, ARRAY_AGG, MAP_AGG. Doc: here. ARRAY_AGG combines values into an array. Split breaks an array into values (link).
Operators: IN, LIKE, BETWEEN. Doc: here.
Logical operators: Remember that U-SQL is using C# syntax so AND is &&, OR is ||, = is ==, <> is != and not is !
“condition ? if_true : if_false” is also supported, see here.

OUTPUT

The output of a SQL statement (called Expression in U-SQL) can be assigned to a row set like in the example above i.e. “@aggData = “, or we can do these 3 things:

Output it to a new file using OUTPUT
Output it to a new table using CREATE TABLE AS
Output it to an existing table using INSERT INTO

The OUTPUT expression saves the row set into a file. There are 3 built-in outputters in U-SQL: Outputters.Tcv for tab delimited files, Outputters.Csv for comma delimited files, and Outputters.Text for any delimiter. They accept these parameters: quoting, nullEscape, rowDelimiter, escapeCharacter, and encoding. Documentation: here.

Documentation

The U-SQL language reference is here (still work in progress), and the U-SQL tutorial is here.

Data Lake documentation is here and the learning guide is here.

Articles on U-SQL from Microsoft:

From Microsoft U-SQL team: link
Ed Macauley U-SQL tutorial: link, click on Tutorial under Prerequisites.
From Mike Rys: link

Comments (10)

10 Comments »

I have a column 00099 in my input file in Blob . How can I extract that column in U_SQL using EXTRACT command as the column name is integer and U-SQL engine is not recognizing the integer columns .

Comment by Gayathri — 26 July 2018 @ 2:11 pm | Reply
- Hi Gayathri, try enclosing the numeric column name with square brackets, i.e. [00099].
  
  Comment by Vincent Rainardi — 27 July 2018 @ 5:00 am | Reply
  - Hi Vincent , It worked ! Thanks a lot 🙂
    
    Comment by Gayathri — 27 July 2018 @ 7:58 am
Is it possible in USQL to have all the letters of a column in Upper case.If so please tell me how to do it in USQL

Comment by Shankar — 2 August 2018 @ 7:53 am | Reply
- Yes it is possible Shankar to make all the letters of a column in upper case, using UPPER function.
  Like this: SELECT UPPER(Column1) FROM Table1
  
  Comment by Vincent Rainardi — 3 August 2018 @ 6:48 am | Reply
3. Is it possible to extract the content from the latest file in blob using U-SQL? Like I have multiple files with filename_20180803.csv , filename_20180802.csv . I would like to extract the filename_20180803.csv only

Comment by Gayathri — 3 August 2018 @ 8:58 am | Reply
- Hi Gayathri, you will need to read all files in the folder, then get the latest file using:
  @files = EXTRACT file_id int, file_name string FROM “/folder/filesets example/{filename}.csv” USING Extractors.Csv();
  @latest_file = SELECT MAX(file_name) FROM @files;
  Then you extract this latest file.
  
  Alternatively you can use a user-defined extractor to find the latest file using C#. See: https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/extract-expression-u-sql
  
  Comment by Vincent Rainardi — 6 August 2018 @ 7:25 am | Reply
Is it possible to extract exponential numbers from the input file using Extract ? Currently I am extracting the column values as string and then converting to decimal in select . Is there any way so that I can directly extract the numbers having E in the values in Extract?

Comment by Gayathri — 23 August 2018 @ 9:34 am | Reply
- Hi Gayathri, try using float or double data type, rather than int or decimal. The float and double data types can handle exponent (E), whereas int and decimal can’t. See https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/u-sql-built-in-extractors, search for “AllowExponent”
  
  Failing that, create a user-defined extractor, see https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/extract-expression-u-sql, search for “Convert.ToInt32” in the DriverExtractor section.
  
  Comment by Vincent Rainardi — 23 August 2018 @ 6:05 pm | Reply
  - It worked 🙂 Thanks
    
    Comment by Gayathri — 31 August 2018 @ 8:45 am