Implement Analytical SQL Function in Data Service
To accommodate the analytical requirement of business, both PL SQL and T SQL have developed similar analytical function over the years. For example, row_number(), rank(), dense_rank(), max(), sum(), lead(), lag(). The main purpose of these functions is to help business get insight of data more conveniently. It works differently from ordinary aggregation function in that it can add aggregation in the same row, as oppose to regular aggregation which only list the unique combination of columns you grouped by
For example, you have data like this
Name | Value |
---|---|
A | 100 |
A | 200 |
B | 150 |
When you run the aggregation like the follows
select name, sum(value) as total from data group by name
you will get
Name | Total |
---|---|
A | 300 |
B | 150 |
When you run analytical SQL function, you would write like this
select name, value, sum(value) over (partition by name, order by name) as Total from data
Run it, you will get
Name | Value | Total |
---|---|---|
A | 100 | 300 |
A | 200 | 300 |
B | 150 | 150 |
If you want to calculate different A row value contribution to total, the second approach has advantage, you just use Value/total. The regular approach will have to join back to original dataset to do calculation.
Now let us get to data service part. Recently, my colleague hand me a project which contains script using Lead and lag analytical function. The backend database is Oracle.
Lead and lag function is used to retrieve the value from next row and previous row.
Find more about lead and lag function at here
https://oracle-base.com/articles/misc/lag-lead-analytic-functions
I have not used Data service to try translating those function. As far as my research goes, I know data service can create row number within group using gen_row_num_by_group. Also, it has is_group_changed and previous_row_value function. Somebody suggested online, but no examples. We might give a shot.
https://archive.sap.com/discussions/thread/3885872
Let us set up the stage by creating a dataset that we want to experiment on
Using the following script, we create a dataset
Now, I can use this table to try some analytical function
As you can see, I have use Lag, Lead and sum function to get some value. We use dept as group, we can get previous record, next record and sum of salary with in same dept.
Question is how we can implement this on the fly with data service?
- First, let us try with embedded data service function
This is the overview
In the order query transform, I set data order by dept asc, salary asc, so that I can use it for dept group and for set the sequence for previous_row_value
In the group query transform grouprecordid definition
isgroupchanged definition
previous_salary defination. When the group changes (is_group_changed(“order”.DEPT ) =1), it indicate it is a new group, the first record should not have previous record, so I set is as null. Otherwise it use previous row value
After I execute the job, this is what I got
Notice
grouprecordid correctly identify there are 2 group and marked correctly.
isgroupchanged column also mark correctly where group changed.
The result in previous salary, however, is not what we expected.
For group finance, the second record should be 1000, but it mark as null (in yellow shade); for group IT, second line should be 2000, but mark as 3000 (in red line)
The issue looks like the previous_row_value function did not notice there is order in data somehow. Maybe someone can find out something not right in my setting.
Since this approach did not work as expected, I start work on an alternative solution, self join.
- Self join
This is the setting overview, notice, I use the same table twice, also you need to create grouprecordid as the previous method in advance
Two table join setting in the self_join_lag as follows. I map one table’s grouprecordid = another table’s grouprecordid +1
In the field mapping part, I include one table’s all field, also include another table’s grouprecrodid. I call it map_record to see if it maps correctly
The most important thing is include another’s table’s salary as previous salary
After execution, the result is what we expected.
This is an example for solving Lag() analytical funciton. For lead(), you just need to change join funciton from +1 to -1. for sum(), avg(), count() etc. analytical function, what I think, the easiest way is to create an ordinary aggregation table first and join back to orginal table using the group by column.
Thank for visiting my site.
If you want to follow along, the code is here.
Good luck!
Wenlei