Tag Archives: sql

Is SAP HANA SQL/SQLScript Turing complete?

There was an interesting (aka nerdy) question in the SAP Community Q&A the other day:
Are SQLScript and HANA SQL Turing complete?”

I took a swing at it and the following is my result (a slightly edited version of my original answer).


Alright, let’s do this! (even though I don’t have an idea why it would matter at all…)

So, I’m not a computer scientist, and only a third through “The Annotated Turing” (which seems to be a great book, btw). This means what follows is the layman’s approach on this.

First off, I had to understand, what is meant by “Turing complete” and how to show that a language has that characteristic.

For that, there are a couple of Wikipedia entries and SO discussions available and I leave that for everyone to read (e.g. here and here).
One of those discussions linked to a presentation that claimed to prove that PostgreSQL-SQL with recursive common table expressions (CTS) is Turing complete. In order to prove this, the author of the presentation (here) said, that it’s enough to prove that a language can emulate another language that has already been shown to be Turing complete.
Fair call.
The author’s choice was a cyclic tag system with a specific rule set (rule 110) which apparently has been shown to be Turing complete.
Then the author goes on and implements this cyclic tag system with a recursive common table expression and thereby proves the claim.

Yippie.

So, what does that mean for SAP HANA SQL?
SAP HANA SQL/SQLScript does not support recursive common table expressions (much to the distaste of everyone who tries to handle hierarchies and does not know about SAP HANA’s special hierarchy views and functions (look there) and it also does not support recursive procedure calls.

Bummer, one might think.
Fortunately, every recursion can be expressed as an iteration (compare here), so I thought, let’s try this cyclic tag system in SQLScript.
This is the result (HANA 1, rev.122.15, on my NUC system). The SQL code is also available in my GitHub repo.

do begin
declare prods VARCHAR(4) ARRAY;
declare currProd, initWord, currWord VARC<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>HAR(300); -- 300 is arbitrary and would be exceeded for more runs
declare currProdNo integer = 0;
declare runs, maxruns bigint = 0;

	initWord :='11001';                   -- the starting/initial 'word'
	maxruns := 100;                       -- a limit to the number of iterations 
                                              -- rule 110 is suspected to run indefinitively
    prods = ARRAY ('010', '000', '1111');     -- the three 'producer rules' stored in a string array

    currWord := :initWord;
    runs := 0;
    -- dummy table var to monitor output
    tmp = select :runs as RUNS, :currProd as CURRPROD, :currWord as CURRWORD 
    from dummy;

    while (:runs &lt; :maxruns) DO
        runs := :runs+1;
        
        currProdNo :=  mod(:runs,3)+1;    -- pick rule no. 1,2 or 3 but never 0
                                          -- as SQLScript arrays are 1 based
        currProd := :prods[:currProdNo];

        if (left (:currWord, 1)='1') then  -- add current producer to the 'word'
            currWord := :currWord || :currProd;
        end if;
        
        currWord := substring (:currWord, 2); -- remove leftmost character

        -- save current state into temp table var
        tmp = select RUNS, CURRPROD, CURRWORD from :tmp
              union all 
              select :runs as RUNS, :currProd as CURRPROD, :currWord as CURRWORD 
              from dummy;

    end while;

    select * from :tmp;                      -- output the table var
end;

Running this gives the following output:

/*
Statement 'do begin declare prods VARCHAR(4) ARRAY; declare currProd, initWord, currWord VARCHAR(300); declare ...' 
successfully executed in 7&lt;span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"&gt;&lt;/span&gt;17 ms 39 µs  (server processing time: 715 ms 590 µs)
Fetched 101 row(s) in 2 ms 517 µs (server processing time: 0 ms 424 µs)

RUNS    CURRPROD    CURRWORD                                                                             
0       ?           11001                                                                                
1       000         1001000                                                                              
2       1111        0010001111                                                                           
3       010         010001111                                                                            
4       000         10001111                                                                             
5       1111        00011111111                                                                          
6       010         0011111111                                                                           
7       000         011111111                                                                            
8       1111        11111111                                                                             
9       010         1111111010                                                                           
10      000         111111010000                                                                         
11      1111        111110100001111                                                                      
12      010         11110100001111010                                                                    
13      000         1110100001111010000                                                                  
14      1111        1101000011110100001111                                                               
15      010         101000011110100001111010                                                             
16      000         01000011110100001111010000                                                           
17      1111        1000011110100001111010000                                                            
18      010         000011110100001111010000010                                                          
19      000         00011110100001111010000010                                                           
20      1111        0011110100001111010000010                                                            
21      010         011110100001111010000010                                                             
22      000         11110100001111010000010                                                              
23      1111        11101000011110100000101111                                                           
24      010         1101000011110100000101111010                                                         
25      000         101000011110100000101111010000                                                       
[...]
*/

That looks suspiciously like the output from the Wikipedia link (above) and does not seem to stop (except for the super-clever maxruns variable ).
With that, I’d say, it’s possible to create a cyclic tag system rule 110 with SQLScript.
The tag system is Turing complete, therefore SQLScript must be Turing complete.

Still, I have no idea why this would matter at all and that’s really all I can say/write about this.

Cheers,
Lars

What’s your number?

Have you been using the ISNUMERIC() function of MySQL, MS SQL Server or PostgreSQL and wondered what to use in SAP HANA instead?
If so, this is the blog for you.

Just recently I’ve seen a customer using my old workaround (first published here) and then a few days later the related stackoverflow question got updated.

This topic is not dead

The new answer to the SO discussion took a popular approach to the IS_NUMERIC problem:

If the string can successfully be converted into a numeric type, then the value is numeric (IS_NUMERIC returns 1 for TRUE), else it’s not (IS_NUMERIC returns 0 for FALSE).
This is a completely feasible approach (and used on other platforms as well, e.g. Oracle or PostgreSQL) and provides a result that allows making the decision of whether or not the tested value can be turned into a number data type in HANA.

IT DEPENDS, as usual

The problem, or rather the limitation of this approach is, that numeric can mean different things in different circumstances.
Sometimes, one might want to accept ‘123.456,–‘ as numeric. Sometimes “$500.00 –” shall be OK, and so on.
Checking the T-SQL definition of ISNUMERIC (here and here) shows that there are plenty other options.

This means nothing else that there are alternative approaches for this problem.
One of those alternatives could be to use regular expressions (regex), which have been supported by HANA for quite some time now (since SP10 if I’m not mistaken).
With tools like https://regex101.com/ one can comfortably work out a pattern that matches or doesn’t match the kind of data one is interested in.

It might well be, that in your application the question is not about whether a string or sub-string can pass IS_NUMERIC() but rather something more specific.
The customer that I mentioned earlier, for example, wanted to see if a string looks like an ICD10 code (one character followed by two digits, ‘B13’, ‘A00’, etc.). Snipping out the first character and putting the rest into an IS_NUMERIC() function is not ideal here.

Instead, the regex ‘^\D\d{2}.’ can be used to capture the whole pattern.

Show me how it’s done

Let’s take a rather simple regex that matches any string that

  • is not a number between 0-9
  • is not a plus (+), minus (-), comma (,) or full stop (.)
  • is not an exponent ‘E’ (as in 1.23E02)

Such a regex is ‘[^0-9-.\,+e]’ (check here) with flags for global, case-insensitive matching.

In SAP HANA SQL we can use this regex in the LOCATE_REGEXPR() function. If it doesn’t find any matching string, then the result is 0, else it will return the location of the first matching character.
Since we are only interested in those cases, where no matches were found, we can easily MAP this output to make up a IS_NUMERIC value:
1 = no unwanted characters found, it’s numeric
0 = unwanted characters found, it’s not numeric

MAP (locate_regexpr(START '[^0-9\-\.\,\+e]' 
                    FLAG 'i' 
                    IN <string to be searched>)
     , 0, 1
     , 0)

So, what’s better now?

Using the “try-the-conversion“-approach can only be done via a procedure or a scalar function. This brings some overhead with it for the function invocation which can be avoided, when using the regex directly in the SQL statement.

As an example, I ran a quick test against 10.000.000 unique values (all tests were done against HANA1 1.00.122 and HANA2 2.00.020).

Conversion-approach function

select 
       sum("IS_NUMERIC_CONV"(id_char) )
from 
       "DEVDUDE"."DIM10000000"
where 
       "IS_NUMERIC_CONV"(id_char) = 1
   and "IS_NUMERIC_CONV"(id_char) IS NOT NULL;

This ran for roughly 19.5 seconds with 8 CPU cores at 100% usage.
And this is the runtime when all values actually can be converted correctly. With the same number of non-numeric values, the runtime goes up to 2:45 mins.

Regex-approach pure SQL

select 
       SUM (IS_NUM_REGEX)
from 
     (select 
             MAP (locate_regexpr(START '[^0-9\-\.\,\+e]' 
                                 FLAG 'i' 
                                 IN id_char)
                  , 0, 1
                  , 0) IS_NUM_REGEX
       from 
            "DEVDUDE"."DIM10000000")
where 
     IS_NUM_REGEX = 1 
 and IS_NUM_REGEX IS NOT NULL;

This ran for roughly 8.6 seconds with 8 CPU cores at 100% usage. Same runtime for both convertible and non-convertible values, here.

Too good to be true

By now you maybe think: “Gee, that regex thing is a lot faster, but also really ugly code. Let’s put that into a function and have our cake and eat it too!“.

Off we go:

FUNCTION "DEVDUDE"."sandpit::IS_NUMERIC_REGEX" (
                 in stringToCheck nvarchar(5000)) 
         RETURNS isNumeric integer
LANGUAGE SQLSCRIPT
SQL SECURITY INVOKER AS
BEGIN
     isNumeric := MAP (locate_regexpr(START '[^0-9\-\.\,\+e]' 
                                      FLAG 'i' 
                                      IN :stringToCheck)
                       , 0, 1
                       , 0);
END;

A statement using this function takes roughly 30 seconds with 8 CPU cores at 100% usage.

No cake for us!

As long as all our input data is clean, the conversion-function approach works nicely but really becomes expensive, when there are many cases, for which the exception handling needs to be executed. The lesson here is probably that it’s not a great (performance) idea to make exception handling part of your main processing path.

A truly high performing solution would need to be implemented as a native HANA function, so for it needs to be carefully checked what exactly the job of IS_NUMERIC()  shall be in your application and how many matches/non-matches are expected.

There you go, now you know!

Cheers,
Lars

Eco-friendly slide recycling with HANA

Hey there!

I recently presented a short session on my experiences with developing and debugging non-classic-SAP HANA solutions at the Leading Insights conference in Melbourne (27./28.3.2017).

And since the slides are prepared in a way that they could be used without my talking as well, I thought, I might just post them. So, there they are on slideshare:

if you feel this is a shameless content plug, you’re probably right, but I got the feeling, that this might still be interesting to some readers of the “SAP HANA” tag.

Content-wise, this is what you can expect (just in case you don’t want to swap to page 2 which shows the table of contents):

  • an exciting example for a pure HANA development without any NetWeaver/ERP background
  • some practical tips for finding performance issues with SQL statements
  • and tidbits of performance gone bad and how to fix it.

There you go; now you know.

Cheers,
Lars

On multiple mistakes with IN conditions

Based on SAP HANA SPS 11

Update 1.6.2017

Retesting the APPLY_FILTER() approach on rev. 112.07 showed that it is now possible to simply hand APPLY_FILTER() an IN condition with a list of integers.
This approach for the workaround of the general limitation on multiple input parameter values for scripted calculation views is also shown in SAP note  “2315085 – Query with Multi-Value Parameter on Scripted Calculation View Fails with Incorrect Syntax Error.

Therefore, if you are using a current version of SAP HANA and still develop/use scripted calculation views, it is not recommended to use the CE-function approach I explained in this blog post. The limitation that this approach worked around has been fixed in SAP HANA.


Dear readers

there is a long-standing modelling problem with SAP HANA calculation views:

Using multiple input parameters to filter data similar to the SQL IN predicate.

This discussion Handling multi value input parameters can be taken as a comprehensive example.

It seems so straight forward at first and so practical.

Once the input parameter is defined the data preview tool built into SAP HANA Studio or your reporting client of choice can read the meta data for it and present the user with a nice UI dialog to specify values.

Something as fancy as this:

input_par_967338

Now, the way that this works is rather counter intuitive.

For graphical calculation views, there are a couple of nicely written blog posts available, like Using Multiple Values in Input parameter for filtering in Graphical Calculation View but it seems that scripted calculation views did simply not want to be as flexible.

For those, rather clunky (and not very well performing) solutions had to be built to make it possible at all, (see SAP HANA: Handling Dynamic Select Column List and Multiple values in input parameter or How to process and use multi-value input parameter in a scripted view in HANA)

Either the solution involved dynamic SQL or some form of parameter string mangling with loops and pseudo-dynamic temporary result set constructs.

Other approaches proposed to avoid the problem altogether and use multiple parameters (instead of one multi-valued parameter).

Developer arrogance driving solution finding…

The last time I read one of those discussions (yesterday) I thought:

 

“This cannot be the right solution. There must be some easier way to do it!”

So arrogance got the better of me – HA! It cannot be that difficult. (It’s so cheesy that for once Comic Sans is a fitting choice).

I dare to guess that nearly every developer had that feeling every now and then (if not, I would have a hard time finding a good explanation for so many drastically underestimated development efforts…)

Attacking the problem

My first impulse was to use the APPLY_FILTER() function, but I soon learned what many others probably discovered before: it doesn’t solve the problem.

The reason for that is the way APPLY_FILTER() works.

It takes the table variable and your filter string and constructs a new SQL statement.

For example, if your table variable is called vfact and your input parameter selection was 1, 2 and 5 your scripted calculation view could look like this:

/********* Begin Procedure Script ************/
BEGIN
  vfact = select * from fact;
  declare vfiltD10 nvarchar(50); -- this is a temp variable to construct the filter condition
  vfiltD10 = ' "DIM10" IN ( ' || :IP_DIM10 || ' )';
  var_out = APPLY_FILTER (:vfact, :vfiltD10);
END /********* End Procedure Script ************/

This compiles fine and if you try to run it with some parameters you are greeted with a surprise:

SELECT
     "DIM10", "DIM100", "DIM1000", "DIM1000000",
     "KF1", "KF2"
FROM "_SYS_BIC"."devTest/MULTIIP"
        ('PLACEHOLDER' = ('$$IP_DIM10$$','1,3,6')) ;

Could not execute ‘SELECT “DIM10”, “DIM100”, “DIM1000”, “DIM1000000”, “KF1”, “KF2” FROM “_SYS_BIC”.”devTest/MULTIIP” …’ in 373 ms 962 µs .

SAP DBTech JDBC: [2048]: column store error: search table error:  [2620] “_SYS_BIC”.”devTest/MULTIIP/proc”: [130] (range 2) InternalFatal exception: not a valid number string ‘1,3,6’

Not only is this error annoying, but it’s FATAL… shudder!

After some investigation, I found out that the input parameter not only provides the digits and the separating commas but also the enclosing single-quotes.

Nothing easier than getting rid of those:

  vfiltD10 = ' "DIM10" IN ( ' || replace (:IP_DIM10 , char(39), '')  || ' )';

With this, the single-quotes get easily removed (39 is the ASCII value for the single quotes and the CHAR function returns the character for the provided ASCII code – this just makes it easier to handle the double-triple-whatever-quotation syntax required when the single-quote character should be put into a string).

Of course, seeing that we have not yet reached the end of this blog post, you already know: that wasn’t the solution.

The problem here was not only the quotation marks but also that  SAP HANA does not parse the string for the input parameter value. The result for the filter variable is that we do not get the condition

  actual condition          ===> syntax structure

  “DIM10” IN ( 1, 3, 6)     ===> X IN ( c1, c2, c3)

but

  “DIM10” IN ( >’1, 3, 6′<) ===> X IN ( c1 )

So even when we remove the quotation marks, we still end up with just one value (I enclosed this single value in >’ ‘< for easier distinction).

Interlude

The different syntax structures pointed out above are easily overlooked also in standard SQL. Often developers do not fully realise that an IN condition with 3 parameters is structurally different from an IN condition with 2 or 4 parameters.

Whenever the number fo parameters of the IN condition changes, the statement is effectively a new statement to the database, requiring new parsing and optimisation and also allocating its own space in the shared SQL cache.

This is another detail that ABAP developers do not need to worry about, since the

SAP NetWeaver database interface gracefully splits up IN-lists into equal chunks and recombines the result set automatically. See this ancient piece SAP Support case “FOR ALL ENTRIES disaster” for more details.

One approach to avoid this issue can be to use temporary tables instead of the IN condition. Especially when parsing/query optimisation is taking a long time for your application, this might be an approach worthwhile to implement.

Back to the main topic though!

So, the “obvious” approach of using APPLY_FILTER() does not help in this case.

Is it possible that it is just not possible to take multiple input parameter values into an IN list? But graphical calculation views can do it – and rather easy.

And in this observation laid the key for the solution. What is different between graphical and scripted calculation views?

Right, graphical calculation views do not produce SQL for the boxes we draw up.

Technically speaking it replaces them with Plan Operators – very much similar to the abandoned CE_-functions.

Do you see where this is heading?

Yes, indeed. The solution I found works with CE_-functions.

Oh, how very heretic!

May the performance gods get angry with me for making the SAP HANA execution switch engines…

But first, let’s look at the solution, shall we?

/********* Begin Procedure Script ************/
BEGIN
     vfact = select * from fact;
  var_out = CE_PROJECTION(:vfact,
                     [ "DIM10", "DIM100", "DIM1000", "DIM1000000"
                     , "KF1", "KF2" ],
                      'IN ("DIM10", $$IP_DIM10$$)');
END /********* End Procedure Script ************/

Easy to see, this approach mimics the filter approach for graphical calculation views.

To not over complicate things I only used the CE_PROJECTION function for the filter part – everything else is still in efficient, familiar SQL.

Important to note is that this works only, when the input parameter is referenced with the $$<name>$$ format.

Also important to recall is that the complete filter expression needs to be provided as one string enclosed in single quotation marks ( ‘ <filter expression goes here> ‘ ).

OK!“, you may say, “this works, but now you broke the holy rule of CE_-functions damnation. The performance of this surely is way worse due to the implicit engine change!

Well, let’s have a look into this!

First the explain plan for the SQL based statement:

SELECT
     "DIM10","DIM100","DIM1000", "DIM1000000",
     "KF1","KF2"
FROM FACT
where DIM10 IN (1,3,6) ;

OPERATOR_NAME   OPERATOR_DETAILS                                         EXEC_ENGINE SUBTREE_COST

COLUMN SEARCH   FACT.DIM10, FACT.DIM100, FACT.DIM1000, FACT.DIM1000000,  COLUMN       1.645529062

                FACT.KF1, FACT.KF2                                                  

                (LATE MATERIALIZATION, OLTP SEARCH, ENUM_BY: CS_TABLE)              

  COLUMN TABLE  FILTER CONDITION:                                                   

                (ITAB_IN (DIM10))                                                   

                FACT.DIM10 = 1 OR FACT.DIM10 = 3 OR FACT.DIM10 = 6       COLUMN    

Now the scripted calculation view version:

SELECT
     "DIM10","DIM100","DIM1000", "DIM1000000",
     "KF1","KF2"
FROM "_SYS_BIC"."devTest/MULTIIP"
        ('PLACEHOLDER' = ('$$IP_DIM10$$','1,3,6')) ;

OPERATOR_NAME   OPERATOR_DETAILS                                         EXEC_ENGINE SUBTREE_COST

COLUMN SEARCH   FACT.DIM10, FACT.DIM100, FACT.DIM1000, FACT.DIM1000000,  COLUMN       1.645529062

                FACT.KF1, FACT.KF2                                                  

                (LATE MATERIALIZATION, OLTP SEARCH, ENUM_BY: CS_TABLE)              

  COLUMN TABLE  FILTER CONDITION:                                                   

                (ITAB_IN (DIM10))                                                   

                FACT.DIM10 = 1 OR FACT.DIM10 = 3 OR FACT.DIM10 = 6       COLUMN     

See any difference?

No?

That’s right, there is none. And yes, further investigation with PlanViz confirmed this.

SAP HANA tries to transform graphical calculation views and CE_-functions internally to SQL equivalents so that the SQL optimizer can be leveraged. This does not always work since the CE_-function are not always easy to map to a SQL equivalent, but a simple projection with a filter works just fine.

Now there you have it.

Efficient and nearly elegant IN condition filtering based on multiple input parameters.

There you go, now you know.

Have a great weekend everyone!

Lars

 

Fiddle riddle – Amazing: stuff gets faster and takes less space!

In Quick notes during a fiddle-session I played around with a table that contained all possible values for a date column.

To fill it I used a FOR LOOP with an INSERT and a commit every 10000 records.

That took 9:37 minutes for 3.652.061 records.

That’s terrible performance and nothing that needs to be accepted with SAP HANA!

I found a way to do the same much quicker:

Statement ‘insert into rmanydates (select * from all_days)’

successfully executed in 6.120 seconds  (server processing time: 4.630 seconds) – Rows Affected: 3652060 ;

Question 1:

The dates inserted above have not been pre-computed in this example.

How do I do this?

——–

Answer 1

 

The 6,120 seconds to create every possible date once and insert it into the row table is a lot faster than the one-by-one looping approach I took in the procedure (manly for clarity purposes).

But, as Fernando Da Ros demonstrated, there are other options to do that.

The one I chose was part of a rather new feature in SAP HANA: SAP HANA Series

SELECT to_date(GENERATED_PERIOD_START)
    FROM SERIES_GENERATE_TIMESTAMP('INTERVAL 1 DAY', '01.01.0001', '31.12.9999')

——–

I also looked at the memory consumption for the stored date tuples.

In the blog post I just used a row store table, because I thought: well, there’s not going to be any column store compression anyhow.

(you know, because column store compression mainly builds on compressing duplicate column values. But with every possible date once, there are no duplicates, so no compression – or is there?)

However, I managed to get the data loaded into a column store table and use a lot less memory.

Question 2:

How to get from

 

BEFORE

— General —

Total Memory Consumption (KB): 25.719

Number of Entries: 3.652.061

Size on Disk (KB): 23.152

Memory Consumption in Main Storage (KB): 25.700

Memory Consumption in Delta Storage (KB): 19

Estimated Maximum Memory Consumption (KB): 25.719

to

AFTER

— General —

Total Memory Consumption (KB): 1.645

Number of Entries: 3.652.061

Size on Disk (KB): 12.912

Memory Consumption in Main Storage (KB): 1.626

Memory Consumption in Delta Storage (KB): 19

Estimated Maximum Memory Consumption (KB): 1.645

Both tables were fully loaded when the space consumption was analyzed.

if you know how that works, put your answers into the comments section! 😀

——–

Answer 2

This one is again a feature from SAP HANA Series: the Series Table,

 

create column table smanydates (dday date)
SERIES (  EQUIDISTANT INCREMENT BY INTERVAL 1 DAY
     MINVALUE '01.01.0001'
     MAXVALUE '31.12.9999'
     PERIOD FOR SERIES (dDAY));

The above CREATE TABLE statements, specifies that the timestamp information in column DDAY are not explicitly stored.

Instead the internal storage is merely a calculation formula that “knows” how every date between MIN/MAXVALUE with the granularity of 1 DAY can be computed.

Instead of storing a full date now, only a number uniquely identifying a day needs to be stored.

Given the starting date ‘01.01.0001’ the current day would then be stored just as the integer 735.832 (20 bits).

Of course there are also limitations to this type of data storage but for ever increasing regular date or timestamp information, this is really an option to save a lot of memory.

For more information around SAP HANA series please check

——–

– Lars

Quick notes during a fiddle-session

For a while now I am working as a Solution Architect for the SAP Custom Development organisation.

This is the team you’d call when you want custom-built solutions that perfectly fit into your standard solution landscape.

My job there is not coding or implementing systems, but rather designing Solutions and turning those designs into commercial offers.

Obviously, it seems like a good idea to every now and then go and check that what I remember of doing “real work” is actually still true. and so I try to keep up with our tools and platforms as well as anyone.

While trying out different stuff for something that might become a blog post, later on, I came across some things I considered noteworthy, “interesting” or otherwise out of the expected.

All this is based on SAP HANA rev. 101 and SAP HANA Studio 2.1.6.

Change of default statement preparation in SAP HANA Studio

In Issue: MDX Query not executing in Hana Studio 2.0.8 Version and Re: Regarding “Generate Time Data” Option in HANA folks complained about the fact that MDX statements suddenly started to return error messages even though the worked in older SAP HANA Studio versions.

The error message was:

SAP DBTech JDBC: [2]: general error: MDX query doesn’t support prepared statement

I first proposed to circumvent this by creating a little wrapper procedure but this is inconvenient.

Shady Shen provided the correct hint: un-check the checkbox “Prepare statements before execution”.

This seems to have been set to active by default in the more recent versions of SAP HANA Studio. Once again a change that doesn’t occur in any documentation or whats-new note… 🙁

At least it’s not necessary to close/re-open the SQL editor to get this changed setting active.

No DEFAULT value specification in INSERT/UPDATE/UPSERT possible

Let’s say you were diligent with your table column specification and not only provided a name and data type but also a NOT NULL constraint and a DEFAULT value clause.

Something as extravagant as this:


create table test (id int not null default 0);


Now you want to make use of this and by following SQL standard you use the DEFAULT keyword to specify that you want to have the pre-defined default value:


insert into test (id) values (default);


SAP DBTech JDBC: [260]: invalid column name: DEFAULT: line 1 col 31 (at pos 30)

This is not to say SAP HANA doesn’t support default values.

If you want to use default values, you mustn’t include them in the DML statement at all.

So the following works without problems:


alter table test add (id2 int not null default 1);
insert into test (id) values (1);
select * from test;


ID ID2
1 1

Would be lovely to find such nuances in the documentation.

Limits of date operations

When working with date data types it’s real easy to forget about the limitations of this data type.

One example could be that non-existing dates like the February 30 or dates between 05.10.1582 and 14.10.1582 are rejected or mapped to the correct dates.

Another limitation I actually ran into while generating random garbage data for a date column is the smallest/largest date.

This time the documentation is clear about this: The range of the date value is between 0001-01-01 and 9999-12-31.

Usually I wouldn’t think much about this, but when using expressions like the following to create random dates, I hit an error:


select add_days (date'01.01.1980', rand()*10000000) from dummy;


[303]: invalid DATE, TIME or TIMESTAMP value: internal error in AddDays() at function add_days()

In my example, the 10000000 was, in fact, a sequence number in a rather large data set.

So, the quick fix would be to set an upper limit for the value that is fed into add_days().


select add_days (date'01.01.1980', least(rand()*10000000, 3560) from dummy;


The least() function comes in real handy here at it makes sure that the largest possible value returned will be 3560 (roughly 10 years worth of days).

Together with its sister the greatest() function, it’s easy to create upper and lower limits for generated values.

But, what if we don’t want to arbitrarily set the limit below what would be possible with this data type?

Easy too:


select days_between (date'31.12.9999', current_date) from dummy;


DAYS_BETWEEN(31.12.9999,CURRENT_DATE)

-2916233

Gives you the number of days between today and the end of times – huh… gotta hurry!

Admittedly this example is super artificial and has no connection to real-life dates.

However, it shows that there are limitations to the date data type in SQL that it’s not too difficult to step into them.

There are of course other examples where a richer semantic for this data type would be useful.

Being able to have a MIN_DATE or MAX_DATE entry that would explicitly show up like this would be

Wandering off…

The date topic side-tracked me at this point and I looked into some other points, too:

I’ve shown the “generator query”-trick (cross join of objects system table to generate millions of rows) before.

This is of course not the only way to create a lot of rows.

Another option is to write SQL Script and this was the first time I used the new SPS 10 capability to run SQL script directly in the SQL console without the need for a wrapper procedure:


create table manydates (dday date);
do
begin
    declare i bigint;
    for i in 0 .. 3652060 do
        insert into manydates (dday) values (add_days(date'01.01.0001', :i) );
        if (mod(i, 10000) = 0 )
        then
            commit;
        end if;
   
    end for;
end


This little gem runs for a while and fills a table with all possible date values.

Statement ‘do begin declare i bigint; for i in 0 .. 3652060 do insert into manydates (dday) values …’

successfully executed in 9:37.694 minutes  (server processing time: 9:37.681 minutes) – Rows Affected: 0

Now, about we claimed that no “no-dates” entries would be created.

And checking for duplicates via


select count(distinct dday) from manydates;


and


select dday, count(*) from manydates
group by dday having count(*) >1;

show that there aren’t any.

But how’s this?

We are looking at how many years worth of days here? 9999 exactly.

So, leaving out leap years we should end up with 365 * 9999 = 3.649.635 separate days.

That number is 2.426 days short of what we actually find in our table.

Alright, so then just taking 366 days per year will help us, maybe?

Nearly: 3.659.634 days result here, which is 7.573 days too many.

Maybe trying to account just for the leap years separately can help here.

Leap years occur roughly every four years, so for 9999/4 = 2.499 years, we should add one day.

That gets us to 3.652.134, which is just 73 days off from the actual number of days in this dates table.

We can even check this in SQL again:


select count(distinct year(dday))  from manydates
where mod(to_integer(year(dday)), 4) = 0;

COUNT(DISTINCT YEAR(DDAY) )

2499

But what leap years didn’t we cover with our simple arithmetic?


(select year(dday), count(dday)
    from manydates
    where  mod(to_integer(year(dday)), 4) = 0
    group by year(dday))
     
EXCEPT
(select year(dday),  count(dday)
    from manydates
    group by year(dday) having count(dday) > 365
)
order by year(dday);

Fetched 63 row(s) in 21 ms 617 µs

YEAR(DDAY) COUNT(DDAY)
1700 365
1800 365
1900 365
2100 365
2200 365
2300 365
2500 365
2600 365
2700 365
2900 365
3000 365
3100 365
3300 365
3400 365
3500 365
3700 365
3800 365
3900 365
4100 365
4200 365
4300 365
4500 365
4600 365
4700 365
4900 365
5000 365
5100 365
5300 365
5400 365
5500 365
5700 365
5800 365
5900 365
6100 365
6200 365
6300 365
6500 365
6600 365
6700 365
6900 365
7000 365
7100 365
7300 365
7400 365
7500 365
7700 365
7800 365
7900 365
8100 365
8200 365
8300 365
8500 365
8600 365
8700 365
8900 365
9000 365
9100 365
9300 365
9400 365
9500 365
9700 365
9800 365
9900 365

If you followed up to here, you’ll notice, that we were looking for 73 “missing days”.

Comparing the set of years where the number of the year is divisible by 4 without remainder and the set of years where the number of days is larger than 365 revealed 63 days.

So, where are the 10 last days here?

Remember that odd reference on the Gregorian Calendar reformation?

That took ten days away in 1582!


select count(distinct dday) from manydates
where year(dday) = 1582;

COUNT(DISTINCT DDAY)

355

So far, so good.

We found all the “missing records”, but why aren’t these counted as leap years anyhow?

That’s because I used a too simple rule here.

Leap Year nearly every 4 years explains that a leap year is only counted as such if

  • The year is evenly divisible by 4;
  • If the year can be evenly divided by 100, it is NOT a leap year, unless;
  • The year is also evenly divisible by 400. Then it is a leap year.

Let’s run that against our dates table:


select year(dday) yr, count(dday) day_cnt
from manydates
where
       mod(to_integer(year(dday)), 4) = 0
       and
       not (mod(to_integer(year(dday)), 100) = 0)
      or
        mod(to_integer(year(dday)), 4) = 0
        and
        mod(to_integer(year(dday)), 400) = 0
group by year(dday)
;

Fetched 2424 row(s) in 123 ms 147 µs (server processing time: 1 ms 665 µs)

Alright – the rule gives us 12 years less than what is in our dates table.

Which ones?


select mod_y.yr mod_yr, long_y.yr long_yr
from
  (select year(dday) yr, count(dday) day_cnt
  from manydates
  where
        mod(to_integer(year(dday)), 4) = 0
        and
        not (mod(to_integer(year(dday)), 100) = 0)
       or
         mod(to_integer(year(dday)), 4) = 0
         and
         mod(to_integer(year(dday)), 400) = 0
        group by year(dday)
  ) mod_y
     
FULL OUTER JOIN
  (select year(dday) yr,  count(dday) day_cnt
     from manydates
     group by year(dday) having count(dday) > 365
  ) long_y
  on mod_y.yr = long_y.yr
WHERE
   mod_y.yr is null or long_y.yr is null;

MOD_YR LONG_YR
? 100
? 500
? 900
? 1300
? 200
? 600
? 1000
? 1400
? 300
? 700
? 1100
? 1500

Can you spot the reason?

All of those records violate the rule of not being divisible by 100 except when they are divisible by 400.

Why had those years been generated with one day too many by our add_days() function?

The Gregorian Calendar Reformation clearly did change history but it didn’t change the past 🙂

Before this regulation got active, the leap years didn’t follow those rules (which called for the Reformation in the first place).

And as we see there are no false leap years after the year 1500.

That’s it for today.

Funny things come up when one spends some hours on a rainy Sunday afternoon fiddling with SAP HANA SQL.

There you go, now you know!

– Lars

The 7th Niceness of new toys’n’tools

If you’re not living behind some red moon and you’re reading this blog, you will likely be aware that SAP HANA SPS 10 had just been released.

With it also the SAP HANA Studio had been improved and extended.

Since I personally like eye-candy and enjoy exploring new features when they come with a button to click, I have installed the new version on my laptop and played with it; against a SPS 9 SAP HANA database!

That’s right – everything I describe here works just well on a Rev. 96, which means you can use these features even without upgrading your server.

(This is one of the benefits of the de-coupling of SAP HANA Studio releases and SAP HANA Server releases – indicated by the different version numbers.)

So here we go with the nice things I found in SAP HANA Studio 2.1.4.

Drop and drop of design time objects

Since basically ever I wished I could just drag and drop information models into the SQL editor so that I could execute SELECT statements against them.

No idea what took the dev-colleagues so long, but now that’s actually possible.

Simply dragging the design time object (you know, the object as it appears under the Content folder) over into the SQL editor and the corresponding runtime object name (yes, the _SYS_BIC.”<package_name>… <object_name>”) is placed right at your cursor.

Alternatively, there is a context menu entry “Copy Column View Name” that puts the name into the clipboard.

Nice!

SELECT statement generation incl. input parameters

Similar to the above, you might have stumbled over models with input parameters.

The usual workaround to use the SQL generated by the data preview is alright, but a bit clumsy.

With SAP HANA Studio 2.1.4 you can now simply right-click and select “Generate Select SQL“.

After a few moments of checking the metadata, you get a new SQL editor with the basic SELECT SQL:

 

As you can tell, someone automatically pressed the “Format” button as well.

Nice!

Compare two plans in PlanViz

More than once I got the question on how to perform comparisons between two PlanViz traces and the usual answer had to be: you put the two PlanViz editors next to each other and compare the Plan OPerator boxes one by one. This is of course tedious.

Now, there is the (once again – anyone remember Show me the timelines, baby! ? 😉 ) well hidden feature to compare two plans.

I found it in the context menu of an executed PlanViz by clicking on “Compare With Another Plan…” – not sure if there is another way to this feature.

Next you’re greeted by a little selection dialogue window, that I found rather self-explanatory:

Seems to be straight forward – so I selected one of the other open PlanViz displays by clicking “Other visualized plan“.

DISCLA(I)MER: for the next step, you should have a wide monitor. Or two monitors. Wide ones. Really! (It’s better for you anyhow, if you use it right.)

See? I told you so!

As I haven’t really worked with the feature there’s little I can report beyond that it’s there.

Go play around yourself!

In my books, this feature really is:

Nice!  🙂

Immediate execution for PlanViz

Alright, you have your SQL statement. You want the PlanViz for it.

Up to now you were always confronted with some “Prepare” step and a graphical version of the Explain Plan.

How useful…not!” you might have thought (so did I most of the times 😉 ).

No more of that!

Modern days SAP HANA Studio comes with this shortcut in the context menu:

Nice!

Identify plan operations in PlanViz by name

Ok, the following is not actually a new feature, but I have just found this and decided to put it in this anyhow.

A major difficulty in analysing PlanViz outputs often is to map the model design to the actual execution.

This is inherent to the calculation view instantiation process but you can make it a bit easier for you by naming the nodes in your model with your own names.

To understand what I mean, just look at those two models, that do pretty much the same thing:

Default node names Custom node names

At this point there is already a benefit as the intended function for the nodes is much clearer.

Now, when the model gets executed and a PlanViz is created, we can find those node names in several places:

1. In the POP node details:

2. In the “Tables Used” list:

3. In the “Operator List”:

You will still find a lot of things that you cannot easily match back to the model, but with the custom node names, you should be able to find some anchors in the PlanViz.

Not sure about you, but I think that’s Nice!.

Comments in information models

One of the things you wish you had when working with complex models, that you or someone else created a while ago typically is: a documentation.

  • Why, just why is this condition there?
  • What exactly is the idea behind this calculated column?
  • Where should this data go?

Wish no more!

With the SPS 10 SAP HANA Studio you can now add comments/notes to virtually anything in the model.

Click the little “Maintain Comment” button in the model, when hovering over a node…

… then enter some amazingly useful comments!

Once you close the comments editor you’ll see a little indicator at the node-icon to show that there’s a note for you:

Let’s say it together:

Nice!

Data Lineage in Modeler

When working with the modeler it can get quite difficult to understand where the data of a certain output column actually comes from.

To help with that, there is now the Data Lineage feature available.

The red borders around the model nodes indicate that the data is “touched in there” and the little annotation boxes show you the column name per node.

Nice!

Alright, given that I only had a few hours to look into it, I’m personally quite happy about the result here.

Seven “NICE” for the current major release of SAP HANA Studio (and there are more, like the on-demand loading of the Contents sub-folders, but I couldn’t really capture this with a screen shot 😀 ).

 

There you go, now you know!

  • Lars

Quick note on IDENTITY column in SAP HANA

Not sure you knew this already, but starting with SPS 8 (seemed to work with Rev. 74 as well), SAP HANA comes with an option to have an IDENTITY column in column store tables.

The idea here is that you can insert data into the table and every row will automagically get a new unique number assigned.

Of course that’s possible e.g. by defining a sequence and use <sequence>.nextval for the INSERT/UPDATE command, but having it “built-in” to the table is kind of neat and can help making data loading easier.

To create such an identity column just use the following syntax:

CREATE COLUMN TABLE <table_name> (<column_name> <num_data_type> GENERATE ALWAYS AS IDENTITY);

CREATE COLUMN TABLE <table_name> (<column_name> <num_data_type> GENERATED BY DEFAULT AS IDENTITY);

Note (edit 27.06.14): the syntax above (stroke out now) was actually only half correct.

Although it technically works it does not yield the correct result, since it would always generate an ID value even if one is provided.

The newly introduced GENERATED BY DEFAULT option for SPS 8 does handle this in a better way. It only generates values if no value is provided and thus the default value should be used. This matters especially for migration use cases where existing data with existing values for the IDENTITY column has to be inserted into the table.

Make sure to not put IDENTITY into any kinds of quotation marks!

For example:

create column table some_names

         (ID bigint not null primary key generated by default as IDENTITY,

          NAME nvarchar(30));

And now we can do things like this:

insert into some_names (name) values (‘Huey’);

insert into some_names (name) values (‘Dewey’);      

insert into some_names (name) values (‘Louie’);   

select * from some_names;

ID NAME
1 Huey
2 Dewey
3 Louie

The magic behind this is of course created by sequences:

select column_name, column_id from table_columns where table_name =’SOME_NAMES’

COLUMN_NAME COLUMN_ID
ID 145210  
NAME 145211

select * from sequences where sequence_name like ‘%145210%’

SCHEMA_NAME SEQUENCE_NAME SEQUENCE_OID START_NUMBER MIN_VALUE MAX_VALUE INCREMENT_BY IS_CYCLED RESET_BY_QUERY CACHE_SIZE
SYSTEM _SYS_SEQUENCE_145210_#0_# 145215 1 1 4611686018427387903 1 FALSE select max(“ID”)+1 from “SYSTEM”.”SOME_NAMES” 1

Aaaaand, that’s it 🙂

With this it is even easier to migrate from other platforms to SAP HANA.

There you go – now you know!

Cheers,

Lars

Of groups of serial shoppers…

based on SAP HANA revision 74

>>> update 09.02.2016

>>> if you are trying to work on a similar problem, please make sure to also review the solutions posted in discussion Fetch the first record with 24 hr interval based on first record timestamp

>>> The solutions posted there are much more efficient and solve the problem better in multiple ways.

>>> update 09.02.2016

Ok, I pondered about this one a while and I did not find a satisfying solution.

So I thought: let’s take this to the community and see what others can come up with.

But let’s start at the beginning.

A few days ago a colleague send me an email asking for advice:

I inserted the colors and markers to make the problem a bit easier to understand.

Basically, the report should aggregate shopping transactions based on the time relation between each of them.

For every transaction of every customer there is, it needs to be checked if there had been a transaction happening within a time frame of 14 days before.

If there is at least one such transaction, these two now belong to the same group.

This of course accumulates like a rolling sum, so whenever a transaction still ‘glues’ to the group via the 14 days range rule, it needs to be included in the group.

 

In the final result however, the date of the very first and very last date of transaction in this group (or chain) of transactions need to be shown along the SUM of all AMOUNTs.

 

It took me quite some thinking, but after a while I came up with nested window functions – something like this:

At this point I was nearly sure to have nailed it.

Just bring down the START_DATE to the NULLed columns and GROUP BY CUST_ID, START_DATE, right?

Wrong. Unfortunately 🙁

Unfortunately there is no way to get the the first not null value from the preceding records.

 

Without that, there is no combination of columns available to perform the grouping as required.

We’re toast!

 

So, the solution I found is this:

drop type customers_tt;

create type customers_tt as table

(cust_id integer, from_date date, to_date date, amount decimal(10,2));

create global temporary column table f14_custom_group ( cust_id integer

, group_id integer

, tx_date date

, amount decimal (10,2));

 

drop procedure forth_nightly_shopping;

create procedure forth_nightly_shopping ( OUT result customers_tt)

language sqlscript

as

begin

declare cur_group_start date := to_date(’01-01-1950′ , ‘DD-MM-YYYY’);

declare cur_group int := -1;

declare cur_group_cust_id int := -1;

declare cur_cust_id int := -1;

declare cur_tx_date date;

declare last_tx_date date := to_date(’01-01-1950′ , ‘DD-MM-YYYY’);

declare cur_amount decimal (10,2);

declare cursor c_pre_aggr for select cust_id, tx_date, sum(amount) as amount

                            from customers

                            group by cust_id, tx_date

                            order by cust_id, tx_date;

truncate table  f14_custom_group;

for cur_row as c_pre_aggr do

cur_tx_date := cur_row.tx_date;

cur_cust_id := cur_row.cust_id;

cur_amount := cur_row.amount;

— new group check: (new cust_id) or 

— (old cust_id but tx_date is more than 14 days awys from last transaction date)

if cur_cust_id <> cur_group_cust_id then

— new customer -> new group

        — increase the group id counter

— set the current transaction date to be the start date for the group

— set the currrent customer id to the current customer id

            cur_group := cur_group + 1;

            cur_group_cust_id := cur_cust_id;

            cur_group_start := cur_tx_date;

elseif   cur_cust_id = cur_group_cust_id

             and cur_tx_date >= add_days (last_tx_date, 14) then

         — old customer

         — but new group   

            cur_group := cur_group + 1;

            cur_group_start := cur_tx_date;

end if;

insert into f14_custom_group(cust_id    , group_id , tx_date    , amount)

                                     values (cur_cust_id, cur_group, cur_tx_date, cur_amount);

last_tx_date := cur_tx_date;

end for;

result = select cust_id

               , min (tx_date) as from_date

               , max (tx_date) to_date

               , sum (amount) as amount

               from f14_custom_group

               group by group_id, cust_id

               order by cust_id, min(tx_date);

truncate table f14_custom_group;

end;

call forth_nightly_shopping ( ?);

It does the trick and the performance might be acceptable ( 3.2 Mio records grouped and aggregated in 23 secs).

However, the problems are obvious:

  • Further parallelism will be difficult to build into this (maybe doable with SP8)
  • the implementation is very inflexible
  • Any option to add selection criteria will need to be included into the procedure and the call
  • Massive copying of data required

Sooo… Does anybody have a better idea to tackle this one?

Cheers,

Lars