Point-in-time-recovery considered harmful

When I first got in contact to the possibility to bring back a database to the state it had, when a backup was taken I was not too much impressed.
I though “Well, it’s like any other backup – copy the saved file back to the place I found it.”

Getting deeper into the tasks of a database developer (who had to take care of the database instance as well), I soon figured out that there is a much cooler feature available: point-in-time-recovery (PITR).

This feature allows to re-execute every change made to the database after the backup has been taken, so that it is possible to bring the database right back to any state it had between the time the backup has been taken and the moment where most recent change had been done.

One major opportunity of this feature that is broadly advertised by all database vendors: the ability to undo human mistakes – even if they are recognized hours or days after the mistakes happened.
Just recover the latest backup and ‘roll-forward’ just before the disastrous “DROP TABLE” was issued (for some reason this must be really the most frightening thing to the developers of the database documentation …).

So if this is that great – what is wrong with it?

The feature itself is a very nice thing that – applied correctly – brings many advantages.

However, the problem is that it is most often not applied correctly. Although the PITR guarantees to result in a transaction-consistent database, the use of it may lead to severe inconsistencies in your “data world”.

Usually database developers expect a database to be the only storage of information. When this assumption hold true, the PITR is completely nice. Unfortunately this assumption is always never true.

Synchronize with the real world

The information stored in the database represent facts we know about the reality. That could be orders, invoices, customer names and the like.
If we set back just the database to a state it had in previous point in time the facts in the real world are not sat back as well.

Even worse, the real world has no way to know that the database from which your company takes all information for its business processes is not “up-to-date” anymore.
If your real world supplier gets a second order for material he usually will not suspect that you made a mistake – he will fulfil the order and invoice you.

It is even worse – in today’s companies there is not a single database installation, there are tens, hundreds or thousands of production database running. Many of them are connected via interfaces to synchronize data and trigger the execution of processes based on this data. Now assume one central database to be set back to a point in time, let us say yesterday morning. All processes triggered and eventually finished since then, will be re-triggered.

Ask yourself, what would happen in your company if such a thing occurs.

Root cause

So what is the reason for all these problems?
Basically, the problems occur because there is no general synchronization between systems.
Systems cannot be synchronized well by timestamps as it is impossible to bring the exact same time to every system – at least the deviations will not be small enough.
Therefore, database systems rely on time-independent mechanisms to bring order to the sequence of changes made to the data.

In Oracle, this is the System-Change-Number, in MaxDB, it is called log-IO-number and there are similar concepts in the other DBMS as well. DBMS just keep counting the number of changes that had been done to the data. That way it’s possible to say: “Ok, I restore a database backup where the last change was change number 1000 and now, by using the redo-information, I roll-forward to change number 1200”.

Obviously, this change counting approach can only work within one system.
Other systems (databases, web-services, real world…) all have their own change counting in place (for the real world this would be the real time) – all of them completely separated from the other.

One way out

One way to come around this is to implement the change counting on a lower level of the IT-structure of your company: at the storage level.

If you keep all data ever stored in your company on a single storage facility (SAN, NAS, … ) than this storage-“thing” can do the change-counting to keep track of all changes made to the whole data-“world” of your company.
With a setup like this, it is of course possible to perform a PITR of your whole data-“world”.
Anyhow, even if now all data in your company is consistent again: the real world is still not set back in time. Your supplier still would deliver the order that has been sent a second time.

So, what can we do about it?

The easy answer is: Do not do point-in-time recoveries!
Do not take them as a valid option to get your data back quick and easy!

In most cases where a PITR has been done, the major effort had to be invested into making the database consistent again with other systems and with the real world afterwards.

Even the often taken approach to restore “just” one table from a database backup in a different instance up to a certain point in time, then copy the table to the target database needs huge effort to result in a consistent database again. For SAP systems, it is yet worse as referential constraints are not implemented at database level but only by the SAP work processes.
Therefore after such a partly PITR of the database one even has to run check reports (if such ones exist for the table in question) to be sure the database is consistent again.

In my view, PITR is not a feature for production databases or databases that are connected to any other system or that have any real relevance to the real world. So which one could that be?

Test systems, Training-Systems and  Q/A-Systems are the only systems that could bear a PITR.

All other systems, especially production- and development-systems should never be exposed to a PITR.
Obvious to see – these are also the systems where nobody should ever be able to manually drop a table or delete all data of a table.

But we need this feature…

Now – what should you do if a user accidentally deletes important data?
What if the user does the wrong change to the wrong data entry (e.g. to invoicing the wrong customer)?

The answer is “It depends on what you would do in the real world.” If you send an invoice to the wrong customer, you usually would try to cancel it in your system and try to prevent it from delivery.
If the wrong invoice is delivered, one would send a mail to say “Sorry, this was a mistake, please ignore the invoice. Thanks!”
Therefore, your application should give the same options to you.

If it is possible that a user makes a mistake that could be undone like any action in WORD – the application should implement UNDO functionality for that (for example keep versions of specific data).

The punch line is: complex UNDO-functions have to be implemented at the level where the changes happened – not on any low-level like the database.

So better forget or banish the PITR option from your DBA toolset.
It is much more dangerous than useful.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.