31 Temporal Databases
Dr R. Baskaran
TEMPORAL DATABASES
Over the last three decades, the computer science community has done extensive work on temporal data, and especially on bitemporal data. During that same period of time, the IT community has developed various forms of versioning, all of which are methods of managing one of the two kinds of uni-temporal data. Asserted Versioning may be thought of as a method of managing both uni- and bi-temporal data which, unlike the standard model of temporal data management, recognizes that rows in bi-temporal tables represent versions of things and that, consequently, these rows do not stand alone as semantic objects. Versions that are versions of the same thing are related to one another by that very fact. Versions that are versions of the same thing, and also that together represent an unbroken period of time in the life of that thing, are even more closely related to one another. Groups of temporally contiguous versions of the same thing are semantic objects that must be managed as single temporal objects, even though they may consist of any number of physical rows. These single semantic objects are what we call episodes.
Asserted Versioning may also be thought of as a form of versioning, a technique for managing historical data that has evolved in the IT industry over the last quarter-century. But unlike existing best practice variations on that theme, Asserted Versioning supports the full semantics of versions, i.e. everything that it is conceptually possible to do with versions. In addition, Asserted Versioning also integrates the management of versions with the management of assertions and with the management of bi-temporal data. As we pointed out earlier, it directly manages bi-temporal physical tables, and manages both forms of uni-temporal tables, as well as conventional tables, as views over those bi-temporal tables.
Besides embracing contributions from computer science research and from business IT best practices, we believe that Asserted Versioning introduces three new concepts to the field of temporal data management. The first concept is that of an episode. In making episodes central to its management of temporal data, Asserted Versioning breaks with a basic component of the relational model, which is that integrity constraints apply only to individual rows. For example, referential integrity, in the relational model, requires that every foreign key in one row expresses a valid relationship which that row has to one other row. In Asserted Versioning, a temporalized foreign key expresses a relationship which one row, called a version, has to an episode, not to any single row which is part of that episode.
The second concept is that of the internalization of pipeline datasets. We define a pipeline dataset as any collection of business data that is not a production table, but that contains data whose destination or origin is such a table. Pipeline datasets which contains data destined for production tables are inflow pipeline datasets. Pipeline datasets which contain data derived from production tables are outflow pipeline datasets. History tables are one example of a pipeline dataset. Sets of transactions, accumulated in files or tables and waiting to be applied to their target tables, are another example. While the use of versions eliminates history tables by internalizing them within the tables whose history they track, the use of deferred transactions and deferred assertions eliminates batch files of transactions waiting to be applied to a database by also internalizing them within their target tables. Here, we will show how the use of these internalized managed objects reduces the costs of maintaining databases by replacing external files or tables such as history tables, transaction files and logfiles, with structures internalized within production tables rather than being external to them.
The third concept is that of encapsulation, as it applies to the management of temporal data. Asserted Versioning fully encapsulates the complexities of maintaining temporal tables by distinguishing between temporal transactions and physical transactions. Temporal transactions are inserts, updates and deletes, as written by their authors, whose targets are asserted version tables. They are submitted to the Asserted Versioning Framework (AVF), not directly to the DBMS. The AVF translates them into physical insert and update transactions which it submits to the DBMS. These physical transactions implement the intentions expressed in those temporal transactions, and we note that, except for temporal insert transactions, one temporal transaction will always be translated into multiple physical transactions. The practical orientation of Asserted Versioning is manifest in its encapsulation of the complexities of temporal data structures and the processes that manage them. Asserted Versioning is an integrated method of managing temporal data which relieves data modelers of the burden of designing and maintaining data models that must explicitly define temporal data structures and integrity constraints on them. It also relieves developers of the burden of designing, writing and maintaining code that enforces the rules which provide the semantics of temporal data. And it relieves query authors, whether developers or end users, of the burden of writing complex queries that must explicitly check for temporal gaps or overlaps, along one or two temporal dimensions, among a set of rows accessed by the query.
Computer Science Research:
In 2000, Dr. Rick Snodgrass stated that academics had published, at that time, over 1500 articles on temporal data management [2000, Snodgrass, xviii]. And over the last decade, much additional work has been done. For our purposes, this work falls into two categories: (i) research relevant to our task of designing and building a method of temporal data management that works with today’s technology; and (ii) research on future directions for the implementation of temporal functions in commercial DBMSs, and for the specification of these functions in extensions to the current SQL standard.
Clocks and Clock Ticks
An atomic clock tick is the smallest interval of time recognized by the DBMS that can elapse between any two physical modifications to a database. We note that the standard computer science term for an atomic clock tick is a chronon. A clock tick is an interval of time defined on the basis of atomic clock ticks, and that is used in an Asserted Versioning database to delimit the two time periods of rows in asserted version tables, and also to indicate several important points in time. In asserted version tables, clock ticks are used for effective time begin and end dates and for episode begin dates; and atomic clock ticks are used for assertion time begin and end dates, and for row create dates.
Clock ticks may be defined at any level of granularity supported by the underlying DBMS. A timestamp is a typical clock tick. A date represents a daily clock tick. A 1-month clock tick represents a situation in which a database is updated from a collection of transactions at most once a month. When time periods measured in clock ticks of different granularities are compared, the technical issues involved in the comparison are complex, even if the conceptual issues are not. We will use the first day of each month to represent those clock ticks. This means that changes to these asserted version tables happen on the first of each month and, by the same token, that the state of the database after each monthly update will remain unaltered for the rest of the month. But no matter what granularity we choose for our clock ticks, there is the issue of how clock ticks of one level of granularity are mapped onto those of another level.
For example, if we choose a clock tick of one day, when does that day start? Does June 1st, 2016 begin at 2016-06-01-12:00:00.000, or perhaps at 2016-06- 01-12:00:00.001, or perhaps even at 2016-05-31-11:59:59.999?
The simplest solution is to let the DBMS determine the mapping for us; and in most cases this is perfectly adequate. But IT data management professionals should at least be aware that issues like these do exist. They should also be aware that another issue exists, that of how SQL timestamps map to when things happen in the real world. SQL uses Universal Coordinated Time (UTC), which is based on cesium clocks, which might lead us to conclude that SQL timestamps are extremely accurate. Precise they may be; but issues of accuracy involved in their use do exist.
For example, suppose we know that an astronomical event which has just happened will happen again in exactly 725 days, 5 hours and 23 seconds. If we know the SQL timestamp when the event occurred, can we calculate exactly what the SQL timestamp will be when the event happens again? We can, of course, calculate the number of seconds that will have elapsed between the two events, that number being 62,658,023. But it is a mistake to conclude that from knowing the timestamp of the first event and also knowing the number of seconds until the second event, we can calculate the timestamp of the second event. In fact, we cannot reliably do that. The reason is that from the moment the first timestamp is current to the moment that the second timestamp is current, one or more leap seconds may have been added to or subtracted from the count of cesium clock ticks between those two timestamps, and for the same reason that a leap day is added to the Gregorian calendar once every four years. But unlike the leap year adjustment, which is regular enough to count on, we do not know, in advance, how many leap seconds might be added or subtracted between now and the next occurrence of that astronomical event, or when those adjustments may occur.
The unpredictable adjustment of UTC and its SQL timestamps by means of leap seconds is seldom an issue in business IT. The reason is that for business purposes, we all tag our data with SQL timestamps, and we all regard those timestamps as an accurate record of when business events happen in the real world. In addition, for most business purposes, we assume that a SQL timestamp plus a defined interval of time will result in a second timestamp that represents when some second event will occur.
Time Periods and Date Pairs
SQL does not recognize a period of time as a managed object. Instead, we have to use a pair of dates. There are four ways we can use a pair of dates to do this. Either the beginning date, or the ending date, or both, may or may not be included in the time period they delimit. If a date is not included, the time period is said to be open on that end; otherwise it is said to be closed.
A discussion of all four ways of representing time periods with begin and end dates can be found in the book presenting the standard temporal model [2000, Snodgrass] and also in the book presenting the alternative temporal model [2002, Date, Darwen, Lorentzos]. Here, we discuss just two of the four combinations: closed-open and closed-closed. This is because the closed-open representation is actually the best way of representing time periods, whereas the closed-closed representation appears to be the best way of doing that.
Consider two versions of the same object, V1 and V2. In both cases, in this example, the first time period for the object starts on 2/19/2015 and ends on 5/22/2015, and the second starts on 5/23/2015 and ends on 10/14/2016. Consequently there is no gap in time between these two time periods. 5/22/2015 and 5/23/2015 are contiguous clock ticks. Consequently there is no gap in time between these two time periods. 5/22/2015 and 5/23/2015 are contiguous clock ticks. If we wanted to argue in support of the closed-closed representation, we could try to show that the closed-open representation should not be used by querying for the time period that contains the date 5/23/2015. On the closed-open representation, it might seem that we can’t tell to which version 5/23/2015 bbelongs. But in fact, we can. We just need the following WHERE clause predicate in our SQL query:
WHERE D1 < ‘05/23/2015’ AND ‘05/23/2015’ < D2
With this predicate, the query will correctly pick out V2 So one reason we might have thought that the closed-closed representation is right is that its begin and end dates are the same dates we used to set up the example when we said that one period begins and ends on one set of dates and the other period begins and ends on the other set of dates. Another reason we might have thought that the closed-closed representation is right is that we are looking for a pair of dates that a third date is between. Between, in one sense, is on display in the statement “Pick a number between 1 and 10”. We all know that 1 and 10 are both numbers that we can pick. And the SQL BETWEEN operator corresponds to this sense of the word. So if we use the closed representation, we can write:
WHERE ‘05/23/2015’ BETWEEN D1 AND D2
In other words, when a closed-closed representation is used, we can rely on SQL’s BETWEEN to express what we ordinarily mean by “between”, which is what we might call the inclusive sense of “between”. But there is another sense of between, which is on display in the statement “The abandoned car is somewhere between mileposts 6 and 10, along I-65N, heading out of Pensacola”. We all know that to find the car, we should start at milepost 6 and continue up to milepost 10. In particular, we know that we don’t need to search past milepost 10, i.e. past the start of the tenth mile heading out of Pensacola. This is the sense of between used in the closed-open convention.
The closest English equivalent would be “from . . . . . up to”, in the sense of “from” and then “up to but not including”. But since each SQL predicate returns the correct result, provided each is used with its corresponding method of representing periods of time, each method is correct. What, then, is the advantage of using the closed-open representation? Well, look again at the two ways of representing V1 and V2. In both cases, V1 and V2 are contiguous. We know this because we have set up the example that way. And with the closed-open representation, we can immediately see that there is no gap between V1 and V2. But with the closed-closed representation, we don’t know whether there is a gap or not. To make that determination, we also need to know at what rate the clock ticks. Assuming for this particular example that the clock ticks once a day, then only when given this additional information can we determine, from a closed-closed representation, that V1 and V2 have no gap between them.
This difficulty of determining whether or not two consecutive versions of the same object are contiguous is not just a problem for the human beings trying to understand the data. It is also a problem for the code which will often be called upon to determine, of two consecutive versions, whether or not they are contiguous, i.e. whether or not there is a clock tick between them. With a closed-open representation, the code only has to compare two values to see whether or not they are EQUAL. Making matters worse, the granularity of DBMS timestamp datatypes has changed over time. Originally, these timestamps were seconds. Later, they were expressed as milliseconds, then partial microseconds, then microseconds, and now we are almost at the point where timestamps will be expressed in nanoseconds. With this sliding scale, across DBMSs and DBMS upgrades, we might not even know the granularity at which to try to determine if two versions are or are not contiguous. With a closed-open representation, the code to make this determination is trivial. With a closed-open representation, it may actually be impossible.
Once the SQL standards groups can agree on temporal extensions to the SQL standard, part of that agreement will certainly be a way of representing time periods directly, without relying on the confusing circumlocutions of various ways of using pairs of dates to represent time periods. But our concern, in this book, is with today’s SQL, and so we must choose a date-pair method of representation. Therefore, from this point forward, we will use the closed-open representation for time periods. 9999 and 12/31/9999 We will use the notation 9999 in our illustrations, to represent the latest date that the DBMS can recognize and manipulate.
That latest date cannot be used in effective or assertion begin dates. If it is used as business data, then it has whatever meaning its users assign to it, which will probably be as the designation of a day which is still a long way off. But if it is used in effective or assertion end dates, it is treated as a date by the DBMS but is not interpreted as a date by the users of the system. Instead, to those users, it means “later than now”. Equivalently, we can say that it means “end of the time period not known, but assumed to not have happened yet”. We will generally use the phrase until further notice to represent this semantics—in effect until further notice, or asserted as true until further notice.
When we specify an insert or an update against a conventional table, we normally do not know when, if ever, the target row will be next updated or deleted. By the same token, when we specify a temporal insert or a temporal update against a bi-temporal table, we normally do not know when, if ever, the effective time period of the new version will end, because we don’t know when, if ever, a subsequent update or deletion will occur. Some data modelers will argue that NULL should be used for unknown data. While logically, this is true, the performance of the physical model will benefit from this non-null datatype. Moreover, we do know one thing about this unknown date. For as long as it is unknown, we choose to assume that it has not happened yet. In other words, for as long as it is unknown, we want the DBMS to tell us that it is greater than the value of Now(), whenever we ask that question.
If we used NULL instead of a data value, the DBMS would not give us that answer to that question. If an effective end date is unknown, then, instead of using NULL, we will set it to 12/31/9999, or to whatever value represents the future-most date that the particular DBMS can manage. So the semantics is date unknown (but still in the future), but the implementation is a real date, one in this case that is nearly 8000 years from now. How does a date like that represent semantics like that?
First, we assume that the date 12/31/9999 will not be required to represent that far-off New Year’s Eve. So it is available to use for a special purpose. In the case of an effective end date, we often insert or update versions without knowing when the effective time period of the new version may end. So in this case, 12/31/9999 means that the end of the effective time period of the version thus marked is unknown, and that the time period will remain in effect until further notice. For example, when a customer’s address is added to the database, we usually will not know when that customer will move from that address.
So the status of that address is that it will remain in effect until that customer moves, at some unknown date in the future. In the case of an assertion end date, that date is always set to 12/31/9999 when a row is created. This reflects the assumption that we would never want to assert that a version is true, while also knowing that, at some later point in time, we will change our minds and no longer be willing to make that assertion. While we adopt this assumption in this book, and in the software which is our implementation of these concepts, we note that a more robust semantics might not include this assumption and that, consequently, later releases of our Asserted Versioning Framework may permit non-12/31/9999 assertion end dates to be specified on temporal transactions. An effective end date may be changed, i.e. the time period it ends may be shortened or lengthened, as long as the change does not violate temporal integrity constraints. As for an assertion end date, it may be changed for one of four reasons.
First, an assertion end date may be changed because we realize that the assertion is incorrect, and we have the correction for that error. In that case, the incorrect assertion will cease to be asserted, i.e. it will be withdrawn, as part of the same atomic unit of work that inserts the correcting assertion, and the same date will be used both for the assertion end date of the incorrect assertion and the assertion begin date of the correcting assertion.
Second, an assertion end date may be changed because we realize that an assertion is incorrect even though we do not know what the correct data is, or else just because, for some reason, we do not wish to make that assertion any longer. Third, we may conclude that an assertion about the future is no longer actionable, probably because both we and our competitors have more recent and presumably more accurate forecasts. In either case, the row representing that assertion will cease to be asserted on that date even if no correcting assertion is supplied to replace it. The last reason an assertion end date may be changed is to lock an assertion which has been updated or deleted by a deferred transaction, until the resulting deferred assertion becomes current.
Now() and UTC
Keeping our notation DBMS agnostic, and keeping the clock tick granularity generic, we will refer to the current moment, to right now, as Now(). SQL Server may use getdate(), and DB2 may use Current Timestamp or Current Date. Depending on our clock tick duration, we might need to use a date formatting function to set the current granularity. In our examples, we generally use one month as our clock tick granularity. However for our purposes, Now() can take on values at whatever level of granularity we choose to use, including day, second or microsecond.
Now() is usually assumed to represent the current moment by using local time. But local time may change because of entering or leaving Daylight Savings Time. And another issue is time zone. At any one time, data about to update a database may exist in a different time zone than the database itself. Users about to retrieve data from a database may exist in a different time zone than the database itself. And, of course, federated queries may attempt to join data from databases located in different time zones. So the data values returned by Now() can change for reasons other than the passage of time. Daylight Savings Time can change those values. At any one point in time, those values can differ because of time zones. Clearly, we need a reference framework, and a set of values, that will not change for any reason other than the passage of time, and that will be the same value, at any point in time, the world over and year around. This reference framework is Universal Coordinated Time (UTC). To make use of UTC, our Asserted Versioning Framework will convert local time to UTC on maintenance and queries, and will store Asserted Versioning temporal parameters, such as begin and end dates, in UTC. For example, with Policy_AV being an asserted version table of insurance policies, we would insert a policy like this:
INSERT INTO Policy_AV (oid, asr_beg_dt . . . . .)
VALUES (55, CURRENT TIMESTAMP – CURRENT TIMEZONE . . . . .)
For queries, they will perform better if we do the time conversion before using the value as a selection predicate in the SQL itself. This is because most optimizers treat functions that appear in predicates as non-Indexable. For example, in DB2,we should write:
SET :my-cut TIMESTAMP(:my-local-time-value) – CURRENT TIMEZONE SELECT . . . . . FROM . . . . .WHERE oid 55 AND asr_beg_dt < :my-cut AND asr_end_dt > :my-cut rather than
SELECT . . . . . FROM . . . . .WHERE oid 55 AND asr_beg_dt < TIMESTAMP(:my-local-time-value) – CURRENT TIMEZONE AND . . . . .
However, if these functions are used for display purposes, then there is no reason to exclude them from the queries. For example:
SELECT asr_beg_dt þ CURRENT TIMEZONE AS my_local_asr_beg_dt . .. . . FROM . . . . .
It would also be useful to add alternate columns for the temporal dates in our views that have the translation to local time performed already.
Given two time periods on a common timeline, either they have at least one clock tick in common or they do not. If they do, we will say that they [intersect] one another. If they do not, we will say that they [exclude] one another. If there is an [intersects] relationship between two time periods, then either one [fills] the other or each [overlaps] the other. If one time period [fills] another, then all its clock ticks are also in the time period it [fills], but not necessarily vice versa. If one time period [overlaps] another, then the latter also overlaps the former; but, being the later of the two time periods, we say that the latter time period has the inverse relationship, [overlaps _1].
In the overlaps cases, each has at least one clock tick that the other does not have, as well as having at least one clock tick that the other does have. If two time periods [exclude] one another, then they do not share any clock ticks, and they are either non-contiguous or contiguous. If there is at least one clock tick between them, they are non-contiguous and we say that one is [before] the other. Otherwise they are contiguous and we say that one [meets] the other. If one time period [fills] the other, then either they are [equal], or one [occupies] the other. If they are [equal], then neither has a clock tick that the other does not have. If one [occupies] the other, then all the clock ticks in the occupying time period are also in the occupied time period, but not vice versa. If one time period [occupies] the other, then either they share an [aligns] relationship, or one occurs [during] the other. If they are aligned, then they either start on the same clock tick or end on the same clock tick, and we say that one either [starts] or [finishes] the other. Otherwise, one occurs [during] the other, beginning after the other and ending before it. Note that if two time periods are aligned, one cannot both [start] and [finish] the other because if it did, it would be [equal] to the other.
If one time period [starts] another, they both begin on the same clock tick. If one [finishes] the other, they both end on the same clock tick. If one time period [occupies] another, but they are not aligned, then one occurs [during] the other.
Now let’s consider the special case in which one of the two time periods is a point in time, i.e. is exactly one clock tick in length, and the other one contains two or more clock ticks. This point in time may either [intersect] or [exclude] the time period. If the point in time [intersects] the time period, it also [fills] and [occupies] that time period. If it [aligns] with the time period, then it either [starts] the time period or [finishes] it. Otherwise, the point in time occurs [during] the time period. If the point in time [excludes] the time period, then either may be [before] the other, or they may [meet].
Finally, let’s consider onemore special case, that in which both the time periods are points in time. Those two points in time may be [equal], or one may be [before] the other, or they may [meet]. There are no other Allen relationships possible for them. As we will see later, four of these Allen relationship categories are especially important.
- (i) The [intersects] relationship is important because for a temporal insert transaction to be valid, its effective time period cannot intersect that of any episode for the same object which is already in the target table. By the same token, for a temporal update or delete transaction to be valid, the target table must already contain at least one episode for the same object whose effective time period does [intersect] the time period designated by the transaction.
- (ii) The [fills] relationship is important because violations of the temporal analog of referential integrity always involve the failure of a child time period to [fill] a parent time period. We will be frequently discussing this relationship from the parent side, and we would like to avoid having to say things like “. . . . . failure of a parent time period to be filled by a child time period”. So we will use the term “includes” as a synonym for “is filled by”, i.e. as a synonym for [fills _1]. Now we can say “. . . . . failure of a parent time period to include a child time period”.
- (iii) The [before] relationship is important because it distinguishes episodes from one another. Every episode of an object is non-contiguous with every other episode of the same object, and so for each pair of them, one of them must be [before] the other.
- (iv) The [meets] relationship is important because it groups versions for the same object into episodes. A series of versions for the same object that are all contiguous, i.e. that all [meet], fall within the same episode of that object.