30 Temporal Database
Dr R. Baskaran
TEMPORAL DATABASES : AN INTRODUCTION
Historical data first manifested itself as the backups and log files we kept and hoped to ignore. We hoped to ignore those datasets because if we had to use them, it meant that something had gone wrong, and we had to recover a state of the database prior to when that happened. Later, as data storage and access technology made it possible to manage massively larger volumes of data than ever before, we brought much of that historical data on-line and organized it in two different ways. On the one hand, backups were stacked on top of one another and turned into data warehouses. On the other hand, log files were supplemented with foreign keys and turned into data marts. We don’t mean to say that this is how the IT or computer science communities thought of the development and evolution of warehouses and marts, as it was happening over the last two decades. Nor is it how they think of warehouses and marts today. Rather, this is more like what philosophers call a rational reconstruction of what happened. It seems to us that, in fact, warehouses are the form that backup files took when brought on-line and assembled into a single database instance, and data marts are the form that transaction logs took when brought online and assembled into their database instances. The former is history as a series of states that things go through as they change over time. The latter is history as a series of those changes themselves. But warehouses and data marts are macro structures. They are structures of temporal data at the level of databases and their instances. In this book, we are concerned with more micro-level structures, specifically structures at the level of tables and their instances. And at this level, temporal data is still a second-class citizen. To manage it, developers have to build temporal structures and the code to manage them, by hand. In order to fully appreciate both the costs and the benefits of managing temporal data at this level, we need to see it in the context of methods of temporal data management as a whole.
Temporal Data Management:
Temporal data management is not a new development. From the earliest days of business data processing (as it was called back then), transactions were captured in a transaction log and the files and tables those transactions were applied to were periodically backed up. With those backups and log files, we could usually recreate what the data looked like at any point in time along either of the temporal dimensions we will be discussing. Indeed, together they contain all the “raw material” needed to support fully bi-temporal data management. What has changed about temporal data management, over the decades, is accessibility to temporal data. These days, it takes less effort to get to temporal data than it used to, and it takes less time as well. But significant additional progress is possible, and computer scientists are working on it.
In one sense, temporal data has been accorded only second class status since the advent of computers and their use in managing business data. Neither database management systems (DBMSs) and the tables they manage, nor access methods and the files they manage, provide explicit mechanisms and structures to distinguish data about the past, present or future of the things we keep track of. Instead, unless developer-designed data structures and developer-written code is deployed, every object is represented by one and only one row in a table. If the row is there, the corresponding object is represented in our databases; otherwise it is not. If something about a represented object changes, the row is retrieved, updated and rewritten to reflect that change.
This focus on current data is reflected in a basic paradigm that has been used since we began managing data with computers. The paradigm is that of one data structure to represent a type of object or event, containing multiple other data structures, each representing an instance of an object or event of that type. Contained within the latter data structures are additional structures, each representing a property of the instance in question, or a relationship it has to another instance of the same type or (more usually) a different type.
This paradigm has manifested itself in such terminologies as (i) files, records, fields and pointers; (ii) tables, rows, columns and foreign keys; and (iii) classes, objects, slots and links. For the remainder of this book, we will use the table, row, column and foreign key terminology, although the concepts of uni-temporal and bi-temporal data management apply equally well to data managed by directly invoking access methods, to data managed with proprietary software, and to data.
Temporal Extensions to SQL
By the early 90s, significant computer science research on bi-temporality had been completed. To the extent that word of these developments made its way into the business IT community, bi-temporality was understood as a distinction between logical time and physical time. Logical time, corresponding to what computer scientists called valid time, was generally referred to by IT professionals as effective time. It was understood to be that period of time, denoted by either a single date or by a pair of dates, during which the object represented by a row conformed to the description that row provided. The term “effective time” derives from the fact that for specific and non-overlapping periods of time, each of these rows is in effect as the representative of an object—as the authorized description of what the object is like during that specific period of time. As for physical time, it was understood to be a single date, the physical date on which the bi-temporal data is created. This view was, in fact, either a misunderstanding of what the computer scientists were saying, or else an independently developed understanding of two kinds of time that were relevant to data.
Either way, it fell short of full bi-temporality. For while it acknowledged that one kind of time is a period of time, it believed that the other kind of time is a point in time. With only one temporal extent represented, this was at best a quasi-bi-temporal model of data. This misunderstanding aside, the computer science work on bi-temporality resulted in a proposal for bi-temporal extensions to the SQL language. The extensions were formulated originally as TSQL, later superseded by TSQL2. This proposal was submitted to the SQL Standards Committee in 1994 by Dr. Rick Snodgrass, but to this date has still not been ratified. Nonetheless, there is much that can be done to support bi-temporal functionality using today’s technology, and much to be gained from doing so.
One thing it might mean is “more available than by using backups and logfiles”. And the most salient feature of the advance from backups and logfiles to these other methods of managing historical data is that backups and logfiles require the intervention of IT Operations to restore desired data from off-line media, while history tables, warehouses and data marts do not. When IT Operations has to get involved, emails and phone calls fly back and forth. The Operations manager complains that his personnel are already overloaded with the work of keeping production systems running, and don’t have time for these one-off requests, especially as those requests are being made more and more frequently.
What is going on is that the job of Operations, as its management sees it, is to run the IT production schedule and to complete that scheduled work on time. Anything else is extra. Anything else is outside what their annual reviews, salary increases and bonuses are based on. And so it is frequently necessary to bump the issue up a level, and for Directors or even VPs within IT to talk to one another. Finally, when Operations at last agrees to restore a backup and apply a logfile (and do the clean-up work afterwards, the manager is sure to point out), it is often a few days or a few weeks after the business use for that data led to the request being made in the first place. Soon enough, data consumers learn what a headache it is to get access to backed-up historical data. They learn how long it takes to get the data, and so learn to do a quick mental calculation to figure out whether or not the answer they need is likely to be available quickly enough to check out a hunch about next year’s optimum product mix before production schedules are finalized, or support a position they took in a meeting which someone else has challenged. They learn, in short, to do without a lot of the data they need, to not even bother asking for it.
But instead of the comparative objective of making temporal data “more available” than it is, given some other way of managing it, let’s formulate the absolute objective for availability of temporal data. It is, simply, for temporal data to be as quickly and easily accessible as it needs to be. We will call this the requirement to have seamless, real-time access to what we once believed, currently believe, or may come to believe is true about what things of interest to us were like, are like, or may come to be like in the future. This requirement has two parts. First, it means access to noncurrent states of persistent objects which is just as available to the data consumer as is access to current states. The temporal one thing it might mean is “more available than by using backups and logfiles”. And the most salient feature of the advance from backups and logfiles to these other methods of managing historical data is that backups and logfiles require the intervention of IT Operations to restore desired data from off-line media, while history tables, warehouses and data marts do not. When IT Operations has to get involved, emails and phone calls fly back and forth. The Operations manager complains that his personnel are already overloaded with the work of keeping production systems running, and don’t have time for these one-off requests, especially as those requests are being made more and more frequently. What is going on is that the job of Operations, as its management sees it, is to run the IT production schedule and to complete that scheduled work on time. Anything else is extra. Anything else is outside what their annual reviews, salary increases and bonuses are based on. And so it is frequently necessary to bump the issue up a level, and for Directors or even VPs within IT to talk to one another. Finally, when Operations at last agrees to restore a backup and apply a logfile (and do the clean-up work afterwards, the manager is sure to point out), it is often a few days or a few weeks after the business use for that data led to the request being made in the first place. Soon enough, data consumers learn what a headache it is to get access to backed-up historical data. They learn how long it takes to get the data, and so learn to do a quick mental calculation to figure out whether or not the answer they need is likely to be available quickly enough to check out a hunch about next year’s optimum product mix before production schedules are finalized, or support a position they took in a meeting which someone else has challenged. They learn, in short, to do without a lot of the data they need, to not even bother asking for it.
But instead of the comparative objective of making temporal data “more available” than it is, given some other way of managing it, let’s formulate the absolute objective for availability of temporal data. It is, simply, for temporal data to be as quickly and easily accessible as it needs to be. We will call this the requirement to have seamless, real-time access to what we once believed, currently believe, or may come to believe is true about what things of interest to us were like, are like, or may come to be like in the future. This requirement has two parts. First, it means access to noncurrent states of persistent objects which is just as available to data consumer as is access to current states. The temporal data must be available on-line, just as current data is. Transactions to maintain temporal data must be as easy to write as are transactions to maintain current data. Queries to retrieve temporal data, or a combination of temporal and current data, must be as easy to write as are queries to retrieve current data only. This is the usability aspect of seamless access.
Second, it means that queries which return temporal data, or a mix of temporal and current data, must return equivalent sized results in an equivalent amount of elapsed time. This is the performance aspect of seamless access.
A taxonomy is a special kind of hierarchy. It is a hierarchy which is a partitioning of the instances of its highest-level node into different kinds, types or classes of things. While an historical approach tells us how things came to be, and how they evolved over time, a taxonomic approach tells us what kinds of things we have come up with, and what their similarities and differences are. In both cases, i.e. in the previous chapter and in this one, the purpose is to provide the background for our later discussions of temporal data management and, in particular, of how Asserted Versioning supports non-temporal, uni-temporal and bi-temporal data by means of physical bi-temporal tables.
Partitioned Semantic Trees
As borrowed by computer scientists, the term “taxonomy” refers to a partitioned semantic tree. A tree structure is a hierarchy, which is a set of non-looping (acyclic) one-to-many relationships. In each relationship, the item on the “one” side is called the parent item in the relationship, and the one or more items on the “many” side are called the child items. The items that are related are often called nodes of the hierarchy. Continuing the arboreal metaphor, a tree consists of one root node (usually shown at the top of the structure, and not, as the metaphor would lead one to expect, at the bottom), zero or more branch nodes, and zero or more leaf nodes on each branch.
Tree structure:
Each taxonomy is a hierarchy. Therefore, except for the root node, every node has exactly one parent node. Except for the leaf nodes, unless the hierarchy consists of the root node only, every node has at least one child node. Each node except the root node has as ancestors all the nodes from its direct parent node, in linear ascent from child to parent, up to and including the root node. No node can be a parent to any of its ancestors.
Partitioned:
The set of child nodes under a given parent node are jointly exhaustive and mutually exclusive. Being jointly exhaustive means that every instance of a parent node is also an instance of one of its child nodes. Being mutually exclusive means that no instance of a parent node is an instance of more than one of its child nodes. A corollary is that every instance of the root node is also an instance of one and only one leaf node.
Semantic:
The relationships between nodes are often called links. The links between nodes, and between instances and nodes, are based on the meaning of those nodes. Conventionally, node-to-node relationships are called KIND-OF links, because each child node can be said to be a kind of its parent node.
Methods for Managing Temporal Data:
In terms of granularity, temporal data can be managed at the level of databases, or tables within databases, or rows within tables, or even columns within rows. And at each of these levels, we could be managing non-temporal, uni-temporal or bi-temporal data. Of course, with two organizing principles— four levels of granularity, and the non/uni/bi distinction—the result would be a matrix rather than a hierarchy.
The Root Node of the Taxonomy
The root node of a taxonomy defines the scope and limits of what the taxonomy is about. Our root node says that our taxonomy is about methods for managing temporal data. Temporal data is data about, not just how things are right now, but also about how things used to be and how things will become or might become, and also about what we said things were like and when we said it. The two nodes which partition temporal data management are reconstructable data and queryable data. Reconstructable data is the node under which we classify all methods of managing temporal data that require manipulation of the data before it can be queried. Queryable data is the opposite.
But the defining feature of reconstructable methods is not the movement of data from off-line to on-line storage. The defining feature is the inability of users to access the data until it has been manipulated and transformed in some way. For this reason, among all these temporal data management methods, reconstructable temporal data takes the longest to get to, and has the highest cost of access. Besides the time and effort involved in preparing the data for querying—either through direct queries or via various tools which generate queries from graphical or other user directives—many queries or reports against reconstructed data are modified from production queries or reports. Production queries or reports point to production databases and production tables; and so before they are used to access reconstructed data, they must be rewritten to point to that reconstructed data. This rewrite of production queries and reports may involve changing database names, and sometimes tables names and even column names. Sometimes, a query that accessed a single table in the production database must be modified to join, or even to union, multiple tables when pointed at reconstructed data.
Queryable Temporal Data
Queryable temporal data, in contrast, is data which can be directly queried, without the need to first transform that data in some way. In fact, the principal reason for the success of data warehousing is that it transformed reconstructable historical data into queryable historical data. Queryable data is obviously less costly to access than reconstructable data, in terms of several different kinds of costs. The most obvious one, as indicated previously, is the cost of the man-hours of time on the part of IT Operations personnel, and perhaps software developers and DBAs as well. Another cost is the opportunity cost of waiting for the data, and the decisions delayed until the data becomes available. In an increasingly fast-paced business world, the opportunity cost of delays in accessing data is increasingly significant.
But in our experience, which combines several decades of work in business IT, the greatest cost is the cost of the business community learning to do without the data they need. In many cases, it simply never crosses their minds to ask for temporal data that isn’t already directly queryable. The core of the problem is that satisfying these requests is not the part of the work of computer operators, DBAs and developers that they get evaluated on. If performance reviews, raises, bonuses and promotions depend on meeting other criteria, then it is those other criteria that will be met. Doing a favor for a business user you like, which is what satisfying this kind of one-off request often amounts to, takes a decidedly second place.To paraphrase Samuel Johnson, “The imminent prospect of being passed over for a promotion wonderfully focuses the mind”.
Queryable Temporal Data: Events and States
Having distinguished queryable data from reconstructable data, we move on to a partitioning of the former. We think that the most important distinction among methods of managing queryable data is the distinction between data about things and data about events. Things are what exist; events are what happen. Things are what change; events are the occasions on which they change.
The issue here is change, and the best way to keep track of it. One way is to keep a history of things, of the states that objects take on. As an object changes from one state to the next, we store the before-image of the current state and update a copy of that state, not the original. The update represents the new current state.
Another way to keep track of change is to record the initial state of an object and then keep a history of the events in which the object changed. For example, with insurance policies, we could keep an event-based history of changes to policies by adding a row to the Policy table each time a new policy is created, and after that maintaining a transaction table in which each transaction is an update or delete to the policy. The relevance of transactions to event-based temporal data management is this: transactions are the records of events, the footprints which events leave on the sands of time.
Event Temporal Data
Methods for managing event data are most appropriately used to manage changes to metric values of relationships among persistent objects, values such as counts, quantities and amounts. Persistent objects are the things that change, things like policies, clients, organizational structures, and so on. As persistent objects, they have three important features:
(i) they exist over time;
(ii) they can change over time; and
(iii) each is distinguishable
from other objects of the same type. In addition, they should be recognizable as the same object when we encounter them at different times (although sometimes the quality of our data is not good enough to guarantee this). Events are the occasions on which changes happen to persistent objects. As events, they have two important features:
(i) they occur at a point in time, or sometimes last for a limited period of time; and
(ii) in either case, they do not change. An event happens, and then it’s over. Once it’s over, that’s it; it is frozen in time.
For example, the receipt of a shipment of product alters the on hand balance of that product at a store. The completion of an MBA degree alters the level of education of an employee. The assignment of an employee to the position of store manager alters the relationship between the employee and the company. Of course, the transactions which record these events may have been written up incorrectly. In that case, adjustments to the data must be made. But those adjustments do not reflect changes in the original events; they just correct mistakes made in recording them.
A Special Relationship: Balance Tables
The event transactions that most businesses are interested in are those that affect relationships that have quantitative measures. A payment is received. This is an event, and a transaction records it. It alters the relationship between the payer and payee by the amount of the payment. That relationship is recorded, for example, in a revolving credit balance, or perhaps in a traditional accounts receivable balance. The payment is recorded as a credit, and the balance due is decreased by that amount. These records are called balance records because they reflect the changing state of the relationship between the two parties, as if purchases and payments are added to opposite trays of an old fashioned scale which then tilts back and forth. Each change is triggered by an event and recorded as a transaction, and the net effect of all the transactions,applied to a beginning balance, gives the current balance of the relationship. But it isn’t just the current balance that is valuable information. The transactions themselves are important because they tell us how the current balance got to be what it is. They tell us about the events that account for the balance. In doing so, they support the ability to drill down into the foundations of those balances, to understand how the current state of the relationship came about.
They also support the ability to re-create the balance as of any point in time between the starting balance and the current balance by going back to the starting balance and applying transactions, in chronological sequence, up to the desired point. We no longer need to go back to archives and logfiles, and write one-off code to get to the point in time we are interested in—as we once needed to do quite frequently. Conceptually, starting balances, and the collections of transactions against them, are like single-table backups and their logfiles, respectively, brought on-line. Organized into the structures discovered by Dr. Ralph Kimball, they are fact/dimension data marts. Of course, balances aren’t the only kind of relationship. For example, a Customer to Salesperson cross-reference table—an associative table, in relational terms—represents a relationship between customers and salespersons. This table, among other things, tells us which salespersons a business has assigned to which customers. This table is updated with transactions, but those transactions themselves are not important enough to keep on-line. If we want to keep track of changes to this kind of relationship, we will likely choose to keep a chronological history of states, not of events. A history table of that associative relationship is one way we might keep that chronological history of states.
To summarize: businesses are all about ongoing relationships. Those relationships are affected by events, which are recorded as transactions. Financial account tables are balance tables; each account number uniquely identifies a particular relationship, and the metrical properties of that account tell us the current state of the relationship. The standard implementation of event time, as we mentioned earlier, is the data mart and the fact/dimension, star or snowflake structures that it uses.
State Temporal Data
Event data, as we have seen, is not the best way of tracking changes to non-metric relationships. It is also not ideal for managing changes to non-metric properties of persistent objects, such as customer names or bill of material hierarchies. Who ever heard of a data mart with customers or bill of material hierarchies as the fact tables? For such relationships and such objects, state-based history is the preferred option. One reason is that, for persistent objects, we are usually more interested in what state they are in at a given point in time than in what changes they have undergone. If we want to know about changes to the status of an insurance policy, for example, we can always reconstruct a history of those changes from the series of states of the policy. With balances, and their rapidly changing metrics, on the other hand, we generally are at least as interested in how they changed over time as we are in what state they are currently in. So we conclude that, except for keeping track of metric properties of relationships, the best queryable method of managing temporal data about persistent objects is to keep track of the succession of states through which the objects pass. When managing time using state data, what we record are not transactions, but rather the results of transactions, the rows resulting from inserts and (logical) deletes, and the rows representing both a before-and an after-image of every update. State data describes those things that can have states, which means those things that can change over time.
An event, like a withdrawal from a bank account, as we have already pointed out, can’t change. Events don’t do that. But the customer who owns that account can change. The branch the account is with can change. Balances can also change over time, but as we have just pointed out, it is usually more efficient to keep track of balances by means of periodic snapshots of beginning balances, and then an accumulation of all the transactions from that point forward. But from a logical point of view, event data and state data are interchangeable. No temporal information is preserved with one method that cannot be preserved with the other. We have these two methods simply because an event data approach is preferable for describing metric-based relationships, while a state data approach is better at tracking changes to persistent objects and to relationships other than metric balances.