11 Data Warehouse

Architecture, in the Data warehousing world, is the concept and design of the data base and technologies that are used to load the data. A good architecture will enable scalability, high performance and easy maintenance.

Data warehouse architecture consists of the following interconnected layers:

 

Operational database layer

 

Operational database layer serves as source for the Data warehouse. This may include the Operation Data Store(ODS) and other similar sources(Ex. Flat files).

 

Data access layer

 

The data access layer is the part which involves in extracting the data from multiple source, cleansing and transforming the data and loading it.

 

Metadata layer

 

The data directory – This is usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.

 

Informational access layer

 

The data accessed for reporting and analyzing and the tools for reporting and analyzing data – Business intelligence tools fall into this layer.

 

Famous authors and data warehouse experts Ralph Kimball and Bill Inmon give two different design methodoloigies for building a data warehouse.

 

Kimball’s approach is more of a Bottom-up design where data marts are created first for specific subject/business areas and have the capability to report and analyse. THen these data marts are combined to create a data warehouse. This approach provide quicker approach to get the data ready for individual sujects/businesses. The major task in this design is maintaining the Dimensions across multiple data marts.

 

Inmon has defined a data warehouse as a centralized repository for the entire enterprise, in which the data warehouse is designed using a normalized enterprise data model. Data at the lowest level of detail is stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse.

 

Inmon states that the data warehouse is: Subject-oriented, Non-volatile and Integrated.This methodology generates highly consistent dimensional views of data across data marts since all data marts are loaded from the centralized repository. Top-down design has also proven to be robust against business changes. Generating new dimensional data marts against the data stored in the data warehouse is a relatively simple task. The main disadvantage to the top-down methodology is that it represents a very large project with a very broad scope. The up-front cost for implementing a data warehouse using the top-down methodology is significant, and the duration of time from the start of project to the point that end users experience initial benefits can be substantial. In addition, the top-down methodology can be inflexible and unresponsive to changing departmental needs during the implementation phases.

 

Now there are new methodologies commonly called as Hybrid Design, that combine these two and provide more comprehensive and robust data warehouse design.

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

 

  • OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data,and Schema used to         Store transactional databases is the entity model   (usually 3NF).
  • OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

The following table summarizes the major differences between OLTP and OLAP system design.

 

 

Online Analytical Processing (OLAP) databases facilitate business-intelligence queries. OLAP is a database technology that has been optimized for querying and reporting, instead of processing transactions. The source data for OLAP is Online Transactional Processing (OLTP) databases that are commonly stored in data warehouses. OLAP data is derived from this historical data, and aggregated into structures that permit sophisticated analysis. OLAP data is also organized hierarchically and stored in cubes instead of tables. It is a sophisticated technology that uses multidimensional structures to provide rapid access to data for analysis. This organization makes it easy for a PivotTable report or PivotChart report to display high-level summaries, such as sales totals across an entire country or region, and also display the details for sites where sales are particularly strong or weak.

 

OLAP databases are designed to speed up the retrieval of data. Because the OLAP server, rather than Microsoft Office Excel, computes the summarized values, less data needs to be sent to Excel when you create or change a report. This approach enables you to work with much larger amounts of source data than you could if the data were organized in a traditional database, where Excel retrieves all of the individual records and then calculates the summarized values.

 

OLAP databases contain two basic types of data: measures, which are numeric data, the quantities and averages that you use to make informed business decisions, and dimensions, which are the categories that you use to organize these measures. OLAP databases help organize data by many levels of detail, using the same categories that you are familiar with to analyze the data.

 

 

The following sections describe each component in more detail:

 

Cube A data structure that aggregates the measures by the levels and hierarchies of each of the dimensions that you want to analyze. Cubes combine several dimensions, such as time, geography, and product lines, with summarized data, such as sales or inventory figures. Cubes are not “cubes” in the strictly mathematical sense because they do not necessarily have equal sides. However, they are an apt metaphor for a complex concept.

 

Measure A set of values in a cube that are based on a column in the cube’s fact table and that are usually numeric values. Measures are the central values in the cube that are preprocessed, aggregated, and analyzed. Common examples include sales, profits, revenues, and costs.

 

Member An item in a hierarchy representing one or more occurrences of data. A member can be either unique or non-unique. For example, 2007 and 2008 represent unique members in the year level of a time dimension, whereas January represents non-unique members in the month level because there can be more than one January in the time dimension if it contains data for more than one year.

 

Calculated member A member of a dimension whose value is calculated at run time by using an expression. Calculated member values may be derived from other members’ values. For example, a calculated member, Profit, can be determined by subtracting the value of the member, Costs, from the value of the member, Sales.

 

Dimension A set of one or more organized hierarchies of levels in a cube that a user understands and uses as the base for data analysis. For example, a geography dimension might include levels for Country/Region, State/Province, and City. Or, a time dimension might include a hierarchy with levels for year, quarter, month, and day. In a PivotTable report or PivotChart report, each hierarchy becomes a set of fields that you can expand and collapse to reveal lower or higher levels.

 

Hierarchy A logical tree structure that organizes the members of a dimension such that each member has one parent member and zero or more child members. A child is a member in the next lower level in a hierarchy that is directly related to the current member. For example, in a Time hierarchy containing the levels Quarter, Month, and Day, January is a child of Qtr1. A parent is a member in the next higher level in a hierarchy that is directly related to the current member. The parent value is usually a consolidation of the values of all of its children. For example, in a Time hierarchy that contains the levels Quarter, Month, and Day, Qtr1 is the parent of January.

 

Level Within a hierarchy, data can be organized into lower and higher levels of detail, such as Year, Quarter, Month, and Day levels in a Time hierarchy.

 

Star Schema

 

 

The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly used nowadays and is recommended by Oracle.

.

  Fact Tables

 

A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact’s data on detail or aggregated level.

 

Dimension Tables

 

A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn’t got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size then fact table.

 

Typical fact tables store data about sales while dimension tables data about geographic region(markets, cities) , clients, products, times, channels.

 

The main characteristics of star schema:

Simple structure -> easy to understand schema

Great query effectives -> small number of tables to join

 

Relatively long time of loading data into dimension tables -> de-normalization, redundancy data caused that size of the table could be large.

The most commonly used in the data warehouse implementations -> widely supported by a large number of business intelligence tools

Snowflake Schema

 

 

A snowflake schema is a logical arrangement of tables in a relational database such that the entity relationship diagram resembles a snowflake in shape. Closely related to the star schema, the snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. In the snowflake schema, however, dimensions are normalized into multiple related tables whereas the star schema’s dimensions are denormalized with each dimension being represented by a single table. When the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and where child tables have multiple parent tables (“forks in the road”), a complex snowflake shape starts to emerge. The “snowflaking” effect only affects the dimension tables and not the fact tables.

 

Fact Constellation Schema.

 

 

For each star schema it is possible to construct fact constellation schema(for example by splitting the original star schema into more star schemes each of them describes facts on another level of dimension hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables.

 

The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected. Moreover, dimension tables are still large.

 

Granularity

 

The single most important aspect and issue of the design of the data warehouse is the issue of granularity. It refers to the detail or summarization of the units of data in the data warehouse. The more detail there is, the lower the granularity level. The less detail there is, the higher the granularity level. Granularity is a major design issue in the data warehouse as it profoundly affects the volume of data.

 

The figure below shows the issue of granularity in a data warehouse.

 

.

Dual levels of Granularity:

 

Sometimes there is a great need for efficiency in storing and accessing data and the ability to analyze the data in great data. When an organization has huge volumes of data it makes sense to consider two or more levels of granularity in the detailed portion of the data warehouse. The figure below shows two levels of granularity in a data warehouse. In the below figure we see a phone company which fits the needs of most of its shops. There is a huge amount of data in the operational level. The data up to 30 days is stored in the operational environment. Then the data shifts to the lightly and highly summarized zone.

 

This process of granularity not only helps the data warehouse it supports more than data marts. It supports the process of exploration and data mining. Exploration and data mining takes masses of detailed historical data and examine the same to analyze and previously unknown patterns of business activity.

 

A multidimensional database is a computer software system designed to allow for the efficient and convenient storage and retrieval of large volumes of data that is (1) intimately related and (2) stored,viewed and analyzed from different perspectives. These perspectives are called dimensions.

 

Getting answers to typical business questions from raw data often requires viewing that data from various perspectives. For example, an automobile marketer wanting to improve business activity might want to examine sales data collected throughout the organization.

 

The evaluation would entail viewing historical sales volume figures from multiple perspectives such as:

 

•  Sales volumes by model

•  Sales volumes by color

•  Sales volumes by dealership

•  Sales volumes over time

 

Analyzing the Sales Volumes data from any one or more of the above perspectives can yield answers to important questions such as:

 

What is the trend in sales volumes over a period of time for a specific model and color across a specific group of dealerships?

 

Having the ability to respond to these types of inquiries in a timely fashion allows managers to formulate effective strategies, identify trends and improve their overall ability to make important business decisions. Certainly, relational databases could answer the question above, but query results must also come to the manager in a meaningful and timely way. End users needing interactive access to large volumes of data stored in a relational environment are often frustrated by poor response times and lack of flexibility offered by relational database technology and their SQL query building tools. What follows now is an explanation of the reason for their frustration and a set of examples that assist in comparing the multidimensional database to its most common alternative–the relational database.

Granularity is the most important to the data warehouse architect because it affects all the environments that depend in the data warehouse for data. The main issue of granularity is that of getting it at the right level. The level of granularity needs to be neither too high nor too low.

 

 

Raw Estimates

 

The starting point to determine the appropriate level of granularity is to do a rough estimate of the number of rows that would be there in the data warehouse. If there are very few rows in the data warehouse then any level of granularity would be fine. After these projections are made the index data space projections are calculated. In this index data projection we identify the length of the key or element of data and determine whether the key would exist for each and every entry in the primary table.

 

Data in the data warehouse grows in a rate never seen before. The combination of historical data and detailed data produces a growth rate which is phenomenal. It is only after data warehouse the terms terabyte and petabyte came into existence. As data keeps growing some part of the data becomes inactively used and they are sometimes called as dormant data. So it is always better to have these kinds of dormant data in external storage media.

 

Data which is usually stored externally are much less expensive than the data which resides on the disk storage. Some times as these data are external it becomes difficult to retrieve the data and this causes lots of performance issues and these issues cause lots of effect on the granularity. It is usually the rough estimates which tell whether the overflow storage should be considered or not.

 

Levels of Granularity

 

After simple analysis is done the next step would be to determine the level of granularity for the data which is residing on the disk storage. Determining the level of granularity requires some extent of common sense and intuition. Having a very low level of granularity also doesn’t make any sense as we will have to need many resources to analyze and process the data. While if the level of granularity is very high then this means that analysis needs to done on the data which reside in the external storage. Hence this is a very tricky issue so the only way to handle this to put the data in front of the user and let he/she decide on what the type of data should be. The below figure shows the iterative loop which needs to be followed.

 

 

The process which needs to be followed is.

  • Build a small subset quickly based on the feedback
  • Prototyping
  • Looking what other people have done
  • Working with experienced user
  • Looking at what the organization has now
  • Having sessions with the simulated output.