This article contrasts data warehouses to operational (or transaction processing) applications. It then goes on to explore effective techniques for building data warehouses. It concludes with a discussion of how a data warehouse allows an architect to design better operational systems.
The major topics are as follows:
Data Warehouse Introduction
The goal of a data warehouse is to integrate timely, accurate information and to make it available to an organization's employees and decision makers. The source of this data is the organization's various operational systems. The population of a data warehouse is shown in Figure 1.
Figure 1: Populating a Data Warehouse
Data is extracted from important operational systems on an ongoing basis. The data is cleaned up and made consistent (e.g. assigned standard customer id, product id, given common units of measure, etc.). The standardized data is then loaded into the data warehouse. In addition to the detailed operational data, summarized data is often produced to allow for easier querying and analysis.
Once data is assembled in the data warehouse, it is available for users to access and analyze. Various tools help the user access the data. Data access tools range from simple ad hoc reporting tools to more sophisticated data mining, Decision Support Systems (DSS) and Online Analytical Processing (OLAP) products.
With easy access to consolidated current and historical company data, ad hoc queries can be created, trends can be analyzed, data can be correlated and anomalies can be identified. The results of this data analysis should help knowledge workers improve the quality of their decision-making.
Contrasting an Operational System with a Data Warehouse
A Data Warehouse differs from an Operational System in many ways.
Table 1 contrasts the nature of data in an operational system compared to data in a warehouse.
Table 1: Nature of Data
|Operational System||Data Warehouse|
|Data currency||Current||Somewhat out-of-date (data might be updated in near real time, daily, weekly, etc.)|
|Data life||Data is short lived and generally leaves the system soon after its associated transactions have been completed or processed (e.g. after an invoice is fully paid).||Data is long lived. Historical data is usually retained for extended periods to allow for long-term analysis.|
|Application breadth||Data is generally associated with a single application, although selected data from other applications may be available.||Goal is to provide organization-wide view of data that goes across applications.|
Table 2 illustrates how data access differs in an operational system as compared to a data warehouse.
Table 2: Data Access
|Operational System||Data Warehouse|
|Reads||Reads generally involve smaller sets of data. Also, the types of reads are generally predictable (i.e. searches use standard search criteria).||Reads often involve large amounts of data as ad hoc queries are issued or data analysis is performed. The nature of data access is frequently unpredictable since users retrieve data in new ways to answer their business questions.|
|Updates||Inserts and updates are mostly user-initiated and must be completed within tight timeframes. Data updates are often highly overlapping so locking and transaction integrity are key to maintaining data integrity.||Users don't update data. Inserts and updates are performed in batch/background mode by special loading programs. As a result, update performance is of little consequence to users (as long as the data arrives). Updates are generally sequential and non-conflicting in nature (since overlaps can be controlled with batching).|
|Read / update trade-offs||A balance between read and update performance must be sought. Improved read performance (through use of indexes and denormalization) has to be traded off with update performance.||Read performance is paramount. Indexes can be used almost without limit. Data summarization and even denormalization has a much lower penalty since update performance is much less visible.|
Table 3 illustrates how the requirements and success factors are different for an operational system compared to a data warehouse.
Table 3: Business Perspective
|Operational System||Data Warehouse|
|Users||The system is used by operational staff who are concerned with day-to-day business operations.||The warehouse is used by knowledge workers or management who want to analyze data to help improve the running and management of business over the longer term.|
|Requirements||Requirements are generally defined ahead of time to support well defined business operations.||Requirements are often unknown to begin with. Requirements emerge as the data warehouse is used to support decision making.|
|Key Success Factors||The application must provide efficient performance which includes fast real time updates, high availability and data integrity.||The application must provide access to a wide array of standardized and well integrated data. The architecture must be flexible so that data can be adjusted and new tools used.|
Building a Data Warehouse
Few organizations undertake the building of a data warehouse all at once. The effort of building a complete, consolidated data warehouse would require an enormous effort and would take too long. Most organizations take one of two approaches:
A data mart is like a "warehouse" that applies to a single application. A data mart can be created essentially by copying an operational database while retaining historical information. When the operational data is easy to access (e.g. a well-designed relational database), a data mart can be created with relatively little effort.
The problem with the standalone data mart approach is that it fails to integrate data between applications and across the organization. For example, without a standardized customer identifier, it would be very difficult for a user to identify those customers with high overdue balances (i.e. Accounting data) that place most of their orders by phone (i.e. Sales data). In addition, inconsistent units of measure (e.g. hours, days, weeks) could make it almost impossible to compare data that comes from different data marts.
Creating data marts that can grow into an integrated data warehouse requires up front architectural effort. An architecture must be put into place so that there are clear guidelines for data mart builders. The following elements should be included in this architecture:
Keys must be standardized across the data marts (i.e. integrated warehouse) to allow data from different areas to be associated. All primary keys should be generated inside the data warehouse itself, with relevant keys from operational systems being defined as alternate keys. Internal data warehouse primary keys are not dependent on any operational system so changes to those operational systems do not impact the core structure of the data warehouse.
It is important that data from feeder operational systems be standardized before it is loaded into the data warehouse. For example:
When the data is standardized, it becomes feasible for data warehouse users to extract data from multiple sources and to compare and correlate that data.
Terminology needs to be standardized throughout the warehouse. For example, "cust_class" should mean the same thing in the sales data mart and the accounting data mart. Standardizing the terminology might mean changing the definition in some areas or it might require a refinement of the definition to avoid ambiguity (e.g. "sales_cust_class" instead of "cust_class").
Consolidated Core Information
Core information about primary entities in the data warehouse should be consolidated in one place. For example, core customer information like customer name, industry, status and total sales should be gathered together from sales, marketing and accounting systems so this core information is easily accessible to data warehouse users. Detailed information on these entities (e.g. A/R contact, preferred billing method, etc.) would be left in the individual application area or data mart.
Using a Data Warehouse to Build Better Operational Systems
For reasons discussed through this article, the availability of a well designed data warehouse provides considerable value to an organization. Another important benefit of a data warehouse is that it allows the IT organization to build better operational systems.
In the past, there were only operational applications so it was necessary for each application to keep historical data as long as it might be needed. Also, operational applications had to support many complex query and reporting requirements. Supporting these requirements meant keeping a large database that had a large number of read-related optimizations (e.g. many indexes, denormalization, etc.). The large database and many read optimizations made it difficult to to provide good update performance.
To help improve update performance, various deferred update strategies had to be employed (e.g. batch and background jobs). These deferred updates made the design of the applications more complex and more difficult to maintain.
With the availability of data warehouses, the architect of an operational application should be asking the following questions:
It is important for the architect to verify that the users of the target operational application have effective tools to access the data warehouse as well as proper training. When this is the case, users will be much more open to the idea of offloading requirements from the operational application to the data warehouse.
In summary, when complex read requirements are satisfied by the data warehouse, the architect will be able to design a better operational system.
... back to Architecture Topics