The Future Problem with Enterprise Data Warehouses

A friend of mine with years of analytics and management experience at large companies wrote recently. He put his finger squarely on a real issue with enterprise data warehouses--they’re built to solve today’s problems (not tomorrow’s) and they’re designed for the needs of analysts that are more statisticians than experts about the business.

I wanted to provide some comments on the enterprise wide data warehouse and the challenges it presents at large corporations. Jim Novo certainly seems to support the roll up approach (I’m on his mailing list) but I agree with Juice that it is too slow, too costly, and results in restricted analytics the way most large companies build them. Most of the large data warehouses I’ve worked with only include data variables that are key to managing a business TODAY as the warehouses are too big and costly to store data variables with a low usage frequency. They also attempt to cleanse the data by classifying. This makes life easier for an analyst with statistical experience but a limited knowledge of the business. However you’re losing information. Problem: You do not know what will be important in the future. Distributed databases at a line of business or product level tend to store more raw data to manage and analyze the business. Sure, the amount of space used would be the same if you simply put the data into the warehouse but that is not the way decisions are made. Decision makers look at the frequency of use of the data variables (TODAY) and the cost to include them then decide which get deleted or excluded. Also, the analysts who are disconnected to the business lines do not understand the raw data.

Let me give you a real world example. Our data warehouse classifies claims into a limited number of claim reason categories. This might be considered a poor design but please see above about analysts that are disconnected from the business lines. When a new type of claim is developing, the person classifying the claims (claims rep) does not have a category to select so they just select what works best to fit into the pre-defined categories. Information is lost due to the restrictions of the allowed categories within the data warehouse. If the notations from the claims system would have been stored (an unforeseen variable) in the warehouse and text mining analytics were being done, a specific word would have been found associated with claims at an alarming rate. Identification of this word would have allowed for early recognition of the issue. It cost us a lot of money in claims due to the missing data but no one thought to include the notes due to the size and costs. Well, we have them now.