The Future Problem with Enterprise Data Warehouses

A friend of mine with years of analytics and management experience at big companies wrote recently. He puts his finger squarely on a real issue with enterprise data warehouses.

"I wanted to provide some comments on the enterprise wide data warehouse and the challenges it presents at large corporations. Jim Novo certainly seems to support the roll up approach (I’m on his mailing list) but I agree with Juice that it is too slow, too costly, and results in restricted analytics the way most large companies build them. Most of the large data warehouses I’ve seen only include data variables that are key to managing a business TODAY as the warehouses are too big and costly to store data variables with a low usage frequency. They also attempt to cleanse the data by classifying. This makes life easier an analyst with statistical experience but a limited knowledge of the business. However you’re losing information.  Problem: You do not know what will be important in the future. Distributed databases at a line of business or product level tend to store more raw data. Sure, the amount of space used would be the same if you simply put into the warehouse but that is not the way decisions are made. Decision makers look at the frequency of use of the data variables (TODAY) and the cost to include them. Also, the analysts who are disconnected to the business lines do not understand the raw data. "

"Let me give you a real world example. Our data classifies claims into a limited number of claim reason categories. When a new type of claim is developing, the person classifying the claims (claims rep) does not have a category to select so they just select what works best to fit into the pre-defined categories. Information is lost due to the restrictions of the allowed categories within the data warehouse. If the notations from the claims system would have been stored (an unforeseen variable) in the warehouse and text mining analytics being done, the word "mold" would have been found associated with claims at an alarming rate. This would have allowed for early recognition of the issue. It cost us a lot of money in mold claims due to the missing data but who would have thought to include the notes due to the size and costs? Well, we have them now."