Alligators in the Data Swamp
By David Scott
Ah, such a beautiful summer day! The water in the lake is so clear, people are skiing, world-record fishing is just around the corner, and there's not a cloud in the sky. This is the promise of the data lake - clear vision to business decisions, gliding through tremendous volumes of data, all to find key business insights with incredible clarity… And then there is reality: losing time mucking around in the swamp, frantically trying to find the phone you just dropped in the water - and there are alligators.
Working with huge volumes and varieties of data can be like that.
Data lakes were conceived and implemented to bring 'one-stop shopping' to businesses and explore new questions from the multitude of data sources and formats that inevitably accompany complex business processes. When a corporate data lake is properly maintained and cataloged, it can bring tremendous value out of these deep waters. Data can be stored in multiple formats, which facilitates quicker acquisition and easier loading. Centralized access and security controls can make security management simpler and more well-controlled.
Large amounts of data are typical of a data lake; it is not unusual to store petabytes of information from various business processes and sources. Most importantly, this large and varied pool of data can improve the ability to investigate and answer new, complex questions, and provide new insight into the business. Alas, there can be serious pitfalls in data lake implementation - enough that some witty pundit once coined the term "data swamp" to express the results! When full of dubious, un-identified, and polluted data, the data lake can turn ugly and brackish. And there are plenty of problem "alligators" that may inhabit the resulting quagmire of issues, eat up your time, and thwart your business needs.
One of the ugliest alligators is the "Store now, understand later" approach to building the data lake. This approach masquerades as flexibility, but can easily inflate cost and complexity. If you encounter phrases like "We don't have time to figure it out" or "We might need this later", the business value of the data may become unclear - like swamp water.
A closely related difficulty occurs when the path to business value is not well-defined, clear, or documented. Though technical solutions abound, there are few business-specific recipes for building value quickly; this often causes companies to "re-invent the wheel" and/or create sub-optimal results. And from a budget standpoint, projects without a clearly defined return on investment are destined to be swallowed up by other priorities.
Data quality issues can take a long time to resolve and consume a lot of effort, especially if there are conflicting, incorrect, or missing data values. Traditional data warehouses often attack this alligator with additional quality processing before the data becomes available to business users and analysts. In the rush to make data easily available, it is easy to inadvertently create security issues; if guardrails like PII, HIPAA, and SOC2 are not considered during planning and implementation, compliance and security incidents may result that are difficult, time-consuming, and expensive to resolve.
A lack of clear data organization and documentation can inhibit accurate and clear understanding of relevant information. This can cause business users to spend a lot of time fishing for insight and come up empty-handed. Or worse, users may come up with wrong decisions based on incomplete or incorrect data. Instead of that world-record fish, you may get an old shoe for your effort!
Technical problems can also take an alligator bite out of your progress, especially when manual configuration of query tools and infrastructure do not adjust to changes in the data lake. Resource unavailable? That's like a ski boat with a malfunctioning motor.
Finally, performance can suffer when large volumes of data overwhelm infrequently updated retrieval technologies. A query or process that once took only seconds may cloud your enthusiasm when it cannot perform in a timely manner. And who wants to wait around for that?
However, there are some ways to help steer clear of the dangers in the data swamp and avoid many of the alligators. Look out for part two of this article for these insights.