During the great California gold rush, early prospectors sifted through rocks and streams in the hopes of finding a fortune. But bigger hauls were often claimed by going below the surface and digging deeply into mines, where a treasure trove of gold lay waiting to be unearthed. The gold rush may be ancient history, but there’s still potential treasure to be tapped–and crises that can be prevented too-by companies that are willing to delve deep into their dark data.
Dark data is information, collected as a function of an organization’s normal operations, that is rarely or never analyzed or used to make intelligent business decisions. Instead, it gets buried within a vast and unorganized collection of other data assets. It’s often referred to as “data exhaust,” but there can be a lot of value in this overlooked info. And the portions that aren’t of value can be a significant drain on resources, including wasted digital storage space.
Amazingly, 52% of all stored data across the globe is dark, per Veritas’ “The Databerg Report: See What Others Don’t. Identify the Value, Risk and Cost of Your Data,” published in February. The report indicates that, by the year 2020, dark data will needlessly cost organizations worldwide a cumulative $5.2 trillion to manage if left unchecked.
Mohammad Nayeem Teli, assistant professor in the department of computer science at Harrisburg University of Science and Technology, says there are many reasons why a lot of data turns dark and is never used. “Not being able to keep pace with the amount of data you’re generating, and lacking the resources to analyze it can cause it,” says Teli. “And we have a lot of data coming at us from all directions, but cannot utilize it because of a lack of experts.”
Brad Anderson, VP of Big Data informatics for Liaison Technologies, agrees. “There is a decided lack of tools, vendors, and open source projects with solutions that can truly tackle the difficult problem of integrating, managing, and analyzing dark data,” says Anderson.
Companies that overcome their dark data malaise often outpace their competition in top-line revenue, growth, and efficiency. “And previously unused data may be able to inform decisions and alter operating procedures,” Anderson adds. Case in point: Say you’re a digital publisher who relies on external users/contributors to provide and post content. If clickstream and scrollstream data can be captured on your platform or be A/B testing-enabled, the dark data created can offer feedback about how your audience consumes your contributor’s work.
“Does most of your audience scroll to a certain paragraph then click away, not finishing the piece? Are there fewer clicks on certain types of titles? Analyzing the dark data generated can answer these questions and make a content platform stickier for contributors,” says Anderson.
Information that is not managed properly can also expose an enterprise to considerable vulnerabilities. “Distributed and duplicated content, without explicit oversights, weakens security. Hackers have more potential entry points-and leaked, lost, stolen, or breached dark data can result in damaged reputations as well as loss of competitive strength,” says Greg Milliken, VP of marketing for M-Files. “In addition, the sheer volume of dark data impacts the costs for searching and producing appropriate information and imposes a wasted storage cost in operating budgets.”
To better harness dark data, experts recommend a variety of strategies. The first step is to secure and encrypt all data to prevent it from being exploited. Second, prevent it from piling up and getting out of control, which requires analyzing it regularly.
“You need to gain visibility into your dark data and understand what you have, where it resides, who has access to it, and when it was last touched,” says David Moseley, global solutions marketing senior manager for Veritas. “Then, you can start to take actions like classifying legacy data to give it meaning going forward. You can assign owners to data, so they can review and decide if there is value. You can set up retention policies to automate how data is handled going forward. And you can delete dark data that’s considered redundant, obsolete, and trivial.”
Dark data is huge and unstructured, but machine learning offers tools and techniques to study this information and recognize patterns in it. “This data needs to be processed using smart algorithms, because it cannot be analyzed manually,” Teli says. “Frameworks like Hadoop provide a platform to break this data into chunks that can be managed and studied, and systems like IBM Watson provide useful insights into unstructured data.”
Additionally, consider employing metadata-the “data about data”-to identify, link, curate, and cross-reference information in a way that unlocks its relevance and usefulness. This involves applying metadata tags when you use an API. “Metadata is like a set of omni-directional headlights that navigate very specifically through dark data while illuminating associations and relationships between items and users all the way,” adds Milliken.
A file analysis tool may further help to sift through dark data. “With better context, you can make more informed decisions and better answer the question of what to get rid of and what to archive to meet business and regulatory retention requirements,” says Moseley, who notes that some file analysis tools are actually integrated with enterprise archiving solutions, some of which offer automated classification to simplify management tasks.
Despite the challenges involved in illuminating and managing dark data, its importance to your operations can be enormous and worth the effort, Teli believes. “Dark data could indicate consumer buying trends that would help better understand customer preferences. It could be used to create solutions that are tailor-made for a certain segment or used by service providers to improve service,” says Teli. “It could also contain sensitive and personal data of your customers and organization that, if it ends up in the wrong hands, can be really dangerous.”