By: Ignacio Barros
Days ago, I read an article from Data In Formation, a magazine specializing in technological topics, with a theme that I found very interesting and that can be summarized as follows: if you are in a conversation about analysis and data science, do you ever think about aluminum? Chances are your answer is no. What does aluminum have to do with data, you might ask. That would be the response of the vast majority. Until now.
Aluminum has many applications. It comes in a wide range of sizes. It can be flexible or very rigid. It can be recycled indefinitely and turned into any number of new objects once the original has served its purpose. When it was first discovered, aluminum had little practical value. However, today it is an essential part of our lives, and its value has increased accordingly.
Does this description sound familiar? Replace “aluminum” with “data” in the previous paragraph and read it again. There is an even more important parallel. In its natural state (a material called bauxite), aluminum looks like any other rock you might come across and ignore. But hidden in that stone, which at first glance seems worthless, is something without which we wouldn’t have passenger airplanes, spacecraft, fuel-efficient cars, or hundreds of other products essential to modern life.
Whatever you call it, perfect camouflage, hiding in plain sight, or going undercover, the problem is that a great deal of value is easily overlooked.
And now comes the surprising revelation: the same is true for your organization’s unstructured data. Some examples include images, videos, call recordings, scanned documents, chat logs, PDF files, and all those other types of files that are not in a neat format of rows and columns. In fact, it is estimated that 80% of all new data created in organizations every day is of the “unstructured” type. Yet, analysts and other data scientists often overlook it because it is not available in an easy-to-use format.
Historically, this type of data has not been included in analytical sets or catalogs. Even those companies that sell data mesh platforms and structures, which claim to make “all” an organization’s data visible and accessible, inevitably exclude unstructured data, because they prefer not to deal with an animal of that size.
But if that’s the case, why should I care about my unstructured data? Just let it go to waste.
The point is that traditional structured data is good for answering “what” questions: What were yesterday’s sales? What is our current customer satisfaction level? What is the average production of Unit #3? But those structured data tend to look backward and cannot answer “why” questions: Why were yesterday’s sales 10% above plan? Why has customer satisfaction dropped 5 percentage points this week? Why does Unit #3 produce at half the level of the other units?
Unstructured data, on the other hand, represents current readings in real time: news sources, audio files from call centers, sensor outputs. In the three previous examples, we could understand what happened: an analysis of the news showed that a sudden burst of cold weather caused an increase in coat purchases, boosting yesterday’s sales; an analysis of customer service calls uncovered a recurring problem that continues to affect satisfaction survey outcomes; an analysis of the operational parameters of a specific production machine indicated that it needs maintenance, explaining its lower performance.
All these insights and answers are only possible if unstructured data is included in models and analyses. While most organizations will say they are already capturing this type of data, and even storing it in a data mart (a data warehouse), they are not really using it. Maybe they have it, but they are not using it.
Unstructured data is not routinely introduced into AI models and is not part of most BI analyses, mainly because doing so requires work. Unstructured data must be tagged, annotated, or transcribed before it can be absorbed by any advanced technological platform. Unfortunately, most organizations are not prepared to do that kind of work: there is no one whose job title is “Data Labeler” and no one experienced in recruiting and managing data labelers.
So, for those forward-thinking leaders, aware of the value locked in their unstructured data, the solution is to find someone in the organization with “data” in their title. And that’s how data labeling often ends up being assigned to data engineers. Which is tragic. Because it’s not that they are unfamiliar with the process or incapable of doing it, but it’s a very costly way to do the work, both in terms of direct cost and opportunity cost (if these highly skilled individuals are working on data annotation, it’s likely that the jobs they were hired for are not getting done). Moreover, these are resourceful people tasked with doing something they would rather not do. So they find quick fixes, such as buying already labeled data from someone else, or worse, short-circuiting the process by using a generative AI tool to create synthetic data. But nothing is as powerful and unique as an organization’s own experimental data, accurately annotated and ready to give a significant boost to the structured data already in an AI or BI tool.
If labeling that data is not something you can do, a data services provider can do it for you (just make sure to find out where the work will actually be performed and whether the company has experience in the specific domain for your business).
Back to the initial question. What does all this have to do with aluminum? A lot.
Most people would walk past a piece of bauxite without a second glance. It looks nothing like aluminum; it’s not something they’ve used or needed before. But refine it into a pure aluminum ingot, and the potential uses quickly present themselves.
All those scanned image files, thousands of recorded customer service calls, chat logs, and huge geospatial studies are exactly the same kind of uninteresting ore until those unstructured files are labeled or annotated. And then, suddenly, the value of the mineral hidden in the sterile stone appears.
Do not overlook your unstructured data. Recognize its true potential and refine it into valuable ingots.
Sources:
- Data is the new aluminium (Data in Formation, Feb 2024)
- Using new data to measure and manage work (The Wall Street Journal, Jan 2024)
- Why Google would drop USD 2.6 billion on an analytics company (Wired, Nov 2023)
- Internal cases of SYNERGOS