Unstructured data

When considering current data issues, the nettle of ‘unstructured’ data is rarely grasped.

Describing it as ‘data that isn’t structured’ is an accurate but unhelpful definition. The next approach might be to say ‘anything that doesn’t fit sensibly into a spreadsheet’ – but although this is is helpful, it’s not accurate . We take a pragmatic view; that structured data can encompass many types of digital items: graphics, pictures, photos, scans, free text in e-mails and social media, documents, videos, audio files, biometric information….. the list goes on (but specifically excludes data in structured systems which has simply not been understood!)

There is also confusion between ‘dark data’ and ‘unstructured data’. Dark data may include, but it is not exclusively, unstructured data. John Kelly, IBM’s “father of Watson” and a senior VP at the Third Cognitive Colloquium,2015 stated:

 “Today, 80% of all data is dark and unstructured. We can’t read it or use in our computing systems. By 2020, that number will be 93%.”

Whereas unstructured data, but only where it is not understood, is dark data.

What can absolutely be stated is that unstructured data is estimated to form approximately 80% of organisational data and on that basis alone, organisations ignore its management at their peril. And the task of managing it should be a persistent exercise – not an inconsiderable one given that annual growth in unstructured data is estimated at 40% per annum. The Data Genomics Project, initiated by Veritas, is salutory reading and its findings should have been a wake-up call:

  • 41% of the total environment hasn’t been modified in the past 3 years (i.e. since 2016). Of that 12% is classified as ‘ancient’, not having been touched in 7 years;
  • The most prevalent file types aren’t what you’d expect;
  • The traditional office file types are overly taxing on the environment;
  • Orphaned information is disproportionately overweight and extra stale;
  • There are some attractive storage cost efficiencies from system archiving, and I mean attractive;
  • Storage capacity is growing faster than file creation, but only by 9% – the average PB contains 2,312,000,000 files (which cant be good for indexing, searching or analysis);
  • The composition (according to file types) has changed radically.

The risk, with information growth, particularly in unstructured data, is that as it grow – and if it remains unmanaged – its usefulness deteriorates as volume increases, so you could in effect, be paying for a lot of storage and disk space where there are diminishing returns! This is where the GDPR ‘stick’ could actually trigger business benefits by forcing us to evaluate whether we really need or ought to retain and store all that we do.

There are also benefits from considering which file formats are best value in terms of storage cost returns, and strangely, one of these is Geographic Information System File Types (GIS or spatial data), which even more bizarrely, we produce most of in the autumn!  The worst offenders are image (videos in particular) and developer files (although the latter is more forgiveable). One of the reasons for this again comes back to metadata: the orphaned, unadopted files, typically tend to be content-rich e.g. video, and are not ‘owned’ to the same extent that office files are. If no one is taking responsibility for the files, then who is managing them? If content is worth owning, it is worth keeping!

So how do you get to the rewards? The Data Genomics Project points out:

“If 41% of the environment is stale, you could be spending as much as $20.5 million per year to manage data that hasn’t been touched in three years. But cleaning it up is tough. That 4.1PB equates to 9,479,200,000 individual file decisions to classify, delete, or archive.”

Unstructured data is a data management nightmare with no easy solution in sight:

  • Size: unstructured data is often bulky with consequent infrastructure, storage and workflow issues;
  • Diversity of origination: the unstructured data origination sources cover a plethora of applications, all with their own file suffix, and new ones emerge every day;
  • Standardisation: unstructured data formats, until fairly recently, were not standardised;
  • Metadata: typically organisations have been slack in either maintaining or inputting their metadata, as this has largely relied on the dedication and savvy of operational staff rather than automation. Poor metadata impacts searchability and archiving capabilities, performance and storage costs. Metadata is essential for comprehending your digital estate;
  • Workflow: unstructured data tends to be workflowed more than structured data;
  • Security, sharing, permissioning: unstructured data does not tend to be in the frame when these issues are considered, yet in many respects, it is more vulnerable than structured data;
  • Storage & archive: rarely is unstructured data culled or optimised (to reduce inefficiencies in storage or performance).

Genuine solutions are as rare as hen’s teeth. Autonomy thought they had one and we all know how that turned out! So is it that the management of unstructured data requires a more strategic approach than simply issuing a tender for content management? After all, the unstructured data is approximately 80% of your organisation’s digital assets. Is there any point worrying about security and sharing, for instance, until you are certain which files you are going to retain?

Because of its nature, unstructured data is more dependent than structured data on metadata, in order to understand the content and context of the files. Similarly, there is no point on managing unstructured data objects unless they are worth it i.e. of good data quality, otherwise as the Data Genomics Project points out, you are simply going to be paying for a lot of unnecessary storage. So, a lot of effort has to go into preparation and cleansing, to ensure the metadata quality (and therefore data quality) of your unstructured data so that at the very least, you understand and have visibility of what you have.  Classification cannot be done until you have something meaningful to classify i.e. quality metadata.

Going forward, most metadata can be added automatically at the point of creation. This applies to administrative, descriptive, structural, preservation, technical and receive metadata (in accordance with documented collecting policies, from data creators, other archives, repositories or data centres). The challenge is with historic material, in assigning appropriate metadata when this is incomplete or meaningless, or where the provenance and referential integrity has been lost. If, for instance, your organisation has grown through acquisition or consolidation, it is perfectly possible that the metadata in older files could be completely random: the result of data input misunderstandings (e.g. whole sentences!) compounded by ‘munging’ (mashed up until no good!)…. we have seen this. In any event, unstructured data quality, and therefore its value, relies on quality metadata.

In arriving at a metadata taxomomy, because you will want to control metadata generation going forward, you will soon realise that use cases will determine data models and therefore taxonomies. The application, processing and governance of unstructured content will be varied, largely according to the file type (graphics versus office documents, for instance) so you are unlikely to be able to standardise taxonomies to the same extent as structured data. The good news is that you are more likely to be able to segment unstructured data according to file types or use cases, as these tend to be more specific, which will reduce the workload somewhat.

Unstructured data, as the medical records demonstrate, is an essential, if challenging, part of an overall data strategy. It cannot be ignored as it forms 80% of your organisation’s data assets. It therefore follows that since it represents such a significant proportion overall, it is at the very least as significant as structured data when considering data that is being put to work in your organisation. Securing value from unstructured data therefore means being able to manage and use it, which, as explained, relies on its metadata. Putting effort into the metadata of your unstructured content will help to mitigate costly mistakes from incorrect data models, unnecessary storage, poor performance and assumed use cases, as well as maximising value from the effective use of the unstructured data itself.

What you need to do is tackle the classification problem. Archiving and deletion rely mostly on practicalities and common sense. Classification, indexing and therefore searchability, rely on metadata and the reason this is essential is because it is means of ‘structuring’ unstructured data.

The single malt solution

Consider the situation of an IT Manager tasked with migrating unstructured data to Cloud repositories in the brave, new world of GDPR:

  • What visibility or understanding is there of the material in scope prior to migration?
  • What about ownership? Which person, department or business unit owns the material? What is their provenance – did they come from an acquired entity?
  • Do they contain personally identifiable information? Are any sensitive?
  • Do they require longer retention periods – for financial diligence for instance? Are they merely stale, or are they ‘ancient’ and destined either for destruction or offline permanent archive?
  • Should the organisation still hold them at all (bearing mind the likely culling of marketing databases arising from GDPR where prior consent has not been obtained)?
  • Is there a ‘Right to be Forgotten’ list to avoid porting over data that was meant to be ‘forgotten’?
  • How will you know what to move without discovery? Has the metadata been investigated and what is the impact of this on the overall data model and process flows?

Such questions are difficult enough to content with regarding structured data, but in terms of unstructured data…. well you could understand the IT Manager reaching for the single malt.

  • Established relevant corporate and governance policies, so you know what is supposed to happen?
  • Does the organisation have a data strategy? If not, you need a plan, as well as relevant policies.
  • You don’t need to deal with it all at once, prioritise;
  • Adopt agile practices so you don’t do the wrong thing really well;
  • Focus on your future desired operations rather than repeating historical mistakes by recreating what you had already;
  • Build for change – data-centric not application-centric;
  • Target value, these are your data assets after all;

Elephant Carpaccio might be more effective….