Look after your metadata and it will look after you
Gartner Research stresses the future importance of active metadata. “By 2023, organizations utilizing active metadata, machine learning and data fabrics to dynamically connect and automate data management processes will reduce their time to data delivery, and impact on value by 30%.” Active metadata implies curated metadata, since it must be up-to-date and accurate to be of any use – to be trusted enough as the basis for decision-making & analytics.
Active not passive
The increasing availability of tools such as metadatabases, metadata management, metadata inventory, enterprise metadata cataloguing and so forth demonstrates increasing awareness of the need for metadata curation. But such tools began life largely as passive administrative repositories where captured metadata was seen as an end itself.
Programmatic, not manual
Going forward, most metadata should and can be added programatically and automatically at the point of creation. Get into good habits now – it will pay dividends in the future.
To remediate or not to remediate, that is the question
Remember that on average, 80% of the digital estate contains data which is not understood or even known – the same applies to its metadata. But remediation is not a straightforward issue.
Governance
If metadata is part of a system of record then any proposed remediation, corrections or insertion (to remedy absent metadata) will need careful consideration. Poor management of metadata can impact negatively on upstream systems, especially compliance.
Workflows that create metadata for an object need to factor in:
Metadata needs to facilitate timely discovery or protection of key information; that this information is collected in the first place; and that any processes connected with it are logged for audit.
Use case
Understand why metadata is added to an object and the scenarios which would require the object’s metadata to be updated or removed. Understanding these helps to identify existing workflows which may require updating to cater to new requirements and conditions to evaluate.
Lifecycle
Workflows that manage the lifecycle of metadata or tags frequently contain functions to validate an object has the correct tags assigned. These functions are often used to determine the success of a task to add, update, or remove a tag & are also useful for searching. But such functions may come with lapsed time before new valid items are added to the system, so assumptions about consistency can lead to unexpected results or unhandled errors. Queries may generate no results for very recently added objects – the creation workflow not yet being completed. Similar issues exist in metadata management where the system executing the CRUD operation is not the same as the system which responds to queries.
Comparative analysis
How do you assign appropriate metadata in an historic or archive environment which may be incomplete or meaningless, where the provenance & referential integrity has been lost. If, for instance, your organisation has grown through acquisition or consolidation, it is perfectly possible that the metadata in older acquired files could be completely random: the result of data input misunderstandings (e.g. whole sentences input as metadata, compounded by ‘munging’ (mashed up until no good) & poor database management!
Structured data
Curating structured data should be done through the lens of analysis or forward purpose, such as AI. Both benefit from canonical data & metadata so metadata curation can be useful in identifying anomalies that can negatively impact downstream activity.
Unstructured data
Unstructured data is often content-rich, so you are more likely to be able to segment it according to file types or use cases (as these tend to be more specific), and therefore make reasonable inference as to what the metadata should be.
Winning ways
They say they have no ‘dark’ data at Google and they have created comprehensive tools for understanding unstructured data e.g. audio-visual searches. Search engines reap the rewards of understanding, knowing and finding data because the metadata exists, is curated and continuously validated by users (e.g. Google maps, report an error):