Moving data to Cloud

Moving data to the Cloud (or to another Cloud) has context. Every modern data system relies upon there being ‘data about data’, at the most granular level, in order to determine who can access data; how long data should be kept; and what data should be masked or tagged for privacy or security purposes. As the lines of business blur across collaborative supply chains, corporate boundaries, ecosystems, regulations and jurisdictions; this becomes ever more important so that regulatory pace and technological innovation keep in step.

“As the world around us increases in technological complexity, our understanding of it diminishes. Underlying this trend is a single idea: the belief that our existence is understandable through computation, and more data is enough to help us build a better world.” (Source: Amazon’s summary of ‘New Dark Age: Technology and the End of the Future’, James Bridle, 2018.)

Data is typically the organisational ‘cinderella’. Organisations devote significant spend to building code or developing applications, but data… not so much. It’s like Hamlet without the Prince!

The opportunity cost of poor data is not trivial:

“Gartner measures the average financial impact of poor data on businesses at $9.7 million per year. These costs, however, are not solely financial; businesses can see loss of reputation, missed opportunities and higher-risk decision making as a result of low confidence in data.” (Source: Forbes)

Cloud needs quality data (assumed to be properly parsed and de-duped) and metadata (assumed to be disciplined according to agreed keyword taxonomies based on governance and risk policies). It is no longer sufficient to think in terms of data alone. The data record (data and metadata) must be accurate, valid, described, and complete; have referential integrity, and be fit for its intended usage in operations, decision-making and planning.

“84% of CEOs are concerned about the quality of the data they’re basing their decisions on. The role data plays in enabling future technologies such as artificial intelligence and the Internet of Things is critical—but one that will be undermined if businesses do not make data quality a priority.” (Source: Forbes)

The performance of Cloud is adversely impacted by unusable or unnecessary data which will also attract equally unnecessary costs, particularly upon exit. Overloading the Cloud with poor or unnecessary data will also exacerbate any bandwidth choke points on your network. So making sure that your shiny new Cloud does not get data indigestion is a key step.

Data sources may include packaged on-premise applications (which are increasingly being remaindered by global vendors in favour of Cloud versions), legacy or proprietary databases; big data lakes; the amalgam arising from mergers & acquisitions; data from packaged systems whose data model contains duplications and inconsistencies arising from a software acquisition strategy; migrations from other clouds; consolidation; storage; or simply spreadsheets that should have been thrown away years ago.

Each will have its own characteristics and may be informed by a desire to remainder obsolete systems for efficiency’s sake; technology stack upgrades or changes; changes in technical leadership; to proactively pursue a common technology platform or to harmonise data exchange formats, models or storage mechanisms.

“Interviews with executives and analysts suggest that confidence in data may be low for various reasons, including silos of information, difficulty in securing executive buy-in, not to mention the sheer quantity of legacy data a company may possess…” (Source: Forbes)

Different data sources require different discovery and preparation. Financial records need to retain integrity and auditability. Customer data needs parsing to achieve and maintain consistency. If the source data is in a packaged application, this does not necessarily guarantee that it will be meaningfully described or complete; and empty cells can hinder the export process. (When extracted to Excel, many reports from a certain packaged application do not come in a clean table format but instead, because of the source data structures, appear with blank cells that have to be logically inferred). The data models of complex, packaged applications may require you to reverse engineer from the package’s own data dictionary. It may not be easy to obtain or extract metadata from table names that are not informed by natural language descriptions. Then there is the challenge of determining the pre-eminence of multiple fields with the same name within the source application. And for DPO’s finding sensitive personal data in the corporate haystack may be a nightmare without proper tagging (the absence of required fields and metadata). Source data may be in remaindered or proprietary silos – requiring an agnostic, broad base of technical skill and expertise to effect discovery, preparation, extraction and migration.

Cloud is suited generally to structured data, which leaves the conundrum of what to do with unstructured data which may be as much as 80% of organisational data. Unstructured data (and its provenance) may be a crucial part of the digital record and shouldn’t be ported to Cloud. Dark data is data which is collected but is unknown, unrecognised, undescribed and therefore useless. It is either unnecessary (generating consequent inefficiencies and costs) or has, as yet, unrealised value (because it is not understood). Unstructured data and dark data are sometimes contiguous since it is the unstructured data which is most likely not to be understood.

“Today, “80% of all data is dark and unstructured. We can’t read it or use in our computing systems. By 2020, that number will be 93%.” (Source: John Kelly, IBM’s “father of Watson” and a senior VP @the Third Cognitive Colloquium,2015.)

Clearly it makes no sense to port data unless it is either useful now, or may be in the future. So organisations need first to discover what data they have; assess what data is useful or of suitable quality and where remedial action is required; determine what unstructured or dark data is held and begin the process of comprehending that. Standardisation of data formats, descriptions, metadata taxonomies and having a corporate ontology facilitates preparation and wrangling of structured data. Unstructured data not only requires these techniques, but has to be accompanied at the very least, by meaningful metadata so that it is understood and capable of analysis – even if that has to be done at a future point. Then, thought should be given to how to conserve provenance, case management attachments and the referential integrity of the record whilst linking in situ BLOBs to new Cloud pathways.

This leads on to consideration of data models. Models need a consistent approach to data descriptors throughout an ecosystem. But the incumbent data model may no longer align to Cloud data models.

  • SaaS Cloud for instance, has data models which are typically constrained (less fields, pre-determined field names, shorter field sizes) which may prompt review of existing data models in case any re-alignment is needed (assuming the organisation already has a standard metadata taxonomy to enforce user input disciplines and consistent tagging). For instance, does the Cloud target use the same field names as your source data model? If not, how will you handle that? Is it a good idea to insist therefore, on canonical data throughout?
  • Within a DaaS infrastructure, data models are the service providers and your data may not be available for download!
  • On some vendor platforms, replication of the entire data model is possible, but this still has to be treated with some caution as data gravity shifts increase the possibility of lock-in. Focus on the consistency and management of the end-to-end model is informed by the latest governance pressures on corporate computing, such as GDPR. These regulations will change and therefore your data model will evolve but must remain consistent overall. So data models must be capable of rapid evolution to include governance requirements and to facilitate sustainable self-governance but this cannot be in isolation.

Then there are business processes – what do you want to do with the data and do you have the data to do it with? Data should have purpose and usability. As more organisations use agile, evaluating ‘stories’ to envision a desired process capability and outcome, viable data and a consistent data model is required to ensure that the right data, in the right format, is available to the right person in the right time and right place.

“Data today is often compared with oil, as in its raw form, its uses are limited. It is through refinement that oil becomes useful as kerosene, gasoline and other goods, and similarly it is through the refinement process of cleansing, validation, de-duplication and ongoing auditing that data can become useful in the kinds of advanced analytics that are starting to shape our world.” (Source: Forbes)

Don’t simply replicate existing processes; use your move to Cloud as an opportunity to view matters afresh!

What of the future? Increasingly, Clouds (whatever variety of ‘as a Service’ they are) may be, are key nodes of fluid ecosystems as organisations, with community of interest, seeking to innovate, transform and seize new collaborative opportunities. Even the rate of change of ecosystems, is itself, changing rapidly:

“Successful businesses are those that evolve rapidly and effectively. Yet innovative businesses can’t evolve in a vacuum. They must attract resources of all sorts, drawing in capital, partners, suppliers, and customers to create cooperative networks…not as a member of a single industry but as part of a business ecosystem that crosses a variety of industries. In a business ecosystem, companies co-evolve capabilities around a new innovation: They work cooperatively and competitively to support new products, satisfy customer needs, and eventually incorporate the next round of innovations.”  (Source: Harvard Business Review; James F. Moore, “Predators and prey: A new ecology of competition.” )

Key to this is ensuring that data, the ‘crude oil’ of any corporate ecosystem, has maximum quality, flow and interoperability. The increasing prevalence of the Internet of Things (IoT) devices and apps is putting severe performance demands on existing infrastructure and changing the nature of the information that is collected. For instance, locational information (spatial data) is increasingly pervasive and has its own peculiar characteristics. And, as they say, everything is somewhere.

There remain longer-term data management issues – sustainable data quality and data curation, especially within ecosystems where there is a need to synchronise multi-lateral input (understood as input from multiple entities along multiple vectors, as opposed to unilateral or bidirectional data input) and update vectors. Why multi-lateral?

“The bilateral exchange of information is a long, slow, complicated process!”

For instance, is there more than one version of the truth or is data being shared intelligently so that when bad data is corrected all copies of it are also updated? This is vital for ecosystems where the whole really is greater than the sum of the individual parts.

Cloud vendors may enforce proprietary data formats, proprietary API’s or compressed data formats which will hinder data interoperability. Poor Cloud commissioning can also lead to show-stoppers such as lack of data interoperability (data transfers not being either viable or possible due to lock in). So early consideration needs to be given to inbound and outbound data, data flows, and data usability.

With interoperability in mind, Incorvus follows the Open Standards principles set out by the Cabinet Open Standards Board; so that software may interoperate through open protocols and data exchanges may occur smoothly between software and data stores.

Open Standards alone though, may not be the entire answer – they are only one enabling aspect of interoperability. The data of the ecosystem must follow common standards and be fit for purpose; but beyond that, the curation has to be ordered so that the lifecycle rolls like a well-drilled orchestra – passing data along the line, according to time-delimited precedence, orchestrated from multiple sources so that harmony and not cacophony, is produced and continues to be produced. This isn’t a one-time exercise; it’s more like painting the Forth Bridge.

Therefore, maximise your opportunity. View your move to Cloud as a chance to improve your data quality; your data circulation and curation. Equip your organisation for evolution!

“More than ever, the ability to manage torrents of data is critical to a company’s success. But even with the emergence of data-management functions and chief data officers (CDOs), most companies remain badly behind the curve. Cross-industry studies show that on average, less than half of an organization’s structured data is actively used in making decisions—and less than 1% of its unstructured data is analyzed or used at all. More than 70% of employees have access to data they should not, and 80% of analysts’ time is spent simply discovering and preparing data. Data breaches are common, rogue data sets propagate in silos, and companies’ data technology often isn’t up to the demands put on it.” (Source: Harvard Business Review)

If you are considering the use of Intelligence Augmentation, Artificial Intelligence, Machine Learning, block-chain or other new technologies, your first step has to be ensuring the fitness of your data and evaluating how this would play within your data strategy. This informs our open, agnostic, data-centric approach to moving data to Cloud.

Quality Cloud needs quality data!