Big data

The bigger the data, the bigger the management problem and the ‘bigger’ (and more costly) the consequences of failure. Why? Because of the difficulties of interrogating big data due to its sheer volume; but also because other disciplines like data ontology or data quality have been overlooked resulting in a data swamp which cannot be interrogated.

Early definitions of big data (e.g. OED) were based chiefly on sheer volume, pointing to chunking or brute force as the only options when considering big data and its analysis. Thinking has moved on and now Oracle defines big data as an “holistic, information management strategy”. (See Oracle’s big data guide)

Big data is about far more than the data itself. The data may be used, processed, hosted or managed by far more than just the controlling organisation. It is distributed – to many locations and devices. There are many different sources of data input or update (social media, for example). It can be gathered from and published to lots of devices. Some lcations will be in the Cloud;  some on premise. There will be various choke points depending on the quality of the infrastructure (bandwidth, networks and hubs) available. This is the complex interrelated nature of the digital ecosystems which are likely to be the outcome of digital transformation initiatives:

  • big data is dispersed – it involves processes that transcend domains, enterprises and firewalls;
  • the Internet of Things, means everything, everywhere will be generating data, so big data growth will be well beyond expectation. (Smart phone applications have simply given us an indication of what is to come.) The sheer number of devices and activities generating new types of data will mean organisations will need to be more rigorous in prioritising relevant information;
  • big data growth is exponential so infrastructure will always struggle to cope and just adding resources does not solve the problem except in the short term. There has to be a longer term, sustainable strategy.
  • big data will have a significant unstructured component which does not lend itself easily to Cloud;
  • big data is data in a state of constant evolution (e.g. block chain) so systems must take account of data in motion, data at rest, and data in time – they must assume the potential for change. In block chain theory, this is likely to generate even larger collections of data to allow for soft and hard forks;
  • big data governance raises many taxing questions about architectures and how to deliver that functionality without impeding performance. The increasing usage of APIs still leaves many questions unanswered;
  • user expectations of big data access, readiness, openness and availability are high – they expect to have it at their fingertips and immediately, so systems have to be performant;
  • big data is too big to analyse or visualise in its entirety (without terabytes of memory and petabytes of hard disk) so it has to be distilled into denser but more valuable chunks. As a result, the process of analysis becomes one of selection and approximation to produce fast, meaningful insights rather than ponderous but absolute ones;
  • The value of big data is realised through capturing the interrelationship between the data and the process thus there is additional data to be captured and stored;
As Cloud uptake increases, emphasis is shifting from IT to the business; from applications to data – its origins, discovery, curation, dispersal, lifecycle, transport, governance and usage. Organisations  should adopt data-centric philosophies when considering their next strategic moves in IT.