Optimise data processing

As data volumes get bigger; formats more varied and growth, exponential, data transformation techniques are vital. Routine transformations ensure harmony of data format and target destination. But more importantly, data transformation aims to optimise the efficiency of downstream data processing & analytics tasks through techniques such as:

  • collecting, aggregating or integrating various datasets;
  • connecting data sources in order to analyse the resulting integrated data as a whole;
  • dataset parameter conversion or harmonisation e.g. measurements, from imperial to metric.

Filter & aggregate, slice & dice

When confronted with massive volumes of structured data, the task has to be a reducing one: to filter out anything that is unnecessary. Aggregation is an enriching but also a reducing technique, often driven by the need to answer a business question which requires combining datasets to produce information, not just data.

Or, for BI, the business may want to look at data from a different perspective e.g. sales figures by product nationally as opposed to branch totals. From our BI experience: filtering, aggregating or summarising should precede reporting, as their aim is to reduce the load on the server and facilitate faster, more accurate reporting.

Prepare for BI, AI or ML

Before any BI, AI or ML project use data transformation techniques to:

  • achieve data reduction or as part of consolidation (e.g. accounts);
  • speed reporting or reduce data processing loads;
  • make data more useful and valuable.

It seems obvious, but data scientists can spend up to 80% of their time preparing the data to be fed into AI pipelines, instead of learning from the data. They have to do this because AI & ML can only learn from the data that is fed to them. GIGO applies.

Migration

Migrating to cloud can be a frustrating process if your data is not suitably transformed before you try the ETL process.

Data conversion ensures that data structures are translated to a row & column structure, suited to database ingestion. In addition, data sorting(e.g. by time stamp) will render its order more suited to an existing database schema.

ETL (extract, transform & load) initially focuses on data format consistency, to ensure compatibility of extracted data with the target destination and any pre-existing data.

In data migration, parse first for efficiency. Cloud migrations can fail because cloud environments are constrained (by comparison with on-premise source applications) – both by volume & format. Gaps in ETL or data can stall migrations but can, in the last resort, be resolved by statistical imputation. However this largely applies to structured data.