Extract, Load, Transform?

Share
  • February 28, 2020

Even as technology changes, many things stay the same. Not least, in the world of enterprise-scale data processing, where massive volumes of data often need to be reformatted, restructured, and moved from one place to another (for a broader introduction, you can read our Key Criteria report on data pipelines).

The Extract-Transform-Load (ETL) model emerged with traditional, data-centric architectures in which large-scale, structured databases were king. Cloud computing is changing the game, for several reasons:

  • ‘Elastic’ infrastructure (which scales according to need) reduces the constraints on what data can be stored where and when
  • Serverless compute models work on an event-driven basis, which is more conducive to (potentially infrequent or nonlinear) streaming data sources
  • Cloud-based data architectures offer greater flexibility, for example, by not requiring data to be fully structured in advance of storing it

As a result of such drivers, models are emerging that are more akin to Extract-Load-Transform, i.e., ELT, than ETL. Data can be loaded into a cloud-based data warehouse directly from the source, either as database exports, content, or data streams, and then can be treated ‘in situ.’ While the difference in naming may not appear huge, its impact is profound.

First, and perhaps most obviously, it streamlines the process overall: data can be loaded in raw form even as you decide how to pre-process and manipulate it. In addition, however, data only needs to be transformed according to need. Rather than having to process all data into a new format before it can be analyzed or used, for example, you can apply transformations to a subset of the data. Note: this is different from schema-on-read models, which bring transformation and query closer together, but do not reduce the gap between transformation and load.

ELT also reduces risk. Historically, a transformation could only be checked when the data was loaded. By loading the data first, transformations can be tested in situ (likely against a sample subset). Potentially, the overall result brings more effective data pipelines, particularly in situations where data structures are harder to lockdown. You should also see higher levels of cost efficiency: Rather than building multiple small bridges between data islands, you are creating a single data ‘mainland’ as a basis for data transformation at scale.

However, these benefits come with a number of ramifications and considerations:

  • Potentially, the sky is the limit in terms of which data sources could be loaded. You will still need to decide what you require to store.
  • In the case of serverless data use, you can define and configure events to trigger how data is loaded and transformed.
  • You will need to define (in advance) a cloud-based data architecture that is suited to the multiplicity of data types that you are looking to manage.
  • Compliance and security considerations remain as important as ever, with policies applying to the data pool as a whole, rather than to isolated data silos.

Overall, ELT changes the equation. Rather than needing to apply fixed data manipulation rules according to specific requirements (as was the case with ETL), you can be much more granular about defining and applying transformations according to your needs. At the same time, you will need to consider cloud-based data architecture strategically, taking into account current and future application and analytics needs.

This brings to it a final point: As we all know well, BI and analytics are not standing still. With machine learning increasing its impact, with new data architectures and use cases moving closer to the infrastructure edge, how we manipulate and use data will also have to evolve. Key to this evolution will be how we can manage data both flexibly and at a massive scale, minimizing any bottlenecks to data delivery, however data is to be used.

Source : Extract, Load, Transform?