Friday, November 12, 2010

ETL vs ELT

ETL – Extract, Transform and Load
This approach to data warehouse development is the traditional and widely accepted approach. The following diagram illustrates each of the individual stages in the process.

Data is “extracted” from the data sources (line of business applications) using a data extraction tool via whatever data connectivity is available. It is then transformed using a series of transformation routines. This transformation process is largely dictated by the data format of the output. Data quality and integrity checking is performed as part of the transformation process, and corrective actions are built into the process. Transformations and integrity checking are performed in the data staging area. Finally, once the data is in the target format, it is then loaded into the data warehouse ready for presentation.

The process is often designed from the end backwards, in that the required output is designed first. In so doing, this informs exactly what data is required from the source. The routines designed and developed to implement the process are written specifically for the purpose of achieving the desired output, and only the data required for the output is included in the extraction process.

In addition, the output design must incorporate all facts and dimensions required to present both the aggregation levels required by the BI solution and any possible future requirements.
Business rules that define how aggregations are achieved and the relationships between the various entities in both the source and target, are designed and therefore coded into the routines that implement the ETL process. This approach leads to tight dependencies in the routines at each stage of the process.

In spite of the fact that there are effectively three stages, the design is often characterized as a monolithic process since the target is predefined and clear.
In addition to the tools provided by the database manufacturer, there are a prolific number of tools available on the market that implement this approach and provide a solid platform, productivity, and flexibility.

STRENGTHS
Development Time:
Designing from the output backwards ensures that only data relevant to the solution is extracted and processed, potentially reducing development, extract, and processing overhead; and therefore time.
Targeted data: 
Due to the targeted nature of the load process, the warehouse contains only data relevant to the presentation. Administration Overhead Reduced warehouse content simplifies the security regime implemented and hence the administration overhead. 
Tools Availability: 
The prolific number of tools available that implement ETL provides flexibility of approach and the opportunity to identify a most appropriate tool. The proliferation of tools has lead to a competitive functionality war, which often results in loss of maintainability.

WEAKNESSES
Flexibility: 
Targeting only relevant data for output means that any future requirements, that may need data that was not included in the original design, will need to be added to the ETL routines. Due to nature of tight dependency between the routines developed, this often leads to a need for fundamental re-design and development. As a result this increases the time and costs involved. 
Hardware: 
Most third party tools utilize their own engine to implement the ETL process. Regardless of the size of the solution this can necessitate the investment in additional hardware to implement the tool’s ETL engine. Skills Investment The use of third party tools to implement ETL processes compels the learning of new scripting languages and processes. 
Learning Curve: 
Implementing a third party tool that uses foreign processes and languages results in the learning curve that is implicit in all technologies new to an organization and can often lead to following blind alleys in their use due to lack of experience.

ELT – Extract, Load and Transform
Whilst this approach to the implementation of a warehouse appears on the surface to be similar to ETL, it differs in a number of significant ways. The following diagram illustrates the process.

Data is “extracted” from the data sources (line of business applications) into the “Staging Database” using a data extraction tool via whatever data connectivity is available. Whilst in the staging area, integrity and usiness rules checks can be applied, and relevant corrections can be made. The source data is then loaded into the warehouse. In effect this provides a validated and cleaned offline copy of the source data in the data warehouse. Once in the warehouse, transformations are performed to re-shape the data into its target output format.
The extract and load process can be isolated from the transformation process. 
This has a number of benefits:

  • Isolating the load process from the transformation process removes an inherent dependency between these stages. In addition to including the data necessary for the transformations, the extract and load process can include elements of data that may be required in the future. Indeed, the load process could take the entire source and load it into the warehouse.
  • Separating the processes enables the project to be broken down into smaller chunks, thus making it more predictable and manageable.
  • Performing the data integrity checks in the staging area enables a further stage in the process to be isolated and dealt with at the most appropriate point in the process. This approach also helps to ensure that only cleaned and checked data is loaded into the warehouse for transformation.
  • Isolating the transformations from the load process helps to encourage a more staged approach to the warehouse design and implementation. This embraces the ongoing changing nature of the warehouse build.
  • Tools are available that inherently implement the ELT process. However, their availability is sparser since this is a more emergent approach to warehouse process design.

STRENGTHS
Project Management: 
Being able to split the warehouse process into specific and isolated tasks, enables a project to be designed on a smaller task basis, therefore the project can be broken down into manageable chunks. 
Flexible & Future Proof: 
In general, in an ELT implementation all data from the sources are loaded into the warehouse as part of the extract and load process. This, combined with the isolation of the transformation process, means that future requirements can easily be incorporated into the warehouse structure. 
Risk minimization: 
Removing the close interdependencies between each stage of the warehouse build process enables the development process to be isolated, and the individual process design can thus also be isolated. This provides an excellent platform for change, maintenance and management. 
Utilize Existing Hardware:
In implementing ELT as a warehouse build process, the inherent tools provided with the database engine can be used. Alternatively, the vast majority of the third party ELT tools available employ the use of the database engine’s capability and hence the ELT process is run on the same hardware as the database engine underpinning the data warehouse, using the existing hardware deployed. 
Utilize Existing Skill sets: 
By using the functionality provided by the database engine, the existing investments in database skills are re-used to develop the warehouse. No new skills need be learned and the full weight of the experience in developing the engine’s technology is utilized, further reducing the cost and risk in the development process.

WEAKNESSES
Against the Norm:
ELT is an emergent approach to data warehouse design and development. Whilst it has proven itself many times over through its abundant use in implementations throughout the world, it does require a change in mentality and design approach against traditional methods. Though some principles remain true to all approaches, ELT is different in many ways. To get the best from an ELT approach requires an open mind. 
Tools Availability:
Being an emergent technology approach, ELT suffers from a limited availability of tools.