Search This Blog

Monday, December 22, 2008

Data Migration

Introduction:

Data migration is the process of transferring data between storage types, formats, or computer systems.
Data migration is usually performed programmatically to achieve an automated migration, freeing up human resources from tedious tasks.
It is required when organizations or individuals change computer systems or upgrade to new systems, or when systems merge (such as when the organizations that use them undergo a merger/takeover).

Data Migration Procedure:
To achieve an effective data migration procedure, data on the old system is mapped to the new system providing a design for data extraction and data loading.
Programmatic data migration may involve many phases but it minimally includes data extraction where data is read from the old system and data loading where data is written to the new system.
Using ETL:
Extract, Transform, and Load (ETL) is a process in data warehousing that involves
Extracting data from outside sources,
Transforming it to fit business needs (which can include quality levels), and ultimately
Loading it into the end target, i.e. the data warehouse.
Extract
The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM).
An intrinsic part of the extraction is the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely.
Transform:
The transform stage applies to a series of rules or functions to the extracted data from the source to derive the data to be loaded to the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformations types to meet the business and technical needs of the end target may be required.
Selecting only certain columns to load
Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the destination stores M for male and F for female), this is called automated data cleansing; no manual cleansing occurs during ETL
Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M)
Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
Filtering
Sorting
Aggregation (for example, Rollup - summarizing multiple rows of data - total sales for each store, and for each region, etc.)
Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)
Applying any form of simple or complex data validation; if failed, a full, partial or no rejection of the data, and thus no, partial or all the data is handed over to the next step, depending on the rule design and exception handling. Most of the above transformations itself might result in an exception, e.g. when a code-translation parses an unknown code in the extracted data.
Load:
The load phase loads the data into the end target, usually being the data warehouse (DW). Depending on the requirements of the organization, this process ranges widely. Some data warehouses might weekly overwrite existing information with cumulative, updated data, while other DW (or even other parts of the same DW) might add new data in a histories form, e.g. hourly. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the DW.
As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers activated upon data load apply (e.g. uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.
Real Life ETL Cycle
The typical real-life ETL cycle consists of the following execution steps
Cycle initiation
Build reference data
Extract (from sources)
Validate
Transform (clean, apply business rules, check for data integrity, create aggregates)
Stage (load into staging tables - if they are used)
Audit reports (Are business rules met? Also in case of failure - helps to diagnose/repair).
Publish (to target tables)
Archive, Clean UP.
Challenges
ETL processes can be quite complex, and significant operational problems can occur with improperly designed ETL systems.
The range of data values or data quality in an operational system may be outside the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis is recommended to identify the data conditions that will need to be managed by transform rules specifications. This will lead to an amendment of validation rules explicitly and implicitly implemented in the ETL process.
DW is typically fed asynchronously by a variety of sources which all serve a different purpose, resulting in e.g. different reference data. ETL is a key process to bring heterogeneous and asynchronous source extracts to a homogeneous environment.
The scalability of an ETL system across the lifetime of its usage needs to be established during analysis. This includes understanding the volumes of data that will have to be processed within service level agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to intra-day micro-batch to integration with message queues or real-time change data capture for continuous transformation and update.
Performance
ETL vendors benchmark their record-systems at multiple TB (tera-bytes) per hour (or ~1 GB per second) using powerful servers with multiple CPUs, multiple hard drives, multiple gigabit-network connections, and lots of memory.
In real life the slowest part of an ETL process is usually in the database load phase. Database is slow because it has to take care of concurrency, integrity maintenance, indexes. Thus for better performance it makes sense to do most of the ETL processing outside of the database - and use bulk load operations whenever possible. Still even using bulk operations, database access is usually the bottleneck in the ETL process. Here are some common tricks used to increase performance:
Partition tables (and indexes). Try to keep partitions similar in size (watch for "null" values which can skew the partitionning.
Do all validation in ETL layer before the load. Disable integrity checking (disable constraint ...) in the target database tables during the load.
Disable triggers (disable trigger ...) in the target database tables during the load. Simulate their effect as a separate step.
Generate IDs in the ETL layer (not in the database).
Drop the indexes (on a table or partition) before the load - and recreate them after the load (drop index ...; create index ...).
Use parallel bulk load when possible - works well when the table is partitioned or there are no indexes. Note: attempt to do parallel loads into the same table (partition) usually causes locks - if not on the data rows - then on indexes.
If you need to do insert/update/delete - find out which rows should be processed in which way in the ETL layer - and then process these 3 operations in the database separately. You often can do bulk load for inserts, but updates and deletes commonly go through API (using SQL).
Whether or not to do certain operations in the database or outside may be a tradeoff. For example, removing duplicates using "distinct" may be slow in the database - thus it makes sense to do it outside. On the other side if using distinct will significantly (x100) decrease the number rows to be extracted - then it makes sense to do de-duping as early as possible - in the database before unloading data.
Common source of problems in ETL is a big number of interdependencies between ETL jobs. For example, job "B" can not start while job "A" is not finished. You can usually achieve better performance by visualizing all processes on a graph, and trying to reduce the graph making maximum use of parallelism, and making "chains" of consecutive processing as short as possible. Again, partitioning of big tables and of their indexes can really help.
Another common example is a situation when the data is spread between several databases, and processing is done in those databases sequentially. Sometimes database replication may be involved as a method of copying data between databases - and this can significantly slow down the whole process. The common solution is to reduce the processing graph to only 3 layers:
Sources
Central ETL layer
Targets
This allows to take maximum advantage of parallel processing. For example, if you need to load data into 2 databases - you can run the loads in parallel (instead of loading into 1st - and then replicating into the 2nd).
Of course, sometimes the sequential processing is required. For example, you usually need to get dimensional (reference) data before you can get and validate the rows for main "fact" tables.

Parallel processing
A recent development in ETL software is the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data.
There are 3 main types of parallelisms as implemented in ETL applications:
Data: By splitting a single sequential file into smaller data files to provide parallel access.
Pipeline: Allowing the simultaneous running of several components on the same data stream. An example would be looking up a value on record 1 at the same time as adding together two fields on record 2.
Component: The simultaneous running of multiple processes on different data streams in the same job. Sorting one input file while performing a de-duplication on another file would be an example of component parallelism.
All three types of parallelism are usually combined in a single job.
An additional difficulty is making sure the data being uploaded is relatively consistent. Since multiple source databases all have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to the contents in a source system or with the general ledger, establishing synchronization and reconciliation points is necessary.
Rerun ability, Recoverability
A big ETL process is usually subdivided into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure having these IDs will help to roll-back and re-run the failed piece. It is also good idea to have "checkpoints" - states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, etc.
Good Practices
Four Layered Approach for ETL Architecture Design
Functional Layer – Core functional ETL processing (Extract ,Transform, and Load).
Operational Management Layer – Job Stream Definition & Management, Parameters, Scheduling, Monitoring, Communication & Alerting.
Audit, Balance and Control (ABC) Layer – Job Execution Statistics, Balancing & Controls, Rejects & Error Handling, Codes Management.
Utility Layer – Common components supporting all other layers.
Use file-based ETL processing where possible
Storage is relatively inexpensive in cost
Intermediate files serve multiple purposes
Used for testing and debugging
Used for restart and recover processing
Used to calculate control statistics
Helps to reduce dependencies - enables modular programming.
Allows flexibility for job execution & scheduling
Better performance if coded properly, and can take advantage of parallel processing capabilities when the need arises.
Use data driven methods and minimize custom ETL coding
Parameter driven jobs, functions, and job control
Code definitions & mapping in database
Consideration for data driven tables to support more complex code mappings and business rule application.
Qualities of a good ETL architecture design
Performance
Scalable
Migrateable
Recoverable (run_id, ...)
Operable (completion codes for phases, rerunning from checkpoints, etc.)
Auditable (in 2 dimensions: business requirements & technical troubleshooting)

No comments:

Post a Comment