Search This Blog

Monday, December 22, 2008

Data Migration


Data migration is the process of transferring data between storage types, formats, or computer systems.
Data migration is usually performed programmatically to achieve an automated migration, freeing up human resources from tedious tasks.
It is required when organizations or individuals change computer systems or upgrade to new systems, or when systems merge (such as when the organizations that use them undergo a merger/takeover).

Data Migration Procedure:
To achieve an effective data migration procedure, data on the old system is mapped to the new system providing a design for data extraction and data loading.
Programmatic data migration may involve many phases but it minimally includes data extraction where data is read from the old system and data loading where data is written to the new system.
Using ETL:
Extract, Transform, and Load (ETL) is a process in data warehousing that involves
Extracting data from outside sources,
Transforming it to fit business needs (which can include quality levels), and ultimately
Loading it into the end target, i.e. the data warehouse.
The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM).
An intrinsic part of the extraction is the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely.
The transform stage applies to a series of rules or functions to the extracted data from the source to derive the data to be loaded to the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformations types to meet the business and technical needs of the end target may be required.
Selecting only certain columns to load
Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the destination stores M for male and F for female), this is called automated data cleansing; no manual cleansing occurs during ETL
Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M)
Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
Aggregation (for example, Rollup - summarizing multiple rows of data - total sales for each store, and for each region, etc.)
Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)
Applying any form of simple or complex data validation; if failed, a full, partial or no rejection of the data, and thus no, partial or all the data is handed over to the next step, depending on the rule design and exception handling. Most of the above transformations itself might result in an exception, e.g. when a code-translation parses an unknown code in the extracted data.
The load phase loads the data into the end target, usually being the data warehouse (DW). Depending on the requirements of the organization, this process ranges widely. Some data warehouses might weekly overwrite existing information with cumulative, updated data, while other DW (or even other parts of the same DW) might add new data in a histories form, e.g. hourly. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the DW.
As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers activated upon data load apply (e.g. uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.
Real Life ETL Cycle
The typical real-life ETL cycle consists of the following execution steps
Cycle initiation
Build reference data
Extract (from sources)
Transform (clean, apply business rules, check for data integrity, create aggregates)
Stage (load into staging tables - if they are used)
Audit reports (Are business rules met? Also in case of failure - helps to diagnose/repair).
Publish (to target tables)
Archive, Clean UP.
ETL processes can be quite complex, and significant operational problems can occur with improperly designed ETL systems.
The range of data values or data quality in an operational system may be outside the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis is recommended to identify the data conditions that will need to be managed by transform rules specifications. This will lead to an amendment of validation rules explicitly and implicitly implemented in the ETL process.
DW is typically fed asynchronously by a variety of sources which all serve a different purpose, resulting in e.g. different reference data. ETL is a key process to bring heterogeneous and asynchronous source extracts to a homogeneous environment.
The scalability of an ETL system across the lifetime of its usage needs to be established during analysis. This includes understanding the volumes of data that will have to be processed within service level agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to intra-day micro-batch to integration with message queues or real-time change data capture for continuous transformation and update.
ETL vendors benchmark their record-systems at multiple TB (tera-bytes) per hour (or ~1 GB per second) using powerful servers with multiple CPUs, multiple hard drives, multiple gigabit-network connections, and lots of memory.
In real life the slowest part of an ETL process is usually in the database load phase. Database is slow because it has to take care of concurrency, integrity maintenance, indexes. Thus for better performance it makes sense to do most of the ETL processing outside of the database - and use bulk load operations whenever possible. Still even using bulk operations, database access is usually the bottleneck in the ETL process. Here are some common tricks used to increase performance:
Partition tables (and indexes). Try to keep partitions similar in size (watch for "null" values which can skew the partitionning.
Do all validation in ETL layer before the load. Disable integrity checking (disable constraint ...) in the target database tables during the load.
Disable triggers (disable trigger ...) in the target database tables during the load. Simulate their effect as a separate step.
Generate IDs in the ETL layer (not in the database).
Drop the indexes (on a table or partition) before the load - and recreate them after the load (drop index ...; create index ...).
Use parallel bulk load when possible - works well when the table is partitioned or there are no indexes. Note: attempt to do parallel loads into the same table (partition) usually causes locks - if not on the data rows - then on indexes.
If you need to do insert/update/delete - find out which rows should be processed in which way in the ETL layer - and then process these 3 operations in the database separately. You often can do bulk load for inserts, but updates and deletes commonly go through API (using SQL).
Whether or not to do certain operations in the database or outside may be a tradeoff. For example, removing duplicates using "distinct" may be slow in the database - thus it makes sense to do it outside. On the other side if using distinct will significantly (x100) decrease the number rows to be extracted - then it makes sense to do de-duping as early as possible - in the database before unloading data.
Common source of problems in ETL is a big number of interdependencies between ETL jobs. For example, job "B" can not start while job "A" is not finished. You can usually achieve better performance by visualizing all processes on a graph, and trying to reduce the graph making maximum use of parallelism, and making "chains" of consecutive processing as short as possible. Again, partitioning of big tables and of their indexes can really help.
Another common example is a situation when the data is spread between several databases, and processing is done in those databases sequentially. Sometimes database replication may be involved as a method of copying data between databases - and this can significantly slow down the whole process. The common solution is to reduce the processing graph to only 3 layers:
Central ETL layer
This allows to take maximum advantage of parallel processing. For example, if you need to load data into 2 databases - you can run the loads in parallel (instead of loading into 1st - and then replicating into the 2nd).
Of course, sometimes the sequential processing is required. For example, you usually need to get dimensional (reference) data before you can get and validate the rows for main "fact" tables.

Parallel processing
A recent development in ETL software is the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data.
There are 3 main types of parallelisms as implemented in ETL applications:
Data: By splitting a single sequential file into smaller data files to provide parallel access.
Pipeline: Allowing the simultaneous running of several components on the same data stream. An example would be looking up a value on record 1 at the same time as adding together two fields on record 2.
Component: The simultaneous running of multiple processes on different data streams in the same job. Sorting one input file while performing a de-duplication on another file would be an example of component parallelism.
All three types of parallelism are usually combined in a single job.
An additional difficulty is making sure the data being uploaded is relatively consistent. Since multiple source databases all have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to the contents in a source system or with the general ledger, establishing synchronization and reconciliation points is necessary.
Rerun ability, Recoverability
A big ETL process is usually subdivided into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure having these IDs will help to roll-back and re-run the failed piece. It is also good idea to have "checkpoints" - states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, etc.
Good Practices
Four Layered Approach for ETL Architecture Design
Functional Layer – Core functional ETL processing (Extract ,Transform, and Load).
Operational Management Layer – Job Stream Definition & Management, Parameters, Scheduling, Monitoring, Communication & Alerting.
Audit, Balance and Control (ABC) Layer – Job Execution Statistics, Balancing & Controls, Rejects & Error Handling, Codes Management.
Utility Layer – Common components supporting all other layers.
Use file-based ETL processing where possible
Storage is relatively inexpensive in cost
Intermediate files serve multiple purposes
Used for testing and debugging
Used for restart and recover processing
Used to calculate control statistics
Helps to reduce dependencies - enables modular programming.
Allows flexibility for job execution & scheduling
Better performance if coded properly, and can take advantage of parallel processing capabilities when the need arises.
Use data driven methods and minimize custom ETL coding
Parameter driven jobs, functions, and job control
Code definitions & mapping in database
Consideration for data driven tables to support more complex code mappings and business rule application.
Qualities of a good ETL architecture design
Recoverable (run_id, ...)
Operable (completion codes for phases, rerunning from checkpoints, etc.)
Auditable (in 2 dimensions: business requirements & technical troubleshooting)

Friday, December 12, 2008


Ajax Documentation:

Ajax definition:

  1. AJAX stands for Asynchronous Java script and XML.
  2. It’s a web development technique for creating interactive web-based application that can asynchronously interchange data between client and Server.

Definition 2:

AJAX is a technique rather than a technology: It describes how JavaScript can be used to pull data from the server using the XML HTTP Request object and then insert this data into the website using DOM. This is done asynchronously - that is, in the background, without having to refresh the whole page.

DOM (Document object Model):

  1. DOM is an internal representation of the website.
  2. DOM is accessible by the java script and provide the way to programmatically insert, remove, modify in a n website..

Example program with DOM:

1. Below example show how to use DOM tag in java script.

Div tag also a DOM (Document object model.):

Code: dom.html


Replace Text

  Hello, world!

The above example shows how to set the content of the

element with the id "foo" to "Hello, AJAX world!”



Replace Text


After Clicked Replace Text:


Replace Text

Hello, AJAX world!

Example 2:

1. Check the username is available in database? If the username available print “username available “message. If not available (ie. same user name already exist in db) print message “username not available”.

1. <%@ page language="java" contentType="text/html; charset=ISO-8859-1"

2. pageEncoding="ISO-8859-1"%>






30. Insert title here




Please choose your username:









Line 7: var http=false; > Initializing the http object reference variable that will hold the browser object .

Line 8 to 12: check browser type. If the browse type is Microsoft internet explorer we have to use the ActiveXObject .

http = new ActiveXObject("Microsoft.XMLHTTP");

Other wise we should use

http = new XMLHttpRequest();

Line 13 to 22: This JavaScript function is responsible for calling the another JSP file and validate the username is available in db.

Line 17: In this line we have to mention the action type either POST or GET and URL."GET", "validate.jsp?name=" + user, true);

Line 18: http.onreadystatechange method we should check JSP (process complete or what) ready for response.

if(http.readyState == 4) {

document.getElementById('foo').innerHTML = http.responseText;


Different ready states are from 1 to 4 ,Ready state 4 is intimating all the process has completed is server .So we can display result for the corresponding DOM.

Line 21: document.getElementById('foo').innerHTML = http.responseText;

Setting up the response into the foo div tag.


It’s a normal jsp page:

<%@ page import="java.util.*" %>


String user="";


System.out.println("user="+new Date()+user);

if (user.equals("test"))


out.println("name is not available");




out.println("name is available");



What ever you will be printing in out.println all the content will get response in the client side ( http.responseText) .

Sunday, November 30, 2008

Java Interview Question.

Hi Friends,
This is my first posting.This all question asked in one of company interview .I don't want to mention the name of company .

If you have come across my blog read this question and send me the feed back for the same.mail to me :

This question and answer format is entirely changed from all other format.
1.precious information.
2.Clear format.
3.Explain about what?why? type of question answer.
4.Difference explanation in Tabular format.

Thanks & Regards,
Lets Enjoy::

1. What are jar, war and ear?


· Jar stands for Java archive.

· This is a file format that enables us to bundle multiple file in to single archive file.

· A jar file will contain one file inside the META-INF folder that describe the version and other feature of jar file.


· War stands for Web Archive.

· It’s a compressed version of web application.

· We can use war file to deploy web application.


· Ear stands for Enterprise archive.

· A J2EE application with all of its modules is delivered in EAR file (War, Jar).

2. What is abstract class ,interface?. Give similarity and difference between abstract and interface.

Definition Abstract:

· Abstract class is an incomplete class.

· Abstract class is defined with the keyword abstract.

· We can’t create object for the abstract class because is an incomplete class.

· It’s set the behavioral protocol for its child class.

Definition Interface:

· An Interface is the collection of method declaration and constant s.

· In Java interface used for achieving multiple inheritance.

· It’s set the behavioral protocol for implemental class.


· Both are set the behavioral protocol for implemental class.



Abstract class



It have concrete method

Not allowed


It has private protected methods and variable.

Only public method and constant.


Extend only one class.

Implement more than one interface.


Tightly coupling

Loosely coupling.

3. Different access level for instance variable, method and classes.

Public: It’s accessible from anywhere

Protected: it’s accessible from the same class and its subclass.

Default: It’s accessible from the same package.

Private : It’s accessible with in class.

4. What is deployment descriptor?

· It’s a XML file.

· Its describes the component deployment setting.

· J2EE application and each of modules has is it own deployment descriptor.

· Deployment descriptor information is declarative; it can be changed without modifying the bean source code.

· At run time, the J2EE server reads the deployment descriptor and acts upon the component accordingly

· Example an enterprise bean module deployment descriptor declares transaction attributes and security authorizations for an enterprise bean

5. Difference between forward and send redirect?



Send Redirect


The forward, sent request to another source with same server without user known.

Web container indicating browser to new URL to be requested.


This process completely with in web container

This request consider as new req.


Object stored in request attribute are forward to next resource.

New request object is generated.


Its faster approach for forward.

Extra round trip make slower.

6. JSP lifecycle method:

Three life cycle method

· jspInit() –The container call this method for initializing Servlet instance. It’s called before any other method, and is called only once for Servlet instance.

· _jspService()- The container calls the _jspservice for each request, passing it the request and response object.

· jspDestroy – The container call this method when its take instance out of the service. It is the method last called in the Servlet instance.

7. Difference between and <%@ page include file=”” %>


<%@ page include file=”” %>


This is like a function call from one JSP to another JSP.

The content of the included file is textually embedded in the page that have <%@ include file=".."> directive


It’s executed each time the client page is accessed by client.

Its executed only ones.


If the included file changed then the new content will be included in the output.

Included file changes, the changed content will not included in the output.


This approach is useful to for modularizing the web application

This approach is used when the code from one JSP file required including in multiple JSP files.

8. Difference between <%@ page forward=”” %> and response.sendRedirect(URL).




forwards the request object containing the client request information from one JSP file to another file.

sends HTTP temporary redirect response to the browser, and browser creates a new request to go the redirected page


Its maintaining session variable.

The response.sendRedirect kills the session variables.

9. Difference between Session bean and Entity bean?


Entity bean

Session bean


An entity bean represents persistent global data from the database.

Session is one of the EJBs and it represents a single client inside the Application Server.


Entity beans data are stored into database.

A session bean is not persistent and it is destroyed once the session terminates.


Entity bean will support the Transaction.

Session bean won’t support the transaction.

10. What is ActionServlet?

· The class org.apache.struts.action.ActionServlet is the called the ActionServlet.

· In the the Jakarta Struts Framework this class plays the role of controller.

· All the requests to the server go through the controller. Controller is responsible for handling all the requests.

11. What is action class and give example?

· The Action is part of the controller. The purpose of Action Class is to translate the HttpServletRequest to the business logic.

· To use the Action, we need to Subclass and overwrite the execute () method. The ActionServlet (commad) passes the parameterized class to Action Form using the execute() method.

· The return type of the execute method is ActionForward which is used by the Struts Framework to forward the request to the file as per the value of the returned ActionForward object.



import javax.servlet.http.HttpServletRequest;

import javax.servlet.http.HttpServletResponse;

import org.apache.struts.action.Action;import org.apache.struts.action.ActionForm;

import org.apache.struts.action.ActionForward;

import org.apache.struts.action.ActionMapping;

public class TestAction extends Action


public ActionForward execute( ActionMapping mapping, ActionForm form, HttpServletRequest request, HttpServletResponse response) throws Exception

{ return mapping.findForward("testAction");



12. What is struts action and mapping?

Struts action Definition:

· A Struts action is an instance of a subclass of an Action class, which implements a portion of a Web application and whose perform or execute method returns a forward.

· An action can perform tasks such as validating a user name and password.

Struts mapping:

· An action mapping is a configuration file entry that, in general, associates an action name with an action.

· An action mapping can contain a reference to a form bean that the action can use, and can additionally define a list of local forwards that is visible only to this action.

13. What are struts validate frame work give needed XML file detail?

Struts validate frame work:

· Struts Framework provides the functionality to validate the form data.

· It can be use to validate the data on the users browser as well as on the server side.

· Struts Framework emits the java scripts and it can be used validate the form data on the client browser. Server side validation of form can be accomplished by sub classing your From Bean with DynaValidatorForm class.

Details of XML files used in Validator Framework:

· The Validator Framework uses two XML configuration files validator-rules.xml and validation.xml.

· The validator-rules.xml defines the standard validation routines, these are reusable and used in validation.xml. to define the form specific validations.

· The validation.xml defines the validations applied to a form bean.

14. It’s possible to read two struts configuration file in tomcat container with in same application?