Informatica Techies: October 2014

Saturday, 25 October 2014

Properties of Administration Console

Using the Repository Server Administration Console -

Use the Repository Server Administration Console to administer your Repository Servers and repositories. A Repository Server can manage multiple repositories. You use the Repository Server Administration Console to create and administer the repository through the Repository Server.

You can use the Administration Console to perform the following tasks:

Add, edit, and remove repository configurations.

Export and import repository configurations.

Create a repository.

Promote a local repository to a global repository.

Copy a repository.

Delete a repository from the database.

Back up and restore a repository.

Start, stop, enable, and disable repositories.

Send repository notification messages.

Propagate domain connection information for a repository.

View repository connections and locks.

Close repository connections.

Upgrade a repository.

Friday, 24 October 2014

Power Center Client

Power Center Client -

The PowerCenter Client consists of the following applications that you use to manage the repository, design mappings, mapplets, and create sessions to load the data:

Repository Server Administration Console. Use the Repository Server Administration console to administer the Repository Servers and repositories.

Repository Manager. Use the Repository Manager to administer the metadata repository. You can create repository users and groups, assign privileges and permissions, and manage folders and locks.

Designer. Use the Designer to create mappings that contain transformation instructions for the PowerCenter Server. Before you can create mappings, you must add source and target definitions to the repository.

The Designer has five tools that you use to analyze sources, design target schemas, and build source-to-target mappings:

Source Analyzer. Import or create source definitions.

Warehouse Designer. Import or create target definitions.

Transformation Developer. Develop reusable transformations to use in mappings.

Mapplet Designer. Create sets of transformations to use in mappings.

Mapping Designer. Create mappings that the Power Center Server uses to extract, transform, and load data.

Workflow Manager. Use the Workflow Manager to create, schedule, and run workflows. A workflow is a set of instructions that describes how and when to run tasks related to extracting, transforming, and loading data.

The Power Center Server runs workflow tasks according to the links connecting the tasks. You can run a task by placing it in a workflow.

Workflow Monitor. Use the Workflow Monitor to monitor scheduled and running workflows for each Power Center Server. You can choose a Gantt Chart or Task view. You can also access details about those workflow runs.

Install the client tools on a Microsoft Windows machine. For more information about installation requirements, see Minimum System Requirements.

Power Center repository & Repository Server

The Power Center repository resides on a relational database. The repository database tables contain the instructions required to extract, transform, and load data. Power Center Client applications access the repository database tables through the Repository Server.

You add metadata to the repository tables when you perform tasks in the PowerCenter Client application, such as creating users, analyzing sources, developing mappings or mapplets, or creating workflows. The PowerCenter Server reads metadata created in the Client application when you run a workflow. The PowerCenter Server also creates metadata, such as start and finish times of a session or session status.

You can develop global and local repositories to share metadata:

Global repository. The global repository is the hub of the domain. Use the global repository to store common objects that multiple developers can use through shortcuts. These objects may include operational or Application source definitions, reusable transformations, mapplets, and mappings.

Local repositories. A local repository is within a domain that is not the global repository. Use local repositories for development. From a local repository, you can create shortcuts to objects in shared folders in the global repository. These objects typically include source definitions, common dimensions and lookups, and enterprise standard transformations. You can also create copies of objects in non-shared folders.

Version control. A versioned repository can store multiple copies, or versions, of an object. Each version is a separate object with unique properties. PowerCenter version control features allow you to efficiently develop, test, and deploy metadata into production.

You can connect to a repository, back up, delete, or restore repositories using pmrep, a command line program. For more information on pmrep, see “Using pmrep”.

Repository Server -

The Repository Server manages repository connection requests from client applications. For each repository database registered with the Repository Server, it configures and manages a Repository Agent process. The Repository Server also monitors the status of running Repository Agents, and sends repository object notification messages to client applications.

The Repository Agent is a separate, multi-threaded process that retrieves, inserts, and updates metadata in the repository database tables. The Repository Agent ensures the consistency of metadata in the repository by employing object locking.

Targets

PowerCenter can load data into the following targets:

Relational. Oracle, Sybase, Sybase IQ, Informix, IBM DB2, Microsoft SQL Server, and Teradata

File. Fixed and delimited flat file and XML.

Application. You can purchase additional PowerConnect products to load data into SAP BW. You can also load data into IBM MQSeries message queues and TIBCO.

Other. Microsoft Access.

You can load data into targets using ODBC or native drivers, FTP, or external loaders.
For more information about targets, see “Working with Targets” in the Designer Guide.

Sources

Sources

Power Center accesses the following sources:

Relational. Oracle, Sybase, Informix, IBM DB2, Microsoft SQL Server, and Teradata.

File. Fixed and delimited flat file, COBOL file, and XML.

Application. You can purchase additional PowerConnect products to access business sources, such as PeopleSoft, SAP R/3, Siebel, IBM MQSeries, and TIBCO.

Mainframe. You can purchase PowerConnect for Mainframe for faster access to IBM DB2 on MVS.

Other. Microsoft Excel and Access.

Note: The Designer imports relational sources, such as Microsoft Excel, Microsoft Access, and Teradata using ODBC and native drivers.

For more information about sources, see “Working with Sources” in the Designer Guide.

Monday, 20 October 2014

Creation of Mapping

A mapping is created using mapping designer tool in the designer client component.

PROCEDURE:

From tools menu

Click on mapping designer
Go to mapping menu

Select create

Enter mapping name (m-simplepass)

Click ok

From repository navigator window expand the source subfolder, drag the source definition (EMP), and drop on mapping designer work space.

From target subfolder drag the target definition (t-employee), drop on mapping designer work space.

without source qualifier the integration service cannot perform extraction from the source definition the source qualifier the sql queries to extract the data.

From source qualifier connect the columns to the target definition.

From repository menu click on ‘save’.

A source qualifier is automatically created when we drag and drop the source definition in the mapping designer’s work space.

Creation of Target Defenition

The source definition can be created using source analyzer tool in the designer client component.

From data base the source analyzer can get the structure of the data

Open the client designer

Note:

At a time only one folder can be active and the bold representation indicates that the folder is in active state.

The subfolder source contains source definition and targets contain target definition.

The sub folders cannot be created for that they are created automatically

The right hand side window is called the work space.

The left window is the repository navigator.

PROCEDURE:

Open the client source designer

From repository navigator window connect to the repository with the name <Rep_name>

Select the folder (dwh)

From tools menu select source analyzer

From source menu

Click on import from database

Connect to the database with the following data

1. ODBC data source : dwh-source-oracle

2. Username : Scott

3. Owner name : Scott

4. Password : tiger

Click on connect

Select the tables and click ok

From repository Manu click an’ save’

Creation of Source Defenition

The source definition can be created using source analyzer tool in the designer client component.

From data base the source analyzer can get the structure of the data

Open the client designer

Note:

At a time only one folder can be active and the bold representation indicates that the folder is in active state.

The sub-folder source contains source definition and targets contain target definition.

The sub folders cannot be created for that they are created automatically

The right hand side window is called the work space.

The left window is the repository navigator.

PROCEDURE:

Open the client source designer

From repository navigator window connect to the repository with the name <Rep_name>

Select the folder (DWH)

From tools menu select source analyzer

From source menu

Click on import from database

Connect to the database with the following data

1. ODBC data source: dwh-source-oracle

2. Username : Scott

3. Owner name : Scott

4. Password : tiger

Click on connect

Select the tables and click ok

From repository MEnu click an’ save’

Creation Of Folder in Informatica

Please fallow below step by step procedure

Start programs

Select -- > Informatica power center X.X

Select client

Click on power center repository manager.

POWER CENTRE CLIENT

The following are the power centre client components.

1) Power Center Designer.

2) Work flow manager.

3) Work flow monitor.

4) Repository Manager.

A client is a GUI where we can deal with Meta data.

REPOSITORY MANAGER

The Informatica repository is a relational database that stores information, or metadata, used by the Informatica Server and Client tools. Metadata can include information such as mappings describing how to transform source data, sessions indicating when you want the Informatica Server to perform the transformations, and connect strings for sources and targets. The repository also stores administrative information such as usernames and passwords, permissions and privileges, and product version.

Informatica Power Center

Informatica PowerCenter provides an environment that allows you to load data into a centralized location, such as a datamart, data warehouse, or operational data store (ODS).

You can extract data from multiple sources, transform the data according to business logic you build in the client application, and load the transformed data into file and relational targets.

PowerCenter provides the following integrated components:

Load Manager Process and Data Transformation Manager Process

Load Manager Process and Data Transformation Manager Process

The Load Manager is the primary Informatica Server process. It performs the following

tasks:

Manages session and batch scheduling.

Locks the session and reads session properties.

INFORMATICA

Informatica Power Center is a powerful ETL tool from Informatica Corporation.

Informatica Corporation products are:

Informatica Power Center
Informatica on demand
Informatica B2B Data Exchange
Informatica Data Quality
Informatica Data Explorer

Informatica Power Center is a single, unified enterprise data integration platform

Steps to install Informatica Power Center Server 8.6

Pre-requisites:

1)Window xp sp2 ( You can also install the same on any high end OS ).

2)At least one database(here we assume you to have Oracle XE pre installed ). 3)2 GB Ram Recommended

Informatica 8.6 Server Installation:

1) Select the sever folder from the informatica installation CD.

2)There will find set up, click on the set up file.

What is the difference between a primary key and a surrogate key?

primary key -

A primary key is a special constraint on a column or set of columns. A primary key constraint ensures that the column(s) so designated have no NULL values, and that every value is unique. Physically, a primary key is implemented by the database system using a unique index, and all the columns in the primary key must have been declared NOT NULL. A table may have only one primary key, but it may be composite (consist of more than one column).

ODS (Operational Data Sources)

ODS (Operational Data Sources):

My understanding of ODS is, its a replica of OLTP system and so the need of this, is to reduce the burden on production system (OLTP) while fetching data for loading targets. Hence its a mandate Requirement for every Warehouse.

What is the difference between snow flake and star schema

What is the difference between snow flake and star schema -

Star Schema	Snow Flake Schema
The star schema is the simplest data warehouse scheme.	Snowflake schema is a more complex data warehouse model than a star schema.
In star schema each of the dimensions is represented in a single table .It should not have any hierarchies between dims.	In snow flake schema at least one hierarchy should exists between dimension tables.
It contains a fact table surrounded by dimension tables. If the dimensions are de-normalized, we say it is a star schema design.	It contains a fact table surrounded by dimension tables. If a dimension is normalized, we say it is a snow flaked design.
In star schema only one join establishes the relationship between the fact table and any one of the dimension tables.	In snow flake schema since there is relationship between the dimensions tables it has to do many joins to fetch the data.
A star schema optimizes the performance by keeping queries simple and providing fast response time. All the information about the each level is stored in one row.	Snowflake schemas normalize dimensions to eliminated redundancy. The result is more complex queries and reduced query performance.
It is called a star schema because the diagram resembles a star.	It is called a snowflake schema because the diagram resembles a snowflake.

Difference between data mart and data warehouse

Data Mart	Data Warehouse
Data mart is usually sponsored at the department level and developed with a specific issue or subject in mind, a data mart is a data warehouse with a focused objective.	Data warehouse is a “Subject-Oriented, Integrated, Time-Variant, Nonvolatile collection of data in support of decision making”.
A data mart is used on a business division/ department level.	A data warehouse is used on an enterprise level
A Data Mart is a subset of data from a Data Warehouse. Data Marts are built for specific user groups.	A Data Warehouse is simply an integrated consolidation of data from a variety of sources that is specially designed to support strategic and tactical decision making.
By providing decision makers with only a subset of data from the Data Warehouse, Privacy, Performance and Clarity Objectives can be attained.	The main objective of Data Warehouse is to provide an integrated environment and coherent picture of the business at a point in time.

Friday, 17 October 2014

DWH Concepts

DWH Concepts -

Data-integrity

OLAP

Inmon vs. Kimball: Which approach is suitable for your data warehouse?

Different methods of loading Dimension tables

Surrogate-key

Normalization

Fact Table

Dimension

DataWareHousing - ETL Project Life Cycle ( Simple to understand )

DataWareHousing - ETL Project Life Cycle ( Simple to understand )

Warehousing -

Data warehousing projects are categorized into 4 types.

1) Development Projects.
2) Enhancement Projects
3) Migration Projects
4) Production support Projects.

The following are the different phases involved in a ETL project development life cycle.

1) Business Requirement Collection ( BRD )
2) System Requirement Collection ( SRD )
3) Design Phase
a) High Level Design Document ( HRD )
b) Low level Design Document ( LLD )
c) Mapping Design
4) Code Review
5) Peer Review
6) Testing
a) Unit Testing
b) System Integration Testing.
c) USer Acceptance Testing ( UAT )
7) Pre - Production
8) Production ( Go-Live )

Business Requirement Collection :-

The business requirement gathering start by business Analyst, onsite technical lead and client business users.

In this phase,a Business Analyst prepares Business Requirement Document ( BRD ) (or) Business Requirement Specifications ( BRS )

BR collection takes place at client location.

The o/p from BR Analysis are

BRS :- Business Analyst will gather the Business Requirement and document in BRS

SRS :- Senior technical people (or) ETL architect will prepare the SRS which contains s/w and h/w requirements.

The SRS will includes -

O/S to be used ( windows or unix )

RDBMS required to build database ( oracle, Teradata etc )

ETL tools required ( Informatica,Datastage )

OLAP tools required ( Cognos ,BO )

The SRS is also called as Technical Requirement Specifications ( TRS )

Designing and Planning the solutions :-

The o/p from design and planning phase is

HLD ( High Level Design ) Document

LLD ( Low Level Design ) Document

HLD ( High Level Design ) Document : -

An ETL Architect and DWH Architect participate in designing a solution to build a DWH.

An HLD document is prepared based on Business Requirement.

LLD ( Low Level Design ) Document : -

Based on HLD,a senior ETL developer prepare Low Level Design Document

The LLD contains more technical details of an ETL System.

An LLD contains data flow diagram ( DFD ), details of source and targets of each mapping.

An LLD also contains information about full and incremental load.

After LLD then Development Phase will start

Development Phase ( Coding ) :-

Based an LLD, the ETL team will create mapping ( ETL Code )

After designing the mappings, the code ( Mappings ) will be reviewed by developers.

Code Review

Code Review will be done by developer.

In code review,the developer will review the code and the logic but not the data.

The following activities takes place in code review

You have to check the naming standards of transformation,mappings of data etc.

Source and target mapping ( Placed the correct logic or not in mapping )

Peer Review :-

The code will reviewed by your team member ( third party developer )

Testing

The following various types testing carried out in testing environment.

Unit Testing

Development Integration Testing

System Integration Testing

User Acceptance Testing

Unit Testing :-

A unit test for the DWH is a white Box testing,It should check the ETL procedure and Mappings.

The following are the test cases can be executed by an ETL developer.

Verify data loss

No.of records in the source and target

Dataload/Insert

Dataload/Update

Incremental load

Data accuracy

verify Naming standards.

Verify column Mapping

The Unit Test will be carried by ETL developer in development phase.

ETL developer has to do the data validations also in this phase.

Development Integration Testing -

Run all the mappings in the sequence order.

First Run the source to stage mappings.

Then run the mappings related to dimensions and facts.

System Integration Testing -

After development phase,we have to move our code to QA environment.

In this environment,we are giving read-only permission to testing people.

They will test all the workflows.

And they will test our code according to their standards.

User Acceptance Testing ( UAT ) -

This test is carried out in the presence of client side technical users to verify the data migration from source to destination.

Production Environment -

Migrate the code into the Go-Live environment from test environment ( QA Environment ).

Dimension

A table in a data warehouse whose entries describe data in a fact table. Dimension tables contain the data from which dimensions are created.

Types

Slowly changing Dimension

SCD Type 1 -

The attribute value is overwritten with the new value, obliterating the historical attribute values. For example, when the product roll-up changes for a given product, the roll-up attribute is merely updated with the current value.

SCD Type 2 -

A new record with the new attributes is added to the dimension table. Historical fact table rows continue to reference the old dimension key with the old roll-up attribute; going forward, the fact table rows will reference the new surrogate key with the new roll-up thereby perfectly partitioning history.

SCD Type 3 -

Attributes are added to the dimension table to support two simultaneous roll-ups - perhaps the current product roll-up as well as “current version minus one”, or current version and original.

Rapidly changing Dimension

A rapidly changing dimension is a dimension whose membership and attribute data may change with either high frequency or with widespread scope, and is typically represented as a separate dimension related to an existing regular or slowly changing dimension. The contents of an RCD are updated far more frequently (monthly, weekly, even daily for some business intelligence applications) than the dimension to which the RCD is associated.

Snowflake Dimension

There may be a need to take out some data, to keep that in a separate table and link that table to the original table.

Junk Dimension

While designing the fact table and dimension table from the operational database, some attributes in the original database cannot be discard. But they donot fit into the fact and dimension table.

In such case, the options are,

To discard them, which may result in the loss of information.
To put them in different dimension table which unnecessarily increase the number of dimension
To use junk dimension where in a junk dimension table is created with the ‘junk’attributes.

Conformed Dimension

Conformed Dimensions are the Dimensions, which are common to two cubes .say CUBE-1 contains F1,D1,D2,D3 and CUBE-2 contains F2,D1,D2,D4 are the Facts and Dimensions ,here D1,D2 are the Conformed Dimensions

Fact Table

Fact Table

A Fact table is a collection of facts and foreign key relations to the dimensions.

The central table in a star join schema, characterized by a composite key, each of whose elements is a foreign key drawn from a dimension table. Facts are information about the business, typically numeric and additive. A table that contains the measures that the business users wish to analyze to find new trends or to understand the success or failure of the organization.

Types

Measures
Conformed Facts

Measures

Measure objects retrieve numeric data, or data that is the result of calculations on data in the database. Measure objects are semantically dynamic: the values they return depend on the objects they are used with

Types

Additive – Additive facts are facts that can be summed up for the dimensions present in the fact table.

Non-Additive - Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Semi-Additive - Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.

Factless fact - A factless fact table captures the many-to-many relationships between

dimensions, but contains no numeric or textual facts. They are often used to record events or

coverage information.

Common examples of factless fact tables include:

Identifying product promotion events (to determine promoted products that didn’t sell)
Tracking student attendance or registration events
Tracking insurance-related accident events
Identifying building, facility, and equipment schedules for a hospital or university

Normalization

The process of decomposing tables to eliminate data redundancy is called Normalization.

1^stN.F:- The table should contain scalar or atomic values.

2^ndN.F:- Table should be in 1N.F + No partial functional dependencies

3^rd N.F:-Table should be in 2 N.F + No transitive dependencies

Normalization can be defined as segregating of table into two different tables, so as to avoid duplication of values.

The normalization is a step by step process of removing redundancies and dependencies of attributes in data structure

The condition of data at completion of each step is described as a “normal form”.

Needs for normalization:

improves data base design.
Ensures minimum redundancy of data.
Reduces need to reorganize data when design is modified or enhanced.
Removes anomalies for database activities.

First normal form:

A table is in first normal form when it contains no repeating groups.
The repeating column or fields in an unnormalized table are removed from the table and put in to tables of their own.
Such a table becomes dependent on the parent table from which it is derived.
The key to this table is called concatenated key, with the key of the parent table forming a part it.

Second normal form:

A table is in second normal form if all its non_key fields fully dependent on the whole key.
This means that each field in a table must depend on the entire key.
Those that do not depend upon the combination key, are moved to another table on whose key they depend on.
Structures which do not contain combination keys are automatically in second normal form.

Third normal form:

A table is said to be in third normal form, if all the non key fields of the table are independent of all other non key fields of the same table.

Surrogate Key

Surrogate Key -

A surrogate key is a substitution for the natural primary key.

It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table.

Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.

Different methods of loading Dimension tables

Conventional Load:

Before loading the data, all the Table constraints will be checked against the data.

Direct load :( Faster Loading)

All the Constraints will be disabled. Data will be loaded directly. Later the data will be checked against the table constraints and the bad data won't be indexed.

Which columns go to the fact table and which columns go the dimension table -

The Aggregation or calculated value columns will go to Fact Table and details information will go to dimensional table.

Degenerate Dimension Table -

The column in a fact table, that does not map to any dimensions, neither it s a measure column.
If a table contains the values, which are neither dimensions nor measures is called degenerate dimensions. Ex: invoice id, empno.

Grains or Granularity

Level of granularity means the upper/lower level of hierarchy, up to which we can see/drill the data in the fact table.

Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based on design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail you are willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it up to minute and put that data.

Saturday, 25 October 2014

Friday, 24 October 2014

Monday, 20 October 2014

Saturday, 18 October 2014

Pre-requisites:

Friday, 17 October 2014