Abinitio Interview Questions: Data Ware Housing Concepts

Data Ware Housing Concepts at a glance

What is DataWareHouse?

When users are requesting access to a large amount of historical information for reporting purposes, you should strongly consider a warehouse or mart. The user will benefit when the information is organized in an efficient manner for this type of access.

Def:

DWH is a type of relational data base system specially designed for query analysis processing rather than transactional processing.

Characterstics:

► Subject Oriented

► Non Volatile

► Integrated

► Time Variant

Subject Oriented:

Means that the data collected must be “business Oriented”.

Integrated:

The DW contains data from most or all of the organization operational systems and those data are made consistent.

Non_Volatile:

Data in the DWH are never overwritten or deleted-once committed the data are static ,read only, and retained for future reporting.

TimeVariant:

The data of the DW msut be historical.

What is Enterprise DataWareHouse?

Enterprise Data Warehouse is a centralized warehouse which provides service for the entire enterprise.

Granularity:

The granularity is the lowest level of information stored in the fact table. The depth of data level is known as granularity. In date dimension the level could be year, month, quarter, period, week, day of granularity.

The process consists of the following two steps:

- Determining the dimensions that are to be included

- Determining the location to place the hierarchy of each dimension of information

Data miningL Extraction of data from warehouse

Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both.

Data Presentation area:

The place where warehouse data is organized,stored and available for direct querying by users,data access tools, and other analytical applications.

Data Staging Area:

A storage area and set of processes that clean, transform, combine, deduplicate , house hold, archive, and prepare source data for use into the data warehouse.

DataMart:

The term “DataMart” refers to a sub entity of data warehouse containing the data of the DW for a particular sector of the company.

Ex: department, division service, product line etc.

Consolidated data mart:

Data marts that combine business measurements from multiple business processes. Some times called a second level data mart.

First level data Mart:

A datamart that is derived from a single primary source system.

Second level data mart:

Same as consolidated datamart.

Architecture:

The Data Warehouse architecture defines the components and the structure of the data warehouse. The architecture also defines the interfaces between the components and the external environment.

All data warehouses applications have many logical layers that are separate. Each layer interacts with the other layers by passing the required data.

http://dwhinfo.com/Images/ETLLayeredArchitecture.gif

Source Layer

This layer contains all the separate OLTP and Legacy systems that provide the business data. These systems are physically separate from each other. This layer could contain data from different operating systems, databases and files.

Landing Area

This is the layer where source data before being transformed is temporarily storedon the ETL Server. We could store the source data in files or load it into database tables. Generally we store the source data in files.

We could use the database to house the source data to provide a sophisticated and fast way of handling reject data reprocessing (one of the DWH Exception Handling processes) requirements for very low tolerance level DWH.

Please Note that adding a database at the landing area adds cost of extra infrastructure and maintenance.

Staging Area

This is the layer where the cleansed and transformed data is temporarily stored.

We generally use files to store the intermediate data while transformation and cleansing. Once the data is ready to be loaded to the warehouse, we load it in the staging database. The advantage of using the staging database is that we add a point in the ETL flow where we can restart the load from. The other advantages of using staging database is that we can directly utilize the bulk load utilities provided by the databases and ETL tools while loading the data in the warehouse/mart, and provide a point in the data flow where we can audit the data.

Warehouse/Mart Layer

This is the layer where we store the transformed data. The data here is used for business analysis.

Presentation Layer

This layer contains the tools and processes used to read the data from the warehouse/mart and present it to the end user in a graphical format.

The DWH architecture can be affected by various factors namely

Requirements,
Geographical location of the source systems and the
Budget

Requirements impact to DWH architecture

In case of business not clear about the overall business requirements for all or multiple subject areas, it would be safer to understand the requirements for a single subject area and develop the data mart for that subject area. This leads to astandalone data mart. In the future once requirements for another subject area is clear we tend to build another standalone data mart and so on.. This creates multiple standalone data marts. This term is called as stove-pipe or siloed structure.

Here multiple data marts would have the same data loaded as many times and, same data may not be consistent across data marts.

Care must be taken when we build the first data mart to avoid a stove-pipe architecture in the future when requirements for other subject areas are clear. We do so by designing dimension tables in the data mart that can be shared with other data marts. These dimension tables are called Conformed Dimension tables. Using the conformed dimensions we can integrate multiple data marts to form a data warehouse. This approach is called the bottom-up approach, suggested by Ralph Kimball.

In the case when the overall business requirements are clear for all subject areas, we first build an Enterprise Data Warehouse (EDW) which is a third normal form (3NF) schema. From the EDW we build the various subject specific data marts. This approach is called the top-down approach, suggested by Bill Inmon.

Two Approaches:

There are two strategies to build a data warehouse namely

Top - Down Approach (Suggested by Bill Inmon)
Bottom - Up Approach (Suggested by Ralph Kimball)

Top Down Approach (Suggested by Bill Inmon)

In the top down approach suggested by Bill Inmon, we build a centralized repository to house corporate wide business data. This repository is called Enterprise Data Warehouse (EDW). The data in the EDW is stored in a normalized form in order to avoid redundancy.

The central repository for corporate wide data helps us maintain one version of truth of the data.

The data in the EDW is stored at the most detail level. The reason to build the EDW on the most detail level is to leverage

Flexibility to be used by multiple departments.
Flexibility to cater for future requirements.

The disadvantages of storing data at the detail level are

The complexity of design increases with increasing level of detail.
It takes large amount of space to store data at detail level, hence increased cost.

Once the EDW is implemented we start building subject area specific data marts which contain data in a denormalized form also called star schema. The data in the marts are usually summarized based on the end users analytical requirements.

The reason to denormalize the data in the mart is to provide faster access to the data for the end users analytics. If we were to have queried a normalized schema for the same analytics, we would end up in a complex multiple level joins that would be much slower as compared to the one on the denormalized schema.

We should implement the top-down approach when

The business has complete clarity on all or multiple subject areas DWH requirements.
The business is ready to invest considerable time and money.

The advantage of using the Top Down approach is that we build a centralized repository to cater for one version of truth for business data. This is very important for the data to be reliable, consistent across subject areas and for reconciliation incase of data related contention between subject areas.

The disadvantages of using the Top Down approach is that it requires more time and initial investment. The business has to wait for the EDW to be implemented followed by building the data marts before which they can access their reports.

Bottom Up Approach (Suggested by Ralph Kimball)

The bottom up approach suggested by Ralph Kimball is an incremental approach to build a data warehouse. Here we build the data marts separately at different points of time as and when the specific subject area requirements are clear. The data marts are integrated or combined together to form a data warehouse. Separate data marts are combined through the use of conformed dimensions and conformed facts. A conformed dimension and a conformed fact is one that can be shared across data marts.

A Conformed dimension has consistent dimension keys, consistent attribute names and consistent values across separate data marts. The conformed dimension means exact same thing with every fact table it is joined.

A Conformed fact has the same definition of measures, same dimensions joined to it and at the same granularity across data marts.

In order to build conformed dimensions and facts we need to create a Bus Matrix with the rows corresponding to the various data marts and columns corresponding to all the dimension tables as depicted in the diagram above.

The bottom up approach helps us incrementally build the warehouse by developing and integrating data marts as and when the requirements are clear. We dont have to wait for knowing the over all requirements of the warehouse. Care must be taken to build an exhaustive bus matrix from the first data mart itself identifying all possible dimensions, otherwise we would be building a stove-pipes in the organization.

We should implement the bottom up approach when

We have initial cost and time constraints.
The complete warehouse requirements are not clear. We have clarity to only one data mart.

The advantage of using the Bottom Up approach is that they do not require high initial costs and have a faster implementation time, hence the business can start using the marts much earlier as compared to the top-down approach.

The disadvantages of using the Bottom Up approach is that it stores data in the denormalized format, hence there would be high space usage for detailed data. We have a tendency of not keeping detailed data in this approach hence loosing out on advantage of having detail data .i.e. flexibility to easily cater to future requirements.

- - - - - - - - - - - --

Attribute:

A column(field) in a dimension table.

Composite key:

A key in a db table made up of several columns. Same as candidate key.

Foreign key:

A column in a relational data base table whose values are drawn from the values of primary key in another table.

What is Dimensional Modeling?

Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measuremnets ie, the dimensions on which the facts are calculated.

Drill Down:

The act of adding a row header or replacing a row header in a report to break down the rows of the answer set more finely.

Drill Up:

The act of removing a row header or replacing a row header in a report to summarize the rows of th answer set.

Flat File:

A simple data structure, often implemented on a mainframe,that relies on nonrelational files, such as IBM VSAM files.

Fact:

A business performance measurement, typically numeric and additive, that is stored in a fact table.

Ex: Branch_key

Fact Table

As explained above the Fact Table forms the center of the star schema. It contains one or more business measure which are numeric values. The fact table is joined to the dimension tables through the foreign keys to the dimension tables. These foreign keys joining the fact table to the dimension tables form the fact tables composite primary key.

Ex:

Monthly Account Fact

Month End Date Key(FK)

Branch Key(FK)

Product key(FK)

Account key(FK)

Account Status(FK)

Types of Measures in a Fact

Based on whether a metric is calculated and or how the measure values can be added we categorize them as follows

Base Measures(Fact) - A metric in the fact table whose value is taken as-is from the source. The ETL process does not change its value based on business rules or other measures.

e.g. The "Sales in $" in the above example is a transaction whose value is loaded as-is from the source system.

Eg: Account_key,Branch_key.

Derived Measures(Fact) - A metric in the fact table whose value is calculated based on business rules and the base measures from source system

.e.g. The "Sales Margin" is a measure that is calculated based on the (Sales in $ - Cost in $)/Sales in $.

Here we use two base measures namely Sales in $ and Cost in $.

Eg: loan Margin

Additive Measures(Facts) - These are measures in the fact table that can be added across all the dimension tables and give meaningful data.

For example the "Sales in $" in the example above can be measured across all the three dimensions attached to the fact table. If we add the "Sales in $" across the time dimension we get the total sales for a period of time, similarly total sales for across all stores, and sales for all products.

Ex: Monthly deposits

Semi Additive Measures(Facts) - These are measures in the fact table that can be added across only some dimensions. When added across the other dimensions it may not produce meaningful results.

For example the Inventory Balance metric in the example, indicates the remaining number of the product in the store at the time of the transaction. Adding it over the time dimension will not result in a meaningful result, but adding it for all the products in the store will give the total inventory count.

Ex: Fees Charged

Non Additive Measures(Facts)- These are measure in the fact table that cannot be added across any fact. If added they would not produce meaningful results. These are generally percentages and ratio metrics.

Example of a non additive fact would be Sales Margin % as shown in the example above.

Ex: percentage of interest.

Types of Fact Tables

Transaction Fact Tables - These are fact tables that contain the value of the business transaction that has occurred at a point of time. Here a row will be inserted for each transaction that has occurred.

Periodic Snapshot Fact Tables - These are fact tables that contain the complete snapshot of the transactions at the end of the business period (day/week/month etc). Take for example that there were 10 sales transactions for a particular product/SKU during the day.

In Transaction Fact table we would have the 10 entries for each of the transaction and the value for inventory balance would reduce with the rows for each transaction. In the case of Periodic Snapshot table we would store the end of day Balance Inventory value only. The Periodic Snapshot fact tables are loaded continuously at the end of every business period(day/week/month etc). This way we build the fact table to provide predictable trends for business measures.

Accumulating Snapshot Fact Tables - These are special type of fact tables that are applied to business processes like order management. Here we create entries for all the phases of the order (start to end of the order process) when an order is created. Once the event to complete a phase is over we update the row corresponding to the event with factual entries and the date of event.

Factless Fact tables - These are multipart key tables with no business measure in them. They are meant to capture business metrics that cannot be determined based on business transactions. These are tables that represent the relationship between the dimensions without the need for a business transaction or event.

The star schema in the figure above would not be able to define the number of products which have not been sold in the last 2 weeks. The Sales Fact table in the figure is based on business transaction event of a sale for a product. To get answer for products not sold, we would create a Factless Fact table containing the the combination of all product key for each store key available in the last 2 weeks. We would compare the product key for the last two week in the Sales Fact table and the Factless Fact table to give us products that have not been sold in the last two weeks.

Conformed Facts:

Facts from multiple fact tables are conformed when the technical definitions of the facts are equivalent conformed facts are allowed to have the same name in separate tables and can be combined and compared mathematically.

Dimension:

An independent entity in a dimensional model that serves as an entity point or as a mechanism for slicing and dicing the additive measures located in the fact table of the dimensional model.

Ex: Customer Name

Dimension Tables:

A dimensional table in a dimensional model is a table with a single-part primary key and descriptive attribute columns.

Ex: Account dimension, Household dimension.

As explained earlier, dimension table is a denormalized table which has a single key and stores business attributes in the form textual information. Dimension tables are joined to the fact table and provide business perspective to the measure in the fact. The key of the dimension table is used to join to the fact table where it becomes a part of the composite primary key. Dimension tables also contain the hierarchy of the business perspective in the denormalized form.

Example is shown below

http://dwhinfo.com/Images/ProductDimExample.gif

The Figure above explains the denormalized structure of the product dimension.

The hierarchy explains how a brand has departments within it and how a department has a classes within it and so on.

Characteristics of Dimension tables

>Dimension tables are denormalized to accommodate simpler navigation and reduce the number of joins in the query.

>This improves the query performance.

>Dimension tables contain business attributes which are generally textual. Some attributes like Size of a product would be numerical dimension attribute.

Dimension table's unique key is generally a meaningless running number called the S urrogate key.

Values of dimension attributes can change over a period of time. These types of dimensions are called Slowly Changing Dimensions.

Types of Dimension tables

Conformed Dimension - These are dimension tables that can be shared across multiple fact tables.

A common example of conformed dimension is the Time dimension.

Degenerate Dimension - A degenerate dimension contains only a key and no attributes. It is also called as an empty dimension as it is stripped of any attributes. It does not exist as a table but the key forms a part of the fact table.

An example would be adding a transaction number in the Sales Fact table.The transaction number will identify all the rows corresponding to a transaction.

Junk dimension:

A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes. A good example would be a trade fact in a company that brokers equity trades.

The fact would contain several metrics (principal amount, net amount, price per share, commission, margin amount, etc.) and would be related to several dimensions such as

account, date, rep, office, exchange, etc. This fact would also contain several codes and flags that were related to the transaction rather than any of the dimensions ... such as origin code (that indicates whether the trade was initiated with a phone call or via the Web), a reinvest flag (that indicates whether or not this trade as was the result of the reinvestment of a dividend payout) and a comment field for storing special instructions from the customer.

These three attributes would normally be removed from the fact table and stored in a junk dimension ... perhaps called the trade dimension. In this way, the number of indexes on the fact table would be reduced, and performance (not to mention ease of use) would be enhanced. Hope this helps.

Dirty Dimension:

If a Record occurs more than one time in a table by the difference of a non-key attribute then such a dimension is called Dirty Dimemsion.

A "dirty dimension" is one in which data quality cannot be guaranteed. For example, in most banks, account-oriented source applications contain data about the same customer multiple times. Many banks attempt to derive a "customer" by matching names and addresses across account applications, but this process results in more than one entry for each bank customer. Similarly, different attributes must be held for each of a bank's heterogeneous products. Attributes that are meaningful for a loan, such as term, credit risk assessment, and collateral, have no meaning for savings, checking, or investment products.

Slowly Changing Dimension (SCD)- These are dimension tables whose attributes change over a period of time. As the name suggests the changes happen at a slow rate. Based on how the business wants to track the changes to the dimension attributes

we can handle the changes in three ways namely

Type 1 - In the Type 1 SCD tables we are not required to track the previous changes that have occurred to the attribute. Here we overwrite the existing value of the attribute with the new value. The table always contain the latest value for the dimension attribute.

Example would be realignment in the class of SKU 223345 from "sports" to "extreme sport", we update the class of the SKU 22345 from "sports" to "extreme sports"

Type 2 - In the Type 2 SCD tables we add a new row whenever there is a change in an attribute. Here we can track the changes of the attribute as we insert a new row as and when a change occurs. In these tables we require a column differentiating the current value and from the previous values. This can be achieved through a flag or a date. In the type 2 SCD we maintain the complete history of the changes.

In the above example of the realignment in the class of SKU 223345 from sports to extreme sports, we add a new row for the SKU 223345 with a new surrogate key and class = "extreme sports".

All the other columns would remain unchanged.

Type 3 - In the Type 3 SCD tables we add a new column whenever there is a change in an attribute.In SCD 3 tables we can track the changes of the attribute in the same row simultaneously. Here we need to alter the structure of the dimension table by adding the new column and renaming the existing column. In the example of the realignment in the class of SKU 223345 from sports to extreme sports, we rename the class column to old_class add a new column called class with value = "extreme sports" for the SKU 223345.

ETL (Extraction Transformation and Load) involves all the processes needed to make data ready for use by the business users so that they can make business decisions. This includes collecting data from various sources, validating it for accuracy, cleaning and making it consistent, changing it with respect to business rules, and eventually loading it in the warehouse/mart where the end-users can query it.

As depicted in the figure below, the ETL process starts from the source layer and goes all the way till the warehouse/mart layer. ETL can be defined as "collection of all the back-room process involved in preparing and loading the data for end users". The end users access only the data prepared by the ETL and not the processes themselves.

http://dwhinfo.com/Images/ETLComponents.gif

Diff:

DWH database (OLAP)	OLTP database
Designed for analysis of business measures by category and attributes.	Designed for real time business operations.
Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table.	Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table.
Loaded with consistent, valid data; requires no real time validation.	Optimized for validation of incoming data during transactions; uses validation data tables.
Supports few concurrent users relative to OLTP.	Supports thousands of concurrent users.

Multidimensional Database Structures	Normalized Data Structures
Index - Many	Index - Few
Joins - Few	Joins - Many
Aggregated Data - More	Aggregate Data - Few
No. of users - Few	No. of users - More
Periodic update of data	Data Modification More
Huge volumes of data	Small volumes of data

Fact Constellation Schema:

4 The fact constellation architecture contains multiple fact tables that share many dimension tables

4 This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension

4 The split of fact table is done only when we want to focus on aggregation over few facts & dimensions.

http://www.bipminstitute.com/dipm_images/DDB3-%20Fact%20Constellation%20Schema.jpg

Slowly changing dimensions (SCD)

4 Entities change over time

4 Customer demographics, product characteristics, classification rules, status of customers etc. lead to changes in the attributes of dimensions

4 In a transaction system, many a times the change is overwritten and track of change is lost

– For example, a source system may have only the latest customer PIN Code, as it is needed to send the marketing and billing statements. However, a data warehouse needs to maintain all the previous PIN Codes as well, because we need to track on how many customers move to new locations over what frequency.

4 A key benefit for Data Warehouse is to provide historical information, which is typically over-written (and thus lost)in the transaction systems.

4 Handling SCD

– Type 1: Not interested in the previous state. Overwrites value.

– Type 2: Adds new row.

– Type 3: Adds new column

Slowly changing dimension type 1 (or SCD 1)

4 The way most of the source systems will handle it- Overwrite the attribute value

4 For example, if a customer’s marital status has moved from 'Unmarried' to 'Married', we over-write ‘Unmarried' to 'Married'

4 Similarly, if an insurance policy status has moved from 'Lapsed' to 'Re-instated' the new status is over written on the old status

4 This is obviously done, when we are not analyzing the historical information

Slowly changing dimension type2 (or SCD 2)

4 This is used, when there is more than one change in the attributes of an entity, and we need to track the date of change of the attribute

4 In this method, a new record is added whereby the new record is given a separate identifier as the primary key

4 We cannot use the production key as the primary key here as it has not changed (Customer ID has remained the same, while the value of its attribute 'marital status' has changed). This new identifier is called the surrogate key

4 Apart from adding a new record and providing a new primary (surrogate) key, the validity period for this new record is also added

4 For example- we have a dimensional table with customer_ID '110002' with marital status as 'single'. Overtime, customer gets married and also moved to a new location. The customer dimension record will be:

Slowly changing dimension type 3 (or SCD 3)

4 This is a mid-way between method 1 and method 2. Here we don’t add an additional record, but add a new field 'old attribute value‘

4 Limitations

– should know from the beginning, what attributes will change

» This is because a new field/attribute has to be added in the design for every attribute, which can change

– attribute can change maximum once in the lifetime of the entity OR at least the lifetime of the data warehouse

4 For the same example that we saw above where the marital status and city of the customer change, the table using type 3 method is as follows:

Fact Table is the table which has facts (measures) and foreign keys to the dimension tables

4 Fact Table

– Is the primary table in a dimensional model

– Facts are numeric measurements (values) that represent a specific business aspect or activity

– Facts can be computed or derived at run-time (metrics).

– Measures are

» quantity, count, amount, percent

» mostly numerical, continuous values

» e.g., price of a product, quantity sold, number of products in inventory, budget value, count of customers, count of sales, account balance

– Have two or more foreign keys(FK) that connect to the dimension table’s primary keys

– Generally has own primary key(called a composite or concatenated key) made up of a subset of the foreign keys

– Express the many-to-many relationships between dimensions

Types of facts and fact tables

4 Facts (or measures) can be classified by…

– Numerical data type

– Aggregation type

– Additive nature

– Granularity

4 Different types of Fact tables

– Transactional: Additive facts tracking events over time

– Snapshot or inventory: Pictures in time of levels or balances

– Factless: Dimensionality relationships

– Combinations of the above

Facts can be classified based on the type of aggregation

4 Aggregation is a summarization of base-level fact table records

4 Aggregations need to account for the additive nature of the measures, created on-the-fly or by pre-aggregation

4 Common aggregation scenarios

– Category product aggs by store by day

– District store aggs by product by day

– Monthly sales aggs by product by store

– Category product aggs by store district by day

4 Common aggregations = Sum, Count, Distinct Count, Max, Min, Average, etc.

semi-additive and non-additive

4 Additive: Facts that can be summed up/aggregated across all of the dimensions in the fact table

– (e.g., discrete numerical measures of activity, i.e., quantity sold, dollars sold)

4 Semi-Additive: Facts that can be summed up for some of the dimensions in the fact table, but not the others

– (e.g., account balances, inventory level, distinct counts)

4 Non-Additive: Facts that cannot be summed up for any of the dimensions present in the fact table

(e.g., measurement of room temperature

Table

4 Fact Table Granularity

4 The level of detail of data contained in the fact table

4 The description of a single instance (a record) of the fact

– Typically includes a time level and a distinct combinations of other dimensions

4 e.g. Daily item totals by product, by store, Weekly snapshot of store inventory by product

Transaction fact tables are the most common types of fact tables

4 Most common type of fact table

4 Track the occurrence of events, each detailed event is captured into a row in the fact table

4 Measures are typically additive across all dimensions

4 Common transactional fact table types

– Sales, Visits, Web-page hits, Account transactions

Star Schema:

StarSchema is a relational db schema for representing multidimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity relationship diagram between dimensions and fact table resembles a star where one fact table is connected to multiple dimensions.

ADV:

Easy to understand data.

Snow Flake schema:

The snow flake schema is a normalized(3NF) star schema . In a snowflake schema, the dimension tables are normalized.

ADV:

In some cases it may improve performance because smaller tables are joined.

>Is easier to maintain.

>Increases flexibility.

Disadv:

Increases the no.of tables an end user must work with.

Makes the queries much more difficult to create because more tables need to be joined.

What is metric attribute?

A)The attributes other than keys.

Diff bw star and snowflake:

	Star Schema	Snowflake Schema
Dimension tables Normalization	2NF	3Nf
Joins	Fewer joins and hence less complex queries easy to understand	Higher no.of joins and hence more complex queries and hence less easy to understand.
Query performance	Less no.of Fkeys and hence lesser query execution time	More Fkeys and hence more query execution time
Easy of maintainence	Has redundant data and hence less easy to maintain/change	No redundancy and hence more easy to maintain and change
Type of DW	Good to use for small DWs/data marts.	Good for large
Dimension tables	Contains only single dimension table for each dimension	It may have more than one dimension table for each dimension

What is 1NF,2NF,3NF?

Normalization is the process of successively reducing relations with anomalies to produce smaller and well structured relations.

First normal form (1NF) is a relation that has a primary key and in which there are no repeating groups.

Second normal form (2NF) a relation in first normal form in which every non key attribute is fully functionally dependent on the primary key.

Third normal form (3NF) a relation that is in second normal form and has no transitive dependencies.

What is functional dependency?

Definition: A functional dependency occurs when one attribute in a relation uniquely determines another attribute. This can be written A -> B which would be the same as stating "B is functionally dependent upon A."

Examples:

In a table listing employee characteristics including Social Security Number (SSN) and name, it can be said that name is functionally dependent upon SSN (or SSN -> name) because an employee's name can be uniquely determined from their SSN. However, the reverse statement (name -> SSN) is not true because more than one employee can have the same name but different SSNs.

What is Transitive Dependency?

A transitive dependency is a type of functional dependency in which the value in a non-key field is determined by the value in another non-key field and that field is not a candidate key.

A Practical Approach

Again, look for repeated values in a non-key field as in the following example.

A table with a single field primary key and repeating values in non-key fields.

*ProjectNum	ProjectTitle	ProjectMgr	Phone
30-452-T3	STAR manual	Garrison	2756
30-457-T3	ISO procedures	Jacanda	2954
30-482-TC	Web site	Friedman	2846
31-124-T3	Employee handbook	Jones	3102
31-238-TC	STAR prototype	Garrison	2756
31-241-TC	New catalog	Jones	3102
35-152-TC	STAR pricing	Vance	3022
36-272-TC	Order system	Jacanda	2954

The phone number is repeated each time a manager name is repeated. This is because the phone number is only a second cousin to the project number. It's dependent on the manager, which is dependent on the project number (a transitive dependency).

The ProjectMgr field is not a candidate key because the same person manages more than one project. Again, the solution is to remove the field with repeating data to a separate table.

What is the difference between ER modeling and dimensional modeling?

ER model:

The entity-relationship model (or ER model) is a way of graphically representing the logical relationships of entities (or objects) in order to create a database.

Dimensional modeling:

Diff:

Basic diff is E-R modeling will have logical and physical model. Dimensional model will have only physical model. E-R modeling is used for normalizing the OLTP database design.Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.

In ER modeling the data is in normalised form. So more number of Joins which may adversly affect the system performnace.Whereas in Dimensional Modelling the data is denormalised so less number of joins by which system performance will improve.

What is ODS(operational Date Source)?

ODS is the Operational Data Source which is also called transactional data ODS is the source of a warehouse. Data from ODs is staged transformed and then moved to datawarehouse.

What is BI?

Business intelligence (BI) refers to computer-based techniques used in spotting, digging-out, and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes.^[1]

BI technologies provide historical, current, and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics,data mining, business performance management, benchmarking, text mining, and predictive analytics.

Business intelligence aims to support better business decision-making. Thus a BI system can be called a decision support system (DSS).^[2] Though the term business intelligence is sometimes used as a synonym for competitive intelligence, because they both support decision making, BI uses technologies, processes, and applications to analyze mostly internal, structured data and business processes while competitive intelligence gathers, analyzes and disseminates information with a topical focus on company competitors. Business intelligence understood broadly can include the subset of competitive intelligence.^[

What is multidimensional modeling?

Multi-dimensional model is an integral aspect of the On-line Analytical Processing which also known as OLAP.

Gathered by,

Divi Narayana,

Abinitio Group Of Researchers.

Abinitio Interview Questions

Sunday, 25 September 2016

Data Ware Housing Concepts

1 comment:

Total Pageviews