What
is DataWareHouse?
When users are requesting access to a large amount of historical
information for reporting purposes, you should strongly consider a warehouse or
mart. The user will benefit when the information is organized in an efficient
manner for this type of access.
Def:
- DWH is a
type of relational data base system specially designed for query analysis
processing rather than transactional processing.
Characterstics:
► Subject Oriented
► Non Volatile
►
Integrated
►
Time Variant
Subject Oriented:
Means that the data collected must be “business
Oriented”.
Integrated:
The DW contains data from most or all of the organization operational
systems and those data are made consistent.
Non_Volatile:
Data in the DWH are never overwritten or deleted-once committed the data
are static ,read only, and retained for future reporting.
TimeVariant:
The data of the DW msut be historical.
What is Enterprise
DataWareHouse?
Enterprise Data Warehouse is a centralized warehouse which
provides service for the entire enterprise.
Granularity:
The granularity is the lowest level of
information stored in the fact table.
The depth of data level is known as granularity. In date dimension the level
could be year, month, quarter, period, week, day of granularity.
The process consists of the following two steps:
- Determining the dimensions that are to be
included
- Determining the location to place the
hierarchy of each dimension of information
Data miningL
Extraction of data from warehouse
Generally, data mining (sometimes called data or knowledge
discovery) is the process of analyzing data from different perspectives and
summarizing it into useful information - information that can be used to
increase revenue, cuts costs, or both.
Data Presentation area:
The place where warehouse data is
organized,stored and available for direct querying by users,data access tools,
and other analytical applications.
Data Staging Area:
A storage area and set of processes that clean, transform, combine, deduplicate
, house hold, archive, and prepare source data for use into the data warehouse.
DataMart:
The term “DataMart” refers to a
sub entity of data warehouse containing the data of the DW for a particular
sector of the company.
Ex: department, division service, product line etc.
Consolidated data mart:
Data marts that combine business measurements from multiple business
processes. Some times called a second level data mart.
First level data Mart:
A datamart that is derived from a single primary source system.
Second level data mart:
Same as consolidated datamart.
Architecture:
The
Data Warehouse architecture defines the components and the structure of
the data warehouse. The architecture also defines the interfaces
between the components and the external environment.
All
data warehouses applications have many logical layers that are separate. Each
layer interacts with the other layers by passing the required data.

Source
Layer
This
layer contains all the separate OLTP and Legacy systems that provide the
business data. These systems are physically separate from each other. This
layer could contain data from different operating systems, databases and files.
Landing
Area
This
is the layer where source data before being transformed is temporarily
storedon the ETL Server. We could store the source data in files or
load it into database tables. Generally we store the source data in files.
We
could use the database to house the source data to provide a sophisticated and
fast way of handling reject data reprocessing (one of the
DWH Exception Handling processes) requirements for very low tolerance level
DWH.
Please
Note that adding a database at the landing area adds cost of extra
infrastructure and maintenance.
Staging
Area
This
is the layer where the cleansed and transformed data is temporarily
stored.
We
generally use files to store the intermediate data while transformation and
cleansing. Once the data is ready to be loaded to the warehouse, we load it in
the staging database. The advantage of using the staging database is that we
add a point in the ETL flow where we can restart the load from.
The other advantages of using staging database is that we can directly utilize
the bulk load utilities provided by the databases and ETL
tools while loading the data in the warehouse/mart, and provide a point in the
data flow where we can audit the data.
Warehouse/Mart
Layer
This
is the layer where we store the transformed data. The data here is used for
business analysis.
Presentation
Layer
This
layer contains the tools and processes used to read the data from the
warehouse/mart and present it to the end user in a graphical format.
The DWH architecture can be affected
by various factors namely
- Requirements,
- Geographical location of the
source systems and the
- Budget
Requirements
impact to DWH architecture
In
case of business not clear about the overall business requirements for
all or multiple subject areas, it would be safer to understand the
requirements for a single subject area and develop the data mart for
that subject area. This leads to astandalone data mart. In the
future once requirements for another subject area is clear we tend to build
another standalone data mart and so on.. This creates multiple standalone data
marts. This term is called as stove-pipe or siloed structure.
Here
multiple data marts would have the same data loaded as many times and, same
data may not be consistent across data marts.
Care
must be taken when we build the first data mart to avoid a stove-pipe
architecture in the future when requirements for other subject areas are clear.
We do so by designing dimension tables in the data mart that can be shared with
other data marts. These dimension tables are called Conformed Dimension tables.
Using the conformed dimensions we can integrate multiple data marts to
form a data warehouse. This approach is called the bottom-up
approach, suggested by Ralph Kimball.
In the case when the overall
business requirements are clear for all subject areas, we first build an Enterprise
Data Warehouse (EDW) which is a third normal form (3NF)
schema. From the EDW we build the various subject specific data marts. This
approach is called the top-down approach, suggested by Bill
Inmon.
Two Approaches:
There are two strategies to build a
data warehouse namely
- Top - Down Approach (Suggested by Bill Inmon)
- Bottom - Up Approach (Suggested by Ralph Kimball)
Top Down Approach (Suggested by Bill
Inmon)

In
the top down approach suggested by Bill Inmon, we build a centralized
repository to house corporate wide business data. This repository is called
Enterprise Data Warehouse (EDW). The data in the EDW is stored in a normalized
form in order to avoid redundancy.
The
central repository for corporate wide data helps us maintain one version of
truth of the data.
The
data in the EDW is stored at the most detail level. The reason to build
the EDW on the most detail level is to leverage
- Flexibility to be used by
multiple departments.
- Flexibility to cater for future
requirements.
The
disadvantages of storing data at the detail level are
- The complexity of design
increases with increasing level of detail.
- It takes large amount of space
to store data at detail level, hence increased cost.
Once
the EDW is implemented we start building subject area specific data marts
which contain data in a denormalized form also called star schema.
The data in the marts are usually summarized based on the end users
analytical requirements.
The
reason to denormalize the data in the mart is to provide faster access to the
data for the end users analytics. If we were to have queried a normalized
schema for the same analytics, we would end up in a complex multiple level
joins that would be much slower as compared to the one on the denormalized
schema.
We
should implement the top-down approach when
- The business has complete
clarity on all or multiple subject areas DWH requirements.
- The business is ready to invest
considerable time and money.
The
advantage of using the Top Down approach is that we build a centralized
repository to cater for one version of truth for business data. This is very
important for the data to be reliable, consistent across subject areas and for
reconciliation incase of data related contention between subject areas.
The
disadvantages of using the Top Down approach is that it requires more
time and initial investment. The business has to wait for the EDW to be
implemented followed by building the data marts before which they can access
their reports.
Bottom
Up Approach (Suggested by Ralph Kimball)

The
bottom up approach suggested by Ralph Kimball is an incremental
approach to build a data warehouse. Here we build the data marts separately at
different points of time as and when the specific subject area requirements are
clear. The data marts are integrated or combined together to form a data
warehouse. Separate data marts are combined through the use of conformed
dimensions and conformed facts. A conformed dimension and a
conformed fact is one that can be shared across data marts.
A
Conformed dimension has consistent dimension keys, consistent
attribute names and consistent values across separate data marts.
The conformed dimension means exact same thing with every fact table it is
joined.
A
Conformed fact has the same definition of measures, same dimensions
joined to it and at the same granularity across data marts.

In
order to build conformed dimensions and facts we need to create a Bus Matrix
with the rows corresponding to the various data marts and columns corresponding
to all the dimension tables as depicted in the diagram above.
The
bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We dont have to
wait for knowing the over all requirements of the warehouse. Care must be taken
to build an exhaustive bus matrix from the first data mart itself identifying
all possible dimensions, otherwise we would be building a stove-pipes in the
organization.
We
should implement the bottom up approach when
- We have initial cost and
time constraints.
- The complete warehouse
requirements are not clear. We have clarity to only one data mart.
The
advantage of using the Bottom Up approach is that they do not require
high initial costs and have a faster implementation time, hence the business
can start using the marts much earlier as compared to the top-down approach.
The
disadvantages of using the Bottom Up approach is that it stores data in
the denormalized format, hence there would be high space usage for detailed
data. We have a tendency of not keeping detailed data in this approach hence
loosing out on advantage of having detail data .i.e. flexibility to easily cater
to future requirements.
- - - - - - - - - - - --
Attribute:
A
column(field) in a dimension table.
Composite
key:
A key in a db table made up of several columns. Same
as candidate key.
Foreign
key:
A column in a relational data base table whose
values are drawn from the values of primary key in another table.
What
is Dimensional Modeling?
Dimensional
Modelling is a design concept used by many data warehouse desginers to build
thier datawarehouse. In this design model all the data is stored in two types
of tables - Facts table and Dimension table. Fact table contains the
facts/measurements of the business and the dimension table contains the context
of measuremnets ie, the dimensions on which the facts are calculated.
Drill
Down:
The act of adding a row header or replacing a row
header in a report to break down the rows of the answer set more finely.
Drill
Up:
The act of removing a row header or replacing a row
header in a report to summarize the rows of th answer set.
Flat
File:
A simple data structure, often implemented on a
mainframe,that relies on nonrelational files, such as IBM VSAM files.
Fact:
A business performance measurement, typically
numeric and additive, that is stored in a fact table.
Ex:
Branch_key
Fact
Table
As
explained above the Fact Table forms the center of the star schema. It contains
one or more business measure which are numeric values. The fact table is joined
to the dimension tables through the foreign keys to the dimension tables. These
foreign keys joining the fact table to the dimension tables form the fact
tables composite primary key.
Ex:
Monthly
Account Fact
|
Month
End Date Key(FK)
Branch
Key(FK)
Product
key(FK)
Account
key(FK)
Account
Status(FK)
.
.
.
|
Types
of Measures in a Fact
Based
on whether a metric is calculated and or how the measure values can be added we
categorize them as follows
Base
Measures(Fact) - A metric in the fact table
whose value is taken as-is from the source. The ETL process does not change its
value based on business rules or other measures.
e.g. The "Sales in $" in the above example is a
transaction whose value is loaded as-is from the source system.
Eg: Account_key,Branch_key.
Derived
Measures(Fact) - A metric in the fact table
whose value is calculated based on business rules and the base measures from
source system
.e.g. The
"Sales Margin" is a measure that is calculated based on the (Sales in
$ - Cost in $)/Sales in $.
Here we use two base measures namely Sales in
$ and Cost in $.
Eg: loan
Margin
Additive
Measures(Facts) - These are measures in the
fact table that can be added across all the dimension tables and give
meaningful data.
For example
the "Sales in $" in the example above can be measured across all the
three dimensions attached to the fact table. If we add the "Sales in
$" across the time dimension we get the total sales for a period of time,
similarly total sales for across all stores, and sales for all products.
Ex:
Monthly deposits
Semi
Additive Measures(Facts) -
These are measures in the fact table that can be added across only some
dimensions. When added across the
other dimensions it may not produce meaningful results.
For example
the Inventory Balance metric in the example, indicates the remaining number of
the product in the store at the time of the transaction. Adding it over the
time dimension will not result in a meaningful result, but adding it for all
the products in the store will give the total inventory count.
Ex: Fees Charged
Non
Additive Measures(Facts)- These
are measure in the fact table that cannot be added across any fact. If added they would not produce meaningful
results. These are generally percentages and ratio metrics.
Example of a non
additive fact would be Sales Margin % as shown in the example above.
Ex:
percentage of interest.
Types
of Fact Tables
Transaction
Fact Tables - These are fact tables that
contain the value of the business transaction that has occurred at a point of
time. Here a row will be inserted for each transaction that has occurred.
Periodic
Snapshot Fact Tables - These are fact tables that
contain the complete snapshot of the transactions at the end of the business
period (day/week/month etc). Take for example that there were 10 sales
transactions for a particular product/SKU during the day.
In
Transaction Fact table we would have the 10 entries for each of the transaction
and the value for inventory balance would reduce with the rows for each
transaction. In the case of Periodic Snapshot table we would store the end of
day Balance Inventory value only. The Periodic Snapshot fact tables are loaded
continuously at the end of every business period(day/week/month etc). This way
we build the fact table to provide predictable trends for business measures.
Accumulating
Snapshot Fact Tables - These are special type of
fact tables that are applied to business processes like order management. Here
we create entries for all the phases of the order (start to end of the order
process) when an order is created. Once the event to complete a phase is over
we update the row corresponding to the event with factual entries and the date
of event.
Factless
Fact tables - These are multipart key tables
with no business measure in them. They are meant to capture business metrics
that cannot be determined based on business transactions. These are tables that
represent the relationship between the dimensions without the need for a
business transaction or event.
The
star schema in the figure above would not be able to define the number of
products which have not been sold in the last 2 weeks. The Sales Fact table in
the figure is based on business transaction event of a sale for a product. To
get answer for products not sold, we would create a Factless Fact table
containing the the combination of all product key for each store key available
in the last 2 weeks. We would compare the product key for the last two week in
the Sales Fact table and the Factless Fact table to give us products that
have not been sold in the last two weeks.
Conformed Facts:
Facts
from multiple fact tables are conformed when the technical definitions of the
facts are equivalent conformed facts are allowed to have the same name in
separate tables and can be combined and compared mathematically.
Dimension:
An independent entity in a dimensional model that
serves as an entity point or as a mechanism for slicing and dicing the additive
measures located in the fact table of the dimensional model.
Ex:
Customer
Name
Dimension
Tables:
A
dimensional table in a dimensional model is a table with a single-part primary
key and descriptive attribute columns.
Ex: Account
dimension, Household dimension.
As
explained earlier, dimension table is a denormalized table which
has a single key and stores business attributes in the form textual
information. Dimension tables are joined to the fact table and provide business
perspective to the measure in the fact. The key of the dimension table is used
to join to the fact table where it becomes a part of the composite primary key.
Dimension tables also contain the hierarchy of the business perspective in the
denormalized form.
Example
is shown below

The
Figure above explains the denormalized structure of the product dimension.
The hierarchy explains how a brand has
departments within it and how a department has a classes within it and so on.
Characteristics
of Dimension tables
>Dimension
tables are denormalized to accommodate simpler navigation and reduce the number
of joins in the query.
>This
improves the query performance.
>Dimension
tables contain business attributes which are generally textual. Some attributes
like Size of a product would be numerical dimension attribute.
Dimension
table's unique key is generally a meaningless running number called the Surrogate key.
Values
of dimension attributes can change over a period of time. These types of
dimensions are called Slowly Changing Dimensions.
Types
of Dimension tables
Conformed
Dimension - These are dimension tables
that can be shared across multiple fact tables.
A
common example of conformed dimension is the Time dimension.
Degenerate
Dimension - A degenerate dimension
contains only a key and no attributes. It is also called as an empty
dimension as it is stripped of any attributes. It does not exist as a
table but the key forms a part of the fact table.
An
example would be adding a transaction number in the Sales Fact table.The
transaction number will identify all the rows corresponding to a transaction.
Junk dimension:
A "junk" dimension is a collection of
random transactional codes, flags and/or text attributes that are unrelated to
any particular dimension. The junk dimension is simply a structure that
provides a convenient place to store the junk attributes. A good example would
be a trade fact in a company that brokers equity trades.
The fact would contain several metrics
(principal amount, net amount, price per share, commission, margin amount,
etc.) and would be related to several dimensions such as
account, date, rep, office, exchange, etc. This
fact would also contain several codes and flags that were related to the
transaction rather than any of the dimensions ... such as origin code (that
indicates whether the trade was initiated with a phone call or via the Web), a
reinvest flag (that indicates whether or not this trade as was the result of
the reinvestment of a dividend payout) and a comment field for storing special
instructions from the customer.
These three attributes
would normally be removed from the fact table and stored in a junk dimension
... perhaps called the trade dimension. In this way, the number of indexes on
the fact table would be reduced, and performance (not to mention ease of use)
would be enhanced. Hope this helps.
Dirty
Dimension:
If a Record occurs more than one time in a table by the difference of a
non-key attribute then such a dimension is called Dirty Dimemsion.
A "dirty dimension" is one in which data quality cannot be
guaranteed. For example, in most banks, account-oriented source applications
contain data about the same customer multiple times. Many banks attempt to
derive a "customer" by matching names and addresses across account
applications, but this process results in more than one entry for each bank
customer. Similarly, different attributes must be held for each of a bank's
heterogeneous products. Attributes that are meaningful for a loan, such as
term, credit risk assessment, and collateral, have no meaning for savings,
checking, or investment products.
Slowly
Changing Dimension (SCD)- These are
dimension tables whose attributes change over a period of time. As the name
suggests the changes happen at a slow rate. Based on how the business wants to
track the changes to the dimension attributes
we
can handle the changes in three ways namely
Type
1 - In the Type 1 SCD tables we are not
required to track the previous changes that have occurred to the attribute.
Here we overwrite the existing value of the attribute with the new value. The
table always contain the latest value for the dimension attribute.
Example would be realignment
in the class of SKU 223345 from "sports" to "extreme
sport", we update the class of the SKU 22345 from "sports" to
"extreme sports"
Type
2 - In the Type 2 SCD tables we
add a new row whenever there is a change in an attribute. Here we can track the
changes of the attribute as we insert a new row as and when a change occurs. In
these tables we require a column differentiating the current value and from the
previous values. This can be achieved through a flag or a date. In the type 2
SCD we maintain the complete history of the changes.
In the above example of the realignment in the
class of SKU 223345 from sports to extreme sports, we add a new row for the SKU
223345 with a new surrogate key and class = "extreme sports".
All
the other columns would remain unchanged.
Type
3 - In the Type 3 SCD tables we
add a new column whenever there is a change in an attribute.In SCD 3 tables we
can track the changes of the attribute in the same row simultaneously. Here we
need to alter the structure of the dimension table by adding the new column and
renaming the existing column. In the example of the realignment in the class of
SKU 223345 from sports to extreme sports, we rename the class column to
old_class add a new column called class with value = "extreme sports"
for the SKU 223345.
ETL
(Extraction Transformation and Load) involves
all the processes needed to make data ready for use by the business users so
that they can make business decisions. This includes collecting data from
various sources, validating it for accuracy, cleaning and making it consistent,
changing it with respect to business rules, and eventually loading it in the
warehouse/mart where the end-users can query it.
As
depicted in the figure below, the ETL process starts from the source layer and
goes all the way till the warehouse/mart layer. ETL can be defined as
"collection of all the back-room process involved in preparing and loading
the data for end users". The end users access only the data prepared by
the ETL and not the processes themselves.

Diff:
DWH database (OLAP)
|
OLTP database
|
Designed for analysis of business
measures by category and attributes.
|
Designed for real time business
operations.
|
Optimized for bulk loads and large,
complex, unpredictable queries that access many rows per table.
|
Optimized for a common set of
transactions, usually
adding or retrieving a single row at a
time per table.
|
Loaded with consistent, valid data;
requires no real time validation.
|
Optimized for validation of incoming
data
during transactions; uses validation
data tables.
|
Supports few concurrent users relative
to OLTP.
|
Supports thousands of concurrent users.
|
Multidimensional Database Structures
|
Normalized Data Structures
|
Index - Many
|
Index - Few
|
Joins - Few
|
Joins - Many
|
Aggregated Data - More
|
Aggregate Data - Few
|
No. of users - Few
|
No. of users - More
|
Periodic update of data
|
Data Modification More
|
Huge volumes of data
|
Small volumes of data
|
Fact Constellation Schema:
4 The
fact constellation architecture contains multiple fact tables that share many
dimension tables
4 This
Schema is used mainly for the aggregate fact tables, OR where we want to split
a fact table for better comprehension
4 The
split of fact table is done only when we want to focus on aggregation over few
facts & dimensions.

Slowly
changing dimensions (SCD)
4 Entities
change over time
4 Customer demographics, product characteristics,
classification rules, status of customers etc. lead to changes in the
attributes of dimensions
4 In
a transaction system, many a times the change is overwritten and track of
change is lost
– For example, a
source system may have only the latest customer PIN Code, as it is needed to
send the marketing and billing statements. However, a data warehouse needs to
maintain all the previous PIN Codes as well, because we need to track on how
many customers move to new locations over what frequency.
4 A
key benefit for Data Warehouse is to provide historical information, which is
typically over-written (and thus lost)in the transaction systems.
4 Handling
SCD
– Type
1: Not interested in the previous state. Overwrites value.
– Type
2: Adds new row.
– Type
3: Adds new column
Slowly
changing dimension type 1 (or SCD 1)
4 The
way most of the source systems will handle it- Overwrite the attribute value
4 For example, if
a customer’s marital status has moved from 'Unmarried' to 'Married', we
over-write ‘Unmarried' to 'Married'
4 Similarly,
if an insurance policy status has moved from 'Lapsed' to 'Re-instated' the new
status is over written on the old status
4 This
is obviously done, when we are not analyzing the historical information
Slowly
changing dimension type2 (or SCD 2)
4 This
is used, when there is more than one change in the attributes of an entity, and
we need to track the date of change of the attribute
4 In
this method, a new record is added whereby the new record is given a separate
identifier as the primary key
4 We
cannot use the production key as the primary key here as it has not changed
(Customer ID has remained the same, while the value of its attribute 'marital
status' has changed). This new identifier is called the surrogate key
4 Apart
from adding a new record and providing a new primary (surrogate) key, the
validity period for this new record is also added
4 For
example- we have a dimensional table with customer_ID '110002' with marital
status as 'single'. Overtime, customer gets married and also moved to a new
location. The customer dimension record will be:

Slowly
changing dimension type 3 (or SCD 3)
4 This
is a mid-way between method 1 and method 2. Here we don’t add an additional
record, but add a new field 'old attribute value‘
4 Limitations
– should
know from the beginning, what attributes will change
» This is because a new field/attribute has to
be added in the design for every attribute, which can change
– attribute
can change maximum once in the lifetime of the entity OR at least the lifetime
of the data warehouse
4 For
the same example that we saw above where the marital status and city of the
customer change, the table using type 3 method is as follows:

Fact
Table is the table which has facts (measures) and foreign keys to the dimension
tables
4 Fact
Table
– Is
the primary table in a dimensional model
– Facts are numeric measurements (values) that
represent a specific business aspect or activity
– Facts can be computed or derived at run-time
(metrics).
– Measures
are
» quantity,
count, amount, percent
» mostly
numerical, continuous values
» e.g.,
price of a product, quantity sold, number of products in inventory, budget
value, count of customers, count of sales, account balance
– Have two or more foreign keys(FK) that connect
to the dimension table’s primary keys
– Generally
has own primary key(called a composite or concatenated key) made up of a subset
of the foreign keys
– Express the many-to-many relationships between
dimensions

Types
of facts and fact tables
4 Facts
(or measures) can be classified by…
– Numerical
data type
– Aggregation
type
– Additive
nature
– Granularity
4 Different
types of Fact tables
– Transactional:
Additive facts tracking events over time
– Snapshot
or inventory: Pictures in time of levels or balances
– Factless:
Dimensionality relationships
– Combinations
of the above
Facts
can be classified based on the type of aggregation
4 Aggregation
is a summarization of base-level fact table records
4 Aggregations
need to account for the additive nature of the measures, created on-the-fly or
by pre-aggregation
4 Common
aggregation scenarios
– Category
product aggs by store by day
– District
store aggs by product by day
– Monthly
sales aggs by product by store
– Category
product aggs by store district by day
4 Common
aggregations = Sum, Count, Distinct Count, Max, Min, Average, etc.
semi-additive
and non-additive
4 Additive:
Facts that can be summed up/aggregated across all of the dimensions in the fact
table
– (e.g.,
discrete numerical measures of activity, i.e., quantity sold, dollars sold)
4 Semi-Additive:
Facts that can be summed up for some of the dimensions in the fact table, but
not the others
– (e.g.,
account balances, inventory level, distinct counts)
4 Non-Additive:
Facts that cannot be summed up for any of the dimensions present in the fact
table
(e.g.,
measurement of room temperature
Table
4 Fact
Table Granularity
4 The
level of detail of data contained in the fact table
4 The
description of a single instance (a record) of the fact
– Typically
includes a time level and a distinct combinations of other dimensions
4 e.g. Daily item totals by product, by store,
Weekly snapshot of store inventory by product
Transaction
fact tables are the most common types of fact tables
4 Most
common type of fact table
4 Track
the occurrence of events, each detailed event is captured into a row in the
fact table
4 Measures
are typically additive across all dimensions
4 Common
transactional fact table types
– Sales, Visits, Web-page hits, Account
transactions
Star Schema:
StarSchema
is a relational db schema for representing multidimensional data. It is the
simplest form of data warehouse schema that contains one or more dimensions and
fact tables. It is called a star schema because the entity relationship diagram
between dimensions and fact table resembles a star where one fact table is
connected to multiple dimensions.
ADV:
Easy
to understand data.
Snow Flake schema:
The
snow flake schema is a normalized(3NF) star schema . In a snowflake schema, the
dimension tables are normalized.
ADV:
In
some cases it may improve performance because smaller tables are joined.
>Is
easier to maintain.
>Increases
flexibility.
Disadv:
Increases
the no.of tables an end user must work with.
Makes
the queries much more difficult to create because more tables need to be
joined.
What is metric attribute?
A)The
attributes other than keys.
Diff bw star and snowflake:
Star Schema
|
Snowflake Schema
|
|
Dimension tables Normalization
|
2NF
|
3Nf
|
Joins
|
Fewer
joins and hence less complex queries easy to understand
|
Higher
no.of joins and hence more complex queries and hence less easy to understand.
|
Query performance
|
Less
no.of Fkeys and hence lesser query execution time
|
More
Fkeys and hence more query execution time
|
Easy of maintainence
|
Has
redundant data and hence less easy to maintain/change
|
No
redundancy and hence more easy to maintain and change
|
Type of DW
|
Good
to use for small DWs/data marts.
|
Good
for large
|
Dimension tables
|
Contains
only single dimension table for each dimension
|
It
may have more than one dimension table for each dimension
|
What is 1NF,2NF,3NF?
Normalization is the process of
successively reducing relations with anomalies to produce
smaller and well structured relations.
First normal form (1NF)
is a relation that has a primary key and in which there are no repeating
groups.
Second normal form (2NF) a relation in first normal form in which every non key attribute is fully functionally dependent on the primary key.
Third normal form (3NF) a relation that is in second normal form and has no transitive dependencies.
Second normal form (2NF) a relation in first normal form in which every non key attribute is fully functionally dependent on the primary key.
Third normal form (3NF) a relation that is in second normal form and has no transitive dependencies.
What is functional dependency?
Definition: A functional dependency occurs when one attribute in a
relation uniquely determines another attribute. This can be written A -> B
which would be the same as stating "B is functionally dependent upon
A."
Examples:
In a table listing employee
characteristics including Social Security Number (SSN) and name, it can be said
that name is functionally dependent upon SSN (or SSN -> name) because an
employee's name can be uniquely determined from their SSN. However, the reverse
statement (name -> SSN) is not true because more than one employee can have
the same name but different SSNs.
What
is Transitive Dependency?
A
transitive dependency is a type of functional dependency in which the
value in a non-key field is determined by the value in another non-key field
and that field is not a candidate key.
A
Practical Approach
Again,
look for repeated values in a non-key field as in the following example.
A table with a single field
primary key and repeating values in non-key fields.
|
|||
*ProjectNum
|
ProjectTitle
|
ProjectMgr
|
Phone
|
30-452-T3
|
STAR manual
|
Garrison
|
2756
|
30-457-T3
|
ISO procedures
|
Jacanda
|
2954
|
30-482-TC
|
Web site
|
Friedman
|
2846
|
31-124-T3
|
Employee handbook
|
Jones
|
3102
|
31-238-TC
|
STAR prototype
|
Garrison
|
2756
|
31-241-TC
|
New catalog
|
Jones
|
3102
|
35-152-TC
|
STAR pricing
|
Vance
|
3022
|
36-272-TC
|
Order system
|
Jacanda
|
2954
|
The
phone number is repeated each time a manager name is repeated. This is because
the phone number is only a second cousin to the project number. It's dependent
on the manager, which is dependent on the project number (a transitive
dependency).
The
ProjectMgr field is not a candidate key because the same person manages more
than one project. Again, the solution is to remove the field with repeating
data to a separate table.
What is the difference between ER
modeling and dimensional modeling?
ER model:
The
entity-relationship model (or ER model) is a way of graphically representing
the logical relationships of entities (or objects)
in order to create a database.
Dimensional modeling:
Dimensional Modelling is a design concept used by
many data warehouse desginers to build thier datawarehouse. In this design
model all the data is stored in two types of tables - Facts table and Dimension
table. Fact table contains the facts/measurements of the business and the
dimension table contains the context of measuremnets ie, the dimensions on
which the facts are calculated.
Diff:
Basic diff is E-R
modeling will have logical and physical model. Dimensional model will have only
physical model. E-R modeling is used for normalizing the OLTP database
design.Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.
In ER modeling the data is
in normalised form. So more number of Joins which may adversly affect the
system performnace.Whereas in Dimensional Modelling the data is denormalised so
less number of joins by which system performance will improve.
What is ODS(operational Date
Source)?
ODS is the Operational Data Source which is also
called transactional data ODS is the source of a warehouse. Data from ODs is
staged transformed and then moved to datawarehouse.
What is BI?
Business
intelligence (BI)
refers to computer-based techniques used in spotting,
digging-out, and analyzing business data, such as sales revenue by products and/or departments, or by
associated costs and incomes.[1]
BI
technologies provide historical, current, and predictive views of business
operations. Common functions of business intelligence technologies are
reporting, online analytical
processing, analytics,data mining, business
performance management, benchmarking, text mining, and predictive analytics.
Business
intelligence aims to support better business decision-making. Thus a BI system
can be called a decision support
system (DSS).[2] Though
the term business intelligence is sometimes used as a synonym for competitive
intelligence, because they both support decision making, BI uses
technologies, processes, and applications to analyze mostly internal, structured
data and business processes while competitive intelligence gathers, analyzes
and disseminates information with a topical focus on company competitors.
Business intelligence understood broadly can include the subset of competitive
intelligence.[
What is multidimensional modeling?
Multi-dimensional model is an integral
aspect of the On-line Analytical Processing which also known as OLAP.
Gathered
by,
Divi Narayana,
Abinitio Group Of
Researchers.
Thanks for your information. very good article.
ReplyDeleteMicro Strategy Training
Online Micro Strategy Training