Dimensional Modelling | STAR , Snowflake and Fact schema.
What is Dimensional modeling?
- Dimensional modeling is a design approach used in data warehousing to structure data for easier access and analysis.
- This approach organizes data into "dimensions" and "facts" to provide a clear, user-friendly way to analyze and understand data relationships.
Objective
Ease of Understanding
- Dimensional modeling makes data more accessible and understandable
- for end users by organizing it in a way that aligns with their business processes and perspectives.
Improved Query Performance
- By structuring data into a star or snowflake schema, queries can be optimized to access data more quickly, leading to faster performance and response times.
Flexibility in Reporting and Analysis
- The model supports various types of analysis and reporting by allowing users to slice and dice data across different dimensions.
Ease of Data Aggregation
- Dimensional modeling makes it easier to perform calculations and aggregations,
- such as totals, averages, and other statistics, by organizing data in a structured way.
Scalability
Dimensional modeling is designed to accommodate large volumes of data and can be scaled up as the data warehouse grows.
From Requirement to Data Design
1. Gathering Requirements
Business Understanding
- Start by understanding the business requirements and objectives.
- Identify the questions the business wants to answer and the key performance indicators (KPIs) to track.
Data Sources
- Determine the data sources needed to meet the business requirements.
- This includes databases, files, and other data systems.
2. Identifying Dimensions and Facts
Dimensions
Identify the key entities or categories (e.g., products, customers, time) that provide context for analyzing the data. These are the dimensions.
Facts
Identify the measurable data points (e.g., sales revenue, quantities sold) that are important for analysis. These are the facts.
3. Designing the Schema
Star Schema
- Design the data model using a star schema, where a central fact table is surrounded by dimension tables.
- The fact table contains measurable data, while the dimension tables provide context.
Snowflake Schema
- Alternatively, a snowflake schema can be used where dimension tables are further normalized, resulting in a more complex but more space-efficient design.
4. Establishing Relationships
Foreign Keys
- Use foreign keys to create relationships between fact tables and dimension tables.
- This links the data and allows for easy navigation between different tables.
- Define hierarchies within dimensions (e.g., year > month > day) to enable data drill-down and roll-up.
5. Validating the Design
Review with Stakeholders
Validate the design with stakeholders to ensure it meets the business requirements.
Test Queries
Test queries against the model to verify performance and data accuracy.
6. Implementation and Maintenance
ETL Process
Design an ETL (Extract, Transform, Load) process to populate the data warehouse with data from source systems.
Ongoing
Continuously monitor and maintain the data warehouse, including updates to the data model as business needs change.
Multi-Dimensional Data Model
- A multi-dimensional data model is a design approach for organizing data in a data warehouse that allows for easy analysis and reporting.
- This model is structured around dimensions and facts, forming a cube-like structure for data analysis.
Components of a Multi-Dimensional Data Model
Facts
Facts are measurable, quantitative data points such as sales revenue, order quantity, or profit. Facts represent the business events being analyzed.
Dimensions
Dimensions provide the context for facts. They represent entities such as time, products, and customers, offering different perspectives to analyze the facts.
Hierarchies
- Hierarchies are levels within dimensions that allow data to be analyzed at different levels of detail.
- For example, a time dimension may have a hierarchy with levels such as year, quarter, month, and day.
Measures
Measures are the specific metrics within the fact table that are being analyzed, such as sales amount or profit margin.
Attributes
Attributes are descriptive properties of dimensions, such as product color, customer age group, or store location.
Multi-Dimensional Data Structure
The multi-dimensional data model organizes data into a cube-like structure that is easy to navigate and analyze.
Data Cube
The data cube represents a multi-dimensional array of data. Each axis of the cube represents a dimension, and the cells of the cube contain fact data.
Star Schema
- A common multi-dimensional data model design is the star schema, which consists of a central fact table linked to dimension tables.
- The fact table contains measures, while the dimension tables provide attributes and hierarchies.
Snowflake Schema
Another design is the snowflake schema, which normalizes the dimension tables further, resulting in a more complex but more space-efficient model.
Advantages of Multi-Dimensional Data Model
Ease of Analysis
The model supports intuitive data analysis, allowing users to slice and dice data across different dimensions.
Performance
The structure of the data cube allows for optimized querying and fast data retrieval.
Flexibility
The model allows for various types of reporting and analysis, including drill-downs, roll-ups, and aggregations.
User-Friendly
The model aligns with how users think about data and business processes, making it easier to understand and use.
Examples of Multi-Dimensional Analysis
Sales Analysis
- A data cube could include sales data (fact) and dimensions such as time, region, and product.
- This allows for analysis of sales trends over time, by region, or by product category.
Customer Analysis
- A cube could include customer-related facts such as purchases and dimensions such as customer demographics, time, and product categories.
- This supports analysis of customer behavior and preferences.
Inventory Management
- A data cube could track inventory levels (fact) across dimensions such as
- time, warehouse location, and product categories, helping manage stock efficiently.
Star Schema
- A schema is like a blueprint or plan that shows how data is organized in a database.
- It defines the structure of the data and includes information about the types of data stored, how the data is arranged,
- and how different parts of the data relate to each other.
- Just like a blueprint guides the construction of a building, a schema guides how data is stored and used in a database.
- The star schema is a common design for organizing data in a data warehouse.
- It structures data into a central fact table surrounded by dimension tables.
- The star schema is easy to understand and use, making it a popular choice for many data warehousing projects.
Key Components of Star Schema
Fact Table
- It contains measurable, quantitative data such as sales revenue or order quantity.
- It also contains foreign keys that link it to the dimension tables.
Dimension Tables
- The dimension tables surround the fact table and provide context for the data in the fact table.
- They represent entities such as time, products, or customers, and include attributes and hierarchies.
Foreign Keys
- The fact table contains foreign keys that link it to the dimension tables, allowing for relationships between the tables.
Measures
- Measures are the specific metrics within the fact table that are being analyzed, such as sales amount or profit margin.
Attributes
- Attributes are descriptive properties of dimensions, such as product color, customer age group, or store location.
Features of Star Schema
Simple Structure
The star schema has a simple and straightforward structure that is easy to understand and work with.
Denormalization
Dimension tables are typically denormalized, meaning they contain repeated data to improve query performance.
Fast Query Performance
Due to its straightforward design, the star schema allows for optimized querying and fast data retrieval.
Flexibility in Analysis
The schema allows for various types of analysis and reporting by enabling users to slice and dice data across different dimensions.
Example of Star Schema
Let's consider a retail business as an example:
Fact Table
- The fact table could be named sales_fact and contain measures such as
- sales amount, quantity sold, and profit. It also includes foreign keys to link it to the dimension tables.
Dimension Tables
- The schema could have the following dimension tables:
- time_dimension: Represents the time aspect with attributes like year, quarter, month, and day.
- product_dimension: Represents product information with attributes such as product ID, name, category, and price.
- customer_dimension: Represents customer information with attributes such as customer ID, name, age group, and location.
- store_dimension: Represents store information with attributes such as store ID, name, location, and region.
- The fact table would be linked to each dimension table through foreign keys (e.g., time_id, product_id, customer_id, store_id).
Snowflake Schema
- A snowflake schema is a design approach for organizing data in a data warehouse.
- It is a variant of the star schema and is known for its complex, hierarchical structure.
- In a snowflake schema, dimension tables are further normalized,
- meaning they are broken down into smaller, related tables to reduce data redundancy.
Key Components of Snowflake Schema
Fact Table
- The fact table is the central table in the schema, containing measurable data such as sales revenue or order quantity.
- It includes foreign keys to link it to dimension tables.
Dimension Tables
Dimension tables provide context for the data in the fact table, such as time, product, or customer information.
Normalized Dimensions
- Unlike the star schema, the snowflake schema normalizes dimension tables by splitting them into smaller, related tables.
- For example, a "product_dimension" table might be split into separate tables for product categories, suppliers, and specific products.
Foreign Keys
Relationships between the tables are established using foreign keys, linking the fact table to the dimension tables and the dimension tables to each other.
Hierarchical Structure
Dimension tables in a snowflake schema can form a hierarchical structure, with tables linked to other tables that provide additional details.
Features of Snowflake Schema
Complex Structure
The snowflake schema has a more complex structure than the star schema due to the normalization of dimension tables.
Data Normalization
Dimension tables are normalized, reducing data redundancy and saving storage space.
Efficiency
The normalized design can lead to more efficient data storage and maintenance.
Slower Query Performance
Due to the complexity and normalization, querying data in a snowflake schema can be slower compared to a star schema.
Example of Snowflake Schema
Let's use a retail business example to illustrate a snowflake schema:
Fact Table
- The fact table could be named sales_fact and contain measures such as sales amount, quantity sold, and profit.
- It also includes foreign keys linking it to the dimension tables.
Dimension Tables
- In a snowflake schema, dimension tables might be split into smaller tables:
- time_dimension: This table could be split into separate tables for years, months, and days.
- product_dimension: This table might be split into tables for product categories, product subcategories, and individual products.
- customer_dimension: This table could be split into tables for customer demographics, geographic locations, and individual customers.
Relationships
- The fact table and dimension tables are linked using foreign keys.
- Additionally, relationships between dimension tables are established to form a hierarchy.
- the snowflake schema provides a more complex but space-efficient way of organizing data in a data warehouse.
- It reduces data redundancy but may require more complex queries for data analysis.
Fact Constellation Schema
- The fact constellation schema, also known as a galaxy schema, is a data warehouse design
- that consists of multiple fact tables sharing common dimension tables.
- It is more complex than the star schema and provides a way to handle multiple business processes within a data warehouse.
Key Components of Fact Constellation Schema
Fact Tables
- Fact tables contain measurable data and metrics such as sales revenue, quantities sold, and profit.
- A fact constellation schema can have multiple fact tables, each representing a different business process (e.g., sales, inventory, and orders).
Dimension Tables
- Dimension tables provide context for the data in the fact tables, such as time, products, or customers.
- These tables are shared across multiple fact tables.
Common Dimensions
- A distinguishing feature of a fact constellation schema is the sharing of dimension tables across multiple fact tables.
- For example, a time dimension table can be linked to both the sales fact table and the inventory fact table.
Foreign Keys
- Relationships between the fact and dimension tables are established using foreign keys.
- Fact tables are linked to the appropriate dimension tables.
Features of Fact Constellation Schema
Complex Structure
The schema has a complex structure due to the presence of multiple fact tables and shared dimensions.
Multiple Business Processes
The schema can support multiple business processes within a single data warehouse, making it more versatile.
Data Integration
Sharing dimension tables across multiple fact tables allows for better data integration and consistency across different business processes.
Flexibility in Analysis
The schema provides flexibility in analyzing data across different business processes and dimensions.
Example of Fact Constellation Schema
Let's use a retail business example to illustrate a fact constellation schema:
Sales Fact Table
- The sales fact table contains measures such as sales amount, quantity sold, and profit.
- It is linked to dimension tables such as time, product, and customer.
Inventory Fact Table
- The inventory fact table contains measures such as inventory levels and stock values.
- It shares dimension tables such as time and product with the sales fact table.
Orders Fact Table
- The orders fact table contains measures such as order quantity and order value.
- It also shares dimension tables such as time, product, and customer with the sales fact table.
Dimension
- time_dimension: Represents time data and is shared across all fact tables.
- product_dimension: Represents product data and is shared across all fact tables.
- customer_dimension: Represents customer data and is shared across sales and orders fact tables.
Conclusion
So we have covered Principles of Dimensional Modeling: Objectives, From Requirements to data design, Multidimensional Data Model, Schemas: the STAR schema, the Snowflake schema, fact constellation schema.