The optimization is a single index, but if you only nurse indexes in prod, you could waste a great time when developing SCDs in SSIS. The method is simple, when you now the nature of your SCD, then you can create an index right away, and reduce your development waiting time. Specially if you are testing with bigger volumes of data.
Let me show you
Let's say you have following table definitions, and you working in a SSIS project using Visual Studio:
-- Staging
-- Staging
CREATE TABLE Staging.Customers
(
CustomerId UNIQUEIDENTIFIER,
FistName NVARCHAR(200),
MiddleInitials NVARCHAR(200),
LastName NVARCHAR(200),
AccountId INT,
CreationDate DATETIME2
)
GO
-- Dimension
CREATE TABLE dbo.dimCustomers
(
CustomerDwhKey INT IDENTITY(1,1),
[Current] BIT,
CustomerId UNIQUEIDENTIFIER,
FistName NVARCHAR(200),
MiddleInitials NVARCHAR(200),
LastName NVARCHAR(200),
AccountId INT,
CreationDate DATETIME2
CONSTRAINT PK_CustomerId PRIMARY KEY CLUSTERED (CustomerDwhKey)
)
GO
In our example setup CustomerId will be a so-called Business key, And Current will be the indicator for which row are current. It should also be noted, it is possible to have more than one Business keys.
We'll configure attributes as:
Now, the Slowly Changing Dimension component work in following way:
For each entity it recieves, it will search the dimension table for an entity with the same business keys(in plural!!!) and is flagged as current, or in plain SQL:
SELECT attribute[, attribute] FROM dimension_table WHERE current_flag = true AND business_key = input_business_key[, business_key = input_business_key]
Or as it will look like in our example
SELECT AccountId, CreationDate, FirstName, MiddleInititals, LastName FROM dbo.dimCustomers WHERE [Current] = 1 AND CustomerId = some_key
Further, in case we have an historical change, the current entity in the dimension must be expired by setting Current = 0.
UPDATE dbo.dimCustomers SET [Current] = 0 WHERE [Current] = 1 AND CustomerId = some_key
The solution
As you might have realize by now, we can improve performance tremendously by putting an index on the current flag and the business keys(again plural!!!). For each entity passing through the Slowly Changing Dimension component, the will be at least 1, but likely 2 searches in the dimension table. And the by knowing the business keys and current flag. and the nature of the Slowly Changing Dimension component, you can predict the index which will improve performance.
The index for our sample will be
CREATE NONCLUSTERED INDEX IX_Current_CustomerId ON dbo.dimCustomers
(
[Current],
CustomerId --- Remember to include each business key
)
Should index be a filtered? I'll let that be up to you.
Indexes, bulk and loading of dimensions
Some tend to drop indexes when loading dimenson, with the argument: Bulk loading is fastest without indexes, which SQL Server has to maintain while loading. This argument has to be revised when working with the Slowly Changing Dimension component.
Because the component searches the dimensions so heavily, (in general) it will be faster loading with indexes than without. If there is no indexes, each entity going through the component, will require at least one table scan, which is quite expensive, and gets more expensive as your dimension grows.
That's all
No comments:
Post a Comment