Concepts and Techniques on Business Analytics: 2010

Thursday, October 21, 2010

Offline Reporting Solution

Sales representatives frequently travel to places where there is limited or degraded network infrastructure but still need access to up to date information in order to make business decisions. Offline reporting solutions provides reports with customized data which can be viewed anytime, anywhere and are unaffected by the quality of the network infrastructure, thus giving key decision makers actionable information in the field to make timely and informed decisions.

For example a manufacturing sales representative traveling to remote facilities in foreign a country where the network infrastructure is unreliable may need to know the actual amount of products purchased as compared to the planned amount. Pharmaceutical sales representative may need to know the latest prescription trends while waiting in a physician's office inside a large hospital with limited network access. In both these cases, an offline reporting solution can provide information that is detailed and robust enough to make timely and informed decisions, but also provide that information at a time when the network infrastructure is limited or is not reliable or fast enough to support large data transfers.

The diagram below illustrates the process to provide offline reporting to users with limited network infrastructure.

The data for the offline reporting solution is staged in a data mart which will serve as the source of the offline reporting solution. This solution can be adapted to any new or existing data mart.

A SQL Server Integration Services (SSIS) package is used to create SQL Server CE (Compact Edition) databases for each employee that will contain data that is relevant to that employee. The table structure is identical to the source data mart, however the file size of each SQL Server CE database is only a fraction of the source data mart since the data is being filtered to include only data that is relevant to that employee. For example, if a sales representative is assigned to the East region, the sales representative will see only data from the East region while a sales representative assigned to the West region will see data only from the West region.

The SQL Server CE database files can be synchronized to a user's mobile device when a network connection is available using a push or pull method. Users can be directed to download the file from SharePoint or the files can be pushed directly to the users using Sybase's Afaria or Microsoft's ActiveSync when new data is available. Microsoft Excel can then connect to the SQL Server CE database and be used as a front end reporting template. The Excel report templates should be prebuilt to be functional, but also to provide a rich and familiar user experience.

This offline reporting solution gives the users the ability to view customized data anytime, anywhere and are unaffected by the quality of the network infrastructure.

During implementation of this solution I have learned a few important lessons that are worth mentioning.

NLS Sort – SQL Server CE relies on the National Language Support (NLS) collation tables of an operating system. Different operating systems (Windows 2003, Windows 2008, Windows XP, or Windows 7) have different NLS sort orders. Because SQL Server CE is designed to be lightweight and run on a mobile device there are some tradeoffs. NLS Sort is one of them. The server edition of SQL Server uses a different architecture to support different sort orders and therefore has no l dependency on the operating system. The difference in NLS sort comes in to play when a SQL Server CE database is created on Windows 2008 and then sent to a user's mobile device which uses another operating system. It is assumed that the user's mobile device is another operating system such as Window XP, Windows 7, or Windows CE and not Windows 2008. The server operating system, Windows 2008, and the operating system of the user's mobile device would have different NLS Sort orders. This difference requires that SQL Server CE rebuild all the indexes in the database the first time the database is opened. This action only happens when there are different NLS sort orders. This can lead to lengthy opening times (up to 8 minutes when opening a 1GB SQL Server CE database) when an end user first opens the report which is reading the SQL Server CE database. There are viable workarounds and ways to minimize opening times.
64bit vs 32 bit – Most new SQL Servers today are 64bit which is great. SQL Server CE runs is a 64bit environment which is great too. So what is the issue? The issue is that if you are going to use SSIS and connect to SQL Server CE you will need to use the 32bit OLE provider. This means your SSIS package must run in 32bit mode on a 64bit server, which is not ideal. Further complexity is added if your data mart is not SQL Server and need to install a 32bit provider to connect to your Oracle, IBM, or Sybase data mart. Hopefully Microsoft will release a 64bit OLE provider for SQL Server CE.
Multithreaded approach – Consider early on to run to design this process to run multiple SSIS packages simultaneously. I have discovered that running 8 of the same SSIS packages (with different parameters) gave the fastest end to end completion time when creating 2000+ SQL Server CE databases. Running sequentially would have taken 24 hours but running 8 SSIS packages in parallel reduces the end to end turnaround time to 3 hours.
Limitations of SQL Server CE - Keep in mind that SQL Server CE has a 4 GB limit and supports most standard SQL, there are some limitations in the SQL that is supported. For example, the PIVOT statement is not supported.

Friday, April 30, 2010

Allocate a Measure to a Lower Level Using a Many to Many Relationship in Analysis Services 2008

I had a request to allocate a measure down to a lower level based on a percentage. A former colleague of mine called this prorating. I am not sure the proper name, but this is common in the media/advertising industry. In the past I have done this in the database by joining the original row to the lower level percentages and exploding the fact rows. The problem with doing this in the database is that the number of fact table grows quickly and becomes unmanageable. I wanted to use Analysis Service do the allocation down to the lower level and avoid the explosion of fact rows in the database. I found documentation from Microsoft here. It was a good starting point and gave me the confidence that this task could be accomplished by Analysis Services, but it did not include the custom rollup script that is needed in order to aggregate the totals properly. The purpose of this article is to build on the Microsoft documentation by going a bit further in explaining the solution.

The example below shows the basic idea of what I wanted to achieve. I wanted to allocate a single row (shown below as 2000 impressions) by a given percentage of an age demographic. I want to determine the allocated value by multiplying the percentage by the impressions.

Impressions	Age Demographic	Age Demographic Percentage	Allocated Value
2000	2-11	5%	100
	12-17	10%	200
	18-24	25%	500
	25-34	25%	500
	35-49	10%	200
	50-64	10%	200
	65+	15%	300

To start I created a main fact table called Fact Advertiser and a related detail table called Fact Advertiser Detail in order to handle the one to many relationship.

CREATE TABLE [dbo].[FactAdvertiser](

[DisplayScreenID] [int] NULL,

[DateId] [int] NULL,

[DeliveredImpressions] [bigint] NULL

) ON [PRIMARY]

CREATE TABLE [dbo].[FactAdvertiserDetail](

[DisplayScreenID] [int] NOT NULL,

[DateId] [int] NOT NULL,

[AgeId] [int] NULL,

[AdjustedPct] [numeric](4, 0) NULL

) ON [PRIMARY]

The primary key on the Fact Advertiser table is Display Screen ID and Date ID. The primary key on the Fact Advertiser Detail was DisplayScreenId, DateId, and AgeID.

A many to many relationship was defined in the cube. In the diagram below, the FactAdvertiserDetail serves as the intermediate measure group and DimDisplayScreenID is the intermediate dimension. The intermediate dimension serves as a bridge between the two measure groups involved in the many to many relationship.

Once the many to many relationship was defined the relationships looks as follows.

The two scripts were added to the cube in BIDS under Calculations then Script View. The first one creates the calculated measure.

CREATE MEMBER CURRENTCUBE.[Measures].AdjustedDeliveredImpressions

AS [Measures].[Delivered Impressions] * ([Measures].[Adjusted Pct] ),

FORMAT_STRING = "Standard",

VISIBLE = 1 , ASSOCIATED_MEASURE_GROUP = 'Fact Advertiser Detail' ;

The second script is a custom rollup formula and was added just after the script above. The purpose of the script is to override the normal aggregation of our calculated measure (AdjustedDeliveredImpressions) and perform the calculation on the lowest level BEFORE the aggregation occurs. Without the custom rollup formula the calculated member would happen AFTER the aggregation and give us erroneous results, especially when querying on Age Demographic without using the intermediate dimension (Dim Display Screen).

SCOPE([Measures].[AdjustedDeliveredImpressions]);

THIS=IIF(

IsLeaf([Dim Display Screen].[Display Screen Id].CurrentMember)

,[Measures].[AdjustedDeliveredImpressions]

,SUM

([Dim Display Screen].[Display Screen Id].CurrentMember.Children

,[Measures].[AdjustedDeliveredImpressions])

);

END SCOPE;

After reprocessing the cube, I was able to verify the results by connecting to the cube with Excel and create this pivot table.