From Sprawl To Structure

Man Up Utility Pole Surrounded By Messy Wiring — Photo by weyfoto loh on Unsplash

Most BI teams hit this problem eventually. What begins as innocent independence turns into a maintenance time-bomb. This post walks through the strategy I used to get from sprawl to sense.

The Problem

I once inherited a suite of 8-10 Power BI reports which, because they were reporting on different areas of the business, were initially expected to be independent of one another. They had therefore been built as such with a separate semantic model each.

However the business matured and user feedback created a problem: report sprawl. The same visuals, the same KPIs and the same underlying data logic began to appear in multiple reports. A change made in one report had to be manually replicated in other reports. A process that was both tedious and prone to error. Our decentralised approach was creating inconsistency and slowing down output.

So the crucial question was posed:

Could we centralise the data and eliminate this risk and duplication of effort?

The Analysis

My first step was a spike to analyse the trade-offs of moving to a single, universal semantic model.

Advantages Of A Single Dataset

Centralising all the different semantic models into one universal model brings the following advantages:

A single source of truth
Reduced maintenance
Reusability
Performance
Scalability

A Single Source of Truth

All business logic and metrics live in one place and one place only. This ensures consistency of approach across all the reports downstream from the semantic model.

Reduced Maintenance

Any enhancements or fixes to current logic only need to be made in one central place, not across dozens of semantic models.

Reusability

Any of the metrics within the central model are available to be used in all or any (new) report, now or in the future.

Performance

One central and well-optimised model can perform better than lots of smaller disparate models. Especially at scale. Performance is also related to cost so faster performance can also mean lower cost.

Scalability

If more reports are needed in future and the chance of overlap between reports remains high, this central approach will lead to less effort being expended.

Disadvantages Of A Single Dataset

Centralising the existing report and dataset sprawl does come with some disadvantages:

Refactor Effort
Disruption Risk
Testing Effort
Complexity
Historical Inaccuracies

Refactor Effort

Consolidating a multi model setup to a single all-encompassing dataset is no trivial task. The work involved in doing it and doing it accurately needs to be considered.

Disruption Risk

While this work is being done, the velocity of output for existing reports and / or future reporting requirements will drop. Additionally there may also be disruption to the existing suite of reports.

Testing Effort

This is a major change and so will need to be validated / tested to ensure it is accurate.

Complexity

Depending on the scope of any future work, a single model may become large and difficult to manage.

Historical Inaccuracies

Combining all models into one may uncover discrepancies in calculations across reports leading to the uncomfortable truth that a value may have been inaccurate in an existing report.

Assumptions

During the spike investigation I made some assumptions as follows:

Although longer term we were planning to put a SQL backend on this, this was not considered for the initial centralisation task. Instead we wanted to keep the existing ETL approach (using Power Query).

We could have considered a halfway house between the two, Dataflows, but to keep things as simple as possible this was also discounted.

Luckily in this particular vertical of the business, we weren't working with Row Level Security (RLS) so had no concerns or considerations with RLS for this centralisation exercise.

The Approach

After considering the trade-offs the next step was to come up with a proof of concept approach on how the consolidation to a central model could be achieved.

The first step was to identify a small subset of existing reports which we knew had overlapping logic and / or visuals and consider this subset only.

The process was then, in general terms, the following:

Discovery
Design
Consolidate ETL logic
Create Relationships
Add (DAX) Logic
Publish Central Model
Validate Reports
Archive Reports

These steps are described in further detail below.

1. Discovery

Before doing anything, we first need to understand the scope of the existing data. Things like the following:

Are the data sources consistent throughout the reports? Are there multiple sources we need to account for?
Are there any access / permission issues with the existing data source(s)?
What tables from the source system do we use and which tables appear in multiple reports?
What measures and calculated columns do we use and why? (Potentially add comments to the code to make it easier to consolidate?)
Do we have any field parameters or calculated tables that we need to account for?
Are there any inconsistencies between report elements? (In name / logic / data type)

2. Design

Once the discovery is done, it will be easier to see the scope and envisage how the new Central Model would look, how it would work and what needs to be included.

3. Consolidate Power Query Logic

The first thing we do is look to centralise the data import from the source system:

Inventory all Power Query queries
For a given table which is loaded separately into the various semantic models, compare the PQ queries used:
- Does the logic align or is there conflict?
- Can we consolidate the logic?
- If so, can we clean up the logic (e.g. remove duplication) and put it in one master query?
- If so, add this master query to the Central Model
Rinse and repeat this approach for all queries
Make sure the queries are named clearly and organised into folders (Examples of group names include staging, dimension, fact, Extract, Transform, Load)

4. Create Model Relationships

After the data has flowed in using Power Query, we need to connect it all into a star schema or as close as possible:

Set up a date table
Create any relevant calculated tables and field parameters
Define the required relationships between tables

5. Add DAX Logic

At this point the data tables are in our Central Model and are related as required. Now the calculated elements can be added:

Bring in measures and calculated columns
Schema compare between the Central Model and each of the reports to ensure we have all calculated elements (use ALM Toolkit)
Organise measures, columns and calculated columns into folders for organisational purposes.

6. Publish the Central Model

The Central Model is now good to go and should be published to the relevant workspace.

7. Validate Reports

Clone the existing reports (important to clone and keep existing reports as they are) and then change the data connection to the Central Model for the cloned reports.

Open each of the cloned reports and investigate any broken visuals. Replace any measures which may now have a different name.

Compare metrics on the cloned report to its metric in the original version of the report to see if they match or if there is any difference. This is the ideal point for QA to conduct regression testing.

8. Archive Reports

Once we are happy with the cloned reports that are using the Central Model as their source, we can archive the previous reports and semantic models.

Concerns

Identifying Duplicates

Measures which return identical values and have the same intention may not necessarily be defined identically across different reports. The reason for this is that the context where they are used can provide different effects on the calculation.

For example, if we use the measure in a table where the first column is, say User, the measure will automatically be filtered by user without us having to explicitly state this in the measure (due to context transition).

However if the measure is used in another visual without this implicit User filter, we would need to include this explicitly in the measure definition.

On top of this, whether or not two measures are defined identically across multiple reports, if they are named differently (perhaps one of them has a prefix or a suffix) the tooling won't be able to recognise that they are supposed to be the same.

For these reasons, it can be hard and time-consuming to identify duplicate measures across semantic models.

Ad Hoc Transformations

By implementing a Central Model (through Power Query) and a live connection from the thin reports to the model, we would be unable to make any further transformations downstream.

While this meets Best Practice, reality and experience lead me to suspect that there are times when stakeholders' ad hoc requests mean unique transformations are required on a given report.

In this situation, we would have a couple of options. Either include this unique logic in the Central Model thereby increasing its size and complexity or push the transformations upstream to SQL (or Dataflows).

Performance

Depending on how complex the PQ transformations required are, Power Query is not the fastest to transform data.

Having said that, these transformations are done on load and refresh so will not affect the end user and indeed consolidating to a central model with live connection may mean the reports actually run faster for the user. This is because the dataset is abstracted away to another file essentially.

Agility

Currently small changes can be made directly in the report and turned around quickly for the stakeholders. In addition, because each report is siloed from the others, any changes made don't have an impact outside of the current report.

The concern with the Central Model is that it would take longer for such small changes to be made. Also any changes made might impact on other reports and cause them to break or worse provide misleading metrics.

Bloat

Ideally we need to ensure we only migrate objects which are actually used in the current version of the reports. Objects that were created but have subsequently been abandoned for whatever reason should not be brought across since this will increase the size of the model needlessly.

However identifying what is used and what is no longer required is time consuming and potentially could be done as a subsequent post-consolidation task later down the line.

This applies to objects such as:

Columns
Calculated columns
Calculated tables
Measures

Tooling

In order to do the actual consolidation, I would look at the following tooling:

ALM Toolkit
Measure Killer
Tabular Editor version 2 (free) or version 3 (paid)
TMDL view in Power BI

Practical Approach

In migrations such as this, there is always some low hanging fruit that can be attacked with minimal fuss. The ETL process for certain dimension tables is probably the same or at least extremely similar across semantic models. So this can easily be streamlined into one process in a Central Model.

After this, I would take a report at a time and use ALM Toolkit to compare this report to the Central Model. ALM Toolkit will be able to identify differences relating to:

Power Query
Tables and columns
Calculated tables and columns
Relationships
Measures

Important: ALM Toolkit does this comparison on object name so if the same object has a different name across reports, it will not be able to help.

Any differences identified can then be resolved automatically with a couple of clicks in ALM Toolkit. Unfortunately this excludes the creation or deletion of tables and any Power Query logic changes which can't be done with ALM Toolkit.

However I would look at potentially using the new TMDL view in Power BI itself to create / update tables as this can be done quickly in bulk by pasting the code from ALM Toolkit.

Power Query differences would need to be done in the Power Query UI of the Central Model but again the code can be pasted from ALM Toolkit.

The one drawback is that ALM Toolkit compares one report directly with another so we would need to repeat the comparison between the Central Model and an existing report, one report at a time.

Summary

This is a task that should not be underestimated. The quantity and complexity of objects across a suite of reports will mean this consolidation will need appropriate time to tackle.

Having said that, experience has taught me that although at first the task appears daunting, a careful and methodical approach means the end result is achieved more easily than first imagined.

Final Thoughts

This project lives in the triangle between 3 competing forces:

Consistency - the single source of truth
Agility - the ability to make changes quickly
Complexity - having a large model to manage

A leader's job is not to solve this but to consciously manage the trade-offs. We chose consistency and accepted a short-term cost to agility to attain this.

Consolidating models like this can uncover skeletons in the closet: the pre-existing inconsistencies between models.

Tools like ALM Toolkit and Tabular Editor are force multipliers. They are what separate a systematic, methodical approach from a chaotic one.