When Deletion and Archiving Look The Same

Skills demonstrated: Root cause analysis, ambiguity identification, cross-system investigation, data lifecycle assessment, risk-based prioritisation.

Outcome: Discovered that the downstream data warehouse deletion logic treats archived records as deleted, creating a data integrity risk. No solution was implemented because the investigation raised a number of questions that need upstream clarification before a reliable solution can be defined.

Context

In this organisation, “Core” refers to the source transactional system (Dataverse), while “Analytics” refers to the downstream reporting warehouse.

The “storage solution” is a retention mechanism that archives completed activity records into a Cosmos database after a defined period.

This spike investigates whether the current Analytics logic correctly distinguishes between:

The investigation uncovered ambiguity in how deletion and archival states are inferred downstream, creating potential data integrity and reporting risks.

Original Spike - 2 min read (Edited for clarity and minor redactions. Core structure remains faithful to the original.)

Objective

The objective of this spike was to review and understand the storage solution and raise any questions or comments to discuss prior to next steps.

Observations

The current process does not distinguish between archived activity records and deleted activity records. Instead all records no longer in Dataverse are considered as deleted. This creates ambiguity between different types of records:

Therefore there is a very real risk that we are marking activity records as deleted and anonymising them when in fact they are not hard deletions but archived records instead.

It is not clear whether Core itself tracks the difference between activity records that have been archived and records that have been hard deleted. Or whether this distinction is something Analytics will have to infer.

If Cosmos records cannot be altered (immutable), they provide a robust point-in-time log of archived activity. If however they are mutable, they may not be trustworthy to be used in Analytics logic.

Existing Logic

As I understand it, the existing logic is as follows:

Questions

  1. What is the mechanism in Core to distinguish between completed records and deleted records?
    • Is it a change of state?
    • Is it a change of status?
    • Does an audit table get updated?
  2. If so, does Core expose this distinction to downstream systems? (Can we make use of this mechanism in Analytics?)
  3. If there is no mechanism and we are looking to infer record status from timestamps, what logic and which timestamp columns represent a completed record, a deleted record, a copied record?
  4. When the record is deleted from Cosmos before the end of the retention period, does Analytics process this case correctly? i.e. it is treated as a completed record prior to the deletion and then treated as a deleted record after.
  5. When a record is completed (and not deleted), is this treated correctly in Analytics?
  6. Judging by the examples it appears that Cosmos records are immutable but for completeness I would want that clarified.
  7. Are there any other instances / states that lead to activity records being copied to Cosmos e.g. if they are cancelled? And if so, do we need to factor these in?
  8. How many of the Analytics customers have this storage solution? (How big of a risk is this?)
sketch of 3 scenarios regarding deletions
Simplified lifecycle scenarios illustrating where archived and deleted records become ambiguous in downstream Analytics processing.

What Happened Next?

This piece of work is still ongoing and the team is trying to fully understand the deletion and anonymisation process used in Core. Once this is understood we can then investigate what needs to be done in Analytics to mirror this logic.