Change Data Capture detects changes to data. In the case of deleted data, the question is: How should we manage deletions in Data Vault?
FastChangeCoTM's Data Management Center of Excellence (DMCE) team [link] is currently working on connecting a new operational system. This already provides delta records that are created when changes are captured.
The team is currently discussing a problem that affects precisely these deltas. It is not always clear what should occur with data in the event of a delivered delete record in the Data Vault: Delete the data logically (not physically) or not?
Zalika Okoro, a working student on the DMCE team, is struggling with this problem at the moment.
She asks the team questions such as, "What is a change record? And why exactly are we having trouble with this?"
Amal Leyla Kasim, a data modeler on the team, is happy to explain Zalika what Change Data Capture (CDC) is, and what it's good for.
What is a CDC?
A CDC mechanism of an operational system (native CDC) continuously identifies and records changes to data. By this we mean the insertions, updates and deletions of data, hereafter collectively referred to as changes, a ‘change’.
The CDC mechanism detects changes by comparing, in simple terms, the old and the new data set based on a defined key. In the data structures of the operational system, this is usually the physical primary key of a table. If the data of the non-key data elements are modified, the CDC detects a ‘change' and transmits the modified data record complete with all data elements.
In the case of a deleted data record, a 'delete', the native CDC often only transmits the data element of the key on the basis of which the change was detected. The information about what data other than the key was deleted is not relevant to the native CDC. Some CDC systems provide not only the key of the deleted record, but a complete record with all information.
Both variants can lead to problems with a target Data Vault data model. What may or must be deleted? After all, the data modelers have put a lot of effort into organizing the data according to business objects.
"But why are we using a CDC if it then leads to problems?" interjects Zalika. "That's a good question, Zalika!" So Amal explains to her why FastChangeCoTM prefers to use CDCs.
What is the benefit of a CDC?
Several decades ago, Jeff Jones (IT Manager) introduced the first CDC systems at FastChangeCoTM.
For Jeff, it was no longer acceptable that operational systems were slowed down, or sometimes even blocked, by analytical queries. The first data warehouses for data analysis were built at FastChangeCoTM during this time, partly because of this.
However, this required copying data to other platforms.
These full or delta loads and, if required, (full) reloads with the latest data, which were necessary for the analysis systems, were time-consuming for all systems involved. They also consumed a lot of computing power.
The new CDC features introduced by Jeff enabled a much more efficient data transfer. Only the actual changes were transferred between the operational and analytical systems from now on, which significantly reduced the load on the operational systems.
From today's perspective, CDC mechanisms are considered important to enable scalability, efficiency, and satisfy real-time data availability requirements without impacting the systems involved.
From the CDC Interface to the Data Vault
To analyze the problem how data in the Data Vault should be managed in the event of a deleted record, the DMCE team creates an example CDC incoming interface (Fig. 1). This contains data about the employees as well as the department in which they work. By using the 'Employee Number' key, the CDC mechanism compares the data elements.
Fig. 1: Structure of the input interface with column names and data types.
The sample records in Fig. 2 show that the employees Jack, Terence and Amal are assigned to the different departments Sales and DMCE. Since at the time (Inscription Time) all records were newly created in the source system, they have received the Change Data Indicator ‘insert’ by the CDC.
Note: For better understanding I have used 'insert' instead of 'I', 'change' instead of 'C', 'delete' instead of 'D' in the figures. I recommend using 'I', 'C' and 'D' to save storage space.
Fig. 2: CDC incoming interface with sample data sets over a period of time.
Based on this example CDC incoming interface, the DMCE team first develops a business data model. Based on the business data model, a physical Data Vault data model is instantiated.
To show how the data should be distributed to the various Data Vault data objects, Amal visualizes this with the following diagram (Fig. 3).
Fig. 3: Mapping of the CDC incoming interface to Data Vault.
Amal uses Fig. 4. to explain to Zalika the resulting Data Vault data model and how the data is distributed and loaded from the CDC dataset.
Fig. 4: Data Vault data model with sample data.
Amal then explains the two variants of a delete record that the CDC mechanism can generate. On the one hand, an empty data set in which only the key is supplied (Fig. 5) and a data set that contains all deleted values in addition to the key (Fig. 6).
Fig. 5: CDC Incoming Interface with an example of an empty record for deletion.
Fig. 6: CDC incoming interface with an example of a deleted record containing (supposedly) deleted values.
The DMCE team discusses whether it is necessary to delete all the data in the target provided by the delete record in the Data Vault data model. Especially considering the previous distribution when loading the data into the Data Vault. After all, a delete record is generated from the operational system for all attributes (Change Data Indicator = Delete, Fig. 7).
Fig. 7: CDC incoming interface with an example of a data set to be deleted and its mapping to the Data Vault objects.
If, under this assumption, all data in the Data Vault were actually logically deleted, this would mean that not only Jack Jogger would be logically deleted, but also the Sales department (Sales). Amal illustrates the described consequences of this discussion to Zalika with the following diagram (Fig. 8).
Fig. 8: Data Vault data model with sample logically deleted data.
Zalika now understands the delete record problem and the question that arises:
What should be (logically) deleted in the Data Vault? Should data in both satellites 'Sat Employee' and 'Sat Department' be deleted? That is, just like loading the data from the interface into the Data Vault data model?
The 3rd normal form as solution!
To find a solution, the DMCE team turns to their coach Diego Pasión and describes the problem to him. Diego suggests applying the normal form (link to Wikipedia ) to the interface to find out which data/columns actually depend on the key 'Employee Number' used to identify the changes and which do not.
"Each data element must provide a fact about the key (1.NF), a fact about the whole (minimal) key (2.NF), and a fact about nothing but the key (3.NF)." Diego reproduces a well-known quotation from Edgar F. Codd in a slightly modified form.
"The example CDC incoming interface (Figs. 1 and 2) is in the first normal form. This is perfectly fine for loading data into a Data Vault, since the methods and concepts of the Data Vault (separation of business key, relation and descriptive context) can handle exactly this very well. When deleting data, on the other hand, a structure in 1st normal form can mislead you into making incorrect assumptions (Fig. 5 and Fig. 6.), which can result in significant consequences if you don't think about it for a moment!", Diego notes.
Based on the example created by the DMCE team, it is quite easy and understandable to see that deleting the department in the Data Vault is not effective. Certainly, there are more people working in the Sales Department at FastChangeCo than just Jack Jogger. With many other native CDC interfaces, this is not easily discernible.
"To solve this, it is beneficial to transform the CDC interface from 1st normal form to 3rd normal form," Diego recommends to the team. One of the steps to transform the interface from 1st normal form to 3rd normal form is to identify which data element depends on which key. The data elements 'Employee First Name' and 'Employee Last Name' depend on the key 'Employee Number' or describe it as a fact. The data elements 'Department Name' and 'Department Address', on the other hand, depend on the 'Department Number' key and not on the 'Employee Number' key (Fig. 9).
Fig. 9: Analysis of which data elements depend on which key.
With this analysis, the CDC interface can be transformed into an imaginary 3rd normal form. The DMCE team illustrates this by showing the example data directly in the third normal form (Fig. 10).
Fig. 10: Transformation of the CDC incoming interface into an imaginary / assumed data model of the 3rd normal form.
This representation allows the DMCE team to see which data is deleted in the 3NF data model, since only the 'Employee' data elements are dependent on the 'Employee Number' key.
Fig. 11: The delete record only affects the 3rd normal form table 'Employee' and the record with the 'Employee Number' 4711.
From this, the team, together with Diego, deduces exactly what data needs to be deleted from the Data Vault: Namely, only the data of the 'Sat Employee' and not that of all satellites (Fig. 12).
Fig. 12: The delete record only affects satellite 'Sat Employee'.
From the data model of the 3rd normal form (Fig. 11) it can be deduced that the relation (the foreign key) could also be deleted and thus an end dating satellite is needed at the link (Fig. 13). Since the link in the Data Vault corresponds to the foreign key of the table 'Employee' in the 3rd normal form.
Fig. 13: The delete record affects the satellites 'Sat Employee' and 'Sat End-Dating'.
Diego Pasión recommends not to do this. The information about the relationship between 'Employee' and 'Department' is not reliable in case of a delete record (the key that leads to a deletion in the CDC is different from the key pair ('Employee Number', 'Department Number') for the relationship). The information that the employee is no longer assigned to the department or no longer works there is recorded by the 'Sat Employee'.
Of course, there are situations where exactly this satellite is needed to terminate a relation, but that is another discussion and not part of this blog post.
However, there are quite a few other experiences that Diego Pasión has had. More on that in an upcoming article in this series. Be sure to check back for more.