Home > Design Patterns > Centralized Dataset Governance

Centralized Dataset Governance (Buhler, Erl, Khattak)

How can a variety of datasets stored in a Big Data platform be governed efficiently and in a standardized manner?

Centralized Dataset Governance

Problem

Successful and value-bearing analysis of data using Big Data technologies warrants continuous governance of data from its acquisition to archival, which in a Big Data environment can become a daunting task due to the high variety and unforeseen usage scenarios of data.

Solution

Data governance is centralized, and a system is introduced that automates data governance tasks, including data lifecycle management, data access auditing and data lineage identification.

Application

A component is introduced within the Big Data platform that provides a centralized interface for authoring policies, enforces policies, centralizes audit tracking and maintains lineage.

A data governance manager is used to centralize data governance tasks. Through a graphical user interface, it allows viewing and searching metadata, which helps with dataset discovery and the retaining of a single copy of a dataset without creating duplicate copies. A lineage viewer provides the ability to view which data processing operations, such as queries, make use of the dataset and its constituent elements. The data governance manager centralizes the configuration of dataset access logging, such as what details should be recorded when a client tries to access a dataset and provides means for viewing dataset access log. For the enforcement of dataset policies, such as for how long a dataset should be kept and when the data should be archived, the data governance manager provides an interface for authoring policies that are then automatically executed generally through a workflow engine.

Centralized Dataset Governance: 1.png A system administrator needs to implement two different data management policies: a retention policy that dictates that a dataset should only be retailed for 180 days and a replication policy that dictates that Dataset A needs to be copied from NoSQL A to NoSQL every 7 days. The system administrator uses the interface provided by the data governance manager to add retention policy and replication policy. The data governance manager automatically generates a retention script from the retention policy. The data governance manager automatically executes the retention script on the specified NoSQL database every 180 days to delete the dataset. The data governance manager automatically generates a replication script from the replication policy. The data governance manager then automatically executes the replication script every 7 days to copy Dataset A from NoSQL A to NoSQL B." title="Centralized Dataset Governance: 1.png" /> A system administrator needs to implement two different data management policies: a retention policy that dictates that a dataset should only be retailed for 180 days and a replication policy that dictates that Dataset A needs to be copied from NoSQL A to NoSQL every 7 days. The system administrator uses the interface provided by the data governance manager to add retention policy and replication policy. The data governance manager automatically generates a retention script from the retention policy. The data governance manager automatically executes the retention script on the specified NoSQL database every 180 days to delete the dataset. The data governance manager automatically generates a replication script from the replication policy. The data governance manager then automatically executes the replication script every 7 days to copy Dataset A from NoSQL A to NoSQL B."/>
  1. A system administrator needs to implement two different data management policies: a retention policy that dictates that a dataset should only be retailed for 180 days and a replication policy that dictates that Dataset A needs to be copied from NoSQL A to NoSQL every 7 days.
  2. The system administrator uses the interface provided by the data governance manager to add retention policy and replication policy.
  3. The data governance manager automatically generates a retention script from the retention policy.
  4. The data governance manager automatically executes the retention script on the specified NoSQL database every 180 days to delete the dataset.
  5. The data governance manager automatically generates a replication script from the replication policy.
  6. The data governance manager then automatically executes the replication script every 7 days to copy Dataset A from NoSQL A to NoSQL B.