Hello everyone, this is Doktor Wind. In today's data-driven business environment, data governance has become one of the key factors in the success of an organization, and data pedigree is one of the keys to success in data governance.
In this article we explore in detail what are all the characteristics of data lineage? Compare and contrast the relationship between data lineage, data relationships, data categorization, data provenance, and knowledge graph related concepts.
This article is the "data lineage analysis principles and practices" book reading notes, some of the ideas referenced from the original book, for a more detailed understanding of the study, please support the hard work of the original author.
The mind map for this article is shown below:
In the field of data governance, Data Lineage is a core concept that describes the entire lifecycle of data from its source to its end use, including where it comes from, how it changes, and where it goes. Understanding the characteristics of Data Lineage and its relationship with other related concepts is essential for data management and data governance. In this paper, we will detail the five main characteristics of data lineage: stability, attribution, multi-sourcing, traceability, and hierarchy, and explore how it relates and differs from data relationships, data categorization, data provenance, and knowledge graphs.
I. Characteristics of the data lineage
- stability
Stability refers to the persistence and consistency of data lineage information in the data processing process. In data governance, stable data lineage information can help organizations track data change paths, ensure that data processing processes are transparent and visible, and prevent data loss and misdirection. This feature makes data lineage an important tool for data compliance and auditing. Stability ensures that data lineage information is unaffected by frequent system changes or data updates over a long period of time, always providing a consistent and reliable record of data flow.
- attribution
Attributability refers to the ability of the data lineage to clearly indicate where the data came from and where it is going, including changes in the data at different stages of processing. Attributability characteristics help data managers understand the flow and transformation of data throughout its lifecycle, ensuring data accuracy and completeness, which in turn improves the reliability of data decisions. Attributability allows each data point to be traced back to its source, knowing how the data was generated, what processing it went through, and where it ended up. This transparency is critical for data governance and data analytics.
- polygenic
Multi-sourcing reflects the fact that data bloodlines can cover multiple data sources and systems. In modern enterprises, data usually comes from multiple heterogeneous systems and data sources. By integrating and analyzing these multiple sources of data, data lineage can provide a comprehensive view and help enterprises better understand and utilize data resources. Multi-sourcing not only refers to the diversity of data sources, but also includes the flow and interaction of data between different systems, which is important for building a global view of data and performing cross-system data analysis.
- traceability
Traceability is the ability of a data bloodline to record and track the process of data generation, modification and use. This characteristic is critical for data quality management, data security and data compliance. With traceability, organizations can identify and resolve data issues and prevent data tampering and misuse. Traceability allows every data operation to be recorded and queried, ensuring that every step of data processing can be retraced when needed to understand how data arrived at its current state from its source.
- hierarchical
The hierarchical feature indicates that data bloodline information can be presented at a hierarchical level, from the macro system level to the micro field level. This hierarchical view helps data managers analyze and understand data flows at different levels, providing flexible query and analysis capabilities. Hierarchicality allows data governance to progressively drill down from a global view to specific details, enabling data lineage information to meet needs at different levels, thus providing more accurate and comprehensive data governance support.
II. Concepts related to data lineage
Data Bloodlines and Data Relationships
Data Relationships (DRs) describe the associations and interactions between data entities. Data lineage is closely related to Data Relationships because Data Lineage documents the flow of data and the process of change between different entities and systems. For example, in a data processing chain, a data lineage may show the transformational relationships from one database table to another, while a data relationship describes the associations between those tables. Data margins provide the basis and support for understanding and analyzing data relationships.
Data relationships typically include hierarchical, referential, and dependency relationships between entities, which form the basis for the flow and interaction of data through the system. Data pedigrees further refine these relationships by describing the specific paths that data flows through them. For example, data margins can show how a data field is derived from one table and ultimately stored in another, and this kind of detailed documentation helps organizations better understand how data relationships are implemented.
Data Lineage and Data Classification
Data Classification (Data Classification) is the process of organizing and grouping data for ease of management and use. Data lineage intersects with data classification because data lineage information helps identify and label different categories and attributes of data. With data lineage, organizations can track the origin and path of change for specific categories of data to ensure accuracy and consistency in data classification. In addition, data classification results can provide contextual information for data lineage to help better understand data flows and transformations.
Data classification is often grouped based on the sensitivity, purpose of use, source, etc., and this classification information can be reflected in data lineage records. For example, the processing path of sensitive data can be specifically labeled and tracked to ensure strict adherence to privacy and security regulations during data processing. The categorization information in the data pedigree record can also help organizations to manage and control different categories of data in a more targeted manner during the data governance process.
Data lineage and data provenance
Data Provenance refers to the origin and history of data, including the process of data generation, collection, processing and storage. Data pedigree and data provenance are closely related concepts because data pedigree records the entire process of data from source to end-use, and is a concrete embodiment of data provenance. Through data pedigree, organizations can understand in detail the history of data generation and change, ensuring the reliability and trustworthiness of data.
Data provenance focuses on the "past" of the data, i.e., where the data came from and what processing steps it has undergone. Data lineage focuses on the "past" as well as the "present" and "future" of the data, i.e., the current state and future direction of the data. The combination of the two provides a complete view of the data lifecycle, helping organizations to fully understand the history, current status and expected flow of data, and providing a solid foundation for data governance and decision-making.
Data Lineage and Knowledge Graph
Knowledge Graph (KG) is a graphical structure that represents entities and their interrelationships for organizing and querying knowledge. There are both connections and differences between data lineage and knowledge graphs. Both focus on the relationships and flow of data and information, but with different emphases. Data lineage focuses on the process of processing and flow of data, while knowledge graphs focus on the organization and representation of entities and their relationships. However, data lineage information can be an important source of data for the construction of knowledge graphs, helping to characterize the associations and flows between data entities and thus enriching the content and application scenarios of knowledge graphs.
Knowledge graphs usually contain rich semantic information representing various complex relationships between entities. These relationships can include contextual relationships, associative relationships, causal relationships, etc. Data pedigree information provides knowledge graphs with specific records about data flows and changes, enabling knowledge graphs to not only represent static relationships between entities, but also reflect the dynamic flow process of data in these relationships. For example, by integrating data lineage information, knowledge mapping can show the change path of a data entity in different processing stages and its interaction with other entities, providing a more comprehensive and dynamic knowledge representation.
With the concept of data pedigree itself pretty much understood, how does data pedigree relate to what's in data governance?
We begin the next chapter by understanding the connection between data lineage and metadata, master data, business data, and metrics data.
We'll see you in the next chapter!