The importance of six “W” in data traceability – the ability to annotate
WHO, WHEN, WHERE, WHAT, WHY and HOW
Reflecting the above facets of data management components, one thing we are sure about is fact that the very essence all data categories and its elements above will always in some ways linking to the basic six “W” descriptions that are strongly relevant to its data or data processes.
You can pick any of the categories above and its data elements can always be associated with these basic six “W” descriptions. However, yet, present data solutions in the market have not fully catered all necessary six “W” inclusion or only covering partial provision. This is especially true when you discover that they totally (or deliberately) omitted the WHY and HOW components when one trying to apply the ability to perform full data annotation. Perhaps, linking the initial first four “W” is already a big challenge for most vendors to fully reach the necessary technical maturity.
In a recent conversion with a data domain expert working for banking sector has pointed out that when come to the final urgent hours of performing data auditing and traceability in the attempt to back-tracking individual relevant data events especially at crucial data points, very frequently, only descriptions associated with WHAT, WHEN, WHERE and WHO were retrievable, while the crucial elements in those HOW and WHY, were often missing in most occasions. The Data Omission issues and its negligence, whether intentionally or not, has becoming a key crisis needing urgent attention.
In other words, when asked for performing a complete data forensic in order to get the full provision of associated “reasoning” behind each and every crucial data point, only the end results were shown but seldom (or close to never) the provision of detailed explanation on the HOW or WHY parts that reached to certain conclusions. When this “reasoning” is missing, it is often quick hard to “connect the dots” and provide an acceptable answer or explanation for its full traceability, especially when dealing with potential breaching of serious data compliance issues.
A great data solution must provide not just partially the ability for data lineage but also a full suite of data annotation capability that can support all ingredient of data lineage necessary for the complexity we have seen nowadays across every industry. The ability to “relate” and “link” to all relevant data points (or data processes) can be a great data asset in urgent time of need, especially if one can perform under real-time basis. The crucial extraction of metadata information and subsequent assignments must have the capability of providing “extended engagement” to “the past”, “the present” and possibly perhaps “the future” events so that full data accountability can be realised.
But how can we achieve all these above while yet not seen to be an intrusive solution? Most company wants to solve this issue but won’t allow data vendors to entering the data gate directly, which present even a greater challenge.
How FST Network’s solution differs
1. Immediate Challenge – Before & After
Besides considering all challenges and issues mentioned previously in Part-A, we have also taken detailed examinations on how most current trend of data management in the market and why these practices are no longer suitable to be adopted as best practise or facing upcoming obsolete issues. The following will highlight for general cases and discuss how a better solution can be considered.
The Before (Current Practice)
Let’s consider a scenario that involves 3 departments in a company:
· The Business Department
· The Finance Department
· The Compliance Department
From the diagram above, a query has been triggered from the business department (1) to get some financial figures from the finance department (2). It may just be a simple request but due to certain regulatory policies in place, the finance department will have to consult the compliance department (3) before disclosing the details.
The Central Data Team will need to liaison with individual department’s data gatekeeper and obtain relevant information from the respective data lake inventory. In turn, the data gate keeper will also need to consult their relevant data supervisor first before deciding which information they will release (or “withholding”) in the information exchange so that their department will bear only as minimum risk factors or least responsibility.
Each query point will take the time it needs and the final answer to the query may have been taken too long to resolve or information provided has already been outdated or expired, render invalid. This scenario has been too familiar in many companies but are still being endured with mounting complexity upon complexity, making the operation somewhat very inefficient and ineffective.
The agony in this scenario is that the information needed may be prepared rather easily on many occasions and in some cases, those answers can be repeated or mirrored or even updated, where the relevant data gate-keeper actually has the immediate knowledge to address the situations.
Sometime, the exchange of information was done even on the “off the book basis” through “private channels” – that will not subject to any accountability, which can even give rise to further hidden risks and complication. The department in urgent will need of those answers before the deadline to their queries but can no longer wait or had passed the allowed period due mostly to delay or late attention.
Can the above scenario description transcend to just a simple Data Ownership and Data Preparation issues? In a way, this current traditional practice of data management looks very much as if like a “headless chicken” looking for foods to bite but unable to consume them directly. Should there be an assignment of Data Owner solely responsible for in-charging of different aspect of “data functions” for specific information exchange and data flow (depending on their relevant specialised domain know-how), then things could perhaps be accomplished more accurately, efficiently, and faster.
Furthermore, when it comes to the Audit process, the challenge is inevitably complicated as discussed previously, where many crucial “reasoning” behind every data preparation from start to the end stage were potentially being omitted or not considered and recorded directly. Data Omission crisis could be a key focus in enabling full data lineage and proper function of true traceability
The After (Future Potential Practice and New Framework)
Here, we propose a future potential data management framework that can be self-sustained. Instead of having a centralised data team for dealing with all relevant data gatekeeper and their respective data lakes (or other means or archive), a comprehensive Decentralised Data Provisioning framework is being adopted where we focus on implementing effective Data Preparation and devising relevant Functions as a Service (FaaS) and Data-as-a-Product (DaaS) to significantly enhance necessary data facilitation whether within or cross departmental. (FaaS will be explained in the next section)
The above diagram provides an inventive yet disruptive visionary approach, in which the DaaS will be prepared and circulated “on shelf” (achieved on clouds) if it is needed with constant contents and status update. In this new approach, we intend to shift the burden loads of the cumbersome centralised data team and individual data gate keeper to re-purpose the entire data engineering architecture where the data preparation provision may be initiated by the assignment of rightful Data Ownership with relevant specialised domain expertise.
The usefulness of Data Ownership is mainly to appoint a highly calibre Data Curator that is sufficiently versatile in all relevant aspects of necessary domains specialisation and know-how inclusion.
the Owner of a particular data preparation may establish full understanding with other close stakeholders such as 1) the Strategist (the C-level chiefs), 2) the Regulator (the compliance and risk management team), 3) the Business Intel (the frontier team for bringing potential revenue) and 4) the Developer (the technical team that builds, compiles, and retrieves necessary information), etc.
It is, however, the responsibility of Data Owner to be in-charged of supervising, preparing and producing the respective DaaS that can be circulated in the company. We are indeed talking about how to devise a new decentralised masterplan that can revolutionaries the future framework on how data can be efficiently managed with full automation. (Yes, the Data Owner does NOT have to be a human being in future but a possible AI or Robot that can perform the task with minimum error).
The important concept of Data Mesh and its application
Next, we describe the concept of Data Mesh, which help to transform our new approach.
We achieve the evolution of new framework by adopting a powerful concept of Data Mesh that hold 4 principles (or features) as below:
a) Domain-Driven Data (DDD) Provisioning or Ownership
b) Data-as-a-Product (DaaP)
c) Self-Serve Data Infrastructure-as-a-Platform (IaaP)
d) Fully Automated Decentralised Federated Governance
This Data Mesh approach holds the key to unlock the potential of full data automation and integration. Let us dive in further how this Data Mesh concept will help to disrupt current data management.
a) Domain-Driven Design (DDD)
In adopting DDD, we ask the questions:
· “Who will be the best person to consult when a query is raised?”
· “Wouldn’t it be more effective if the answer came straight from the source that has full knowledge of the subject of query?”
· Will the answer become inaccurate or diluted if the information passes through multiple channels before reaching to the intended recipient?”
In DDD architecture, we focus on “going directly” to the rightful source of domain expertise and specialisation to derive a more complete data preparation as precise as it can be. This must come from the source that has the relevant know-how itself, rather than the 3rd party to avoid any possibility of information dilution.
b) Data-as-a-Product (DaaP)
Upon a solid data preparation by relevant DDD parties, the relevant data is packed in the format of DaaP as the outcome or output (which can be kept in the Cloud and processed further via Cloud Computing). This DaaP data preparation stage is crucial, not just about retrieving the right information but also how each relevant data process and their related data elements are carefully “referenced” and “packaged” to produce the ultimate automatic streamlining for managing gigantic volume of data input and output. The ability of accurate referencing is the key and killer application of our new framework design.
c) Self-Serve Data Infrastructure-as-a-Platform (IaaP)
Once individual DaaPs have been produced by various departments or data owners, it should be considered ready to be put “on shelf”, provided if the underlying data referencing architecture is readily operational and functioning. This relies on careful and intricate data pipelining design across the entire data network within a company to produce a fully functioning IaaP.
This is the job of applying various key Function-as-a Services (FaaS) as the essential service tools for seamless connectivity and orchestrating the master blueprint of entire company’s data network. (The data pipelining and referencing will be explained in the next section).
In brief, each company can now have their own “Data Supermarket” that can be self-serving upon triggering right relevant and specific query. Just like normal supermarkets, all DaaPs circulating on the shelf are “purchasable” or “accessible” with a given credit or rights. Everyone in the company (provided with the right access control) is now able to view the similar materials and contents. No more hidden issues, private channels, or under-table dealings, solving the data transparency issue once and for all. The Data Supermarket can be extended externally too with right gatekeeping.
Another issue is that if whether each DaaP and its contents are still within the expiry period of consumption, meaning if information are still kept up to date. The Data-Supermarket must ensure each of the on-shelf DaaP is “healthily” usable and with the right time frame status. Any latest update must be applied periodically to refresh the contents and making it always reliable, even it has already been acquired for usage. The ability to perform regular and in-house real-time update is the very key to make it a successful application.
d) Fully Automated Decentralised Federated Governance (blockchain technology enabled).
One must wonder what have been described so far will require a very sophisticated and solid governance (a mission impossible if using conventional centralised data team approach). It is in fact true that such new framework must also incorporate an absolute trustless based approach – meaning that trust is beyond the issue as information formation and flow within the system is fully immutable and trustable – meaning I know that you know what I see is also what you see.
Also, every single aspect of activities and potential “reasoning” behind each action can now be fully accountable and recorded if necessary (not violating any privacy rules) so that all events can be fully traceable for full lineage, audit or forensic purpose.
The fully decentralised federated governance in the new framework will be serving as the “sky-eye” which can be used for achieving the higher purpose of data compliance and monitoring, where necessary alerts may be triggered if any events will deviate from its original intended usage. This is very powerful way of ensuring data compliance and smooth handling of any auditing and data forensic if asked by the regulator to produce necessary supporting evidence.
The diagram below depicts how all four features of Data Mesh can be integrated and interacted to ultimately provide a fully sustainable data ecosystem, which can eliminate unnecessary replicas, duplications and repetitions, while addressing real-time issues and enabling the true transformation from those almost obsolete old fashion data management approach.
Note also, all these four features will be accomplished in an autonomous fashion, rendering higher efficiency achievable by our new framework. All four features under Data Mesh will be integrated to provide a seamless data network and management framework, which we name it as VADO in our solution – the Virtual Assistive Data Officer (VADO).
2. VADO – The Virtual Assistive Data Officer
FST Network is a leading data preparation & FaaS company in Asia, invested by National Development Fund of Taiwan and H&Q from Silicon Valley. Our major customers are established financial institutions which currently undergoing potential major upgrade and keen to adopt new emerging revolutionary innovations in data management field that can help them to improve and enhance their company’s data governance and lineage in their future business undertakings.
Several Unique Selling Points (USPs) and Characteristic of VADO includes:
a) Non-intrusive solution with real-time Data Lineage provisioning
b) Precision metadata extraction with encrypted based Labelling technique
c) Smart data pipelining with operative data processes assignment
d) Data granularity referencing capability – business & operational logics assignment
e) Full suite flexible Access Control management – Rule-Based or Attribute-Based
f) Flexible data sharing & information field masking
g) Reliable Data Compliance capability – Privacy & Financial Data Regulatory Monitoring
h) Data Tracing & Data Forensic Capability
i) Data Preparation & Data Readiness Provisioning
j) Data Observatory, Discovery & Discoverability capability
k) Data enhancement capability with business intelligent learning provisioning
l) Data Automation with Decentralised Federated Governance
m) API Connectors friendly with other solution vendors
With all the above features in place, our solution can be truly revolutionary and innovative to address the current and upcoming on-demand needs and sudden increase of data influx in every organisation.
Next, we will highlight and integrate the above key features to provide useful illustration for better explanation and clarification.
We ask the serious questions, while designing every aspect of the new framework architecture
· “What are main problems currently ‘clogging’ the data flow within any organisation?”,
· “What is the challenge in resolving the inefficiency of data flow?”,
· “Can we address issues while able to simultaneously introduce new ways of improvements?”
· “What most clients are facing and desperately want to resolve with hierarchical prioritisation?”
· “How can we meet and achieve real-time requirement in handling large volume data?”
· “How our solution provisioning for autonomous information updating capability, while able to retain the details of reasoning behind each aspect of update and changes”
· “How can our solution cover future demands of extended engagement, not limited just in the data management domain?”
· “What customers expecting for what they willing to pay?”
The Data Pipelining and Data Process
Upon taking into the account of those key questions, we took a deep dive into multiple case studies via established institutions (e.g., finance/insurance company) and attempting to identify the common scenarios and trends depicted by examining and measuring the end-to-end (from starting to endpoint) process for complete journeys on how data and events are linked and communicated and eventuated.
When reviewing the general data flow events for any organisations, we noticed that the complications arise were mostly relevant to the inability for careful data segments segregation and weak execution on data connection & communication protocol, which is often related to the poor implementation of data network architecture and data engineering on company’s data management infrastructure.
We also arrived at two basic conclusions:
· Data are hidden among the entire structure, and this is just a corner of the picture.
· Data is not only just stored in database, data lake or warehouse, but majority of them remain in the data pipelines across different processes.
In a nutshell, the implementation of data-pipelining and the assignment of individual data processes associated with these data-segments need to undergo a revolutionary change in order to effectively resuscitate the intended full data functionalities and making them “unclogged”.
These hidden pain-points of inefficiency or data mishandling (whether detectable or undetectable) have probably been buried under for far too long that they may have been “forgotten” or too tedious to be unwrapped, which we would like to term as “Data Alzheimer’s Symptom” – or DAS.
Also, those unwanted “clogging” in data flow can be also viewed and termed as “Data Fibrosis” problem that can continuously causing massive degeneration and thickening of the key data pipes and eventually reaching to the acute point sooner or later, if unmanaged or unattended.
** The term fibrosis describes the development of fibrous connective tissue as a reparative response to injury or damage. When fibrosis occurs in response to injury, the term “scarring” is used. In technical terms, fibrosis means thickening or scarring of the tissue. For example, Pulmonary fibrosis is a lung disease that occurs when lung tissue becomes damaged and scarred. https://www.lung.org/blog/7-things-know-pulmonary-fibrosis
It is with this scenario in mind that aspired us to re-engineer our approach while finding a suitable yet non-intrusive “cure” to address the major data handling problem, once and for all.
Most data strategies nowadays are looking into integrating the data from everywhere or within their so called “company’s ecosystem” as shown in the diagram below. However, this can result in high-cost mounting (data debts) of many unwanted duplications, replicas and most hurting of all, an only single-uses, which did not construe as a right, robust and scalable data solution.
So, rather than having a big data management resolution or relying on traditional pipelining process that may clog up everything, FST views that the catalyst of solution could lie within the possible assignment of precise business or operative “logics” (a detailed “bite-size” descriptions) at crucial point of data segments or interactions.
In other words, we want to revolutionise and revive the data pipelining and their data processes to include these assistive “logics” with relevant referencing utilising appropriate labelling or tagging technique to provisioning the recording of events and reasoning in combination.
The “little Mario” represents FST Network’s “Virtual Assistive Data Officer”, proactively linking every details of “data everywhere”, extracting necessary individual key metadata information, applying precise referencing for individual logics and bind them together for every key data processes with respective associated contents that can be found within the pipelining. (You can view it as Mario literally going into every single pipeline to perform detailed unclogging, examination and then labelling them, where these labels can be later revisited virtually or called upon for checking and auditing purpose.
With the allocation of these precise labelling mechanism, every aspect of various complicated data journeys can now be precisely “discovered, cleansed, tracked and reasoned” from start to endpoint, potentially rendering a complete yet very prevailing data lineage resolution to significantly improve the current data management crisis faced by every company across different industries.
The Logic Operation Centre (LOC) within VADO
We can now reveal how this VADO actions can be accomplished using the Logic Operation Centre (LOC). The design of LOC within the VADO is basically a factory for VADO solution deployment
The diagram above highlights how our approach can systematically unwrap and unfold the events for any data journey from start to endpoint. The bottom of the diagram illustrates the long data journey and its data ETL and retrieval actions along the way (e.g., from data preparation to data compilation and business analysis, etc…).
Whereas the top of the diagram illustrate how VADO is used to precisely create necessary individual data segments, determine individual data pipelining, and define its associated data process. Within each data process (respective pipe), we can then determine the necessary logics, applying necessary “tagging” and appoint appropriate Agents (persons-in-charge or key involvers), alongside with the necessary connectivity assignment via relevant API routes.
In brief, any data journey can be entirely pronounced by joining relevant “data pipes” that describe all data processes involved and form the respective pipelining. The concatenation of these data pipes can now be accurately streamlined and layout all relevant processes used in completing the full narrative of individual highlights, points of relevancy and emphasis. If there is any “leaking” along the way, it is now possible to pinpoint & cross-examine the relevant data process and its ingredients within.
For data compliance, this is a great news for the company to be able to revisit every detail of the data events. One thing we would like to emphasis is that all these data re-engineering actions are done in an absolute non-intrusive manner, meaning, we let company adopting VADO to define and set their own rules and logics. The contents of individual processes (i.e., relevant labelling, agents & API routes) can be set according to company’s data policy and rule settings. We are only the “enabler” of right tools and functions, but we leave it to the company to decide how they want to get it done.
The LOC labelling methodology
Now, with possible precision of these settings applied internally within every data pipeline, a complete and full data lineage can be easily accomplished with the display of its relevant fine granularity details. The diagram below illustrate how labelling mechanism can help in precise streamlining of specific data lineage requirement.
Event Sourcing with full data lineage via Precision Labelling
Notice that in the example, various tags are “colour coded” to highlight different descriptions applied in their associated labelling and respective reasoning. These different coloured tags can then be linked, grouped, or combined with each other (whether it is from the same or different colour) during the event sourcing period.
Moreover, these installations of tagging can help to really speed up the searching process and become handy in the urgent auditing period. For instance, if only the golden tags the event of interests alongside the data journey, it can be called upon instantly and be examined.
The utilisation of this intricate yet obvious design is basically the true revelation of how granularity of individual key processes and specific data annotation can be truly implemented to achieve detailed data transparency and visualisation, should one choose to enhance further its outlook. This is up to individual data domain experts to apply and upgrade if they feel necessary. With the provisioning of VADO, this can now be done straightforwardly whereas before they may be too complicated to accomplish.
Above is a snapshot of how each data process can be described in terms of its related logics and labelling. Note that, one data process may consist of multiple logics and each logic can have multiple labels as well as agents being defined. The data process and individual logic settings can be defined by its Digital Identity (DID) for “title” differentiation.
Within each logic, respective rules and policy settings can be applied alongside with appropriate labels. Above example shows the application of applying the “If” operator as one of the functions. Note that numbers and values can also be included accordingly.
The LOC as the data engine
This is an overview on how LOC can be used as the Data Engine that links various referencing DB, solution can be packed into various container following various Schema and Data Aggregation alongside with respective API managers, Proxy and Routers that can provide instant fetching of necessary ingredients. (Need more description to explain this diagram).
The Ultimate Enterprise Architecture
VADO, via its flexible API routes setting can efficiently link with or tap into different solution providers’ connectors and enabling non-intrusive labelling for enhancing the referencing methods. In brief, we are agnostic towards all formats of solution in the data solution market.
Finally, it is the aspiration of FST Network to introduce our VADO solution as the “collaborator” tool that is enabling efficient communication and connectivity with various existing data solution vendors and help them to amplify or upgrade their existing approach. We endeavour to achieve win-win cases with existing or future partners and helping enterprises to evolve and migrate with frontier data innovations that can make a difference in their business venture.