Data models are the core of metadata and the majority of data modelling is done using Entity-relationship models. Despite many variations in notations, the underlying idea is always the same - name the main entities, their attributes and the cardinality of the relationships between them (1:N, M:N,…). However, when it comes to describing how the tables or cubes are calculated and how their content is to be correctly interpreted, this type of modelling has limitations.
There are two basic places where we can look for help:
Using more expressive languages like UML, RDF, OWL - languages which were designed to generate code for machines
Using a semi-formal language like pseudo SQL, that is derived from natural language, designed to create sentences similar enough to natural language that non-programmers can see the meaning or intent
Formal modelling languages
Metadata-driven development has given birth to a plethora of formal languages. They've been designed with code generation in mind. It means that based on a model you can generate a code that can be executed. This approach works up to a point but definitely has its limits. Take for example ETL tools that use graphical interfaces to define transformations. They look great on sales presentations, but when the reality of the complexity and size of data volumes kicks in, they fall short. The transformations are scripted in the proprietary code of the database being used and the visual ETL tool is typically used for the scheduling and management of ETL jobs - you'll have fun getting some meaningful metadata model out of that (please let me know if you've had a different experience!).
Even when someone succeeds in using something like UML to describe some transformation logic, the energy needed to model the process is equal to the energy needed to actually code the process. Let me give you an example: There's a story about a king who wanted to have a detailed map of his kingdom (map = metadata model). His court cartographers presented him with map after map, each one better than the last, but the king was never satisfied. Each new map he saw was not good enough. Finally the oldest and wisest of the cartographers pointed out to the king that if he wanted a perfect map, its scale would be 1:1 and the map would be as big as the kingdom itself. This is exactly the issue with metadata-driven development - the model is as complex as the logic implemented and needs as much energy to create, maintain and interpret properly as the logic needs to be implemented.
Formal languages for the modelling of business logic (e.g. ETL)
are complex and time-consuming to work with,
and business people don't know how to read them.
We encounter a similar situation when we consider the output side. People love to define business glossaries as hierarchical, ontological trees (e.g. you have the word "client" as a general term and "retail" and "corporate" clients as sub-branches of the term). Again it sounds like a great idea until you dive deeper into the ontological modelling needed - questions like "What type of hierarchy should be used?" and "An SME mortgage client has some of the attributes of a retail and some of a corporate client where should I place it in the hierarchy?" pop up.
Again the ontology languages like RDF and OWL were designed so an AI using them can find solutions through inference - not for ordinary business people to read and write with them.
Let's give up on the idea of a perfect map of the information landscape for a moment and imagine we want to construct a map that is not perfect (like 1:1 map), but cheaper and also easier to work with.
The basic elements we need to address are:
whether they are data entities like a table, a cube or KPI
or complex entities like a business term, a report, or a person
each entity consists of attributes - a block, in general, has some inner structure.
entities have relationships with each other.
not just links but relationships of different qualities and directions *)
current production models vs historical versions vs models in development
we will touch on this area in a future blog, so let's skip it for the moment.
*) The quality and direction of a relationship matters a lot! Let’s say you have your nose in my armpit. The statement “s/he has a nose in an armpit” would be true for both of us, but I would be far better off :).
Let’s look at that SME mortgage client example again and see what solution to our problem is thrown up when we use a pseudo-formal language.
We would ask our business analyst a few simple questions in order to define the term and also provide him/her with the option to use any of the existing entities already in our data warehouse (We want to re-use definitions and capture relationships. Let's assume for this example, that the terms "legal entity", "employees" etc. are already in the Dictionary). We would get a definition something like this:
*Define* SME mortgage client: It’s a legal entity with such and such a number of employees or such and such a turnover which meets this mortgage risk criteria.
When creating this new term definition the analyst is only making hyperlinks to existing definitions. However, an intelligent dictionary would be able to interpret the links as the relationships of the type “term as defined by the link target term”. So when, for example, the term “turnover” changes, the intelligent dictionary would signal to me that I also need to review the definition of SME mortgage client.
*Calculate* SME mortgage client: SELECT client_name, client_id FROM dim_party WHERE classification_type = ‘segment’ classification = ‘SME’
Just a simplified pseudo SQL can be used to capture a calculation logic - admittedly it wouldn't be possible to run as code, but any analyst would be able to create a report using this information in minutes. Also if the dictionary interprets links as “term has a data source in a link target table” and uses this term in a request for a monthly report you would get a simplified data lineage with no extra effort.
Wow, an analyst has modeled a conceptual model of an SME client with business logic and data sources without any knowledge of the conceptual modelling techniques!
Metadata-language best practice
Use ER models to document your tables. They are easy to create and people understand them. Pay extra attention to creating a subject-area diagrams - don't use an A0 format diagram to cover all entities and relationships.
Do not try to use advanced formal languages to model business logic or business semantics - the modelling effort is equivalent to the coding, and few can model it properly anyway, while outputs are understandable only to a few.
Capture the nuts and bolts of definitions in a natural language. An imperfect definition is better than a hard-to-understand one or none at all. Use links to reference and re-use existing definitions. Sooner or later we will need a tool to define the quality and direction of the relationships for impact analyses and change management.
Finally start simple and small, but do start ;).