A semantic layer is a business-friendly representation of data, allowing for explanation of complex business logic in simpler terms. In Business Intelligence (BI), it has been called the metadata layer, semantic model, business view, or BI model.
When the semantic layer was first introduced to BI tools ~30 years ago, it defined table joins, metric aggregation, user-friendly names and more, allowing BI end-users to simply drag-and-drop fields like Product Name and Sales onto a report. Wham, there’s your data! Yes, “no-code” BI has been around for at least 30 years. This allowed early data teams to start thinking more strategically about where to put business logic, but also opened up a lot of complex issues. As a BI consultant for 20+ years and founder of FlexIt Analytics, these issues have been on my mind for a long time.
Where to put business logic?
For a very simplified explanation, business logic resides in one of three places in the “BI Layers”:
In an ideal world, you’re putting as much business logic as possible at the lowest level, in the data warehouse through transformations (the “T” in ETL/ELT). This reduces duplication (DRY — don’t repeat yourself), allows for a “single source of truth”, reduces vendor lock-in, and simplifies consumption at higher levels. However, this isn’t always feasible.
Data warehouse development can be slow. If a business definition needs to be changed, the business user may not have access to the data engineering team, or they may be backlogged. So they reach out to the BI team and have them put the logic into the BI semantic layer. What if the BI team is also backlogged, or not willing to make changes? Then, the business user puts the logic in the report. What happens if they don’t have “creator” access to the BI tool? Then they run the report, export to Excel, and put their business logic there (this is another article, or a thousand articles).
We know that stuffing business logic in reports gets messy very quickly. But we also know that there are limits to the amount of business logic that you can apply to the data warehouse. For example, you cannot define complex joins very well (dimensional modeling vs de-normalized “one-big-table” is another article). Also, you generally cannot assign column attributes like label, format, aggregation, description. Therefore, the natural place to put a majority of your business logic became the BI tool semantic layer. How much business logic is put in the semantic layer then determines if it is thin (very little) or thick (a lot).
Phase 1: Thick Semantic Layer
BI tool semantic layers started out medium thick, allowing you to define lots of complex business logic, but they were also somewhat limited in their capabilities. Early tools (Business Objects, Cognos) certainly fell into this category, and even successors like Tableau got into the game with the ability to add joins in the past few years. Looker, however, came in big with LookML to create a very thick semantic layer with advanced code and scripting capabilities. You might say a similar thing of Power BI’s MDX.
The thick semantic layer was a huge improvement, allowing for both complex and re-usable business logic. However, it was also isolated to that tool, not available to other tools or users who don’t have access. If you’re a small organization with one tool and everyone loves that tool, then this may not be a problem. However, what happens when 1) you grow and add more tools to your stack, or 2) want to move to a different tool (i.e. vendor lock-in).
To that point, the Business Intelligence Trends 2020 study revealed that 67% of employees have access to more than one BI tool, with an average of 3.8 BI tools per company. There are many other reasons to avoid the thick semantic layer, some of them detailed in the post below:
https://blog.transform.co/data-talks/why-business-metric-logic-shouldnt-live-in-bi-tools/
Thus, usher in the next phase in modern BI, the thin semantic layer.
Phase 2: Thin Semantic Layer
The idea of a thin semantic layer in the BI tool is to leverage other tools that build a semantic layer between the BI tool and the database. Some of these are metric layers (aka Headless BI, metric store), like Transform, Cube, and Metricql. Others, like dbt (data build tool), are data transformation tools that offer support for metrics, as well as other semantic layer functionality.
More on the metrics store: https://towardsdatascience.com/a-brief-history-of-the-metrics-store-28208ec8f6f1
BI tools that have a thin semantic layer sync or pull most of the metadata from the headless semantic layer and then define some additional metadata on top of that. There are a growing number of BI tools that adopt the thin semantic layer approach. Here is an article about Superset, detailing the ideas behind a thin semantic layer:
https://preset.io/blog/understanding-superset-semantic-layer/
The thin semantic layer is clearly a huge step forward for BI. But now, being a go-getter, you’re probably thinking “why not take it further”? In addition to metrics and some other metadata, why not push more metadata like names/labels, descriptions, formatting, and synonyms down to the headless semantic layer?
Phase 3: Semantic-free BI
The concept of semantic-free BI is not new. It dates back to early BI tools, and was first thought of as using a universal (unified) semantic layer, recently termed “headless”. The idea is that all consumers of data (BI, ML, and other tools) can “speak the same language” by accessing a “single source of truth” where common metadata semantics are applied. Unlike the thin semantic BI tools that synchronize some metadata from the headless semantic layer, the semantic-free BI tool simply holds a reference to metadata in the headless layer. There is no metadata detail held in the BI tool. Technically, you could change an attribute at the report layer and call this a semantic layer, but it’s not “the” semantic layer that we’re talking about.
Like all great ideas (hoverboards, flying cars, inside-out Oreo’s), implementing a universal semantic layer that actually works and is worth the investment remains elusive. The metrics layer solutions mentioned earlier (Transform, Cube, dbt, Metricql) are gaining major traction, but are somewhat singularly focused on metrics. For good reason, it is the most important component of the universal semantic layer. On the other end of the spectrum, there are full universal semantic layer offerings like AtScale and Kyligence. But they are not focused metrics layers, and it also remains to be seen if they will gain traction. Will BI and other tools put in the effort to integrate with them? Unlike the open-source metrics solutions, AtScale or Kyligence are neither open source nor transparent. They both have no pricing page and list only the largest companies as their customers, so I think it’s fair to say that they are not “universal” unified semantic layer offerings.
With data teams of all shapes and sizes, current offerings likely work very well for a small percentage of organizations. Perhaps smaller, nimbler companies find the headless BI offerings fit perfectly. Additionally, “semantic-free” BI is probably not that critical for these organizations. On the other end of the spectrum, large mega-corporations may be having success with offerings like AtScale or Kyligence. That’s great! However, this article is really for the 90% that are in between.
How do we get to semantic-free BI?
In order to get there, I see this as a three step approach:
- Refine and merge concepts of the metrics and universal semantic layer
- Define standards for BI tools to talk to the headless layers
- Reach a critical mass of BI tools that will support semantic-free BI
The metrics and universal semantic layers cover nearly everything, but need to both come together and mature. Once that happens, it needs to be easy for BI tools to integrate with this headless semantic layer. Without a set of standards for talking to the headless layer, each BI tool will have to create custom connectors to each, which will likely lead to failure. If the framework is both open and easy, then BI tools will naturally adopt a semantic-free model. Then, we can start to reach critical mass for semantic-free BI.
Currently, I see dbt in the lead toward enabling semantic-free BI for a handful of reasons:
- It’s not another tool in your stack. As the fastest growing solution for data transformation, it already holds a majority of your complex business logic. Companies are already moving their LookML down a layer, into dbt.
- By default, a lot of metadata is naturally built-in to the dbt models.
- dbt’s documentation and support for meta already enable full data cataloging capabilities. Now, it’s a matter of improving the functionality and making it more active/alive.
- dbt-core is open source and free, with a vibrant community
- Extras like data lineage and data freshness are huge for BI
Although dbt is not a true headless server, they are currently working on their headless metrics offering. Additionally, dbt recognizes the need for improvement and has laser focus on both the metrics and semantic layer:
https://venturebeat.com/2022/02/28/dbt-labs-will-soon-add-a-semantic-layer-in-the-modern-data-stack/
Right now, BI tools can and are integrating with dbt to provide thin and semantic-free BI experiences. FlexIt Analytics and Lightdash already have semantic-free capabilities through integration with dbt. Others like Superset and Metabase have sync tools that allow for manual syncing of dbt models to support a thin semantic layer. Given dbt’s popularity, many others, like Thoughtspot (Aug 2022) and Holistics (beta access available now) are coming soon, so we’ll see how they integrate some time in 2022. Lastly, some BI tools like Mode give you a little bit (dbt source freshness), but not much in the way of the semantic model.
Things are progressing rapidly, making it very important to get some standards in place. Here is an article that both details how to set standards and provides a working Github dbt project for how to integrate with dbt to enable semantic-free BI:
Closing Thoughts
We’ve come a long way toward enabling data teams and data consumers of all stripes. In some regards, we’re just at the beginning, but the current sense of urgency and rapid pace of both headless semantic offerings and thin or semantic-free BI tools gives me hope that we will get there soon.
I’d love to hear your thoughts, or reach out to Andrew Taft