Data products made easy: Implementing Data Mesh on Microsoft Fabric

Reading Time: 7 minutes

Microsoft Fabric is an end-to-end analytics and data platform designed to unify various data services into a single, cohesive environment. In this blog post, we explore the capabilities of this platform and look at how it aligns with the key principles of a Data Mesh architecture.

 

Fabric’s SaaS model promises to simplify setup and management and offers automatic integration and optimization, thus reducing the complexity and cost of managing multiple services on a ‘traditional’ Azure-based data and analytics architecture. The focus therefore is likely to shift from managing cloud services to data and analytics assets. It would be possible to tailor experiences for different organizational roles (data engineers, data analysts, business users), ensuring that each team member has the tools they need to be effective. With Copilot integration productivity of various roles is bound to improve.

 

Key capabilities include:

 

  • Data Engineering: Fabric provides robust tools for data movement, processing, and transformation, enabling seamless data integration and preparation for analytics
  • Data Factory: This capability allows for efficient data ingestion and orchestration, supporting complex data workflows and pipelines.
  • Real-Time Analytics: It offers real-time event routing and analytics, ensuring timely insights and decision-making based on live data streams.
  • Data Science: Fabric integrates advanced data science tools, facilitating machine learning model development, training, and deployment within the same platform
  • Data Warehouse: Fabric includes a scalable data warehousing solution, centralizing data storage and enabling high-performance querying and reporting.
  • Data Lakehouse: MS Fabric’s Lakehouse capability seamlessly integrates data lakes and data warehouses, enabling unified data management and analytics.

 

Key technical features underpinning the above capabilities

  • OneLake: Perhaps the single most important feature that differentiates Fabric from preceding MS data offerings. All data is stored in Delta format (with Microsoft’s optimization algorithms on top). By adopting OneLake as the store (ADLS Gen2 behind the scenes) and delta as the common format for all workloads, Microsoft offers customers a data stack that’s unified at the most fundamental level. A single copy of the data in OneLake can directly power all the workloads to process/query data that were written by any of the other compute engines (T-SQL, Spark). Data objects are either
     

    • Files – Object storage, raw files, reference files etc. not needed in tabular format
    • Tables – Managed Delta Tables created through Spark or T-SQL

     

  • Evenstream and KQL Database enable ingestion of streaming data which can be seamlessly integrated with One Lake and Power BI to enable real-time insights
  • Fabric SQL Engine is designed to operate efficiently over any OneLake artifact, providing high performance for queries across data warehouses, lakehouses, and mirrored databases.
  • Microsoft Purview integration provides a unified view of an organization’s data landscape. Metadata is automatically captured and can be viewed in Purview Data Catalog facilitating easy discovery, enhanced data lineage, and auditing capabilities.
  • Azure ML Integration enables read and write back data to OneLake thus reducing the time to develop and test new models. Azure AI Studio’s Large Language Models (LLMs) can be enhanced through RAG by leveraging integration with OneLake.

 

img

OneLake: Unified data storage system (Adapted from Microsoft)

What are the likely benefits of Fabric adoption?

A Forrester Study commissioned by Microsoft detailed the costs and benefits of moving to MS Fabric for a typical global organization that has $5 billion in annual revenue and 10,000 employees (including 40 data engineers and 400 business analysts)

 

Improvements envisaged for ‘technical’ users:

 

  • Data Analysts can ‘shift left’ on the data stack and work on data preparation (including cleansing, transformation, and aggregation) without the need to switch between tools. The enhanced data governance will give them greater confidence in data quality and reliability. The AI features within Power BI will automate the detection of trends, anomalies, and key drivers
  • Data Engineers benefit from a unified platform spanning ingestion (batch & real-time) to semantic data models. Purview integration provides data lineage and auditing capabilities. Developer productivity and experience are significantly improved with integration of Co-Pilot and VS Code, enabling deployment of clusters in just seconds.
     

    The study quantifies the following improvements:

    • Improved business analyst access and output by 20%.
    • Increased data engineering productivity by 25%.
    • Being a SaaS offering, elimination of infrastructure costs.
    • Reduced data engineering time related to searching, integrating, and debugging by 90%.

     

 

Advantages for business users start with improved data access through seamless integration from Power BI, Excel, and OneLake data. Collaborative workspaces and communication tools allow work on dashboards/reports in real-time and share related insights resulting in enhanced business performance. Improved alignment between the analytics and business teams results in faster time to market.

 

The associated costs for Fabric adoption over a 3-year period include:

  • Microsoft Fabric-related fees of $1.1 million
  • Implementation costs of $1.1 million.
  • Ongoing maintenance costs of $352,000

 

The overall NPV of benefits from adoption as per the study $9.79M representing an ROI of 379%!

A real-world perspective on MS Fabric’s potential to transform data mesh implementation

Sigmoid is involved in a multi-year data & analytics transformation that addressed some of the key issues with data architecture such as the use of legacy data tools, lack of scalability, and high costs. The evolution to Data Mesh architecture for this client was based on Data as a Product principle which helped functions such as Marketing & Sales, Supply, and Finance to convert their data (internal and external), into assets. This was underpinned by a robust architecture as illustrated below:

 

img

Data Mesh | Logical Architecture

 

An Azure-based implementation of the logical architecture had evolved into a standard platform that was centrally managed and could be leveraged by all domains.

For example, the Finance domain leveraged this platform and built data products organized into product streams and data products as follows:

  • Global Financial Services > Record to Report | Procure to Pay | Order to Cash > Productivity & Operations* | Service Quality Indicators* | Benefits Tracking*
  • Corporate Reporting > Management Reporting > Group Operations Leadership Reporting* | Supply Finance Reporting* | Gross Margin Analysis* | Data Products*

 

Each data product had a separate business case with corresponding costs & benefits being tracked. A team consisting of product managers, architects, data engineers, data visualization engineers, business analysts, and scrum masters took a product from ideation to deployment and evolve phase.

 

Although largely successful, the journey was not without challenges experienced on all the principles of the Data Mesh architecture. We think MS Fabric with its unified SaaS platform construct would mitigate many of the challenges experienced by the client.

 

The table below illustrates how MS Fabric’s capabilities align with and address the specific challenges faced during the Data Mesh journey.

 

Data Mesh Architecture Principle Azure Architecture Limitations Supporting MS Fabric Capabilities & Features
Domain-oriented ownership
  • Enablement of autonomous domain teams has proved difficult to execute and there is a reliance on central CoE to ‘delegate’ squads for implementation of data products.
  • Cross-domain data assets delivery (such as required for Sustainability initiatives) has added complexity due to multiple semantic layer components and file formats used by data product teams
  • MS Fabric provides an end-to-end tech stack which can make establishing, hiring, and upskilling domain product teams easier
  • The integration of PBI with Lakehouse can also enable ‘citizen’ analysts to emerge without having specialized data engineering skillsets
  • OneLake Lakehouse has a standardized file format making it easier to share data and develop cross-domain data products
Data as a product
  • Enabling discoverability of data products through tooling proved challenging and often resulted in reliance on data product owners’ expertise
  • Keeping track and preventing the proliferation of ‘data contracts’ for cross-domain data sharing with specific rules.
  • MS Purview integration provides catalogs for discovering and using data products. Purview gives a full overview of data products such as ownerships, SLA, data quality, security, and timeliness.
  • OneLake data hub hosts a repository of shared assets including semantic models, data marts, and KQL databases across workspaces and capacities. Endorsements help identify data assets that are fit for purpose
  • Data format is standardized, and data models are reusable, allowing both Spark and SQL engines to read data written to OneLake
  • The ‘Shortcuts’ feature in OneLake provides access to the shared data without data movement. Object and Row Level Security enables secure sharing.
Self-serve data platform
  • Data infrastructure provisioning requires a lengthy budget approval process and architecture reviews
  • Allocation of shared resources (like reserved instances) required diligent FinOps alignment and monitoring
  • Strongest alignment to this principle through the SaaS model. Basically, all platform level complexities are abstracted to SKUs and capacities, Teams do not have to worry about ‘turning platform knobs’ in the dev, test, and prod environments.
Federated computational governance
  • Although standards are defined for every component of the stack, it has proved challenging to govern these through code since each component tends to have a different tool/technology for computational governance. (think IaC, data quality tools, code quality tools etc). This has led to audits being manually executed taking away time from core product development
  • Fabric allows the creation of domains and sub-domains within a tenant to organize workspaces. The settings applicable can be governed at the domain or delegated to the sub-domain level.
  • The uniformity of tech stack components makes it easier to evolve standards applicable to eventually govern them through code or configuration options.

Accelerating Implementation and engineering productivity through Copilot and its integration points

Apart from enabling a de-centralized data mesh architecture an organization can also benefit from de-centralized and shorter implementation times thanks to Copilot integration with Fabric.

  • Data Science and Engineering: Copilot for Data Science and Data Engineering (in preview) helps generate code in Notebooks to work with Lakehouse data, providing insights and visualizations. It supports various data sources and formats, offering answers and code snippets directly in the notebook, streamlining data analysis. You can ask questions in the chat panel, and the AI provides responses or code to copy into your notebook.
  • Data Factory: Copilot offers intelligent code generation for data transformation and explanations to simplify complex tasks. It integrates with Dataflow Gen2 to generate new transformation steps, summarize queries and applied steps, and create new queries with sample data or references to existing queries.
  • Data Warehouse: write and explain T-SQL queries, make intelligent suggestions, and provide fixes while coding. Key features include Natural Language to SQL, AI-powered code completion, quick actions for fixing and explaining queries, and intelligent insights based on your warehouse schema and metadata.
  • Real-Time Intelligence: Co-pilot can seamlessly translate natural language queries into Kusto Query Language (KQL).
  • Power BI: Quickly create report pages, natural language summaries, and generate synonyms. It assists with writing DAX queries, streamlining semantic model documentation, creating narrative visuals, and providing tailored responses to specific questions about visualized data.

 

In essence, Copilot can lower the technical barriers to leveraging data and analytics and in doing so enable citizen data analysts and scientists to emerge.

Conclusion

The journey to a fully mature Data Mesh architecture remains a challenge due to technical constraints and organizational obstacles. Moving from an already implemented platform and governance structure to a new one adds another layer of complexity to adoption. MS Fabric is well positioned to help organizations achieve their end state data and analytics vision. However, we think a phased approach to migration is probably best as it enables organizations to experiment, and course correct while discovering the benefits of the new SaaS paradigm of MS Fabric.

References:

Transform data into real-world outcomes with us.