LucidBI

Cross Workspace Fabric Notebook Execution

William Crayger — Thu, 06 Nov 2025 16:32:41 GMT

Have you ever tried to run a Fabric notebook in a different workspace using notebookutils? I have, and it didn’t work out so well. Thankfully, persistence wins, and now I get to share the secret sauce with all you lovely people.

The Setup

You’re going to need two workspaces, one lakehouse, and two notebooks. If you want to see another little scoop of secret sauce, create two lakehouses.

Workspace 1:

Workspace 2:

In this setup, I Speak Whale is the target workspace and lucid_is_awesome is the target notebook. just_keep_swimming is the source notebook that will be attempting to negotiate with the target.

just_keep_swimming:

lucid_is_awesome:

Pressing the Button:

When you hit the go button, the current configuration is going to be trying to execute a target notebook in the same context of the calling notebook. What’s that mean, exactly? Well, the calling notebook is attached to a default lakehouse, and that lakehouse belongs to a specific workspace. So, the calling notebook is looking for the receiving notebook in the same workspace as the calling notebooks default lakehouse. This is problematic given the two notebooks are assigned to different default lakehouses in different workspaces.

It’s basically telling you, “This isn’t where I parked my car!”

So, we have to tell it where the valid parking spot is, and to do this we need to grab the workspace GUID of the target workspace.

Round 2 - FIGHT:

Well, that didn’t work.

But wait, THERE’S MORE! If you take another look at the error, it’s telling you why it tanked AND how to fix it. Because the notebooks are hooked into different default lakehouse’, we need to add one more little nugget of goodness.

I remember trying to do this a while back and banging my head against the wall without success. And here we are, running cross-workspace jobs like nobody’s business.

Anyway, quick one today but something I ended up circling back to and thought it would be helpful.

Cheers, and remember to stay Lucid.

Variable Libraries, The Fabric Unsung Hero

William Crayger — Wed, 27 Aug 2025 16:16:39 GMT

In case you missed it, the Variable Library team over on the Fabric PG quietly snuck out some pretty big updates relatively recently that are actually quite massive. Before we dive into, it’s important to understand why these updates are so huge.

Deployments and CICD have been a real thorn in the side for most teams since launch, specifically with regards to data pipelines. If you’re not familiar, when you build a data pipeline and add an activity, such as a copy or a lookup activity, you have to configure a connection. This is done using a couple dropdown selections, which is trivial.

After getting the connections all sorted, you kickoff your job and everything works wonderfully. All good, right? Well, no, not really.

When you make the dropdown selection in the UI, the JSON of the pipeline is being updated with a hardcoded GUID of the selection made.

Eh, who care? It’s working now, we’re good to go. Right?

Insert our good friend, CICD

Assuming you’re following good development practices, you’re likely working in a feature branch or a development environment. So, what happens when you need to check that code in and promote your new pipeline to production? Remember, we now have a hardcoded GUID in the JSON of our pipeline.

Unlike some of the other Fabric items that will remain connected using named references (think notebook and lakehouse references), data pipelines do not maintain a link based on named reference. Said different, when you deploy your code to a new workspace, the GUID reference in your JSON will not change. If you deploy a pipeline to a new workspace and immediately trigger, it will reference the GUIDs from the originating workspace.

Now that we understand the issue, let’s talk about a solution.

Variable Library, Early Days

Variable Libraries have been in preview for quite a while, why did it take so long to give them the respect they deserve?! When VL’s first launched, they could only be used with data pipelines, and they did not support usability with activity connections. Yeah, you read that correctly. We could use them to swap other values in our pipelines, but we couldn’t use them to do the highest priority (personal opinion) function needed, let me hot swap an activity connection.

For that reason, we had solutions popping up like the fabric-cicd repo by one of the Microsoft internal engineering teams. fabric-cicd is a hugely helpful CICD kickstarter that solved the issue of changing connections by essentially doing a find/replace during deployment to replace GUIDs and other references.

Having seen the VL’s, I suspected it was only a matter of time before we had full support, so I extended the fabric-cicd repo to dynamically generate a parameter.yml file using a VL. This was obviously not the intended use case, but it allowed me to leverage the VL item in preparation (and hope) that we’d get connection integration at some point.

If you’re interested, I do have an open PR on fabric-cicd repo (I owe some updates to the docs, sorry for being slow) where you can view the code mentioned above. That said, I need to revisit to determine its viability going forward, but I suspect it still has a place in the toolbelt.

Added support for dynamic creation of parameter.yml file using integration with VariableLibrary artifact by Lucid-Will · Pull Request #264 · microsoft/fabric-cicd

Variable Library, All Grown Up

Fast forward a few months, and the current state of VL finally warranted some updates to the Lucid Data Platform codebase. I’m going to make a bit of an assumption that people have at least experimented with how to use a VL, so I won’t be going super deep on a “how-to”. If there’s interest in doing so, I’d be happy to put together another tutorial, but I’ll keep this one relatively high level.

Let’s take a quick look under the hood at how these little fellas can make your life a whole lot easier.

Setup:

Firstly, you’ll want to create a VL item in your workspace. If you do a bit of proper planning, your future self will thank you. E.g, instead of mashing everything into a single library, I try to be a bit more organized and spun up multiple libraries for logical grouping.

Here’s a glimpse into the mind of a slightly OCD, very ADHD person that probably got carried away due to excitement and how they structured their libraries:

And if you crack the door a bit more:

You can probably pickup on the logical grouping for the rest of them, so I’ll spare you the screenshots. But, keeping things isolated this way tends to reduce my time spent scrolling endlessly looking for that one thing, so it’s helped quite a bit.

Implementing in Data Pipelines

Circling back to the previous pipeline example, time to get rid of all those hardcoded references. If you crack open a data pipeline, you’ll find the Variable Library tab at the bottom of your canvas. Here’s where you’ll initialize the variables from your VL’s for usage in the pipeline.

Next step is to replace all the hardcoded references in your pipeline activities with variable library references for the juicy flexibility.

You drop them in the same way you would a parameter or a variable.

Now, if you check the pipeline JSON, the hardcoded references are gone and you’re ready to jam.

This Is Really, Really Cool

So, yeah, this is actually a super duper cool and very welcomed enhancement to the Fabric stack. It gets us closer to closing one of the biggest gaps (personal opinion) Fabric pipelines have had compared to legacy ADF/Synapse in that we were missing Linked Service items. It’s also a huge W with respect to CICD and how we manage connections and other references during deployment.

On a semi-related note, another feature that was quietly rolled out recently - Schedules are now included as part of the metadata being deployed. If you deploy a data pipeline with an active schedule, the schedule will go with it as part of deployment (and will deploy in the same state). So, after spending about ~30 minutes trying to figure out why jobs were duplicating and failing over top of each other, I realized the issue was related to doing a test deployment to a feature branch. The variable libraries were deployed in two workspaces, and the schedules deployed as active, so be careful with that, I guess?

Anyway, happy coding, Fabricators!

Oh, if you’re interested in a rock-star pickup to augment your dev team, check us out over here at Lucid BI and feel free to connect on LinkedIn for more cool Fabric stuff.

Data Profiling with Spark and YData

William Crayger — Tue, 26 Nov 2024 15:40:27 GMT

Data analytics often begins with profiling your data. Data profiling is simply the act of examining the raw source data to understand things like the structure and quality of the data.

As an engineer, understanding things such as the distribution of value, min/max, and unique occurances of values in the data you’re working with is very powerful. It helps you to better understand how to work with your data when considering joining tables, configuring incremental extract/load processes, and identifying the natural key of the table for dimensional modeling.

Data profiling can often be a long, tedious process. Thankfully, there are tools available to help expedite the process.

the Ydata library is one such tool and should be in everyones toolbelt.

Welcome - YData Profiling

The library itself is very robust; however, it can also be simplified to get a quick profile of your data. In addition, you have several options of what to do with the output once it’s generated such as rendering the html directly in your notebook or writing the output as JSON to a storage location.

With only a few lines of code you gain significant visibility into your data. For example, if we wanted to understand the makeup of our customers table:

Variables represent the columns of the table and can be explored in more detail. We can start to probe into the columns (variables) to find things like the uniqueness of that column:

From there we can probe even further to view the statistical breakdown of the column:

Or maybe we wanted to see the distribution of values:

Beyond these simple examples there are advanced settings allowing you to customize your exploration through configuration files and sample configurations available through the public github:

ydataai/ydata-profiling: 1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Below is a snippet to get you started.

%pip install ydata-profiling --q

from pyspark.sql.functions import col, when, lit
from datetime import datetime, timezone
from pyspark.sql.types import DecimalType, DateType, TimestampType, IntegerType, DoubleType, StringType
from ydata_profiling import ProfileReport

def profile_spark_dataframe(
    df,
    table_name
):
    """
    Profiles a Spark DataFrame by handling null values, transforming the DataFrame, and generating a profiling report.

    This function first processes the DataFrame by setting default values for null entries based on data type:
    - Decimal fields are set to 0.0 if null.
    - Date and Timestamp fields are set to January 1, 1900, if null.

    The transformed Spark DataFrame is then converted to a Pandas DataFrame and passed to `ydata_profiling.ProfileReport`
    to create a profiling report. The report can be returned either as an HTML string or as a `ProfileReport` object.

    :param df: The Spark DataFrame to be profiled.

    :return: Returns the report as a `ProfileReport` object.
    """

    # Handle nulls by setting defaults before profiling
    for field in df.schema.fields:
        if isinstance(field.dataType, DecimalType):
            df = df.withColumn(field.name, when(col(field.name).isNull(), lit(0.0)).otherwise(col(field.name).cast(DoubleType())))
        elif isinstance(field.dataType, DateType) or isinstance(field.dataType, TimestampType):
            df = df.withColumn(field.name, when(col(field.name).isNull(), lit(datetime(1900, 1, 1, 0, 0, 0))).otherwise(col(field.name)))

    # Convert to Pandas dataframe
    df = df.toPandas()

    # Generate report
    report = ProfileReport(
        df,
        title=f'Profiling for {table_name}',
        infer_dtypes=False,
        correlations=None,
        minimal=True
    )

    return report

# Set variables
schema_name = 'stage'
table_name = 'customers'

# Build sample DataFrame
df = spark.table(f'{schema_name}.{table_name}').limit(10000)

# Generate report
report = profile_spark_dataframe(df, table_name)

# Convert report to HTML
report = report.to_html()

# View report
displayHTML(report)

# # Create a timestamped file name
utc_timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
file_name = f'{table_name}_profile_{utc_timestamp}.html'

# # Set output path to save html file
output_path = f"abfss://xxxxxx@onelake.dfs.fabric.microsoft.com/xxxxxx/Files/data_profile/{file_name}"

# # Write file to lakehouse
mssparkutils.fs.put(
    output_path,
    report
)

If you'd like to learn more about how Lucid can support your team, let's connect on LinkedIn and schedule an intro call.

https://www.linkedin.com/in/willcrayger/

Using Custom Python Libraries Without Fabric Environment

William Crayger — Mon, 21 Oct 2024 15:42:27 GMT

Fabric Environment artifacts, where to begin…

If you’re not familiar, the environment artifact, in part, is intended to mimic the functionality of the Synapse Workspace by allowing you to do things like install packages for easy reusability across your Spark workloads.

Conceptually, I love the idea of having the ability to pre-define my spark pool configurations and easily install my custom python libraries. However, the current state of the environment item leaves a lot to desire.

During active development cycles, using an environment has actually increased my development time exponentially simply due to how long it takes to publish changes. The UI often reflects inaccurate information regarding the state of your installed libraries, reflecting libraries that no longer exist or wanting me to publish changes that will delete my library without me asking to do so.

I would often open my environment to see the following:

Seemingly, it appears the environment doesn’t recognize the state of the packages installed and often wants to remove them.

Taking the various bugs into consideration along with the immediate loss of using the starter pools really makes it challenging to see the benefits. As such, I’ve stopped using them altogether in favor of a more efficient approach that allows me to continue using starter pools while still leveraging my reusable libraries.

Before we start, if you’re not familiar with creating and managing .whl files I recommend checking out my buddy Sandeep’s blog:

Installing and Managing Python Packages in Microsoft Fabric

When working with custom packages, .whl files are how you package up and deploy your code. The .whl files are what would be installed to your Fabric library or Synapse Workspace as a prerequisite for installing them in your notebook.

However, we can also use inline installation by running a command such as:

# Install lucid python utility library
!pip install /lakehouse/default/Files/python_utility/lucidctrlutils-0.0.1-py3-none-any.whl

But, what if you want to install from a location other than the default lakehouse for the notebook, or a notebook without an attached lakehouse at all?

For example, when creating a new feature branch, notebooks retain attachment defaults from the state in which they were branched from. Said differently, your notebook will remain attached to the default lakehouse in the workspace your feature branch originated from (this warrants an entire blog post on its own). In this scenario, if you’re wanting to test modifications to your library you have to shuffle through the notebook and lakehouse settings which gets quite redundant.

Instead of doing the notebook dance, you can do something like this:

try:
    # Define the ADLS path and local file path
    install_path = "abfss://xxxxxxx@onelake.dfs.fabric.microsoft.com/xxxxxxx/Files/ctrlPythonLibrary/lucidctrlutils-0.0.1-py3-none-any.whl"
    local_filename = "/tmp/lucidctrlutils-0.0.1-py3-none-any.whl"

    # Use mssparkutils to copy the file from storage account to the local filesystem
    mssparkutils.fs.cp(install_path, f"file:{local_filename}", True)

    # Install the .whl file locally using pip
    !pip install {local_filename}

    # Import modules for data processing
    from lucid_ctrl_utils import *

    print("Successfully installed the utility library.")

except Exception as e:
    print(f"An error occurred while installing the utility library: {str(e)}")

The above snippet reads the .whl file from your specified path, copies it to a temp directory, and allows you to execute !pip install. However, this code still requires you to manually update the install_path. Surely, we can do better.

In my scenario, I want to use the lakehouse name as my storage identifier instead of the GUID so we’ll have to write a bit more code.

try:
    # Install semantic link
    !pip install semantic-link --q
    import sempy.fabric as fabric

    # Set lakehouse name
    lakehouse_name = 'stage'

    # Get workspace id and list of items in one step
    notebook_workspace_id = fabric.get_notebook_workspace_id()
    df_items = fabric.list_items(workspace=notebook_workspace_id)

    # Filter the dataframe by 'Display Name' and 'Type'
    df_filtered = df_items.query(f"`Display Name` == '{lakehouse_name}' and `Type` == 'Lakehouse'")

    # Ensure there's at least one matching row
    if df_filtered.empty:
        raise ValueError("No matching rows found for Display Name = 'stage' and Type = 'Lakehouse'")

    # Get the 'Id' from the filtered row
    stage_lakehouse_id = df_filtered['Id'].iloc[0]

except ValueError as ve:
    # Handle specific errors for value-related issues
    raise ve
except Exception as e:
    # Catch and raise any unexpected exceptions
    raise ValueError(f"An error occurred: {str(e)}")

The above code uses semantic-link to dynamically access the GUIDs for the workspace and lakehouse for my feature branch, removing the need to hardcode these values in install_path. Now, I can construct the path without any manual input.

try:
    # Define the ADLS path and local file path
    install_path = f"abfss://{notebook_workspace_id}@onelake.dfs.fabric.microsoft.com/{stage_lakehouse_id}/Files/ctrlPythonLibrary/lucidctrlutils-0.0.1-py3-none-any.whl"
    local_filename = "/tmp/lucidctrlutils-0.0.1-py3-none-any.whl"

    # Use mssparkutils to copy the file from storage account to the local filesystem
    mssparkutils.fs.cp(install_path, f"file:{local_filename}", True)

    # Install the .whl file locally using pip
    !pip install {local_filename}  --no-cache-dir --q > /dev/null 2>&1

    # Import modules for data processing
    from lucid_ctrl_utils import *

    print("Successfully installed the utility library.")

except Exception as e:
    print(f"An error occurred while installing the utility library: {str(e)}")

Great, we now have the library installed. But, this is a lot of code to move around to all my notebooks. The overhead introduced is going to be a PITA, right? Well, we can optimize one step further by keeping this in an isolated “utility” notebook and activate it using the %run command in subsequent notebooks.

%run nb_lucid_ctrl_utils

So, why go through all this effort? Why not use the environment item? Well, there’s a few reasons beyond avoiding the bugs mentioned earlier in the article.

With this approach I get to abuse the quick spin-up time of the starter pool rather than waiting 90+ seconds for a custom pool to come online.

If I gained nothing else, this is a huge W in my opinion. However, this also solves a REALLY frustrating issue with branching, which I’ll address in another blog coming soon.

If you'd like to learn more about how Lucid can support your team, let's connect on LinkedIn and schedule an intro call.

linkedin.com/in/willcrayger

Calendly - Will Crayger

There's No Shortcut to Proper Planning

William Crayger — Sun, 09 Jun 2024 17:07:10 GMT

About a month ago, while working on the first iteration of the Fabric capacity monitoring report, I stumbled across a bit of an invisible fence. I posted a quick teaser with a screenshot on LinkedIn to see if anyone else could spot it:

https://www.linkedin.com/posts/willcrayger_microsoftfabric-activity-7186074625783574529-cdu3/?utm_source=share&utm_medium=member_desktop

It's very subtle so you have to look closely. In case you haven't spotted it yet, the location (region) of your artifacts is directly tied to the location in which your capacity exists.

As I said, it's quite subtle, but this is actually a very big deal for several reasons.

Region Lock

When you create a Fabric workspace you must assign it to a capacity to use the Fabric features. Before assigning to a capacity, you must create a capacity in the Azure portal. When you create the capacity you pick a region for that capacity to exist in.

Let's assume you currently have a Trial Fabric capacity that exists in East US 2. We'll create a new workspace and assign it to the trial capacity. We'll then create a lakehouse in the new workspace.

After creating the workspace and lakehouse, you decide to reassign the workspace to a different capacity in another region, no big deal, right?

Unfortunately, by adding a lakehouse to the workspace we have region-locked ourselves. To move the workspace we would need to remove the storage account (lakehouse).

The limitation here makes sense when you consider storage costs are different depending on region. Now, I know what everyone is going to say, "Who cares, just create a shortcut and be done with it". Well, welcome to the conversation.

The Hidden Cost of Shortcuts

The reason I included "hidden" in the title of this section is due to the lack of transparency in the documentation. When it comes to pricing, we generally know how much storage and compute will cost, but the documentation around network and data transfer costs is almost non-existent. There's a single line buried in the pricing whitepaper:

https://azure.microsoft.com/en-us/pricing/details/microsoft-fabric/

I've been hesitant to write this article because I haven't been able to get a definitive answer, until now.

Let's review the following scenario:

In the above diagram, we have a lakehouse in a workspace assigned to a capacity in East US 2. Another team has requested access to our data and created a shortcut from their workspace assigned to a capacity in West US per the recommended approach.

Even when leveraging shortcuts, this scenario still produces a read operation that spans across data centers and therefore will (eventually) incur data transfer fees. I say eventually because we're not currently seeing these charges, but your team should be aware that they are coming.

Final Thoughts

When planning your Fabric implementation, capacity planning is going to play a big part, but not just from a cost and sizing perspective. Planning the location of the capacities is equally important if you want to avoid things like region lock and unexpected line items on your monthly bill.

If you haven't already, I encourage you to check out my last article on deployment as many of the broader architectural considerations will also be applicable here.

https://lucidbi.co/fabric-architecture-considerations

If you'd like to learn more about how Lucid can support your team, let's connect on LinkedIn and schedule an intro call.

linkedin.com/in/willcrayger

https://calendly.com/wcrayger/lucid-intro-call

Microsoft Fabric Enterprise Deployment Considerations

William Crayger — Mon, 27 May 2024 17:56:11 GMT

When it comes to architecture strategies there's no one-size-fits-all solution. Every team will have different use cases that will drive requirements such as security, environment isolation, and tools needed.

While working with one of my clients we started to dive into what a future architecture in Microsoft Fabric would look like for them. As we discussed their requirements I realized that a few potential issues were waiting for us in the fog.

Before we dive in, I'd like to recognize the awesome work of my friend, the data goblin himself, Kurt Buhler. Kurt helped create the Power BI usage scenario diagrams that can be found here:

https://learn.microsoft.com/en-us/power-bi/guidance/powerbi-implementation-planning-usage-scenario-diagrams#enterprise-bi

A ton of work went into the design of the usage diagrams and I love the stylistic approach the team took. As such, I tried to keep the general format the same while extending some of the patterns to broader architecture.

If you haven't already I highly recommend checking out Kurt's blog:

https://data-goblins.com/

Understanding Medallion Patterns

Everyone should have heard of medallion architecture by now. For those that haven't, medallion patterns have been around for years and have had many names (raw, validated, enriched, semi-curated, curated, etc.). While the naming pattern used is less important, the idea behind using zones typically stems from the need to have separation of data by state of readiness and other security or governance requirements. This is of course an oversimplification for conceptual purposes.

Below is an example medallion pattern in Fabric.

Read/copy data from the source system using Spark via notebook, pipeline, or shortcut.
Use managed private endpoints or on-premise data gateway when working with on-premise or firewall-enabled source systems.
Store data in bronze lakehouse as close to raw form as possible.
Create a shortcut from bronze to silver lakehouse to begin the data enrichment and cleansing process.
Create a shortcut from silver to gold to persist in a dimensional model for enterprise analytics.

We could follow a very similar pattern implemented with Azure data services.

Read/copy data from the source system using Spark via notebook or pipeline.
Use managed private endpoints or integration runtime gateway when working with on-premise or firewall-enabled source systems.
Store data in a data lake gen2 bronze container as close to raw form as possible.
Write from bronze container to silver delta lake container to begin data enrichment and cleansing process.
Write from silver delta lake container to gold delta lake container to persist dimensional model for enterprise analytics.

Combining Lifecycle Management and Medallion Patterns

In addition to medallion patterns, we must also be aware of application lifecycle management (ALM) practices (dev., test, prod.).

An example of a simple ALM strategy for Azure data services is to use prod./non-prod. subscriptions with dev./test/prod. resource groups.

Additional steps for ALM include:

Integrate infrastructure as code (IaC) with source control repo.
Build validation, continuous integration (CI), and continuous deployment (CD) pipelines for code deployment.
Deploy to test using the release pipeline.
Deploy to prod. using the release pipeline.

Fabric Application Lifecycle Management (ALM)

In Fabric, we have deployment pipelines to enable code promotion to higher environments. The overall flow is quite similar with a few differences in how things are released.

Additional steps for Fabric ALM:

Sync dev. workspace with git.
Build validation, continuous integration (CI), and continuous deployment (CD) pipelines for code deployment.
The release pipeline triggers the Fabric deployment pipeline to deploy to test.
Release pipeline triggers Fabric deployment pipeline to deploy to prod.

The wonderful thing with Fabric deployment pipelines is having the ability to sync a workspace directly to your git repo and having built-in pipelines to move artifacts giving us integrated low-code IaC and CI/CD.

https://learn.microsoft.com/en-us/fabric/cicd/deployment-pipelines/intro-to-deployment-pipelines

Governance and Security Consideration

You may have picked up on this through the diagrams already, but a core concept for the remainder of the article is that artifact management in Fabric is centralized around a workspace.

In the documentation for implementing medallion lakehouse architecture in Fabric, there's one section that sparked my curiosity.

https://learn.microsoft.com/en-us/fabric/onelake/onelake-medallion-lakehouse-architecture

Conceptually, the above diagram makes sense per the rule of zone isolation. That said, the diagram is illustrated to show separation at the lakehouse level with all lakehouses residing in the same workspace. We'll come back to this in a few minutes, but this is an important piece of the puzzle.

If you continue reading you'll find the following statement:

The statement above contradicts the diagram and suggests that each medallion zone should be broken into its own workspace. Said differently, a bronze lakehouse should be located in a bronze workspace, and so on. The reason behind the recommendation can be found in the lakehouse access control documentation.

https://learn.microsoft.com/en-us/fabric/data-engineering/workspace-roles-lakehouse

Circling back to the idea of all lakehouse zones living in the same workspace, we can see that from a security and governance perspective, this presents our first, rather large issue.

To perform any activity other than "read" over a lakehouse one must have an admin, member, or contributor role in the workspace in which the lakehouse resides. Granting A/M/C over the workspace is granting applicable permissions to all artifacts in the workspace. In other words, users have visibility to all lakehouse zones and the data within them.

Let's consider a scenario in which we have data with personally identifiable information (PII) such as a social security number in the bronze layer. As the data moves through silver we apply masking rules to the PII data. We want to enable our advanced business analysts and developers to perform tasks other than "read" on the data but we do not want them to see the underlying raw data. We would not be able to facilitate this requirement if the lakehouse zones were contained in a single workspace.

Workspace Sprawl

The next consideration is workspace sprawl. If we adhere to the recommendation of isolating zones by workspace, the number of workspaces needed exponentially increases. For example, rather than having a single workspace with three separate lakehouses, we will now have three workspaces with one lakehouse each.

I'm sure you're mind is already drifting in this direction, but what about handling lifecycle management?

Let's consider a scenario to justify following ALM for each medallion zone. To surface data in the bronze zone you will likely be copying data from a source system using a pipeline or notebook. Once the data is available in bronze you will create a shortcut to silver.

Thinking about the lifecycle:

Net new request for data is received.
Begin development of the pipeline or notebook to copy the data (dev.).
Test the copy process to ensure it meets requirements (test).
The copy process is considered stable and is placed on a schedule to ensure current data is available to Silver (prod.).

As you can see, by splitting the medallion zones by workspace we've tripled the number of needed workspaces in our pattern. With the increase in workspace count, the question becomes how do we manage deployments?

Artifact Deployment Considerations

Fabric deployment pipelines have been refactored quite a bit from their Power BI days with one of the most significant changes being an increase in the number of "stages" supported from three to now supporting ten.

Theoretically, if we followed the recommendations above, one deployment pipeline would support our nine workspaces. However, if we wanted to include additional separation of artifacts or add another zone to our medallion pattern we would exceed the allowed number of workspaces for our pipeline.

Another consideration with deployment pipelines is that they're linear, meaning an artifact must be deployed through all stages sequentially.

In addition to the workspace quantity limit, we're also limited in that a workspace can only belong to one pipeline.

Since a workspace can belong to only one deployment pipeline, the idea of chaining multiple pipelines together in the Fabric UI becomes void. Instead, you would need to integrate with DevOps release pipelines to programmatically trigger a deployment pipeline release.

Enterprise Architecture Strategies

As I stated in the opening, every team is going to have its own set of requirements that will drive source code management and deployment strategies. Below are a few examples of potential enterprise patterns.

Single Workspace Medallion Pattern:

If security and governance at the lakehouse level aren't a concern, perhaps it doesn't make sense to split your medallion layers by workspace. In such a scenario, a potential architecture could look something like the following:

Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.
Read/copy data from the source system using Spark via notebook, pipeline, or shortcut.
Use managed private endpoints or on-premise data gateway when working with on-premise or firewall-enabled source systems.
Store data in the lakehouse as close to raw form as possible.
1. Copy data to the warehouse using pipeline (4b. optional).
Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.).
1. Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (5b. optional).
Define joins, create measures calculation groups, and implement additional granular security within semantic models as an extension of lakehouse/warehouse delta tables. Semantic models may be created as Direct Lake, Import, or Direct Query connections.
Content creators create reports and dashboards for consumption.
Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.
Feature branches can be created for isolated development workspaces enabling multi-developer workloads.
Content creators clone remote repos to their local development environment to capture the latest working code version.
Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.
Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before publishing.
The build pipeline is then triggered to prepare content for deployment.
Deployment to test and production is facilitated by the release pipeline.
Release to higher environments is gated by a release manager's approval(s).
The release pipeline performs deployment from dev. to test by triggering the Fabric / Power BI deployment pipeline.
Testing and QA are performed before being released to prod.
Release pipeline performs deployment to prod. by triggering the Fabric / Power BI deployment pipeline.
A workspace app is created and serves as the primary entry point for end-user consumption.

Isolated Workspace Medallion Pattern:

However, for most teams, I believe the security and governance isolation conversation isn't something that will easily be ignored and therefore zone isolation will be required. The overall change in architecture is quite significant as the entry points for workloads shift.

Bronze key considerations:

Data will be read from source systems and written to the bronze layer.
The readiness of the data doesn't yet enable enterprise report development.

Note: workspaces from each medallion layer are now managed by individual deployment pipelines

Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.
Read/copy data from the source system using Spark via notebook, pipeline, or shortcut.
Use managed private endpoints or on-premise data gateway when working with on-premise or firewall-enabled source systems.
Store data in the lakehouse as close to raw form as possible.
1. Copy data to the warehouse using pipeline (4b. optional).
Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.)
1. Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (5b. optional).
Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.
Feature branches can be created for isolated development workspaces enabling multi-developer workloads.
Content creators clone remote repos to their local development environment to capture the latest working code version.
Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.
Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before publishing.
The build pipeline is then triggered to prepare content for deployment.
Deployment to test and production is facilitated by the release pipeline.
Release to higher environments is gated by a release manager approval(s).
The release pipeline performs deployment from dev. to test by triggering the Fabric deployment pipeline.
Testing and QA are performed before being released to prod.
Release pipeline performs deployment to prod. by triggering the Fabric deployment pipeline.

Silver key considerations:

Data will be read from the bronze layer and written to the silver layer.
The readiness of the data doesn't yet enable enterprise report development.

Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.
Read data from bronze using a shortcut.
1. Read data from bronze using pipeline (2b. optional).
Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.).
1. Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (3b. optional).
Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.
Feature branches can be created for isolated development workspaces enabling multi-developer workloads.
Content creators clone remote repos to their local development environment to capture the latest working version.
Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.
Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before publishing.
The build pipeline is then triggered to prepare content for deployment.
Deployment to test and production is facilitated by the release pipeline.
Release to higher environments is gated by a release manager's approval(s).
The release pipeline performs deployment from dev. to test by triggering the Fabric deployment pipeline.
Testing and QA are performed before being released to prod.
Release pipeline performs deployment to prod. by triggering the Fabric deployment pipeline.

Gold key considerations:

Data will be read from the silver layer and written to the gold layer.
The readiness of the data now enables enterprise report development.
End-user testing of reports will be needed.
Workspace applications will be used for report consumption.

Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.
Read data from Silver using a shortcut.
1. Read data from Silver using a pipeline (2b. optional).
Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.).
1. Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (3b. optional).
Define joins, create measures calculation groups, and implement additional granular security within semantic models as an extension of lakehouse/warehouse delta tables. Semantic models may be created as Direct Lake, Import, or Direct Query connections.
Content creators create reports and dashboards for consumption.
Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.
Feature branches can be created for isolated development workspaces enabling multi-developer workloads.
Content creators clone remote repos to their local development environment to capture the latest working version.
Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.
Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before being published.
The build pipeline is then triggered to prepare content for deployment.
Deployment to test and production is facilitated by the release pipeline.
Release to higher environments is gated by a release manager's approval(s).
The release pipeline performs deployment from dev. to test by triggering the Fabric / Power BI deployment pipeline.
Testing and QA are performed before being released to prod.
Release pipeline performs deployment to prod. by triggering the Fabric / Power BI deployment pipeline.
A workspace app is created and serves as the primary entry point for end-user consumption.

By isolating the medallion zones into separate deployment pipelines you now have more control over the environment but you're also introducing overhead by needing to build and manage additional DevOps artifacts.

Final Thoughts

Your decision of which architecture to deploy will be based on several things, one of which is the balance between security/governance and the overhead to maintain your system. Like most things when working with data, there's no one-size-fits-all solution.

If you'd like to learn more about how Lucid can support your team, let's connect on LinkedIn and schedule an intro call.

https://www.linkedin.com/in/willcrayger/

You can find the diagrams in SVG format on github:

https://github.com/Lucid-Will/Fabric-Architecture-Diagrams

How To Reduce Data Integration Costs By 98%

William Crayger — Thu, 11 Apr 2024 16:11:28 GMT

One of the amazing things about Microsoft Fabric is the number of options you have for moving data. For example, you can use Dataflow Gen2, Pipelines, Notebooks, or any combination of the three.

However, with all the options available, it can also make deciding which pattern to use a bit like reading the menu at The Cheesecake Factory. Hopefully, this article will help shed some light on the pros and cons of each approach while also showing you how they impact the elephant in the room: Capacity Units (CUs).

Before we dive in, we need to understand a few things. Let's start with the most important topic: understanding how your capacity consumption is measured.

* Disclosure: Due to a correction of a record duplication issue, some modifications have been made to the original publication of this article. For transparency, I will identify the impact throughout the article with comments and (bold) identifers.

Brief Capacity Overview

Premium Capacity has been around for a few years as part of the Power BI stack. However, until the release of Fabric, you only had to worry about how Power BI impacts your capacity. With Fabric, "all" experiences have been standardized to use the same serverless compute, meaning pipelines, notebooks, SQL endpoints, and so on, which all consume compute from your available capacity pool. Because of this, it's more important than ever to truly understand the cost of each workload.

Like Power BI Premium, Fabric capacities have many SKUs, each giving you a specified amount of compute to work with.

There's an additional SKU that isn't listed, FT1 or Fabric Trial Capacity. The FT1 is equivalent to the F64 SKU, which is also equivalent to the existing Power BI P1 SKU.

All tests performed in this article were done so using an FT1.

Capacity Metrics: First Look

To begin understanding your capacity consumption, the Microsoft team has created a Fabric Capacity Metrics app that can be deployed from the Power BI app collection. Go to the Apps section of your left-side navigation, click "Get apps" in the top right, and search Fabric.

Once you have the app installed, you'll need to authenticate and make a few selections regarding the configuration of the timezone and such; it's quite straightforward, though.

Alas, you're ready to see the magic! Well, it's not quite magic, but it's a starting point. The app has a few challenges, as you'll soon encounter, but perhaps the biggest challenge is having only a rolling 14 days available. This makes historical analysis quite challenging. There are ways around this which I'll cover in a later article, for now let's stay on topic.

Upon taking your first look at the metrics, your first thought might be, "What the heck am I looking at?" With all analysis, unless you understand the numbers, they're just numbers. It's a movie without a plot. They exist, but what do they mean? What story are they telling?

Capacity Unit: The Numbers

For our analysis, we will hone in on a specific data point: CU(s).

These look like big numbers. These activities must be super expensive, right? Like the old Transformers theme song, there's "more than meets the eye."

To understand the true cost of an activity, we have to do some math. Referring to the chart in the previous section, we're using the equivalent of an F64, which means we have 64 Capacity Units at our disposal. To translate that, we must first understand what a Capacity Unit is. Let's break it down.

First and foremost, a Capacity Unit is not the same as a CU(s). A Capacity Unit is a measurement based on hours, as indicated here:

CU(s) is a measurement of seconds, which can be misleading. I initially thought the Duration was somehow used in the conversion, but this is not true. The conversion of Capacity Unit to CU(s) is as follows:

1 Capacity Unit (hour) = 60 seconds * 60 minutes = 3,600 (seconds per hour)

For our F64 (FT1) with 64 Capacity Units:

64 Capacity Unit (hour) = 3,600 CU(s) * 64 = 230,000 CU(s) per hour

To translate this to cost, I'll use the first row from the capacity metrics app screenshot above with a CU(s) consumption of 909,747.44.

With our understanding of the conversion between CU(s) and Capacity Units, we can convert CU(s) to cost:

Capacity Units = 909,747.44 CU(s) / 3,600 = 252.707622222

PayGo Cost per Capacity Unit = $11.52 / 64 = $0.18

PayGo Cost (USD) = 252.71 * $0.18 = ~$45.49

Reservation Cost per Capacity Unit = $6.853 / 64 = $0.107078125

Reservation Cost (USD) = ~$27.06

If you really wanted to break it down further to determine the cost per CU(s):

PayGo Cost per Capacity Unit = $11.52 / 64 = $0.18

PayGo Cost per CU(s) = $0.18 / 3,600 = $0.00005

Reservation Cost per Capacity Unit = $6.853 / 64 = ~$0.107078125

Reservation Cost per CU(s) = $0.107078125 / 3,600 = ~$0.000029744

Now that you understand the meaning behind the numbers let's dig into the real purpose of this article.

For the remainder of the article, any cost-related metrics will be calculated using the Reservation rate.

All Experiences Were Not Created Equal

To make this a bit easier to digest, I've created a monitoring report to help me tell the story.

Welcome to the Lucid capacity monitoring report!

*Image updated

The custom monitoring report uses a combination of data elements from the Fabric tenant as well as data points captured directly from the Fabric Capacity Metrics app. The backend of this report is a Fabric Lakehouse that's being populated by a Spark notebook in scheduled intervals.

The notebook has been written to perform the following operations:

Refresh the Fabric Capacity Metrics semantic model
Capture data about the Fabric tenant, such as workspace, capacity, and item details, into a series of stage tables
Create a calendar stage table
Perform a dynamic UPSERT from the stage tables to a set of dimensional tables

The Lucid monitoring report is then connected to a semantic model of the Lakehouse via Direct Lake mode.

Setting The Stage

For our comparisons, we will focus on different processing patterns using Pipelines and Notebooks. The three patterns I'll be reviewing are a traditional pipeline pattern using nested pipelines with a ForEach loop, using a notebook to generate the list that is passed to a single pipeline via API, and using a notebook to generate the list as well as "copy" directly from the source.

Each scenario follows the same structure, reading data from the source and writing a parquet file to a designated folder in a Lakehouse.

All tests were scheduled to run hourly and performed at staggered intervals to minimize the potential of a noisy neighbor impacting the test.

I used the WideWorldImporters OLAP database hosted in an Azure SQL database for sample data. To simulate real-world examples, I have a daily pipeline to execute a stored procedure that populates fresh data.

Sample:

Additionally, each pattern uses a basic metadata-driven approach consisting of a single Azure SQL database containing a few control tables. There are a total of 44 tables that are processed as part of my testing.

dbo.Copy sample:

My testing aimed to understand the efficiency and consumption of each processing pattern with respect to Fabric workloads.

Traditional Parent / Child Pipeline Pattern

A simple and efficient pattern to dynamically process data using pipelines is to use a parent/child relationship. In this pattern, a parent pipeline typically generates a list of items to process before passing the items in the list to a ForEach loop.

Inside the ForEach loop is usually an activity to execute another pipeline, the child. In this example, the child pipeline sets the path where the parquet file will be written, performs the copy, and logs the path for later retrieval.

Scenario 1 Results

*Image updated

The test for this pattern was quite interesting. Repeating values. Why are there repeating values? I thought maybe I had a bad measure or was missing a relationship. I went back to my Lakehouse to check the data. Interestingly, the capacity units consumed remain static, but the activity duration and other metrics fluctuate.

*Image updated

Looking at the child pipeline, we see the same pattern.

*Image updated

This looks like the "smoothing" is kicking in and spreading the total consumption of the runs out. That said, I'm not 100% sure and would like to dig in more to confirm my suspicion.

For now, let's continue and focus on the total consumption for all runs in the day. As we can see from our report, this pattern cost us $4.29 for the day and consumed 2.6% of the total available daily compute.

*Image updated

Notebook Orchestration with Pipeline Copy

The next scenario I wanted to test combines the use of both a pipeline and a notebook. In this pattern, we use a notebook to replace the parent pipeline from Scenario 1.

There are several reasons as to why you may want to do this. Pipelines are efficient at copying data but can often be too rigid regarding lookups or other configuration requirements, especially when working with metadata frameworks. Because of this, developers will begin to include multiple layers of nested pipelines, which can become quite expensive.

By combining the use of a notebook to build the configuration and a pipeline to perform the copy, you have much more flexibility and control over your process.

# Function to be executed in parallel for each row to call API
def call_api_with_payload(row, workspace_id, item_id, job_type, client):
    """
    Orchestration pattern consumption analysis
    Tests performed by Will Crayger of Lucid
    """

    # Extract parameters for payload from the row
    payload = {
        "executionData": {
            "parameters": {
                "schema": row["Schema"],
                "object": row["Object"],
            }
        }
    }

    # Call the Fabric REST API
    try:
        response = client.post(f"/v1/workspaces/{workspace_id}/items/{item_id}/jobs/instances?jobType={job_type}", json=payload)
        if response.status_code == 202:
            pass
        else:
            return response.json()
    except Exception as e:
        # Print error
        print(f"An error occurred while calling API: {e}")
        return None

# Retrieve processing list
process_list_sql = "SELECT * FROM [dbo].[Copy]"
df_process_list = spark.read.format("jdbc").option("url", key_vault_secret).option("query", process_list_sql).load()

# Convert to Pandas DataFrame
df_pandas = df_process_list.toPandas()

# Define parameters for scheduler
workspace_id = fabric.get_workspace_id()
item_id = ""
job_type = "Pipeline"

# Use ThreadPoolExecutor to call APIs concurrently
with ThreadPoolExecutor(max_workers=min(len(df_pandas), (os.cpu_count() or 1) * 5)) as executor:
    # Submit tasks to the executor
    future_to_row = {executor.submit(call_api_with_payload, row, workspace_id, item_id, job_type, client): index for index, row in df_pandas.iterrows()}

Scenario 2 Results

*Image updated

There are currently a few challenges with this approach in Fabric. One challenge is there appears to be a limited set on the API itself that only allows 10 connections. Any more than 10 connections are throttled, queued, and executed when previous connections are closed. I wasn't able to find this in the documentation, though.

Further investigation shows the CU(s) required for the pipeline execution itself are comparable to that of the Scenario 1 ExecuteCopy activity. The decrease in efficiency is attributed to the notebook remaining active during the API throttling.

https://learn.microsoft.com/en-us/rest/api/fabric/articles/throttling

Scenario 1:

Scenario 2:

This is also visible on the monitoring hub, as only 10 pipeline executions will trigger at once. As you can also see, the Notebook_Orchestration activity remains in an "In Progress" status until all pipelines have been executed, thus increasing consumption.

My initial thought was to bypass the Semantic Link API and try a potential workaround, but I soon remembered the next challenge with this approach. There's currently no support for service principal authentication, meaning using another API strategy is a no-go.

Scenario 2 yielded a ~18% decrease in efficiency and cost compared to Scenario 1.

*Image updated

Notebook Orchestration and Copy

The final scenario I tested was using a notebook to orchestrate and copy the data.

For years, we've relied on pipelines, and before pipelines, we used tools like SSIS to create orchestration packages. We've used this pattern for so long that it's become muscle memory. There's a reason for this, though. They're easy to use!

Setting up a pipeline is as simple as clicking through a GUI these days, and with the ability to integrate things like Dataflows Gen2, things will only get easier. However, ease comes with a significant cost.

Spark processing in tools like Fabric and Databricks opens the door to more programmatic ETL/ELT patterns like the one below.

def read_source_data(row):
    """
    Orchestration pattern consumption analysis
    Tests performed by Will Crayger of Lucid
    """
    try:

        # Create dynamic SQL using the row values
        dynamic_sql_query = f"SELECT * FROM [{row['Schema']}].[{row['Object']}]"

        # Read source data to DataFrame and write to Delta        
        df_source = spark.read.format("jdbc") \
                    .option("url", source_connection) \
                    .option("query", dynamic_sql_query) \
                    .load()

        # Set staging table name using the row values
        stage_file = f"Files/WideWorldImporters_Scenario3/{row['Schema']}_{row['Object']}"

        # Write to delta
        df_source.write.format("parquet") \
            .mode("overwrite") \
            .save(stage_file)
    except Exception as e:
        print(f"Error processing {row['Schema']}_{row['Object']}: {e}")

# Retrieve processing list and convert to Pandas DataFrame
process_list_sql = "SELECT * FROM [dbo].[Copy]"
df_process_list = spark.read.format("jdbc") \
                    .option("url", control_connection) \
                    .option("query", process_list_sql) \
                    .load()
df_process_list_pandas = df_process_list.toPandas()

# Use ThreadPoolExecutor to execute the function in parallel for each row
max_workers = min(len(df_process_list_pandas), (os.cpu_count() or 1) * 5)
with ThreadPoolExecutor(max_workers=max_workers) as executor:
    # Submit tasks to the executor for each row in the Pandas DataFrame
    futures = [executor.submit(read_source_data, row) for index, row in df_process_list_pandas.iterrows()]

In this example, you'll notice 2 SQL executions, one to retrieve the list of objects to process and the other to execute a SELECT against the source system using a JDBC connection. You'll also notice the use of ThreadPoolExecutor, which allows for parallel execution.

Scenario 3 Results

*Scenario 3 was not impacted and remains unchanged

If you're at a loss for words, don't worry; I was right there with you when I saw the results. Absolutely shocking!

Now, this approach isn't without considerations. Traditionally, this scenario can be challenging for on-premise environments or sources behind a firewall of some sort. The Fabric team has rolled out several new features, such as VNET and gateway integration, to alleviate some of these concerns. Another consideration is not all sources will support direct reads using Spark.

Let's look at the comparisons by the numbers.

Scenario 2 vs. Scenario 3

*Image updated

Scenario 1 vs. Scenario 3

*Image updated

Comparisons show an improvement of ~98% for cost and CU(s) consumption across the board.

Final Thoughts

As the numbers show, traditional metadata patterns, while easy, can also be incredibly costly. Every data team should review and potentially redesign their frameworks to address these inefficiencies.

At Lucid, I've begun following a Spark-first approach and will only use Pipelines when required. I've also developed a framework to quickly deploy and integrate within my client environments, allowing them to focus on decision-making and giving them back their most valuable asset, time.

If you'd like to learn more about how Lucid can support your team, let's connect on LinkedIn and schedule an intro call.

https://www.linkedin.com/in/willcrayger/