<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[LucidBI]]></title><description><![CDATA[LucidBI]]></description><link>https://blog.lucidbi.co</link><generator>RSS for Node</generator><lastBuildDate>Mon, 27 Apr 2026 19:15:11 GMT</lastBuildDate><atom:link href="https://blog.lucidbi.co/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Cross Workspace Fabric Notebook Execution]]></title><description><![CDATA[Have you ever tried to run a Fabric notebook in a different workspace using notebookutils? I have, and it didn’t work out so well. Thankfully, persistence wins, and now I get to share the secret sauce with all you lovely people.
The Setup
You’re goin...]]></description><link>https://blog.lucidbi.co/cross-workspace-fabric-notebook-execution</link><guid isPermaLink="true">https://blog.lucidbi.co/cross-workspace-fabric-notebook-execution</guid><category><![CDATA[microsoftfabric]]></category><category><![CDATA[microsoft fabric]]></category><category><![CDATA[data-engineering]]></category><dc:creator><![CDATA[William Crayger]]></dc:creator><pubDate>Thu, 06 Nov 2025 16:32:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1762446371993/ac5dc7d6-0e0f-48f2-9198-2d0faebf3c20.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Have you ever tried to run a Fabric notebook in a different workspace using notebookutils? I have, and it didn’t work out so well. Thankfully, persistence wins, and now I get to share the secret sauce with all you lovely people.</p>
<h2 id="heading-the-setup">The Setup</h2>
<p>You’re going to need two workspaces, one lakehouse, and two notebooks. If you want to see another little scoop of secret sauce, create two lakehouses.</p>
<p>Workspace 1:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762444674995/c0c2e26d-0b96-429e-8b9f-329d80a604ff.png" alt class="image--center mx-auto" /></p>
<p>Workspace 2:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762444738435/dcdc5d00-b3b0-484e-ae48-bb353d7b3751.png" alt class="image--center mx-auto" /></p>
<p>In this setup, I Speak Whale is the target workspace and <strong>lucid_is_awesome</strong> is the target notebook. <strong>just_keep_swimming</strong> is the source notebook that will be attempting to negotiate with the target.</p>
<p><strong>just_keep_swimming</strong>:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762445053328/97d11b90-0a59-4548-b692-49e89fa136be.png" alt class="image--center mx-auto" /></p>
<p><strong>lucid_is_awesome</strong>:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762445086776/cfa66f22-3185-4ca9-96ad-b99e06178961.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-pressing-the-button">Pressing the Button:</h2>
<p>When you hit the go button, the current configuration is going to be trying to execute a target notebook in the same context of the calling notebook. What’s that mean, exactly? Well, the calling notebook is attached to a default lakehouse, and that lakehouse belongs to a specific workspace. So, the calling notebook is looking for the receiving notebook in the same workspace as the calling notebooks default lakehouse. This is problematic given the two notebooks are assigned to different default lakehouses in different workspaces.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762445329763/8a1077fd-168c-43df-82a5-7b8f1b0cfb70.png" alt class="image--center mx-auto" /></p>
<p>It’s basically telling you, “This isn’t where I parked my car!”</p>
<p>So, we have to tell it where the valid parking spot is, and to do this we need to grab the workspace GUID of the target workspace.</p>
<h2 id="heading-round-2-fight">Round 2 - FIGHT:</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762445504528/3640ad7e-bc0a-40f5-a11c-d188cca6df95.png" alt class="image--center mx-auto" /></p>
<p>Well, that didn’t work.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762445585093/e9dcbbe2-d6c7-4983-b77f-e8c4bf79ccc8.png" alt class="image--center mx-auto" /></p>
<p>But wait, THERE’S MORE! If you take another look at the error, it’s telling you why it tanked AND how to fix it. Because the notebooks are hooked into different default lakehouse’, we need to add one more little nugget of goodness.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762445852125/471026b7-1830-4713-bf07-debba6cd4e40.png" alt class="image--center mx-auto" /></p>
<p>I remember trying to do this a while back and banging my head against the wall without success. And here we are, running cross-workspace jobs like nobody’s business.</p>
<p>Anyway, quick one today but something I ended up circling back to and thought it would be helpful.</p>
<p>Cheers, and remember to stay Lucid.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762446605874/bd86019a-de78-4df7-bed3-79cef1e8b198.png" alt class="image--center mx-auto" /></p>
]]></content:encoded></item><item><title><![CDATA[Variable Libraries, The Fabric Unsung Hero]]></title><description><![CDATA[In case you missed it, the Variable Library team over on the Fabric PG quietly snuck out some pretty big updates relatively recently that are actually quite massive. Before we dive into, it’s important to understand why these updates are so huge.
Dep...]]></description><link>https://blog.lucidbi.co/variable-libraries-the-fabric-unsung-hero</link><guid isPermaLink="true">https://blog.lucidbi.co/variable-libraries-the-fabric-unsung-hero</guid><category><![CDATA[data analysis]]></category><category><![CDATA[data analytics]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[microsoftfabric]]></category><dc:creator><![CDATA[William Crayger]]></dc:creator><pubDate>Wed, 27 Aug 2025 16:16:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756311003343/b3d01118-160a-442d-8062-3de47d38a07c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In case you missed it, the Variable Library team over on the Fabric PG quietly snuck out some pretty big updates relatively recently that are actually quite massive. Before we dive into, it’s important to understand why these updates are so huge.</p>
<p>Deployments and CICD have been a real thorn in the side for most teams since launch, specifically with regards to data pipelines. If you’re not familiar, when you build a data pipeline and add an activity, such as a copy or a lookup activity, you have to configure a connection. This is done using a couple dropdown selections, which is trivial.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756304796697/06a28650-30cb-44c7-9512-72a884efb8ce.png" alt class="image--center mx-auto" /></p>
<p>After getting the connections all sorted, you kickoff your job and everything works wonderfully. All good, right? Well, no, not really.</p>
<p>When you make the dropdown selection in the UI, the JSON of the pipeline is being updated with a hardcoded GUID of the selection made.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756304961938/3cea76a3-8a27-4e50-b7df-5659fb4d63ba.png" alt class="image--center mx-auto" /></p>
<p>Eh, who care? It’s working now, we’re good to go. Right?</p>
<p><em>Insert our good friend, CICD</em></p>
<p>Assuming you’re following good development practices, you’re likely working in a feature branch or a development environment. So, what happens when you need to check that code in and promote your new pipeline to production? Remember, we now have a hardcoded GUID in the JSON of our pipeline.</p>
<p>Unlike some of the other Fabric items that will remain connected using named references (think notebook and lakehouse references), data pipelines do <strong>not</strong> maintain a link based on named reference. Said different, when you deploy your code to a new workspace, the GUID reference in your JSON <strong>will not change</strong>. If you deploy a pipeline to a new workspace and immediately trigger, it will reference the GUIDs from the originating workspace.</p>
<p>Now that we understand the issue, let’s talk about a solution.</p>
<h2 id="heading-variable-library-early-days">Variable Library, Early Days</h2>
<p>Variable Libraries have been in preview for quite a while, why did it take so long to give them the respect they deserve?! When VL’s first launched, they could only be used with data pipelines, and <strong>they did not support usability with activity connections</strong>. Yeah, you read that correctly. We could use them to swap other values in our pipelines, but we couldn’t use them to do the highest priority (personal opinion) function needed, let me hot swap an activity connection.</p>
<p>For that reason, we had solutions popping up like the fabric-cicd repo by one of the Microsoft internal engineering teams. fabric-cicd is a hugely helpful CICD kickstarter that solved the issue of changing connections by essentially doing a find/replace during deployment to replace GUIDs and other references.</p>
<p>Having seen the VL’s, I suspected it was only a matter of time before we had full support, so I extended the fabric-cicd repo to dynamically generate a parameter.yml file using a VL. This was obviously not the intended use case, but it allowed me to leverage the VL item in preparation (and hope) that we’d get connection integration at some point.</p>
<p>If you’re interested, I do have an open PR on fabric-cicd repo (I owe some updates to the docs, sorry for being slow) where you can view the code mentioned above. That said, I need to revisit to determine its viability going forward, but I suspect it still has a place in the toolbelt.</p>
<p><a target="_blank" href="https://github.com/microsoft/fabric-cicd/pull/264">Added support for dynamic creation of parameter.yml file using integration with VariableLibrary artifact by Lucid-Will · Pull Request #264 · microsoft/fabric-cicd</a></p>
<h2 id="heading-variable-library-all-grown-up">Variable Library, All Grown Up</h2>
<p>Fast forward a few months, and the current state of VL finally warranted some updates to the Lucid Data Platform codebase. I’m going to make a bit of an assumption that people have at least experimented with how to use a VL, so I won’t be going super deep on a “how-to”. If there’s interest in doing so, I’d be happy to put together another tutorial, but I’ll keep this one relatively high level.</p>
<p>Let’s take a quick look under the hood at how these little fellas can make your life a whole lot easier.</p>
<h3 id="heading-setup">Setup:</h3>
<p>Firstly, you’ll want to create a VL item in your workspace. If you do a bit of proper planning, your future self will thank you. E.g, instead of mashing everything into a single library, I <em>try</em> to be a bit more organized and spun up multiple libraries for logical grouping.</p>
<p>Here’s a glimpse into the mind of a slightly OCD, very ADHD person that probably got carried away due to excitement and how they structured their libraries:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756308351915/f016c3c1-27ba-4d33-9c13-c2212429e559.png" alt class="image--center mx-auto" /></p>
<p>And if you crack the door a bit more:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756308400840/278b31d0-bb34-4b54-b06e-1c42f6594cee.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756308425094/17a10027-3c9e-4824-8b0b-113b41f17e5e.png" alt class="image--center mx-auto" /></p>
<p>You can probably pickup on the logical grouping for the rest of them, so I’ll spare you the screenshots. But, keeping things isolated this way tends to reduce my time spent scrolling endlessly looking for that one thing, so it’s helped quite a bit.</p>
<h3 id="heading-implementing-in-data-pipelines">Implementing in Data Pipelines</h3>
<p>Circling back to the previous pipeline example, time to get rid of all those hardcoded references. If you crack open a data pipeline, you’ll find the Variable Library tab at the bottom of your canvas. Here’s where you’ll initialize the variables from your VL’s for usage in the pipeline.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756309267777/e4f7942c-4d73-4a74-8435-6c650adfab3d.png" alt class="image--center mx-auto" /></p>
<p>Next step is to replace all the hardcoded references in your pipeline activities with variable library references for the juicy flexibility.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756309341286/046deb85-db70-4f56-9d0f-ec8ba95d81cd.png" alt class="image--center mx-auto" /></p>
<p>You drop them in the same way you would a parameter or a variable.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756309378672/01b1a283-2181-42e8-8f58-6ccdc4039a3e.png" alt class="image--center mx-auto" /></p>
<p>Now, if you check the pipeline JSON, the hardcoded references are gone and you’re ready to jam.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756309495628/6f25d980-fb2a-4de6-af71-44cd7758b4ff.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-this-is-really-really-cool">This Is Really, Really Cool</h2>
<p>So, yeah, this is actually a super duper cool and very welcomed enhancement to the Fabric stack. It gets us closer to closing one of the biggest gaps (personal opinion) Fabric pipelines have had compared to legacy ADF/Synapse in that we were missing Linked Service items. It’s also a huge W with respect to CICD and how we manage connections and other references during deployment.</p>
<p>On a semi-related note, another <em>feature</em> that was quietly rolled out recently - Schedules are now included as part of the metadata being deployed. If you deploy a data pipeline with an active schedule, the schedule will go with it as part of deployment (and will deploy in the same state). So, after spending about ~30 minutes trying to figure out why jobs were duplicating and failing over top of each other, I realized the issue was related to doing a test deployment to a feature branch. The variable libraries were deployed in two workspaces, and the schedules deployed as active, so be careful with that, I guess?</p>
<p>Anyway, happy coding, Fabricators!</p>
<p>Oh, if you’re interested in a rock-star pickup to augment your dev team, check us out over here at Lucid BI and feel free to connect on LinkedIn for more cool Fabric stuff.</p>
]]></content:encoded></item><item><title><![CDATA[Data Profiling with Spark and YData]]></title><description><![CDATA[Data analytics often begins with profiling your data. Data profiling is simply the act of examining the raw source data to understand things like the structure and quality of the data.
As an engineer, understanding things such as the distribution of ...]]></description><link>https://blog.lucidbi.co/data-profiling-with-spark-and-ydata</link><guid isPermaLink="true">https://blog.lucidbi.co/data-profiling-with-spark-and-ydata</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[spark]]></category><category><![CDATA[data]]></category><dc:creator><![CDATA[William Crayger]]></dc:creator><pubDate>Tue, 26 Nov 2024 15:40:27 GMT</pubDate><content:encoded><![CDATA[<p>Data analytics often begins with profiling your data. Data profiling is simply the act of examining the raw source data to understand things like the structure and quality of the data.</p>
<p>As an engineer, understanding things such as the distribution of value, min/max, and unique occurances of values in the data you’re working with is very powerful. It helps you to better understand how to work with your data when considering joining tables, configuring incremental extract/load processes, and identifying the natural key of the table for dimensional modeling.</p>
<p>Data profiling can often be a long, tedious process. Thankfully, there are tools available to help expedite the process.</p>
<p>the Ydata library is one such tool and should be in everyones toolbelt.</p>
<p><a target="_blank" href="https://docs.profiling.ydata.ai/latest/">Welcome - YData Profiling</a></p>
<p>The library itself is very robust; however, it can also be simplified to get a quick profile of your data. In addition, you have several options of what to do with the output once it’s generated such as rendering the html directly in your notebook or writing the output as JSON to a storage location.</p>
<p>With only a few lines of code you gain significant visibility into your data. For example, if we wanted to understand the makeup of our customers table:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732634326986/e6432fad-9655-4f96-85c7-46da92fcfbd1.png" alt class="image--center mx-auto" /></p>
<p>Variables represent the columns of the table and can be explored in more detail. We can start to probe into the columns (variables) to find things like the uniqueness of that column:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732634540559/a5b7a444-11d8-4181-8b59-d6ab9433c836.png" alt class="image--center mx-auto" /></p>
<p>From there we can probe even further to view the statistical breakdown of the column:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732634602851/16adcb9d-176d-43a9-8d63-7c4743f894d9.png" alt class="image--center mx-auto" /></p>
<p>Or maybe we wanted to see the distribution of values:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732634642847/895a8ace-4636-42fd-be3d-f5039592022a.png" alt class="image--center mx-auto" /></p>
<p>Beyond these simple examples there are advanced settings allowing you to customize your exploration through configuration files and sample configurations available through the public github:</p>
<p><a target="_blank" href="https://github.com/ydataai/ydata-profiling">ydataai/ydata-profiling: 1 Line of code data quality profiling &amp; exploratory data analysis for Pandas and Spark DataFrames.</a></p>
<p>Below is a snippet to get you started.</p>
<pre><code class="lang-python">%pip install ydata-profiling --q

<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> col, when, lit
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime, timezone
<span class="hljs-keyword">from</span> pyspark.sql.types <span class="hljs-keyword">import</span> DecimalType, DateType, TimestampType, IntegerType, DoubleType, StringType
<span class="hljs-keyword">from</span> ydata_profiling <span class="hljs-keyword">import</span> ProfileReport

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">profile_spark_dataframe</span>(<span class="hljs-params">
    df,
    table_name
</span>):</span>
    <span class="hljs-string">"""
    Profiles a Spark DataFrame by handling null values, transforming the DataFrame, and generating a profiling report.

    This function first processes the DataFrame by setting default values for null entries based on data type:
    - Decimal fields are set to 0.0 if null.
    - Date and Timestamp fields are set to January 1, 1900, if null.

    The transformed Spark DataFrame is then converted to a Pandas DataFrame and passed to `ydata_profiling.ProfileReport`
    to create a profiling report. The report can be returned either as an HTML string or as a `ProfileReport` object.

    :param df: The Spark DataFrame to be profiled.

    :return: Returns the report as a `ProfileReport` object.
    """</span>

    <span class="hljs-comment"># Handle nulls by setting defaults before profiling</span>
    <span class="hljs-keyword">for</span> field <span class="hljs-keyword">in</span> df.schema.fields:
        <span class="hljs-keyword">if</span> isinstance(field.dataType, DecimalType):
            df = df.withColumn(field.name, when(col(field.name).isNull(), lit(<span class="hljs-number">0.0</span>)).otherwise(col(field.name).cast(DoubleType())))
        <span class="hljs-keyword">elif</span> isinstance(field.dataType, DateType) <span class="hljs-keyword">or</span> isinstance(field.dataType, TimestampType):
            df = df.withColumn(field.name, when(col(field.name).isNull(), lit(datetime(<span class="hljs-number">1900</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>))).otherwise(col(field.name)))

    <span class="hljs-comment"># Convert to Pandas dataframe</span>
    df = df.toPandas()

    <span class="hljs-comment"># Generate report</span>
    report = ProfileReport(
        df,
        title=<span class="hljs-string">f'Profiling for <span class="hljs-subst">{table_name}</span>'</span>,
        infer_dtypes=<span class="hljs-literal">False</span>,
        correlations=<span class="hljs-literal">None</span>,
        minimal=<span class="hljs-literal">True</span>
    )

    <span class="hljs-keyword">return</span> report

<span class="hljs-comment"># Set variables</span>
schema_name = <span class="hljs-string">'stage'</span>
table_name = <span class="hljs-string">'customers'</span>

<span class="hljs-comment"># Build sample DataFrame</span>
df = spark.table(<span class="hljs-string">f'<span class="hljs-subst">{schema_name}</span>.<span class="hljs-subst">{table_name}</span>'</span>).limit(<span class="hljs-number">10000</span>)

<span class="hljs-comment"># Generate report</span>
report = profile_spark_dataframe(df, table_name)

<span class="hljs-comment"># Convert report to HTML</span>
report = report.to_html()

<span class="hljs-comment"># View report</span>
displayHTML(report)

<span class="hljs-comment"># # Create a timestamped file name</span>
utc_timestamp = datetime.now(timezone.utc).strftime(<span class="hljs-string">"%Y%m%d_%H%M%S"</span>)
file_name = <span class="hljs-string">f'<span class="hljs-subst">{table_name}</span>_profile_<span class="hljs-subst">{utc_timestamp}</span>.html'</span>

<span class="hljs-comment"># # Set output path to save html file</span>
output_path = <span class="hljs-string">f"abfss://xxxxxx@onelake.dfs.fabric.microsoft.com/xxxxxx/Files/data_profile/<span class="hljs-subst">{file_name}</span>"</span>

<span class="hljs-comment"># # Write file to lakehouse</span>
mssparkutils.fs.put(
    output_path,
    report
)
</code></pre>
<p>If you'd like to learn more about how Lucid can support your team, let's connect on LinkedIn and schedule an intro call.</p>
<p><a target="_blank" href="https://www.linkedin.com/in/willcrayger/"><strong>https://www.linkedin.com/in/willcrayger/</strong></a></p>
]]></content:encoded></item><item><title><![CDATA[Using Custom Python Libraries Without Fabric Environment]]></title><description><![CDATA[Fabric Environment artifacts, where to begin…
If you’re not familiar, the environment artifact, in part, is intended to mimic the functionality of the Synapse Workspace by allowing you to do things like install packages for easy reusability across yo...]]></description><link>https://blog.lucidbi.co/using-custom-python-libraries-without-fabric-environment</link><guid isPermaLink="true">https://blog.lucidbi.co/using-custom-python-libraries-without-fabric-environment</guid><category><![CDATA[microsoftfabric]]></category><category><![CDATA[spark]]></category><category><![CDATA[data-engineering]]></category><dc:creator><![CDATA[William Crayger]]></dc:creator><pubDate>Mon, 21 Oct 2024 15:42:27 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729525148546/f064998e-1531-438a-b82d-1ca5d66751dc.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fabric Environment artifacts, where to begin…</p>
<p>If you’re not familiar, the environment artifact, in part, is intended to mimic the functionality of the Synapse Workspace by allowing you to do things like install packages for easy reusability across your Spark workloads.</p>
<p>Conceptually, I love the idea of having the ability to pre-define my spark pool configurations and easily install my custom python libraries. However, the current state of the environment item leaves a lot to desire.</p>
<p>During active development cycles, using an environment has actually increased my development time exponentially simply due to how long it takes to publish changes. The UI often reflects inaccurate information regarding the state of your installed libraries, reflecting libraries that no longer exist or wanting me to publish changes that will delete my library without me asking to do so.</p>
<p>I would often open my environment to see the following:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729521558837/81150d2e-dd7d-40c7-9e8e-9068fe1d7131.png" alt class="image--center mx-auto" /></p>
<p>Seemingly, it appears the environment doesn’t recognize the state of the packages installed and often wants to remove them.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729521542548/3047a338-ceda-4a7e-abf2-d9371e704168.png" alt class="image--center mx-auto" /></p>
<p>Taking the various bugs into consideration along with the immediate loss of using the starter pools really makes it challenging to see the benefits. As such, I’ve stopped using them altogether in favor of a more efficient approach that allows me to continue using starter pools while still leveraging my reusable libraries.</p>
<p>Before we start, if you’re not familiar with creating and managing .whl files I recommend checking out my buddy Sandeep’s blog:</p>
<p><a target="_blank" href="https://fabric.guru/installing-custom-python-packages-in-fabric">Installing and Managing Python Packages in Microsoft Fabric</a></p>
<p>When working with custom packages, .whl files are how you package up and deploy your code. The .whl files are what would be installed to your Fabric library or Synapse Workspace as a prerequisite for installing them in your notebook.</p>
<p>However, we can also use inline installation by running a command such as:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Install lucid python utility library</span>
!pip install /lakehouse/default/Files/python_utility/lucidctrlutils<span class="hljs-number">-0.0</span><span class="hljs-number">.1</span>-py3-none-any.whl
</code></pre>
<p>But, what if you want to install from a location other than the default lakehouse for the notebook, or a notebook without an attached lakehouse at all?</p>
<p>For example, when creating a new feature branch, notebooks retain attachment defaults from the state in which they were branched from. Said differently, your notebook will remain attached to the default lakehouse in the workspace your feature branch originated from (this warrants an entire blog post on its own). In this scenario, if you’re wanting to test modifications to your library you have to shuffle through the notebook and lakehouse settings which gets quite redundant.</p>
<p>Instead of doing the notebook dance, you can do something like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">try</span>:
    <span class="hljs-comment"># Define the ADLS path and local file path</span>
    install_path = <span class="hljs-string">"abfss://xxxxxxx@onelake.dfs.fabric.microsoft.com/xxxxxxx/Files/ctrlPythonLibrary/lucidctrlutils-0.0.1-py3-none-any.whl"</span>
    local_filename = <span class="hljs-string">"/tmp/lucidctrlutils-0.0.1-py3-none-any.whl"</span>

    <span class="hljs-comment"># Use mssparkutils to copy the file from storage account to the local filesystem</span>
    mssparkutils.fs.cp(install_path, <span class="hljs-string">f"file:<span class="hljs-subst">{local_filename}</span>"</span>, <span class="hljs-literal">True</span>)

    <span class="hljs-comment"># Install the .whl file locally using pip</span>
    !pip install {local_filename}

    <span class="hljs-comment"># Import modules for data processing</span>
    <span class="hljs-keyword">from</span> lucid_ctrl_utils <span class="hljs-keyword">import</span> *

    print(<span class="hljs-string">"Successfully installed the utility library."</span>)

<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
    print(<span class="hljs-string">f"An error occurred while installing the utility library: <span class="hljs-subst">{str(e)}</span>"</span>)
</code></pre>
<p>The above snippet reads the .whl file from your specified path, copies it to a temp directory, and allows you to execute !pip install. However, this code still requires you to manually update the install_path. Surely, we can do better.</p>
<p>In my scenario, I want to use the lakehouse name as my storage identifier instead of the GUID so we’ll have to write a bit more code.</p>
<pre><code class="lang-python"><span class="hljs-keyword">try</span>:
    <span class="hljs-comment"># Install semantic link</span>
    !pip install semantic-link --q
    <span class="hljs-keyword">import</span> sempy.fabric <span class="hljs-keyword">as</span> fabric

    <span class="hljs-comment"># Set lakehouse name</span>
    lakehouse_name = <span class="hljs-string">'stage'</span>

    <span class="hljs-comment"># Get workspace id and list of items in one step</span>
    notebook_workspace_id = fabric.get_notebook_workspace_id()
    df_items = fabric.list_items(workspace=notebook_workspace_id)

    <span class="hljs-comment"># Filter the dataframe by 'Display Name' and 'Type'</span>
    df_filtered = df_items.query(<span class="hljs-string">f"`Display Name` == '<span class="hljs-subst">{lakehouse_name}</span>' and `Type` == 'Lakehouse'"</span>)

    <span class="hljs-comment"># Ensure there's at least one matching row</span>
    <span class="hljs-keyword">if</span> df_filtered.empty:
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">"No matching rows found for Display Name = 'stage' and Type = 'Lakehouse'"</span>)

    <span class="hljs-comment"># Get the 'Id' from the filtered row</span>
    stage_lakehouse_id = df_filtered[<span class="hljs-string">'Id'</span>].iloc[<span class="hljs-number">0</span>]

<span class="hljs-keyword">except</span> ValueError <span class="hljs-keyword">as</span> ve:
    <span class="hljs-comment"># Handle specific errors for value-related issues</span>
    <span class="hljs-keyword">raise</span> ve
<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
    <span class="hljs-comment"># Catch and raise any unexpected exceptions</span>
    <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f"An error occurred: <span class="hljs-subst">{str(e)}</span>"</span>)
</code></pre>
<p>The above code uses semantic-link to dynamically access the GUIDs for the workspace and lakehouse for my feature branch, removing the need to hardcode these values in install_path. Now, I can construct the path without any manual input.</p>
<pre><code class="lang-python"><span class="hljs-keyword">try</span>:
    <span class="hljs-comment"># Define the ADLS path and local file path</span>
    install_path = <span class="hljs-string">f"abfss://<span class="hljs-subst">{notebook_workspace_id}</span>@onelake.dfs.fabric.microsoft.com/<span class="hljs-subst">{stage_lakehouse_id}</span>/Files/ctrlPythonLibrary/lucidctrlutils-0.0.1-py3-none-any.whl"</span>
    local_filename = <span class="hljs-string">"/tmp/lucidctrlutils-0.0.1-py3-none-any.whl"</span>

    <span class="hljs-comment"># Use mssparkutils to copy the file from storage account to the local filesystem</span>
    mssparkutils.fs.cp(install_path, <span class="hljs-string">f"file:<span class="hljs-subst">{local_filename}</span>"</span>, <span class="hljs-literal">True</span>)

    <span class="hljs-comment"># Install the .whl file locally using pip</span>
    !pip install {local_filename}  --no-cache-dir --q &gt; /dev/null <span class="hljs-number">2</span>&gt;&amp;<span class="hljs-number">1</span>

    <span class="hljs-comment"># Import modules for data processing</span>
    <span class="hljs-keyword">from</span> lucid_ctrl_utils <span class="hljs-keyword">import</span> *

    print(<span class="hljs-string">"Successfully installed the utility library."</span>)

<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
    print(<span class="hljs-string">f"An error occurred while installing the utility library: <span class="hljs-subst">{str(e)}</span>"</span>)
</code></pre>
<p>Great, we now have the library installed. But, this is a lot of code to move around to all my notebooks. The overhead introduced is going to be a PITA, right? Well, we can optimize one step further by keeping this in an isolated “utility” notebook and activate it using the %run command in subsequent notebooks.</p>
<pre><code class="lang-python">%run nb_lucid_ctrl_utils
</code></pre>
<p>So, why go through all this effort? Why not use the environment item? Well, there’s a few reasons beyond avoiding the bugs mentioned earlier in the article.</p>
<p>With this approach I get to abuse the quick spin-up time of the starter pool rather than waiting 90+ seconds for a custom pool to come online.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729524576506/d345d90a-60c0-442a-8760-7ae1f9ea75d1.png" alt class="image--center mx-auto" /></p>
<p>If I gained nothing else, this is a huge W in my opinion. However, this also solves a REALLY frustrating issue with branching, which I’ll address in another blog coming soon.</p>
<p>If you'd like to learn more about how Lucid can support your team, let's connect on LinkedIn and schedule an intro call.</p>
<p><a target="_blank" href="https://www.linkedin.com/in/willcrayger/"><strong>linkedin.com/in/willcrayger</strong></a></p>
<p><a target="_blank" href="https://calendly.com/wcrayger">Calendly - Will Crayger</a></p>
]]></content:encoded></item><item><title><![CDATA[There's No Shortcut to Proper Planning]]></title><description><![CDATA[About a month ago, while working on the first iteration of the Fabric capacity monitoring report, I stumbled across a bit of an invisible fence. I posted a quick teaser with a screenshot on LinkedIn to see if anyone else could spot it:
https://www.li...]]></description><link>https://blog.lucidbi.co/theres-no-shortcut-to-proper-planning</link><guid isPermaLink="true">https://blog.lucidbi.co/theres-no-shortcut-to-proper-planning</guid><category><![CDATA[microsoftfabric]]></category><category><![CDATA[Microsoft]]></category><category><![CDATA[Power BI]]></category><dc:creator><![CDATA[William Crayger]]></dc:creator><pubDate>Sun, 09 Jun 2024 17:07:10 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1717952806289/b54b1bf5-0c32-4553-91ec-1c8aa3c611f6.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>About a month ago, while working on the first iteration of the Fabric capacity monitoring report, I stumbled across a bit of an invisible fence. I posted a quick teaser with a screenshot on LinkedIn to see if anyone else could spot it:</p>
<p><a target="_blank" href="https://www.linkedin.com/posts/willcrayger_microsoftfabric-activity-7186074625783574529-cdu3/?utm_source=share&amp;utm_medium=member_desktop">https://www.linkedin.com/posts/willcrayger_microsoftfabric-activity-7186074625783574529-cdu3/?utm_source=share&amp;utm_medium=member_desktop</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717947173075/45debc64-a29d-47c4-90b3-e44281ad63a6.png" alt class="image--center mx-auto" /></p>
<p>It's very subtle so you have to look closely. In case you haven't spotted it yet, the location (region) of your artifacts is directly tied to the location in which your capacity exists.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717947296282/34bf2175-c275-4a24-9ab6-6bbdad40f9c0.png" alt class="image--center mx-auto" /></p>
<p>As I said, it's quite subtle, but this is actually a very big deal for several reasons.</p>
<h3 id="heading-region-lock">Region Lock</h3>
<p>When you create a Fabric workspace you must assign it to a capacity to use the Fabric features. Before assigning to a capacity, you must create a capacity in the Azure portal. When you create the capacity you pick a region for that capacity to exist in.</p>
<p>Let's assume you currently have a Trial Fabric capacity that exists in East US 2. We'll create a new workspace and assign it to the trial capacity. We'll then create a lakehouse in the new workspace.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717948252191/3cb8a742-938a-4fc1-9e78-79f92d3d1584.png" alt class="image--center mx-auto" /></p>
<p>After creating the workspace and lakehouse, you decide to reassign the workspace to a different capacity in another region, no big deal, right?</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717948352378/35601e15-0d72-4eee-88f2-8bd101037212.png" alt class="image--center mx-auto" /></p>
<p>Unfortunately, by adding a lakehouse to the workspace we have region-locked ourselves. To move the workspace we would need to remove the storage account (lakehouse).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717948464143/79b9de0e-16a4-46b4-b399-56168418a4cc.png" alt class="image--center mx-auto" /></p>
<p>The limitation here makes sense when you consider storage costs are different depending on region. Now, I know what everyone is going to say, "Who cares, just create a shortcut and be done with it". Well, welcome to the conversation.</p>
<h3 id="heading-the-hidden-cost-of-shortcuts">The Hidden Cost of Shortcuts</h3>
<p>The reason I included "hidden" in the title of this section is due to the lack of transparency in the documentation. When it comes to pricing, we generally know how much storage and compute will cost, but the documentation around network and data transfer costs is almost non-existent. There's a single line buried in the pricing whitepaper:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717952499038/351148a8-2244-4281-8db5-b98c0399e91f.png" alt class="image--center mx-auto" /></p>
<p><a target="_blank" href="https://azure.microsoft.com/en-us/pricing/details/microsoft-fabric/">https://azure.microsoft.com/en-us/pricing/details/microsoft-fabric/</a></p>
<p>I've been hesitant to write this article because I haven't been able to get a definitive answer, until now.</p>
<p>Let's review the following scenario:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717950508276/2963ae95-8051-405c-aa75-263f97d34a2e.png" alt class="image--center mx-auto" /></p>
<p>In the above diagram, we have a lakehouse in a workspace assigned to a capacity in East US 2. Another team has requested access to our data and created a shortcut from their workspace assigned to a capacity in West US per the recommended approach.</p>
<p>Even when leveraging shortcuts, this scenario still produces a read operation that spans across data centers and therefore will (eventually) incur data transfer fees. I say eventually because we're not currently seeing these charges, but your team should be aware that they are coming.</p>
<h3 id="heading-final-thoughts">Final Thoughts</h3>
<p>When planning your Fabric implementation, capacity planning is going to play a big part, but not just from a cost and sizing perspective. Planning the location of the capacities is equally important if you want to avoid things like region lock and unexpected line items on your monthly bill.</p>
<p>If you haven't already, I encourage you to check out my last article on deployment as many of the broader architectural considerations will also be applicable here.</p>
<p><a target="_blank" href="https://lucidbi.co/fabric-architecture-considerations">https://lucidbi.co/fabric-architecture-considerations</a></p>
<p>If you'd like to learn more about how Lucid can support your team, let's connect on LinkedIn and schedule an intro call.</p>
<p><a target="_blank" href="https://www.linkedin.com/in/willcrayger/"><strong>linkedin.com/in/willcrayger</strong></a></p>
<p><a target="_blank" href="https://calendly.com/wcrayger/lucid-intro-call">https://calendly.com/wcrayger/lucid-intro-call</a></p>
]]></content:encoded></item><item><title><![CDATA[Microsoft Fabric Enterprise Deployment Considerations]]></title><description><![CDATA[When it comes to architecture strategies there's no one-size-fits-all solution. Every team will have different use cases that will drive requirements such as security, environment isolation, and tools needed.
While working with one of my clients we s...]]></description><link>https://blog.lucidbi.co/fabric-architecture-considerations</link><guid isPermaLink="true">https://blog.lucidbi.co/fabric-architecture-considerations</guid><category><![CDATA[microsoftfabric]]></category><dc:creator><![CDATA[William Crayger]]></dc:creator><pubDate>Mon, 27 May 2024 17:56:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1716837932498/6c049ae6-89e0-4f98-b7fa-74713f9dc31d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When it comes to architecture strategies there's no one-size-fits-all solution. Every team will have different use cases that will drive requirements such as security, environment isolation, and tools needed.</p>
<p>While working with one of my clients we started to dive into what a future architecture in Microsoft Fabric would look like for them. As we discussed their requirements I realized that a few potential issues were waiting for us in the fog.</p>
<p>Before we dive in, I'd like to recognize the awesome work of my friend, the data goblin himself, Kurt Buhler. Kurt helped create the Power BI usage scenario diagrams that can be found here:</p>
<p><a target="_blank" href="https://learn.microsoft.com/en-us/power-bi/guidance/powerbi-implementation-planning-usage-scenario-diagrams#enterprise-bi">https://learn.microsoft.com/en-us/power-bi/guidance/powerbi-implementation-planning-usage-scenario-diagrams#enterprise-bi</a></p>
<p>A ton of work went into the design of the usage diagrams and I love the stylistic approach the team took. As such, I tried to keep the general format the same while extending some of the patterns to broader architecture.</p>
<p>If you haven't already I highly recommend checking out Kurt's blog:</p>
<p><a target="_blank" href="https://data-goblins.com/">https://data-goblins.com/</a></p>
<h3 id="heading-understanding-medallion-patterns">Understanding Medallion Patterns</h3>
<p>Everyone should have heard of medallion architecture by now. For those that haven't, medallion patterns have been around for years and have had many names (raw, validated, enriched, semi-curated, curated, etc.). While the naming pattern used is less important, the idea behind using zones typically stems from the need to have separation of data by state of readiness and other security or governance requirements. This is of course an oversimplification for conceptual purposes.</p>
<p>Below is an example medallion pattern in Fabric.</p>
<p><img src="https://raw.githubusercontent.com/Lucid-Will/Fabric-Architecture-Diagrams/main/Sample%20Architecture%20Diagrams/Simple%20Medallion%20-%20Fabric.svg" alt="Simple Medallion - Fabric" /></p>
<ol>
<li><p>Read/copy data from the source system using Spark via notebook, pipeline, or shortcut.</p>
</li>
<li><p>Use managed private endpoints or on-premise data gateway when working with on-premise or firewall-enabled source systems.</p>
</li>
<li><p>Store data in bronze lakehouse as close to raw form as possible.</p>
</li>
<li><p>Create a shortcut from bronze to silver lakehouse to begin the data enrichment and cleansing process.</p>
</li>
<li><p>Create a shortcut from silver to gold to persist in a dimensional model for enterprise analytics.</p>
</li>
</ol>
<p>We could follow a very similar pattern implemented with Azure data services.</p>
<p><img src="https://raw.githubusercontent.com/Lucid-Will/Fabric-Architecture-Diagrams/main/Sample%20Architecture%20Diagrams/Simple%20Medallion%20-%20Azure%20Data%20Services.svg" alt="Simple Medallion - Azure Data Services" /></p>
<ol>
<li><p>Read/copy data from the source system using Spark via notebook or pipeline.</p>
</li>
<li><p>Use managed private endpoints or integration runtime gateway when working with on-premise or firewall-enabled source systems.</p>
</li>
<li><p>Store data in a data lake gen2 bronze container as close to raw form as possible.</p>
</li>
<li><p>Write from bronze container to silver delta lake container to begin data enrichment and cleansing process.</p>
</li>
<li><p>Write from silver delta lake container to gold delta lake container to persist dimensional model for enterprise analytics.</p>
</li>
</ol>
<h3 id="heading-combining-lifecycle-management-and-medallion-patterns">Combining Lifecycle Management and Medallion Patterns</h3>
<p>In addition to medallion patterns, we must also be aware of application lifecycle management (ALM) practices (dev., test, prod.).</p>
<p>An example of a simple ALM strategy for Azure data services is to use prod./non-prod. subscriptions with dev./test/prod. resource groups.</p>
<p><img src="https://raw.githubusercontent.com/Lucid-Will/Fabric-Architecture-Diagrams/main/Sample%20Architecture%20Diagrams/Simple%20ALM%20-%20Azure%20Data%20Services.svg" alt="Simple ALM - Azure Data Services" /></p>
<p>Additional steps for ALM include:</p>
<ol start="6">
<li><p>Integrate infrastructure as code (IaC) with source control repo.</p>
</li>
<li><p>Build validation, continuous integration (CI), and continuous deployment (CD) pipelines for code deployment.</p>
</li>
<li><p>Deploy to test using the release pipeline.</p>
</li>
<li><p>Deploy to prod. using the release pipeline.</p>
</li>
</ol>
<h3 id="heading-fabric-application-lifecycle-management-alm">Fabric Application Lifecycle Management (ALM)</h3>
<p>In Fabric, we have deployment pipelines to enable code promotion to higher environments. The overall flow is quite similar with a few differences in how things are released.</p>
<p><img src="https://raw.githubusercontent.com/Lucid-Will/Fabric-Architecture-Diagrams/main/Sample%20Architecture%20Diagrams/Simple%20ALM%20-%20Fabric.svg" alt="Simple ALM - Fabric" /></p>
<p>Additional steps for Fabric ALM:</p>
<ol start="6">
<li><p>Sync dev. workspace with git.</p>
</li>
<li><p>Build validation, continuous integration (CI), and continuous deployment (CD) pipelines for code deployment.</p>
</li>
<li><p>The release pipeline triggers the Fabric deployment pipeline to deploy to test.</p>
</li>
<li><p>Release pipeline triggers Fabric deployment pipeline to deploy to prod.</p>
</li>
</ol>
<p>The wonderful thing with Fabric deployment pipelines is having the ability to sync a workspace directly to your git repo and having built-in pipelines to move artifacts giving us integrated low-code IaC and CI/CD.</p>
<p><img src="https://learn.microsoft.com/en-us/fabric/cicd/deployment-pipelines/media/intro-to-deployment-pipelines/full-pipeline.gif" alt="A screenshot of a working deployment pipeline with all three stages, development, test, and production, populated." /></p>
<p><a target="_blank" href="https://learn.microsoft.com/en-us/fabric/cicd/deployment-pipelines/intro-to-deployment-pipelines">https://learn.microsoft.com/en-us/fabric/cicd/deployment-pipelines/intro-to-deployment-pipelines</a></p>
<h3 id="heading-governance-and-security-consideration">Governance and Security Consideration</h3>
<p>You may have picked up on this through the diagrams already, but a core concept for the remainder of the article is that artifact management in Fabric is centralized around a workspace.</p>
<p><img src="https://github.com/Lucid-Will/Fabric-Architecture-Diagrams/raw/main/Sample%20Architecture%20Diagrams/Architecture%20Layers%20-%20Fabric.svg" alt="Architecture Layers - Fabric" /></p>
<p>In the documentation for implementing medallion lakehouse architecture in Fabric, there's one section that sparked my curiosity.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1716570489846/2c1aad3d-468f-4f1d-8cb3-981ff985353e.png" alt class="image--center mx-auto" /></p>
<p><a target="_blank" href="https://learn.microsoft.com/en-us/fabric/onelake/onelake-medallion-lakehouse-architecture">https://learn.microsoft.com/en-us/fabric/onelake/onelake-medallion-lakehouse-architecture</a></p>
<p>Conceptually, the above diagram makes sense per the rule of zone isolation. That said, the diagram is illustrated to show separation at the lakehouse level with all lakehouses residing in the same workspace. We'll come back to this in a few minutes, but this is an important piece of the puzzle.</p>
<p>If you continue reading you'll find the following statement:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1716570469833/03393976-a490-4a08-afe8-93b1e7379d60.png" alt class="image--center mx-auto" /></p>
<p>The statement above contradicts the diagram and suggests that each medallion zone should be broken into its own workspace. Said differently, a bronze lakehouse should be located in a bronze workspace, and so on. The reason behind the recommendation can be found in the lakehouse access control documentation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1716570556547/79bac5ae-6d32-46a4-aca1-f5902158a1c9.png" alt class="image--center mx-auto" /></p>
<p><a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-engineering/workspace-roles-lakehouse">https://learn.microsoft.com/en-us/fabric/data-engineering/workspace-roles-lakehouse</a></p>
<p>Circling back to the idea of all lakehouse zones living in the same workspace, we can see that from a security and governance perspective, this presents our first, rather large issue.</p>
<p>To perform any activity other than "read" over a lakehouse one must have an admin, member, or contributor role in the workspace in which the lakehouse resides. Granting A/M/C over the workspace is granting applicable permissions to all artifacts in the workspace. In other words, users have visibility to all lakehouse zones and the data within them.</p>
<p>Let's consider a scenario in which we have data with personally identifiable information (PII) such as a social security number in the bronze layer. As the data moves through silver we apply masking rules to the PII data. We want to enable our advanced business analysts and developers to perform tasks other than "read" on the data but we do not want them to see the underlying raw data. We would not be able to facilitate this requirement if the lakehouse zones were contained in a single workspace.</p>
<h3 id="heading-workspace-sprawl">Workspace Sprawl</h3>
<p>The next consideration is workspace sprawl. If we adhere to the recommendation of isolating zones by workspace, the number of workspaces needed exponentially increases. For example, rather than having a single workspace with three separate lakehouses, we will now have three workspaces with one lakehouse each.</p>
<p><img src="https://github.com/Lucid-Will/Fabric-Architecture-Diagrams/raw/main/Sample%20Architecture%20Diagrams/Isolated%20Medallion%20-%20Fabric.svg" alt="Isolated Medallion - Fabric" /></p>
<p>I'm sure you're mind is already drifting in this direction, but what about handling lifecycle management?</p>
<p><img src="https://github.com/Lucid-Will/Fabric-Architecture-Diagrams/raw/main/Sample%20Architecture%20Diagrams/Isolated%20Medallion%20ALM%20-%20Fabric.svg" alt="Isolated Medallion ALM - Fabric" /></p>
<p>Let's consider a scenario to justify following ALM for each medallion zone. To surface data in the bronze zone you will likely be copying data from a source system using a pipeline or notebook. Once the data is available in bronze you will create a shortcut to silver.</p>
<p>Thinking about the lifecycle:</p>
<ol>
<li><p>Net new request for data is received.</p>
</li>
<li><p>Begin development of the pipeline or notebook to copy the data (dev.).</p>
</li>
<li><p>Test the copy process to ensure it meets requirements (test).</p>
</li>
<li><p>The copy process is considered stable and is placed on a schedule to ensure current data is available to Silver (prod.).</p>
</li>
</ol>
<p>As you can see, by splitting the medallion zones by workspace we've tripled the number of needed workspaces in our pattern. With the increase in workspace count, the question becomes how do we manage deployments?</p>
<h3 id="heading-artifact-deployment-considerations">Artifact Deployment Considerations</h3>
<p>Fabric deployment pipelines have been refactored quite a bit from their Power BI days with one of the most significant changes being an increase in the number of "stages" supported from three to now supporting ten.</p>
<p>Theoretically, if we followed the recommendations above, one deployment pipeline would support our nine workspaces. However, if we wanted to include additional separation of artifacts or add another zone to our medallion pattern we would exceed the allowed number of workspaces for our pipeline.</p>
<p>Another consideration with deployment pipelines is that they're linear, meaning an artifact must be deployed through all stages sequentially.</p>
<p>In addition to the workspace quantity limit, we're also limited in that a workspace can only belong to one pipeline.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1716651627607/9f953837-3924-46a3-b003-7fbc0ef88269.png" alt class="image--center mx-auto" /></p>
<p>Since a workspace can belong to only one deployment pipeline, the idea of chaining multiple pipelines together in the Fabric UI becomes void. Instead, you would need to integrate with DevOps release pipelines to programmatically trigger a deployment pipeline release.</p>
<h3 id="heading-enterprise-architecture-strategies">Enterprise Architecture Strategies</h3>
<p>As I stated in the opening, every team is going to have its own set of requirements that will drive source code management and deployment strategies. Below are a few examples of potential enterprise patterns.</p>
<p><strong>Single Workspace Medallion Pattern:</strong></p>
<p>If security and governance at the lakehouse level aren't a concern, perhaps it doesn't make sense to split your medallion layers by workspace. In such a scenario, a potential architecture could look something like the following:</p>
<p><img src="https://github.com/Lucid-Will/Fabric-Architecture-Diagrams/raw/main/Sample%20Architecture%20Diagrams/Single%20Workspace%20Medallion.svg" alt="Single Workspace Medallion" /></p>
<ol>
<li><p>Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.</p>
</li>
<li><p>Read/copy data from the source system using Spark via notebook, pipeline, or shortcut.</p>
</li>
<li><p>Use managed private endpoints or on-premise data gateway when working with on-premise or firewall-enabled source systems.</p>
</li>
<li><p>Store data in the lakehouse as close to raw form as possible.</p>
<ol>
<li>Copy data to the warehouse using pipeline (4b. optional).</li>
</ol>
</li>
<li><p>Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.).</p>
<ol>
<li>Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (5b. optional).</li>
</ol>
</li>
<li><p>Define joins, create measures calculation groups, and implement additional granular security within semantic models as an extension of lakehouse/warehouse delta tables. Semantic models may be created as Direct Lake, Import, or Direct Query connections.</p>
</li>
<li><p>Content creators create reports and dashboards for consumption.</p>
</li>
<li><p>Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.</p>
</li>
<li><p>Feature branches can be created for isolated development workspaces enabling multi-developer workloads.</p>
</li>
<li><p>Content creators clone remote repos to their local development environment to capture the latest working code version.</p>
</li>
<li><p>Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.</p>
</li>
<li><p>Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before publishing.</p>
</li>
<li><p>The build pipeline is then triggered to prepare content for deployment.</p>
</li>
<li><p>Deployment to test and production is facilitated by the release pipeline.</p>
</li>
<li><p>Release to higher environments is gated by a release manager's approval(s).</p>
</li>
<li><p>The release pipeline performs deployment from dev. to test by triggering the Fabric / Power BI deployment pipeline.</p>
</li>
<li><p>Testing and QA are performed before being released to prod.</p>
</li>
<li><p>Release pipeline performs deployment to prod. by triggering the Fabric / Power BI deployment pipeline.</p>
</li>
<li><p>A workspace app is created and serves as the primary entry point for end-user consumption.</p>
</li>
</ol>
<p><strong>Isolated Workspace Medallion Pattern:</strong></p>
<p>However, for most teams, I believe the security and governance isolation conversation isn't something that will easily be ignored and therefore zone isolation will be required. The overall change in architecture is quite significant as the entry points for workloads shift.</p>
<p><strong>Bronze key considerations:</strong></p>
<ul>
<li><p>Data will be read from source systems and written to the bronze layer.</p>
</li>
<li><p>The readiness of the data doesn't yet enable enterprise report development.</p>
</li>
</ul>
<p><img src="https://github.com/Lucid-Will/Fabric-Architecture-Diagrams/raw/main/Sample%20Architecture%20Diagrams/Isolated%20Medallion%20(bronze).svg" alt="Isolated Medallion (bronze)" /></p>
<p><em>Note: workspaces from each medallion layer are now managed by individual deployment pipelines</em></p>
<ol>
<li><p>Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.</p>
</li>
<li><p>Read/copy data from the source system using Spark via notebook, pipeline, or shortcut.</p>
</li>
<li><p>Use managed private endpoints or on-premise data gateway when working with on-premise or firewall-enabled source systems.</p>
</li>
<li><p>Store data in the lakehouse as close to raw form as possible.</p>
<ol>
<li>Copy data to the warehouse using pipeline (4b. optional).</li>
</ol>
</li>
<li><p>Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.)</p>
<ol>
<li>Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (5b. optional).</li>
</ol>
</li>
<li><p>Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.</p>
</li>
<li><p>Feature branches can be created for isolated development workspaces enabling multi-developer workloads.</p>
</li>
<li><p>Content creators clone remote repos to their local development environment to capture the latest working code version.</p>
</li>
<li><p>Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.</p>
</li>
<li><p>Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before publishing.</p>
</li>
<li><p>The build pipeline is then triggered to prepare content for deployment.</p>
</li>
<li><p>Deployment to test and production is facilitated by the release pipeline.</p>
</li>
<li><p>Release to higher environments is gated by a release manager approval(s).</p>
</li>
<li><p>The release pipeline performs deployment from dev. to test by triggering the Fabric deployment pipeline.</p>
</li>
<li><p>Testing and QA are performed before being released to prod.</p>
</li>
<li><p>Release pipeline performs deployment to prod. by triggering the Fabric deployment pipeline.</p>
</li>
</ol>
<p><strong>Silver key considerations:</strong></p>
<ul>
<li><p>Data will be read from the bronze layer and written to the silver layer.</p>
</li>
<li><p>The readiness of the data doesn't yet enable enterprise report development.</p>
</li>
</ul>
<p><img src="https://github.com/Lucid-Will/Fabric-Architecture-Diagrams/raw/main/Sample%20Architecture%20Diagrams/Isolated%20Medallion%20(silver).svg" alt="Isolated Medallion (silver)" /></p>
<ol>
<li><p>Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.</p>
</li>
<li><p>Read data from bronze using a shortcut.</p>
<ol>
<li>Read data from bronze using pipeline (2b. optional).</li>
</ol>
</li>
<li><p>Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.).</p>
<ol>
<li>Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (3b. optional).</li>
</ol>
</li>
<li><p>Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.</p>
</li>
<li><p>Feature branches can be created for isolated development workspaces enabling multi-developer workloads.</p>
</li>
<li><p>Content creators clone remote repos to their local development environment to capture the latest working version.</p>
</li>
<li><p>Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.</p>
</li>
<li><p>Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before publishing.</p>
</li>
<li><p>The build pipeline is then triggered to prepare content for deployment.</p>
</li>
<li><p>Deployment to test and production is facilitated by the release pipeline.</p>
</li>
<li><p>Release to higher environments is gated by a release manager's approval(s).</p>
</li>
<li><p>The release pipeline performs deployment from dev. to test by triggering the Fabric deployment pipeline.</p>
</li>
<li><p>Testing and QA are performed before being released to prod.</p>
</li>
<li><p>Release pipeline performs deployment to prod. by triggering the Fabric deployment pipeline.</p>
</li>
</ol>
<p><strong>Gold key considerations:</strong></p>
<ul>
<li><p>Data will be read from the silver layer and written to the gold layer.</p>
</li>
<li><p>The readiness of the data now enables enterprise report development.</p>
</li>
<li><p>End-user testing of reports will be needed.</p>
</li>
<li><p>Workspace applications will be used for report consumption.</p>
</li>
</ul>
<p><img src="https://github.com/Lucid-Will/Fabric-Architecture-Diagrams/raw/main/Sample%20Architecture%20Diagrams/Isolated%20Medallion%20(gold).svg" alt="Isolated Medallion (gold)" /></p>
<ol>
<li><p>Leverage metadata-driven control framework for orchestration, modeling, table maintenance, data refresh, and logging activities.</p>
</li>
<li><p>Read data from Silver using a shortcut.</p>
<ol>
<li>Read data from Silver using a pipeline (2b. optional).</li>
</ol>
</li>
<li><p>Transform and prepare data using Spark notebooks. Notebooks provide flexibility for developers to use the language native to them (PySpark, SparkSQL, etc.).</p>
<ol>
<li>Transform and prepare data using warehouse SQL analytics endpoint. Create a shortcut from warehouse to lakehouse (3b. optional).</li>
</ol>
</li>
<li><p>Define joins, create measures calculation groups, and implement additional granular security within semantic models as an extension of lakehouse/warehouse delta tables. Semantic models may be created as Direct Lake, Import, or Direct Query connections.</p>
</li>
<li><p>Content creators create reports and dashboards for consumption.</p>
</li>
<li><p>Sync development workspace to git repo in Azure DevOps to commit changes and establish source control.</p>
</li>
<li><p>Feature branches can be created for isolated development workspaces enabling multi-developer workloads.</p>
</li>
<li><p>Content creators clone remote repos to their local development environment to capture the latest working version.</p>
</li>
<li><p>Commit local development changes to the git repo and create a pull request to merge changes from the feature branch to the main branch.</p>
</li>
<li><p>Completing a pull request will trigger the Test / Validate which performs automated tests to validate content before being published.</p>
</li>
<li><p>The build pipeline is then triggered to prepare content for deployment.</p>
</li>
<li><p>Deployment to test and production is facilitated by the release pipeline.</p>
</li>
<li><p>Release to higher environments is gated by a release manager's approval(s).</p>
</li>
<li><p>The release pipeline performs deployment from dev. to test by triggering the Fabric / Power BI deployment pipeline.</p>
</li>
<li><p>Testing and QA are performed before being released to prod.</p>
</li>
<li><p>Release pipeline performs deployment to prod. by triggering the Fabric / Power BI deployment pipeline.</p>
</li>
<li><p>A workspace app is created and serves as the primary entry point for end-user consumption.</p>
</li>
</ol>
<p>By isolating the medallion zones into separate deployment pipelines you now have more control over the environment but you're also introducing overhead by needing to build and manage additional DevOps artifacts.</p>
<h3 id="heading-final-thoughts">Final Thoughts</h3>
<p>Your decision of which architecture to deploy will be based on several things, one of which is the balance between security/governance and the overhead to maintain your system. Like most things when working with data, there's no one-size-fits-all solution.</p>
<p>If you'd like to learn more about how Lucid can support your team, let's connect on LinkedIn and schedule an intro call.</p>
<p><a target="_blank" href="https://www.linkedin.com/in/willcrayger/"><strong>https://www.linkedin.com/in/willcrayger/</strong></a></p>
<p>You can find the diagrams in SVG format on github:</p>
<p><a target="_blank" href="https://github.com/Lucid-Will/Fabric-Architecture-Diagrams">https://github.com/Lucid-Will/Fabric-Architecture-Diagrams</a></p>
]]></content:encoded></item><item><title><![CDATA[How To Reduce Data Integration Costs By 98%]]></title><description><![CDATA[One of the amazing things about Microsoft Fabric is the number of options you have for moving data. For example, you can use Dataflow Gen2, Pipelines, Notebooks, or any combination of the three.
However, with all the options available, it can also ma...]]></description><link>https://blog.lucidbi.co/how-to-reduce-data-integration-costs-by-98</link><guid isPermaLink="true">https://blog.lucidbi.co/how-to-reduce-data-integration-costs-by-98</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[data analysis]]></category><category><![CDATA[microsoft fabric]]></category><dc:creator><![CDATA[William Crayger]]></dc:creator><pubDate>Thu, 11 Apr 2024 16:11:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1712871009634/6f04cfb1-424b-4a1a-a681-8e687745b362.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of the amazing things about Microsoft Fabric is the number of options you have for moving data. For example, you can use Dataflow Gen2, Pipelines, Notebooks, or any combination of the three.</p>
<p>However, with all the options available, it can also make deciding which pattern to use a bit like reading the menu at The Cheesecake Factory. Hopefully, this article will help shed some light on the pros and cons of each approach while also showing you how they impact the elephant in the room: Capacity Units (CUs).</p>
<p>Before we dive in, we need to understand a few things. Let's start with the most important topic: understanding how your capacity consumption is measured.</p>
<p><strong>* Disclosure: Due to a correction of a record duplication issue, some modifications have been made to the original publication of this article. For transparency, I will identify the impact throughout the article with comments and (bold) identifers.</strong></p>
<h3 id="heading-brief-capacity-overview">Brief Capacity Overview</h3>
<p>Premium Capacity has been around for a few years as part of the Power BI stack. However, until the release of Fabric, you only had to worry about how Power BI impacts your capacity. With Fabric, "all" experiences have been standardized to use the same serverless compute, meaning pipelines, notebooks, SQL endpoints, and so on, which all consume compute from your available capacity pool. Because of this, it's more important than ever to truly understand the cost of each workload.</p>
<p>Like Power BI Premium, Fabric capacities have many SKUs, each giving you a specified amount of compute to work with.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712757838861/898bf589-51f5-48ff-8d75-67129aca302d.png" alt class="image--center mx-auto" /></p>
<p>There's an additional SKU that isn't listed, FT1 or Fabric Trial Capacity. The FT1 is equivalent to the F64 SKU, which is also equivalent to the existing Power BI P1 SKU.</p>
<p><strong>All tests performed in this article were done so using an FT1.</strong></p>
<h3 id="heading-capacity-metrics-first-look">Capacity Metrics: First Look</h3>
<p>To begin understanding your capacity consumption, the Microsoft team has created a Fabric Capacity Metrics app that can be deployed from the Power BI app collection. Go to the Apps section of your left-side navigation, click "Get apps" in the top right, and search Fabric.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712758086999/4165a77c-2eeb-457e-8061-6d9c442b37d2.png" alt class="image--center mx-auto" /></p>
<p>Once you have the app installed, you'll need to authenticate and make a few selections regarding the configuration of the timezone and such; it's quite straightforward, though.</p>
<p>Alas, you're ready to see the magic! Well, it's not quite magic, but it's a starting point. The app has a few challenges, as you'll soon encounter, but perhaps the biggest challenge is having only a rolling 14 days available. This makes historical analysis quite challenging. There are ways around this which I'll cover in a later article, for now let's stay on topic.</p>
<p>Upon taking your first look at the metrics, your first thought might be, "What the heck am I looking at?" With all analysis, unless you understand the numbers, they're just numbers. It's a movie without a plot. They exist, but what do they mean? What story are they telling?</p>
<h3 id="heading-capacity-unit-the-numbers">Capacity Unit: The Numbers</h3>
<p>For our analysis, we will hone in on a specific data point: CU(s).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712758465800/8e2cbe25-4419-43e6-81b2-91eae53fdbd1.png" alt class="image--center mx-auto" /></p>
<p>These look like big numbers. These activities must be super expensive, right? Like the old Transformers theme song, there's "more than meets the eye."</p>
<p>To understand the true cost of an activity, we have to do some math. Referring to the chart in the previous section, we're using the equivalent of an F64, which means we have 64 Capacity Units at our disposal. To translate that, we must first understand what a Capacity Unit is. Let's break it down.</p>
<p>First and foremost, a Capacity Unit is not the same as a CU(s). A Capacity Unit is a measurement based on hours, as indicated here:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712759196209/bf42f614-dd3d-4c52-9fdc-0a0cb0efbee5.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712759240866/d102d8da-57e6-479e-aace-2f2a59ffc230.png" alt class="image--center mx-auto" /></p>
<p>CU(s) is a measurement of seconds, which can be misleading. I initially thought the Duration was somehow used in the conversion, but this is not true. The conversion of Capacity Unit to CU(s) is as follows:</p>
<p><code>1 Capacity Unit (hour) = 60 seconds * 60 minutes = 3,600 (seconds per hour)</code></p>
<p><code>For our F64 (FT1) with 64 Capacity Units:</code></p>
<p><code>64 Capacity Unit (hour) = 3,600 CU(s) * 64 = 230,000 CU(s) per hour</code></p>
<p>To translate this to cost, I'll use the first row from the capacity metrics app screenshot above with a CU(s) consumption of 909,747.44.</p>
<p>With our understanding of the conversion between CU(s) and Capacity Units, we can convert CU(s) to cost:</p>
<p><code>Capacity Units = 909,747.44 CU(s) / 3,600 = 252.707622222</code></p>
<p><code>PayGo Cost per Capacity Unit = $11.52 / 64 = $0.18</code></p>
<p><code>PayGo Cost (USD) = 252.71 * $0.18 = ~$45.49</code></p>
<p><code>Reservation Cost per Capacity Unit = $6.853 / 64 = $0.107078125</code></p>
<p><code>Reservation Cost (USD) = ~$27.06</code></p>
<p>If you really wanted to break it down further to determine the cost per CU(s):</p>
<p><code>PayGo Cost per Capacity Unit = $11.52 / 64 = $0.18</code></p>
<p><code>PayGo Cost per CU(s) = $0.18 / 3,600 = $0.00005</code></p>
<p><code>Reservation Cost per Capacity Unit = $6.853 / 64 = ~$0.107078125</code></p>
<p><code>Reservation Cost per CU(s) = $0.107078125 / 3,600 = ~$0.000029744</code></p>
<p>Now that you understand the meaning behind the numbers let's dig into the real purpose of this article.</p>
<p><strong>For the remainder of the article, any cost-related metrics will be calculated using the Reservation rate.</strong></p>
<h3 id="heading-all-experiences-were-not-created-equal">All Experiences Were Not Created Equal</h3>
<p>To make this a bit easier to digest, I've created a monitoring report to help me tell the story.</p>
<p>Welcome to the Lucid capacity monitoring report!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712871192329/ed2d2bb0-b3b2-48e7-8e1e-d61d157a796c.png" alt class="image--center mx-auto" /></p>
<p><strong>*Image updated</strong></p>
<p>The custom monitoring report uses a combination of data elements from the Fabric tenant as well as data points captured directly from the Fabric Capacity Metrics app. The backend of this report is a Fabric Lakehouse that's being populated by a Spark notebook in scheduled intervals.</p>
<p>The notebook has been written to perform the following operations:</p>
<ul>
<li><p>Refresh the Fabric Capacity Metrics semantic model</p>
</li>
<li><p>Capture data about the Fabric tenant, such as workspace, capacity, and item details, into a series of stage tables</p>
</li>
<li><p>Create a calendar stage table</p>
</li>
<li><p>Perform a dynamic UPSERT from the stage tables to a set of dimensional tables</p>
</li>
</ul>
<p>The Lucid monitoring report is then connected to a semantic model of the Lakehouse via Direct Lake mode.</p>
<h3 id="heading-setting-the-stage">Setting The Stage</h3>
<p>For our comparisons, we will focus on different processing patterns using Pipelines and Notebooks. The three patterns I'll be reviewing are a traditional pipeline pattern using nested pipelines with a ForEach loop, using a notebook to generate the list that is passed to a single pipeline via API, and using a notebook to generate the list as well as "copy" directly from the source.</p>
<p>Each scenario follows the same structure, reading data from the source and writing a parquet file to a designated folder in a Lakehouse.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712845009716/25a18425-7159-4d91-8676-c99f9ee4e69e.png" alt class="image--center mx-auto" /></p>
<p><strong>All tests were scheduled to run hourly and performed at staggered intervals to minimize the potential of a noisy neighbor impacting the test.</strong></p>
<p>I used the WideWorldImporters OLAP database hosted in an Azure SQL database for sample data. To simulate real-world examples, I have a daily pipeline to execute a stored procedure that populates fresh data.</p>
<p>Sample:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712763708345/c2633e61-94bd-42ef-91db-cc571697956e.png" alt class="image--center mx-auto" /></p>
<p>Additionally, each pattern uses a basic metadata-driven approach consisting of a single Azure SQL database containing a few control tables. There are a total of 44 tables that are processed as part of my testing.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712763740921/1ccdef1b-573a-4395-86f3-da1dc66d3b94.png" alt class="image--center mx-auto" /></p>
<p>dbo.Copy sample:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712845206125/ced2ece4-2225-46a4-b407-73fcba71a784.png" alt class="image--center mx-auto" /></p>
<p>My testing aimed to understand the efficiency and consumption of each processing pattern with respect to Fabric workloads.</p>
<h3 id="heading-traditional-parent-child-pipeline-pattern">Traditional Parent / Child Pipeline Pattern</h3>
<p>A simple and efficient pattern to dynamically process data using pipelines is to use a parent/child relationship. In this pattern, a parent pipeline typically generates a list of items to process before passing the items in the list to a ForEach loop.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712764034673/6444b411-e44f-4682-9877-bec5979238ef.png" alt class="image--center mx-auto" /></p>
<p>Inside the ForEach loop is usually an activity to execute another pipeline, the child. In this example, the child pipeline sets the path where the parquet file will be written, performs the copy, and logs the path for later retrieval.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712764098120/0f8614dd-706e-4f77-92ec-f6ea59c392b8.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-scenario-1-results">Scenario 1 Results</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712871294142/9d660033-37ed-41e3-8dee-4d9b95156140.png" alt class="image--center mx-auto" /></p>
<p><strong>*Image updated</strong></p>
<p>The test for this pattern was quite interesting. Repeating values. Why are there repeating values? I thought maybe I had a bad measure or was missing a relationship. I went back to my Lakehouse to check the data. Interestingly, the capacity units consumed remain static, but the activity duration and other metrics fluctuate.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712871518743/c8a7a4d6-3d6b-4ba9-a782-4fd57eae4d24.png" alt class="image--center mx-auto" /></p>
<p><strong>*Image updated</strong></p>
<p>Looking at the child pipeline, we see the same pattern.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712871651768/584a18e7-9490-4ce4-99d7-1dd66a644aa6.png" alt class="image--center mx-auto" /></p>
<p><strong>*Image updated</strong></p>
<p>This looks like the "smoothing" is kicking in and spreading the total consumption of the runs out. That said, I'm not 100% sure and would like to dig in more to confirm my suspicion.</p>
<p>For now, let's continue and focus on the total consumption for all runs in the day. As we can see from our report, this pattern cost us <strong>$4.29</strong> for the day and consumed <strong>2.6%</strong> of the total available daily compute.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712871899790/fe38dff2-0fd6-4971-a199-f2c0229198dd.png" alt class="image--center mx-auto" /></p>
<p><strong>*Image updated</strong></p>
<h3 id="heading-notebook-orchestration-with-pipeline-copy">Notebook Orchestration with Pipeline Copy</h3>
<p>The next scenario I wanted to test combines the use of both a pipeline and a notebook. In this pattern, we use a notebook to replace the parent pipeline from Scenario 1.</p>
<p>There are several reasons as to why you may want to do this. Pipelines are efficient at copying data but can often be too rigid regarding lookups or other configuration requirements, especially when working with metadata frameworks. Because of this, developers will begin to include multiple layers of nested pipelines, which can become quite expensive.</p>
<p>By combining the use of a notebook to build the configuration and a pipeline to perform the copy, you have much more flexibility and control over your process.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Function to be executed in parallel for each row to call API</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">call_api_with_payload</span>(<span class="hljs-params">row, workspace_id, item_id, job_type, client</span>):</span>
    <span class="hljs-string">"""
    Orchestration pattern consumption analysis
    Tests performed by Will Crayger of Lucid
    """</span>

    <span class="hljs-comment"># Extract parameters for payload from the row</span>
    payload = {
        <span class="hljs-string">"executionData"</span>: {
            <span class="hljs-string">"parameters"</span>: {
                <span class="hljs-string">"schema"</span>: row[<span class="hljs-string">"Schema"</span>],
                <span class="hljs-string">"object"</span>: row[<span class="hljs-string">"Object"</span>],
            }
        }
    }

    <span class="hljs-comment"># Call the Fabric REST API</span>
    <span class="hljs-keyword">try</span>:
        response = client.post(<span class="hljs-string">f"/v1/workspaces/<span class="hljs-subst">{workspace_id}</span>/items/<span class="hljs-subst">{item_id}</span>/jobs/instances?jobType=<span class="hljs-subst">{job_type}</span>"</span>, json=payload)
        <span class="hljs-keyword">if</span> response.status_code == <span class="hljs-number">202</span>:
            <span class="hljs-keyword">pass</span>
        <span class="hljs-keyword">else</span>:
            <span class="hljs-keyword">return</span> response.json()
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        <span class="hljs-comment"># Print error</span>
        print(<span class="hljs-string">f"An error occurred while calling API: <span class="hljs-subst">{e}</span>"</span>)
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

<span class="hljs-comment"># Retrieve processing list</span>
process_list_sql = <span class="hljs-string">"SELECT * FROM [dbo].[Copy]"</span>
df_process_list = spark.read.format(<span class="hljs-string">"jdbc"</span>).option(<span class="hljs-string">"url"</span>, key_vault_secret).option(<span class="hljs-string">"query"</span>, process_list_sql).load()

<span class="hljs-comment"># Convert to Pandas DataFrame</span>
df_pandas = df_process_list.toPandas()

<span class="hljs-comment"># Define parameters for scheduler</span>
workspace_id = fabric.get_workspace_id()
item_id = <span class="hljs-string">"&lt;your_item_id&gt;"</span>
job_type = <span class="hljs-string">"Pipeline"</span>

<span class="hljs-comment"># Use ThreadPoolExecutor to call APIs concurrently</span>
<span class="hljs-keyword">with</span> ThreadPoolExecutor(max_workers=min(len(df_pandas), (os.cpu_count() <span class="hljs-keyword">or</span> <span class="hljs-number">1</span>) * <span class="hljs-number">5</span>)) <span class="hljs-keyword">as</span> executor:
    <span class="hljs-comment"># Submit tasks to the executor</span>
    future_to_row = {executor.submit(call_api_with_payload, row, workspace_id, item_id, job_type, client): index <span class="hljs-keyword">for</span> index, row <span class="hljs-keyword">in</span> df_pandas.iterrows()}
</code></pre>
<h3 id="heading-scenario-2-results">Scenario 2 Results</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712871932797/e8f02df2-9b82-40e3-a942-03f11d4073dd.png" alt class="image--center mx-auto" /></p>
<p><strong>*Image updated</strong></p>
<p>There are currently a few challenges with this approach in Fabric. One challenge is there appears to be a limited set on the API itself that only allows 10 connections. Any more than 10 connections are throttled, queued, and executed when previous connections are closed. I wasn't able to find this in the documentation, though.</p>
<p>Further investigation shows the CU(s) required for the pipeline execution itself are comparable to that of the Scenario 1 ExecuteCopy activity. The decrease in efficiency is attributed to the notebook remaining active during the API throttling.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712779091957/817a1eab-a7a4-498a-aa8b-99e8c225e028.png" alt class="image--center mx-auto" /></p>
<p><a target="_blank" href="https://learn.microsoft.com/en-us/rest/api/fabric/articles/throttling">https://learn.microsoft.com/en-us/rest/api/fabric/articles/throttling</a></p>
<p><strong>Scenario 1:</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712872506925/1c111929-e479-46dc-86de-688015e9db18.png" alt class="image--center mx-auto" /></p>
<p><strong>Scenario 2:</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712872534588/14423ce9-75c6-4a80-9e39-b1701a72e134.png" alt class="image--center mx-auto" /></p>
<p>This is also visible on the monitoring hub, as only 10 pipeline executions will trigger at once. As you can also see, the Notebook_Orchestration activity remains in an "In Progress" status until all pipelines have been executed, thus increasing consumption.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712779355595/ce276aca-be11-43c1-b8e2-b78b0a1e7eea.png" alt class="image--center mx-auto" /></p>
<p>My initial thought was to bypass the Semantic Link API and try a potential workaround, but I soon remembered the next challenge with this approach. There's currently no support for service principal authentication, meaning using another API strategy is a no-go.</p>
<p>Scenario 2 yielded a <strong>~18% decrease</strong> in efficiency and cost compared to Scenario 1.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712873165474/0b318882-f800-4154-85c0-4398be413948.png" alt class="image--center mx-auto" /></p>
<p><strong>*Image updated</strong></p>
<h3 id="heading-notebook-orchestration-and-copy">Notebook Orchestration and Copy</h3>
<p>The final scenario I tested was using a notebook to orchestrate and copy the data.</p>
<p>For years, we've relied on pipelines, and before pipelines, we used tools like SSIS to create orchestration packages. We've used this pattern for so long that it's become muscle memory. There's a reason for this, though. They're easy to use!</p>
<p>Setting up a pipeline is as simple as clicking through a GUI these days, and with the ability to integrate things like Dataflows Gen2, things will only get easier. However, ease comes with a significant cost.</p>
<p>Spark processing in tools like Fabric and Databricks opens the door to more programmatic ETL/ELT patterns like the one below.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_source_data</span>(<span class="hljs-params">row</span>):</span>
    <span class="hljs-string">"""
    Orchestration pattern consumption analysis
    Tests performed by Will Crayger of Lucid
    """</span>
    <span class="hljs-keyword">try</span>:

        <span class="hljs-comment"># Create dynamic SQL using the row values</span>
        dynamic_sql_query = <span class="hljs-string">f"SELECT * FROM [<span class="hljs-subst">{row[<span class="hljs-string">'Schema'</span>]}</span>].[<span class="hljs-subst">{row[<span class="hljs-string">'Object'</span>]}</span>]"</span>

        <span class="hljs-comment"># Read source data to DataFrame and write to Delta        </span>
        df_source = spark.read.format(<span class="hljs-string">"jdbc"</span>) \
                    .option(<span class="hljs-string">"url"</span>, source_connection) \
                    .option(<span class="hljs-string">"query"</span>, dynamic_sql_query) \
                    .load()

        <span class="hljs-comment"># Set staging table name using the row values</span>
        stage_file = <span class="hljs-string">f"Files/WideWorldImporters_Scenario3/<span class="hljs-subst">{row[<span class="hljs-string">'Schema'</span>]}</span>_<span class="hljs-subst">{row[<span class="hljs-string">'Object'</span>]}</span>"</span>

        <span class="hljs-comment"># Write to delta</span>
        df_source.write.format(<span class="hljs-string">"parquet"</span>) \
            .mode(<span class="hljs-string">"overwrite"</span>) \
            .save(stage_file)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"Error processing <span class="hljs-subst">{row[<span class="hljs-string">'Schema'</span>]}</span>_<span class="hljs-subst">{row[<span class="hljs-string">'Object'</span>]}</span>: <span class="hljs-subst">{e}</span>"</span>)

<span class="hljs-comment"># Retrieve processing list and convert to Pandas DataFrame</span>
process_list_sql = <span class="hljs-string">"SELECT * FROM [dbo].[Copy]"</span>
df_process_list = spark.read.format(<span class="hljs-string">"jdbc"</span>) \
                    .option(<span class="hljs-string">"url"</span>, control_connection) \
                    .option(<span class="hljs-string">"query"</span>, process_list_sql) \
                    .load()
df_process_list_pandas = df_process_list.toPandas()

<span class="hljs-comment"># Use ThreadPoolExecutor to execute the function in parallel for each row</span>
max_workers = min(len(df_process_list_pandas), (os.cpu_count() <span class="hljs-keyword">or</span> <span class="hljs-number">1</span>) * <span class="hljs-number">5</span>)
<span class="hljs-keyword">with</span> ThreadPoolExecutor(max_workers=max_workers) <span class="hljs-keyword">as</span> executor:
    <span class="hljs-comment"># Submit tasks to the executor for each row in the Pandas DataFrame</span>
    futures = [executor.submit(read_source_data, row) <span class="hljs-keyword">for</span> index, row <span class="hljs-keyword">in</span> df_process_list_pandas.iterrows()]
</code></pre>
<p>In this example, you'll notice 2 SQL executions, one to retrieve the list of objects to process and the other to execute a SELECT against the source system using a JDBC connection. You'll also notice the use of ThreadPoolExecutor, which allows for parallel execution.</p>
<h3 id="heading-scenario-3-results">Scenario 3 Results</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712848110310/2ca7087c-a0ba-45a9-8b38-e5cffa2c0c0a.png" alt class="image--center mx-auto" /></p>
<p><strong>*Scenario 3 was not impacted and remains unchanged</strong></p>
<p>If you're at a loss for words, don't worry; I was right there with you when I saw the results. Absolutely shocking!</p>
<p>Now, this approach isn't without considerations. Traditionally, this scenario can be challenging for on-premise environments or sources behind a firewall of some sort. The Fabric team has rolled out several new features, such as VNET and gateway integration, to alleviate some of these concerns. Another consideration is not all sources will support direct reads using Spark.</p>
<p>Let's look at the comparisons by the numbers.</p>
<p>Scenario 2 vs. Scenario 3</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712873501532/d5045c41-5227-49ed-b1fd-bb580a408628.png" alt class="image--center mx-auto" /></p>
<p><strong>*Image updated</strong></p>
<p>Scenario 1 vs. Scenario 3</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712873542513/09366f52-ec36-45ce-b4b7-385d32ffcab5.png" alt class="image--center mx-auto" /></p>
<p><strong>*Image updated</strong></p>
<p>Comparisons show an improvement of <strong>~98%</strong> for cost and CU(s) consumption across the board.</p>
<h3 id="heading-final-thoughts">Final Thoughts</h3>
<p>As the numbers show, traditional metadata patterns, while easy, can also be incredibly costly. Every data team should review and potentially redesign their frameworks to address these inefficiencies.</p>
<p>At Lucid, I've begun following a Spark-first approach and will only use Pipelines when required. I've also developed a framework to quickly deploy and integrate within my client environments, allowing them to focus on decision-making and giving them back their most valuable asset, time.</p>
<p>If you'd like to learn more about how Lucid can support your team, let's connect on LinkedIn and schedule an intro call.</p>
<p><a target="_blank" href="https://www.linkedin.com/in/willcrayger/">https://www.linkedin.com/in/willcrayger/</a></p>
]]></content:encoded></item></channel></rss>