Building a Fabric Lab for Semantic Model Memory Experiments (with help from ChatGPT)

For quite some time, I’ve wanted to write about semantic model memory optimization in Power BI and Fabric.

It’s a topic that keeps coming up in projects, training, and community conversations – and one where many people struggle once their models start to grow.

Before I could write about optimizing semantic models, I needed a proper lab environment:

large enough to be interesting
realistic enough to expose real problems
flexible enough to support multiple modeling patterns

This blog post is about how I built that lab environment in Microsoft Fabric.

The actual optimization patterns, VertiPaq analysis, and semantic modeling techniques will come later – including a series of posts on the official Tabular Editor blog.
If you’re reading this because of those posts, this is the “how did you build the dataset?” part.

A transparency note (important)

Before we go any further, I want to be completely transparent:

The notebooks and SQL views used in this lab were created with extensive help from ChatGPT.
I am not a Python developer, and the majority of the Spark and SQL code is AI-generated.

My role in this project was not to hand-write Spark code, but to:

define requirements and constraints
validate results
iterate on design decisions
debug data quality issues
decide what is correct and what is acceptable

AI helped with syntax and structure.
The responsibility for architecture, correctness, and modeling decisions is still mine.

I’m being explicit about this because I believe this reflects how many of us actually work today – and because the value of this project is not the Python code itself.

Why build a dedicated lab?

If the goal is to talk about semantic model memory, small demo datasets don’t really help.

I wanted a dataset that would:

span many years
include bad data
allow “bad” modeling choices on purpose
scale to billions of rows

That way, memory behavior becomes visible and measurable — not theoretical.

For this, the NYC TLC Yellow Taxi Trip Records are ideal:

public
well-known
large enough to stress real systems

The dataset spans from 2009 to 2025 and contains more than 1.8 billion rows, with around 40 GB of source parquet files.

High-level pipeline

The lab is built entirely in Microsoft Fabric, using notebooks and Delta tables.

At a high level, the pipeline looks like this:

Ingest raw parquet files into OneLake
Unify 17 years of schema evolution into a single structure
Build a wide Gold fact table (intentionally not optimized)
Create dimensions for later semantic modeling

Each step is implemented as a Fabric notebook.

Step 1: Ingesting the raw data

The first notebook is intentionally boring — and that’s a good thing.

Its only job is to:

download the public parquet files over HTTPS
copy them into OneLake
store them in a Hive-style folder structure by year and month

No transformations, no cleansing, no schema changes.

This preserves the raw data exactly as published and gives a clean starting point.

Step 2: Unifying the schema (where most of the work happens)

This is the most important notebook in the entire project.

Over the years, the TLC dataset has:

renamed columns
changed data types
added new fields
removed old ones
used different naming conventions

If you simply union all files together, you don’t get a usable dataset.

In this notebook, I:

scanned schemas across all months and years
identified schema variants
normalized column names
aligned data types
introduced missing columns with nulls
handled legacy field names

The goal was not to clean the data perfectly, but to make it consistent.

At the end of this step, I had a single, unified dataset spanning 2009–2025 that Spark could actually work with.

Step 3: Creating a deliberately “bad” fact table

This step is counterintuitive if you’re used to building production models.

Instead of optimizing early, I intentionally created a wide fact table that includes:

high-cardinality columns
GUID-like identifiers
string-heavy attributes
redundant fields
unnecessary numeric precision

Why?

Because the goal of this lab is to observe memory behavior, not to avoid it.

If the dataset is already perfectly optimized, there’s nothing interesting to analyze later.

This table is large by design – hundreds of gigabytes – and that’s intentional.

Step 4: Building dimensions (and handling bad data)

The final notebook creates the supporting dimensions:

Date
Time of day
Taxi zone (pickup and dropoff)
Vendor
Rate code
Payment type

This is also where real-world data quality issues show up.

For example, the dataset contains a small number of valid but implausible dates:

years far outside the dataset’s range (e.g. 2001 or 2098)

Rather than filtering rows out or extending the date dimension indefinitely, I chose to:

introduce a single Unknown Date member
explicitly map out-of-range values to it

This keeps referential integrity intact and reflects a modeling decision many of us face in real projects.

Why this lab matters

At this point, I have:

a large, stable dataset
real schema evolution
real data quality issues
intentional modeling anti-patterns

Most importantly: I can now build multiple semantic models on top of the same data and change only the modeling decisions.

That’s the foundation for the next part of the journey.

Lab environment

What comes next

This blog post is about building the lab.

The next step is using that lab to explore:

Import vs Direct Lake
model size vs SKU limits
dictionary sizes and cardinality
date/time modeling choices
high-cardinality columns
refresh behavior and partitioning

Those topics will be covered in a series of posts on the Tabular Editor blog, where the focus shifts from data engineering to semantic modeling.

If you’re mainly interested in how the data was prepared, this post and the accompanying GitHub repository should give you full transparency.

Final thoughts on using AI

One last note.

Using ChatGPT did not remove complexity from this project – it moved it.

Instead of spending time on syntax, I spent time on:

asking better questions
validating assumptions
interpreting results
deciding what to keep and what to change

That trade-off is exactly why this lab exists.