Principles of Good Data Engineering

One major challenge is the ambiguity in metrics and the ambiguity in human questions. For data to deliver meaningful answers, the underlying data must be supported by well-defined semantic models, governance, and clear metric definitions. To address this, organizations must build systems that validate questions, maintain semantic relationships between data entities.

- Andres Vourakis from Data and AI Stockholm Event

For the past few weeks in my Data Visualization Course, something has been bothering me in the back of my mind and it was eating me alive... quietly as a Data Engineer in training.

As a data engineer working together with the data engineering team and UX designers, our task was to deliver Power BI dashboards using a dataset of our choice. We decided to use Netflix data from Tudum, and after performing Exploratory Data Analysis, we started to understand the dataset more deeply.

Early on, from sprint one, I suggested that building a star schema would be the best approach based on the structure, grain, and overall schema of the data. However, since it wasn’t required to pass the course, there was pushback. It’s common in teams for people to question why something might seem overcomplicated. So, I followed the majority, just to be a "team player." But deep down, I wasn't happy with myself. I know that building a data model for the semantic layer is extra work, but it has to be done. But I never did anyway and it bothered me for weeks, in silence.

But in my mind, it wasn’t overcomplication. I kept thinking about a talk I attended in Stockholm about Data and AI, where the importance of semantic models really stood out to me. Still, I ignored my instinct and went with the majority direction. As we built the dashboards, things started to break. Some filters worked, others didn’t, and some insights and KPI's were difficult to extract because of the underlying model. At that point, I knew the issue was the semantic model, but I also lacked the confidence to push back further since neither I or no one else in the course had taken that approach before.

By sprint two, I started losing motivation. The dashboard kept breaking, and I found myself frustrated because I knew I SHOULD TAKE ACTION so that it could be done better. I kept asking myself what was stopping me from fixing it.

Around that time, I was reading Fundamentals of Data Engineering by Joe Reis and Matt Housley, and I reached the section on the principles of good data engineering. That chapter really stayed with me and impacted me. This was the last push I needed to take action and improve the project, and learn as an individual.

One Monday morning, I decided to go back to VS Code and rebuild everything, this time creating the star schema outside of Power BI for more flexibility and control. What really clicked for me were a few principles from the book.

Always Be Architecting

Architecture is never finished. Data systems are living systems. They evolve as requirements change, as new tools emerge, and as scale increases. That means constantly reviewing the architecture, tracking technical debt, questioning assumptions, and simplifying where possible. The mistake to avoid is treating architecture as something you design once and forget. It’s something you continuously refine and improve over time.

Plan for Failure

It’s okay that the first model didn’t work, that’s part of the process. But assess and address the problem straight away, especially when it's the backbone of the project. If Path A doesn't work, create a Path B, Path C, and so on.

Architect for Scalability

As the project evolved, features demanded left and right, and feedback came in more and more, the limitations of our initial model became more obvious.

Architecture is Leadership

"Data architects and engineers are responsible for technology decisions and architecture descriptions and disseminating these choices through effective leadership and training. Data architects and engineers should also be highly technically competent, but delegate most individual contributor work to others. Strong leadership skills combined with technical competence are rare and extremely valuable. The best data architects take this duality seriously."

- Fundamentals of Data Engineering

I know that I am not there yet and I have seen my flaws in what lacks to be able to convince and lead a team, but this is a valuable lesson learned for me. Wisdom comes from experience, I would like to say. Even if I lack the experience, I hope that this experience will be something that I will always remember and learn from.

Make Reversible Decisions

Sometimes you have to go back to the drawing board, even if it means discarding previous work, because forcing a flawed foundation will only make things worse. Jeff Bezos refers to reversible decisions as "two-way doors". As he says, "If you walk through and don't like what you see on the other side, you can't get back to before. But most decisions aren't like that. They're always changing, reversable- they are two-way doors. Aim for two-way doors whenever possible.

Responsibility and Accountability

Another thing that weighed on me was responsibility. As a data engineer, I knew that how we structure and serve data impacts everyone else. If the semantic model was flawed, it wouldn’t just affect me, it would affect the entire team’s ability to succeed. Also, time was running out and we didn't have a back up plan. Based on past conversations, bringing up "let's build the star schema for the semantic model" always led to a NO. That realization pushed me to act on my own. I didn’t know if rebuilding the model would work perfectly, but I knew I had to try and trust my judgment. When the drafted dashboard from the 2nd semantic model was done, I presented it to the UX Designers and Data Engineers and, I'm happy to say. They wanted to use it.

Most importantly, this decision to act would have not been possible if the whole team hasn't done so many trials and errors. Every decisions, pass or fail, always has a purpose. Those decisions helped the team move forward. Those past decisions that bothered me quietly, helped me to take accountability of this 2nd semantic model, whether it works or breaks. I take full ownership of the semantic model, regardless of the results.

The second version of the semantic model as a star schema. This is the very first star schema I built, so I know it's not perfect, but I learned so much from this.

⭐Conceptual Star Schema⭐

From a data engineering perspective, this model is built to track how Netflix shows perform each week across different countries. Each row represents one show in one country for one specific week, which defines the grain of the data. The fact table stores measurable values like views and rankings, while dimension tables provide context such as show details, time, country, and category. A bridge table is used to handle cases where a show has multiple genres, keeping the model clean and flexible. Disconnected tables are added to support comparisons and interactive features in the dashboard. This approach avoids common mistakes like mixing data types or ignoring structure, and follows best practices for building clear and user-friendly in Power BI.

Data Flow Diagram

The data pipeline begins with Netflix data sourced from Tudum, where the main datasets used are global_weekly, country_weekly and global_alltime. The data is then explored using Pandas and Matplotlib to understand its structure, identify patterns, and handle any quality issues. This step is important because it ensures the data is reliable before building any models. Based on this, a star schema is designed in Power BI to organize the data into a structured and scalable format. Finally, the model is used to create a dashboard with KPIs and visuals that support clear and interactive data analysis.

Grain & Metrics

⭐ Primary FACT TABLE | `fact_weekly_performance`

This table is the center of everything as it shows how well does a Netflix show perform over time, across countries?

⭐ Secondary fact table | `fact_alltime`

This is the summary table of how a specific show performed overall.

Dimension Table

These tables explains the who, where, when and what.

dim_show

Shows "what is the show?"

dim_country

Shows "where it performed?"

dim_date

Shows "when did it happen?"

dim_category

Explains "what type of category?"

dim_genre

Shows the genres

Bridge Table | `bridge_show_genre`

This connects dim_show ←→ dim_genre because it needs to resolve the Many : Many relationship between dim_show and dim_genre. So, instead of forcing one genre per show, we use a bridge table so a show can belong to multiple genres.

So far...

So far, the project is not done as I am writing this but this reflection, I believe is just as important as the Dashboard itself.

This is a draft / test of using the 2nd semantic model version in Power BI. The design is from the team's UX Designers as they are also the Product Owners. Other features are also added from the Data Engineer's team's previous model.

Why Semantic Models Need Strong Data Engineering Principles