Skip to main content

Command Palette

Search for a command to run...

Marathos Atlas - Building a data platform in Databricks to learn Medallion Architecture

Updated
16 min read
Marathos Atlas - Building a data platform in Databricks to learn Medallion Architecture
A
For the love of data, someone has to clean up the mess Data Scientists and Machine Learning Engineers make. https://github.com/Akina-Aoki

For my Big Data and Cloud Platform Development Course, I used the Databricks platform to build Marathos Atlas. The whole learning goal was to build a lakehouse workflow that transforms raw global marathon result data inside the medallion architecture: Bronze for raw ingestion, Silver for cleaning and standardization, and Gold for dimensional modeling.

Then, applying the cleaned and standardized data into a dashboard with analytical tables, curated views, dashboard metrics, and be able to get insights from Databricks Genie LLM chatbot, all data from the gold layer and curated tables.

https://github.com/Akina-Aoki/marathos_databricks

Dashboard: Marathos Atlas

The business case is marathon stakeholders need a reliable way to understand participation and performance across events, countries, athlete groups, and race categories:

  • Which countries have the most race result records?

  • How has race activity changed over time?

  • How do athlete demographics differ by gender and age group?

  • How do kilometer, mile, and fixed-time race categories compare?

  • What performance metrics can be analyzed consistently?

  • Which ad hoc questions can stakeholders ask through Genie without waiting for a custom SQL query?

Features

  • The Race Year and Athlete Country filters control the broader dashboard.

    I improved the dashboard iteratively to avoid clutter and focus on participation, trends, demographics, race categories, and country-level comparisons.

  • The Leaderboard Metric filter control is separate and only changes the leaderboard chart. This separation keeps the dashboard easier to understand because global filters and chart-specific controls do not conflict.

  • KPI cards

  • Global leaderboard snapshot (sample: Sweden, 2022, ranking of countries with most runner entries)

  • Athlete's number of entries over time

  • Events held over time

  • Athlete Average speed over time

  • Average athlete age over time

  • Runner participation by gender

  • Number of athletes by age group

  • Average speed by age group

  • Number of competition events held per categories (with country and year filter)

  • Average Athlete Speed per competition category (with country and year filter)

Databricks Genie

A major part of the project was connecting Databricks Genie to curated Gold datasets so stakeholders could ask natural language questions instead of always requesting ad hoc SQL.

I connected Genie to Gold views and tables rather than raw Bronze or technical Silver data. Genie is only as useful as the data model behind it; if connected objects are inconsistent or too raw, answers can become misleading.

Sample questions to Genie

Summarize all the race reults for Sweden in 2022

Results:

Architecture and Flow

This is the high-level project flow.

Summary of Data Quality Checks and Transformations Across All Layers

Raw Data Storage Inside Unity Catalog Raw Volume (default layer)

The main source is a marathon and ultramarathon race result dataset where each row represents one athlete result in one race. Key fields include:

- event name
- event date
- event distance
- number of finishers
- athlete ID
- athlete country
- gender
- age category
- performance
- average speed

Bronze Streaming Tables

bronze ingestion script

The Bronze layer ingests raw CSV files from a Unity Catalog volume into Databricks tables. The raw volume base path is centralized in utils/table_names.py as:

RAW_VOLUME_BASE_DIR = f"/Volumes/{CATALOG}/{DEFAULT_SCHEMA}/raw"

The Bronze ingestion script creates two raw streaming tables:

  • raw_marathon_results

  • raw_country_codes

The implementation uses PySpark and Lakeflow from pyspark.pipelines. Both raw datasets are read using spark.readStream, which allows new files dropped into the monitored source folders to be processed incrementally.

One important streaming detail is that Spark streaming reads require a predefined schema. I inferred the schema once from a sample CSV file and reused it in the streaming read:

marathon_schema = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(MARATHON_SAMPLE_FILE)
    .schema
)

return (
    spark.readStream.format("csv")
    .options(header="true", inferSchema="true")
    .schema(marathon_schema)
    .load(MARATHON_DATA_PATH)

I intentionally kept Bronze close to the original source for auditability and traceability. If something looks wrong in Silver or Gold, I can compare it back to the raw Bronze record before business logic was applied.

Silver Cleaned One Big Table

silver ingestion script

The Silver layer is where most of the data engineering work happens. It creates a cleaned One Big Table (OBT) named marathon_results_obt. This table is still result-level, but the raw fields have been standardized, enriched, and filtered for analytical use.

The first step is standardizing column names to snake_case using DRY principle,utils/column_helpers.py, which avoids source columns with spaces or inconsistent casing.

The main Silver data quality rules are:

  • Event dates must be valid and parseable.

  • Distances must be recognized as kilometers, miles, or fixed hours.

  • Performance format must match the race type.

  • Finisher count must be greater than zero.

  • Athlete age must be realistic.

  • Speed must be within realistic boundaries.

  • Required demographic and country fields must be present.

Event Date Parsing

The raw event date field contains multiple formats, so I had to parse several cases instead of assuming one simple date format:

  • Single date, such as 07.12.1991

  • Same-month ranges, such as 23.-24.11.1991

  • Different-month ranges, such as 30.10.-03.11.1991

  • Different-year ranges, such as 31.12.1992-01.01.1993

The Silver logic classifies the date format, extracts the needed date parts with regular expressions, builds complete date strings where necessary, and converts them into event_start_date and event_end_date. Rows with unparseable dates are removed because valid dates are required for time analysis, the date dimension, dashboard filters, and seasonal trends.

    marathon_df = (
        marathon_df
        .withColumn("event_date_raw", col("event_dates"))  # 
        .withColumn(
            "event_date_format_type",
            when(col("event_dates").rlike(r"^\d{2}\.\d{2}\.\d{4}$"), "single_date")
            .when(col("event_dates").rlike(r"^\d{2}\.-\d{2}\.\d{2}\.\d{4}$"), "date_range_same_month")
            .when(col("event_dates").rlike(r"^\d{2}\.\d{2}\.-\d{2}\.\d{2}\.\d{4}$"), "date_range_different_month")
            .when(col("event_dates").rlike(r"^\d{2}\.\d{2}\.\d{4}-\d{2}\.\d{2}\.\d{4}$"), "date_range_different_year")
            .otherwise("unknown_format")
        )
    )

Distance and Race Type Standardization

The raw event distance field also contains different formats. The Silver transformation recognizes kilometer races, mile races, fixed-hour races, and shorthand k values. For example, 50k is standardized to 50km.

    marathon_df = marathon_df.withColumn(
        "event_distance_type",
        when(lower(trim(col("event_distance_length"))).rlike(r"^[0-9]+(\.[0-9]+)?km$"), "distance_km")
        .when(lower(trim(col("event_distance_length"))).rlike(r"^[0-9]+(\.[0-9]+)?mi$"), "distance_mi")
        .when(lower(trim(col("event_distance_length"))).rlike(r"^[0-9]+h$"), "time_hours")
        .when(lower(trim(col("event_distance_length"))).rlike(r"^[0-9]+(\.[0-9]+)?k$"), "distance_k_needs_standardization")
        .when(lower(trim(col("event_distance_length"))).rlike(r"^[0-9]+d$"), "days_invalid")
        .when(lower(trim(col("event_distance_length"))).contains("etappen"), "multi_stage_invalid")
        .otherwise("invalid_or_unknown")
    )
    # Standardize "k" suffix to "km" (e.g., "50k" → "50km")
    marathon_df = marathon_df.withColumn(
        "event_distance_length",
        when(
            col("event_distance_type") == "distance_k_needs_standardization",
            regexp_replace(lower(trim(col("event_distance_length"))), "k$", "km")
        ).otherwise(lower(trim(col("event_distance_length"))))
    )

It also filters out day-based or multistage events such as 7d or values containing etappen, because those formats do not fit the comparison model used in this project.

    # Filter out invalid event types (multi-day and multi-stage events)
    marathon_df = marathon_df.filter(
        col("event_distance_type").isin(
            "distance_km",
            "distance_mi",
            "time_hours",
            "distance_k_needs_standardization"
        )
    )

After classification, the pipeline extracts event_distance_value, event_distance_unit, and event_distance_type, making race category comparisons more controlled.

Athlete Performance Parsing

Performance data also depends on the race type. For distance races, performance is time-based. For fixed-hour races, performance is distance-based.

The Silver layer handles time formats such as: 2:15:30 h and 1d 05:30:00 h

It also handles fixed-time performance values such as: 150.5 km and 93.2mi.

    # First, classify each performance format
    marathon_df = marathon_df.withColumn(
        "athlete_performance_type",
        when(col("athlete_performance").isNull(), "null")
        .when(trim(col("athlete_performance")).rlike(r"^[0-9]+:[0-9]{2}:[0-9]{2} h$"), "time_hours")
        .when(trim(col("athlete_performance")).rlike(r"^[0-9]+d [0-9]{2}:[0-9]{2}:[0-9]{2} h$"), "time_days_hours")
        .when(trim(col("athlete_performance")).rlike(r"^[0-9]+(\.[0-9]+)? km$"), "distance_km")
        .when(trim(col("athlete_performance")).rlike(r"^[0-9]+(\.[0-9]+)? mi$"), "distance_mi")
        .otherwise(UNKNOWN_VALUE)
    )

The transformation converts performance into comparable fields: athlete_performance_seconds for distance races, athlete_performance_distance for fixed-hour races, and athlete_performance_unit to track meaning.

A key quality rule is that kilometer and mile events need time-based performance, while fixed-hour events need distance-based performance.

Recalculating Average Speed

One important Silver decision was not to blindly trust the raw average speed field. The project recalculates athlete_average_speed_kmh from cleaned event distance and performance values.

For distance races km/h, the logic is: speed_kmh = distance_km / performance_hours

            # Distance events in km:
            # speed = distance in km / performance hours
            (col("event_distance_unit") == "km")
            & col("athlete_performance_seconds").isNotNull(),
            col("event_distance_value") / (col("athlete_performance_seconds") / SECONDS_PER_HOUR)
        ).when(

For mile races, miles are converted using KM_PER_MILE = 1.60934.

            # Distance events in miles:
            # convert miles to km (1 mi = 1.60934 km), then divide by performance hours
            (col("event_distance_unit") == "mi")
            & col("athlete_performance_seconds").isNotNull(),
            (col("event_distance_value") * KM_PER_MILE) / (col("athlete_performance_seconds") / SECONDS_PER_HOUR)

For fixed-hour races, the completed distance is divided by the event duration in hours. This produces a consistent km/h metric across race categories.

The Silver layer then filters unrealistic speeds. In this project, speeds must be greater than 0 and less than or equal to 50 km/h. That boundary keeps obvious errors out of downstream metrics.

    marathon_df = marathon_df.filter(
        col("athlete_average_speed_kmh").isNotNull()
        & (col("athlete_average_speed_kmh") > MIN_VALID_SPEED_KMH)
        & (col("athlete_average_speed_kmh") <= MAX_VALID_SPEED_KMH)
    )

Demographics, Country Enrichment and IDs

The Silver transformation also cleans athlete club, country, birth year, age category, gender, and athlete ID. Athlete age at event is calculated from year_of_event - athlete_year_of_birth, and unrealistic ages outside the configured range are removed.

    # Filter out records with:
    # - Missing speed (calculation failed)
    # - Zero or negative speed (impossible)
    # - Speed > 50 km/h (world record marathon pace is ~20 km/h, so 50 km/h is clearly an error)
    
    marathon_df = marathon_df.filter(
        col("athlete_average_speed_kmh").isNotNull()
        & (col("athlete_average_speed_kmh") > MIN_VALID_SPEED_KMH)
        & (col("athlete_average_speed_kmh") <= MAX_VALID_SPEED_KMH)
    )

Country enrichment is done by joining the cleaned marathon records to raw_country_codes. This adds country_code_iso3 and country_name, which are later used in dim_country, dashboard filters, and country-level views.

    marathon_df = (
        marathon_df
        .join(
            country_df,
            marathon_df["athlete_country"] == country_df["athlete_country_code"],
            "inner"
        )
        .drop("athlete_country_code")  # Remove duplicate column after join
    )

Finally, deterministic IDs are generated using SHA-256 hashing. The project creates stable event_id and result_id values for downstream dimensional modeling. This is especially useful in a streaming pipeline because ranking-style window functions are not a good fit for generating stable IDs in streaming transformations.

 # --------------------------------------------------------
    # Create deterministic IDs for dimensional modelling
    # --------------------------------------------------------
    # Generate stable, deterministic IDs using SHA-256 hashing.
    # This ensures the same event or result always gets the same ID,
    # which is critical for incremental updates and dimensional modeling.
    #
    # Note: We use SHA-256 hashing instead of dense_rank() because this is
    # a streaming table, and window functions with dense_rank() are not
    # supported in streaming queries.

    marathon_df = (
        marathon_df
        .withColumn(
            "event_id",
            sha2(
                concat_ws(
                    HASH_DELIMITER,  # Delimiter to prevent hash collisions
                    col("event_name"),
                    col("event_distance_length")
                ),
                HASH_BITS  # SHA-256 produces a 64-character hex string
            )
        )
        .withColumn(
            "result_id",
            sha2(
                concat_ws(
                    HASH_DELIMITER,
                    col("event_name"),
                    col("event_date_raw"),
                    col("event_distance_length"),
                    col("event_number_of_finishers").cast("string"),
                    col("athlete_id").cast("string"),
                    col("athlete_country"),
                    col("country_code_iso3"),
                    col("country_name"),
                    col("athlete_year_of_birth").cast("string"),
                    col("athlete_gender"),
                    col("athlete_age_category"),
                    col("athlete_club"),
                    col("athlete_performance"),
                    col("athlete_average_speed")
                ),
                HASH_BITS
            )
        )
    )

Dimensional Modelling

Conceptual Model

  • One Event can have many Results. Each Result belongs to one Event.

  • One Athlete can have many Results. Each Result belongs to one Athlete.

  • One Country can have many Results. Each Result is linked to one Country.

  • One Date can have many Results. Each Result is linked to one Date.

Logical Model

  • Result is the fact-like central entity because it contains the measurable race outcome.

  • athlete_age_at_event stays in Result because age depends on the event year.

  • athlete_year_of_birth stays in Athlete because birth year is stable athlete information.

  • date_id is created from event_start_date in the Gold layer.

    # Transform the data into a date dimension:
    # 1. Create a surrogate key (date_id) in integer format
    # 2. Extract useful calendar components from the date
    # 3. Remove duplicate dates to ensure one row per unique date
    dim_date_df = (
        silver_df
        .select(
            # Create date surrogate key - convert date to yyyyMMdd integer format
            # Example: 2023-03-15 becomes integer 20230315
            # Integer keys are more efficient for joins than date strings
            date_format(col("event_start_date"), "yyyyMMdd").cast("int").alias("date_id"),
            
            # Keep the original date value for reference
            col("event_start_date"),
            
            # Extract calendar components for time-based filtering and grouping
            year(col("event_start_date")).alias("year"),          # Extract year (2023)
            month(col("event_start_date")).alias("month"),        # Extract month number (1-12)
            date_format(col("event_start_date"), "MMMM").alias("month_name"),  # Get month name ("March")
            dayofmonth(col("event_start_date")).alias("day")              # Day of month, 1-31
        )
        # Deduplicate on date_id to ensure each date appears only once
        # Multiple races can happen on the same date, but we want one date record
        .dropDuplicates(["date_id"])
    )
  • Then in Gold, create dim_date like this:
    event_start_date = 1991-12-07

  • date_id = 19911207 year = 1991

  • month = 12 month_name = December

  • Country is connected to Result because the country value comes from the recorded athlete result and supports country-based analysis.

Physical Model

Grain One row in fact_results = one athlete result in one event

Gold Dimensional Model and Analytical View Tables

Gold scripts

The Gold layer turns the Silver OBT into a star schema, separating measurable facts from descriptive dimensions.

The Gold tables in the star schema are:

Table Role
fact_results One row per race result, with foreign keys, finisher counts, performance values, speeds, and athlete age at event.
dim_event Event name, distance length, distance value, unit, and race type.
dim_athlete Athlete ID, gender, birth year, age category, and club.
dim_country Athlete country code, ISO3 country code, and country name.
dim_date Date key, event start date, year, month, month name, and day.

Dimensional modeling makes analytical queries easier and gives the dashboard and Genie a cleaner semantic foundation.

Facts answer “what happened?” and dimensions describe “who, where, when, and what type of event?”

Gold Views and Lightweight Data Marts

Curated Tables

On top of the Gold star schema, I created curated analytical views that act like lightweight data marts for specific business questions.

The main gold view tables are:

View Supports
vw_global_leaderboard Country leaderboard by race result records, unique athletes, events participated, average speed, and average athlete age.
vw_seasonal_race_trends Monthly race trends, athlete entries, unique athletes, events held, speed, and age.
vw_country_seasonal_trends Country-level monthly trends for filtering and comparison.
vw_runner_demographics Gender and age group participation and performance.
vw_distance_km_results Kilometer race analysis.
vw_distance_mi_results Mile race analysis.
vw_time_hour_results Fixed-time race analysis.

This is the layer I want dashboards and Genie to use. Bronze is too raw, and Silver is still more technical. Gold views are more reliable, more business-friendly, and easier for stakeholders to query.

Streaming and Scheduled Pipeline

For streaming, I used LLM-generated fake marathon records for Stockholm, Paris, and Tokyo in 2024. I uploaded the CSV into the Unity Catalog raw volume folder monitored by Bronze and ran the Lakeflow pipeline.

I validated that the fake records appeared in Bronze, Silver, and Gold with SQL checks. After refreshing the dashboard, 2024 also appeared in the Race Year filter.

I scheduled the Lakeflow pipeline to run daily at 08:05 UTC+2. For this test, a normal update was better than a full refresh because it demonstrated the incremental behavior: the pipeline detected the new file in the monitored source folder instead of rebuilding everything from scratch.

One practical lesson here is that streaming ingestion is not just about writing readStream code. The new file has to be placed in the exact source folder that the Bronze table monitors.

Maintainability and DRY (Don't Repeat Yourself) Principles

The utils folder is small, but it is important for maintainability.

  • table_names.py centralizes catalog, schema, table, view, and volume paths.

    CATALOG = "marathos_catalog"
    
    BRONZE_SCHEMA = "bronze"
    SILVER_SCHEMA = "silver"
    GOLD_SCHEMA = "gold"
    DEFAULT_SCHEMA = "default"
    
    RAW_VOLUME_BASE_DIR = f"/Volumes/{CATALOG}/{DEFAULT_SCHEMA}/raw"
    
    RAW_MARATHON_RESULTS = f"{CATALOG}.{BRONZE_SCHEMA}.raw_marathon_results"
    RAW_COUNTRY_CODES = f"{CATALOG}.{BRONZE_SCHEMA}.raw_country_codes"
    
    MARATHON_RESULTS_OBT = f"{CATALOG}.{SILVER_SCHEMA}.marathon_results_obt"
    
    DIM_EVENT = f"{CATALOG}.{GOLD_SCHEMA}.dim_event"
    DIM_ATHLETE = f"{CATALOG}.{GOLD_SCHEMA}.dim_athlete"
    DIM_COUNTRY = f"{CATALOG}.{GOLD_SCHEMA}.dim_country"
    DIM_DATE = f"{CATALOG}.{GOLD_SCHEMA}.dim_date"
    FACT_RESULTS = f"{CATALOG}.{GOLD_SCHEMA}.fact_results"
    
    VW_DISTANCE_KM_RESULTS = f"{CATALOG}.{GOLD_SCHEMA}.vw_distance_km_results"
    VW_DISTANCE_MI_RESULTS = f"{CATALOG}.{GOLD_SCHEMA}.vw_distance_mi_results"
    VW_TIME_HOUR_RESULTS = f"{CATALOG}.{GOLD_SCHEMA}.vw_time_hour_results"
    VW_GLOBAL_LEADERBOARD = f"{CATALOG}.{GOLD_SCHEMA}.vw_global_leaderboard"
    VW_SEASONAL_RACE_TRENDS = f"{CATALOG}.{GOLD_SCHEMA}.vw_seasonal_race_trends"
    VW_RUNNER_DEMOGRAPHICS = f"{CATALOG}.{GOLD_SCHEMA}.vw_runner_demographics"
    VW_COUNTRY_SEASONAL_TRENDS = f"{CATALOG}.{GOLD_SCHEMA}.vw_country_seasonal_trends"
    
  • pipeline_config.py stores reusable Delta table properties.

    DELTA_TABLE_PROPERTIES = {
        "delta.columnMapping.mode": "name",
        "delta.minReaderVersion": "2",
        "delta.minWriterVersion": "5",
    }
    
  • column_helpers.py provides reusable snake_case column naming helpers.

    import re
    
    
    def to_snake_case(column_name):
        clean_name = column_name.strip().lower()
        clean_name = re.sub(r"[^a-z0-9]+", "_", clean_name)
        return clean_name.strip("_")
    
    
    def rename_columns_to_snake_case(df):
        new_columns = [to_snake_case(column) for column in df.columns]
        return df.toDF(*new_columns)
    
  • silver_constants.py stores shared constants such as date format, speed limits, age limits, mile conversion, and hashing settings.

    """Shared constants for Silver layer transformations."""
    
    DATE_FORMAT = "dd.MM.yyyy"
    
    UNKNOWN_VALUE = "unknown"
    
    KM_PER_MILE = 1.60934
    
    SECONDS_PER_MINUTE = 60
    SECONDS_PER_HOUR = 3600
    SECONDS_PER_DAY = 86400
    
    MIN_ATHLETE_AGE = 5
    MAX_ATHLETE_AGE = 100
    INVALID_BIRTH_YEAR = 1193
    
    MIN_VALID_SPEED_KMH = 0
    MAX_VALID_SPEED_KMH = 50
    SPEED_ROUND_SCALE = 3
    
    HASH_DELIMITER = "||"
    HASH_BITS = 256
    

    This avoids hardcoding paths and settings across every layer, reduces duplicated logic, and makes future changes safer.

Validations

Validations Notebook Queries

Because the dashboard and Genie are stakeholder-facing outputs, validation is a core part of the project. I used exploration notebooks to check row counts, duplicate keys, null values, schemas, date parsing, invalid event types, performance/race type alignment, speed boundaries, country mapping, and Gold consistency.

Some examples of validation includes:

  • Checking that key IDs were not null.

  • Confirming there were no duplicate result_id values.

  • Validating that event distance units and performance types matched.

  • Checking that recalculated speeds were not null, non-positive, or above the realistic boundary.

  • Confirming that country enrichment did not leave missing country names or ISO codes.

  • Verifying that fake streaming records moved from Bronze to Silver to Gold.

  • Manually checking Genie answers with SQL.

    This validation connects engineering decisions to trust. A dashboard can look polished, but if the underlying tables are not reliable, the insights are not reliable either.

Conclusion

Marathos Atlas is a full Databricks data engineering workflow: raw file ingestion, Bronze streaming tables, Silver cleaning and enrichment, Gold dimensional modeling, curated analytical views, dashboarding, Genie self-service analytics, scheduled pipeline execution, and streaming validation.

  • Medallion architecture makes the pipeline easier to reason about.

  • Bronze should preserve raw data for auditability and traceability.

  • Silver is where most data quality work happens.

  • Gold should be designed around business questions, not just technical convenience.

  • Dashboards become much easier when Gold views are clean and well modelled properly.

  • Genie is powerful, but only when connected to curated and validated datasets.

  • Streaming ingestion depends on both code and correct file placement.

  • Good naming and reusable utilities reduce errors as the project grows.

For me, the project connected data engineering concepts with a realistic analytics use case: transforming raw marathon and ultramarathon result data into a trusted lakehouse model for countries, trends, demographics, race categories, and performance.