Part 3: The 'L' - Loading, Leveraging & Reporting Insights

Scene: With the data extracted and transformed, the final step is to make it accessible and use it to answer María’s policy questions

This is where the fruits of her ETL labor are realized, and she produces a comprehensive Excel report as part of her exploratory data analysis. The “L” in ETL isn’t just about loading data into storage - it’s about making insights accessible and actionable for policy decisions.

Source of generated image on this page: Google Gemini

What We Will Learn

Transform analysis results into consumable outputs and reports

ℹ️ If you have fallen behind, you can git checkout part_2 before starting this section.

For the advanced section, you can the re-run dg dev to restart your Dagster webserver.

Data Architecture Evolution

We will load and leverage our transformed data to create policy insights, completing the “L” in ETL - Load.


%%{
  init: {
    'themeVariables': {
      'lineColor': '#d3d3d3'
    }
  }
}%%

flowchart TD
    P[Population]
    C[Energy Consumption]
    R[Renewable Energy Coverage]
    M[New Regions Mapping]

    F[Energy Consumption Breakdown Fossil vs Renewable]
    C --> F
    R --> F

    A[Energy Consumption Breakdown Fossil vs Renewable With Population]

    F --> A
    P --> A

    MC[New Regions Energy Consumption Breakdown Fossil vs Renewable With Population]
    A --> MC
    M --> MC

    PC[Per capita consumption]

    MC --> PC
    A --> PC

    RE[Iberia vs North America Report]

    PC --> RE

    classDef greyedOut stroke:#d3d3d3,stroke-width:2px,fill:#f9f9f9,color:#d3d3d3
    class P,C,R,M,F,A,MC,PC greyedOut

    classDef highlight stroke:#4CAF50,stroke-width:3px
    class RE highlight

Basic Approach: Manual Output & Ad-hoc Analysis

Loading the data

A basic way to simply export the data is to create an output file.

energy_breakdown_per_capita.to_csv('per_capita_metrics.csv')

From Dagster we can also be done through a materializable asset.

@dg.asset()
def export_energy_breakdown(energy_breakdown_per_capita):
    """Export data as pure CSV"""
    energy_breakdown_per_capita.to_csv('per_capita_metrics.csv')

Create Excel Report

Another common case comprehensive energy report consists of seven tabs:

Tab 1: Top 10 regions by year with highest fossil energy consumption
Tab 2: Top 10 regions by year with highest renewable energy consumption
Tab 3: North America vs Iberia renewable energy consumption per capita comparison
Tab 4: North America vs Iberia total energy consumption comparison
Tab 5: North America vs Iberia energy mix breakdown comparison
Tab 6: North America vs Iberia energy efficiency trends comparison
Tab 7: Raw data and source information

from openpyxl.chart import LineChart, Reference

filename = "energy_analysis_report.xlsx"
with pd.ExcelWriter(filename, engine="openpyxl") as writer:
    latest_year_with_population = per_capita_metrics.loc[
        per_capita_metrics["population"].notna(), "year"
    ].max()

    # Remove invalid data
    valid_energy_data = per_capita_metrics.loc[
        lambda x: (
            (x["renewable_energy_per_capita_gwh"].notna())
            & (x["renewable_energy_per_capita_gwh"] != float("inf"))
            & (x["fossil_energy_per_capita_gwh"].notna())
            & (x["fossil_energy_per_capita_gwh"] != float("inf"))
            & (x["energy_consumption_per_capita_gwh"].notna())
            & (x["energy_consumption_per_capita_gwh"] != float("inf"))
        )
    ].reset_index()

    # Tab 1: Top 10 Renewable Energy Consumers by Max Year
    (
        valid_energy_data.loc[
            valid_energy_data.groupby("entity")[
                "renewable_energy_per_capita_gwh"
            ].idxmax()
        ]
        .nlargest(10, "renewable_energy_per_capita_gwh")[
            ["entity", "year", "renewable_energy_per_capita_gwh"]
        ]
        .to_excel(writer, sheet_name="Top_10_Renewable_Max", index=False)
    )

    # Tab 2: Top 10 Fossil Energy Consumers by Max Year
    (
        valid_energy_data.loc[
            valid_energy_data.groupby("entity")["fossil_energy_per_capita_gwh"].idxmax()
        ]
        .nlargest(10, "fossil_energy_per_capita_gwh")[
            ["entity", "year", "fossil_energy_per_capita_gwh"]
        ]
        .to_excel(writer, sheet_name="Top_10_Fossil_Max", index=False)
    )
    # Tab 3: NAM vs IBERIA Comparison (past 10 years, pivoted by year)
    past_10_years = latest_year_with_population - 9
    nam_iberia_data = valid_energy_data.loc[
        lambda x: (x["year"] >= past_10_years)
        & (x["entity"].isin(["North America", "Iberia"]))
    ][
        [
            "entity",
            "year",
            "renewable_energy_per_capita_gwh",
            "fossil_energy_pct",
            "energy_consumption_per_capita_gwh",
        ]
    ]

    metric_labels = {
        "renewable_energy_per_capita_gwh": "Renewable_Energy_Per_Capita",
        "fossil_energy_pct": "Fossil_Energy_Percent",
        "energy_consumption_per_capita_gwh": "Energy_per_Capita",
    }

    # Pivot data with years as rows and regions as columns
    for metric in [
        "renewable_energy_per_capita_gwh",
        "fossil_energy_pct",
        "energy_consumption_per_capita_gwh",
    ]:
        pivoted = nam_iberia_data.pivot(
            index="year", columns="entity", values=metric
        )
        pivoted.to_excel(writer, sheet_name=metric_labels[metric])

    # Add chart to Renewable Energy sheet as example
    ws_renewable = writer.sheets["Renewable_Energy_Per_Capita"]
    chart = LineChart()
    chart.title = "NAM vs Iberia: Renewable Energy Per Capita Over Time"

    # Get the data range (assuming years are in column A, regions in columns B and C)
    max_row = (
        len(
            nam_iberia_data.pivot(
                index="year",
                columns="entity",
                values="renewable_energy_per_capita_gwh",
            )
        )
        + 1
    )

    # Add data for North America (column B)
    data_nam = Reference(ws_renewable, min_col=2, min_row=1, max_row=max_row)
    chart.add_data(data_nam, titles_from_data=True)

    # Add data for Iberia (column C)
    data_iberia = Reference(ws_renewable, min_col=3, min_row=1, max_row=max_row)
    chart.add_data(data_iberia, titles_from_data=True)

    # Set categories (years from column A)
    categories = Reference(ws_renewable, min_col=1, min_row=2, max_row=max_row)
    chart.set_categories(categories)

    # Enable legend below the chart
    chart.legend.position = "b"  # Position legend at bottom
    chart.legend.overlay = False  # Don't overlay on chart area

    # Add chart to worksheet
    ws_renewable.add_chart(chart, "E2")


    # Tab 4: Full raw data
    per_capita_metrics.to_excel(writer, sheet_name="Raw_Data", index=False)

print(f"Excel report exported to: {filename}")

Problem Highlight

While creating an Excel report works, how do we ensure this report stays updated as data changes? How do we add charts programmatically? And how can we make this analysis repeatable and trackable within our pipeline?

Advanced Approach: Dagster Assets for Comprehensive Reporting

Comprehensive Excel Report Generation

Transform the Excel creation into a structured Dagster asset. Now that we are familiar with Dagster, we should be capable of doing it ourselves.

💡 Click to reveal solution

@dg.asset()
def export_analysis_report(energy_breakdown_per_capita):
    """Export analysis report to Excel with multiple tabs and charts"""

    filename = "energy_analysis_report.xlsx"
    with pd.ExcelWriter(filename, engine="openpyxl") as writer:
        latest_year_with_population = energy_breakdown_per_capita.loc[
            energy_breakdown_per_capita["population"].notna(), "year"
        ].max()

        # Remove invalid data
        valid_energy_data = energy_breakdown_per_capita.dropna(
            subset=["renewable_energy_per_capita", "fossil_energy_per_capita", "energy_consumption_per_capita"]
        ).replace([float("inf"), -float("inf")], pd.NA).dropna().reset_index()

        # Tab 1: Top 10 Renewable Energy Consumers by Max Year
        (
            valid_energy_data.loc[
                valid_energy_data.groupby("entity")[
                    "renewable_energy_per_capita"
                ].idxmax()
            ]
            .nlargest(10, "renewable_energy_per_capita")[
                ["entity", "year", "renewable_energy_per_capita"]
            ]
            .to_excel(writer, sheet_name="Top_10_Renewable_Max", index=False)
        )

        # Tab 2: Top 10 Fossil Energy Consumers by Max Year
        (
            valid_energy_data.loc[
                valid_energy_data.groupby("entity")["fossil_energy_per_capita"].idxmax()
            ]
            .nlargest(10, "fossil_energy_per_capita")[
                ["entity", "year", "fossil_energy_per_capita"]
            ]
            .to_excel(writer, sheet_name="Top_10_Fossil_Max", index=False)
        )
        # Tab 3: NAM vs IBERIA Comparison (past 10 years, pivoted by year)
        past_10_years = latest_year_with_population - 9
        nam_iberia_data = valid_energy_data.loc[
            lambda x: (x["year"] >= past_10_years)
            & (x["entity"].isin(["North America", "Iberia"]))
        ][
            [
                "entity",
                "year",
                "renewable_energy_per_capita",
                "fossil_energy_pct",
                "energy_consumption_per_capita",
            ]
        ]

        metric_labels = {
            "renewable_energy_per_capita": "Renewable_Energy_Per_Capita",
            "fossil_energy_pct": "Fossil_Energy_Percent",
            "energy_consumption_per_capita": "Energy_per_Capita",
        }

        # Pivot data with years as rows and regions as columns
        for metric in [
            "renewable_energy_per_capita",
            "fossil_energy_pct",
            "energy_consumption_per_capita",
        ]:
            pivoted = nam_iberia_data.pivot(
                index="year", columns="entity", values=metric
            )
            pivoted.to_excel(writer, sheet_name=metric_labels[metric])

        # Add chart to Renewable Energy sheet as example
        ws_renewable = writer.sheets["Renewable_Energy_Per_Capita"]
        chart = LineChart()
        chart.title = "NAM vs Iberia: Renewable Energy Per Capita Over Time"

        # Get the data range (assuming years are in column A, regions in columns B and C)
        max_row = (
            len(
                nam_iberia_data.pivot(
                    index="year",
                    columns="entity",
                    values="renewable_energy_per_capita",
                )
            )
            + 1
        )

        # Add data for North America (column B)
        data_nam = Reference(ws_renewable, min_col=2, min_row=1, max_row=max_row)
        chart.add_data(data_nam, titles_from_data=True)

        # Add data for Iberia (column C)
        data_iberia = Reference(ws_renewable, min_col=3, min_row=1, max_row=max_row)
        chart.add_data(data_iberia, titles_from_data=True)

        # Set categories (years from column A)
        categories = Reference(ws_renewable, min_col=1, min_row=2, max_row=max_row)
        chart.set_categories(categories)

        # Enable legend below the chart
        chart.legend.position = "b"  # Position legend at bottom
        chart.legend.overlay = False  # Don't overlay on chart area

        # Add chart to worksheet
        ws_renewable.add_chart(chart, "E2")

        print(f"Excel report exported to: {filename}")

        # Tab 4: Full raw data
        energy_breakdown_per_capita.to_excel(writer, sheet_name="Raw_Data", index=False)

Demonstrate Asset-Based Reporting Benefits

The Dagster UI now shows María’s complete pipeline from raw data through to comprehensive reporting:

Trackable Reports: See when each report was last generated
Automatic Updates: Reports can be regenerated when upstream data changes
Version Control: Historical versions of reports and analysis
Clear Dependencies: Understand what data feeds each report

Moment of Discovery

We’ve completed the “L” in ETL - Load! Our pipeline now automatically transforms raw country-level data into comprehensive regional analysis reports. However, María realizes that while her reports look polished, she hasn’t implemented proper quality controls throughout her pipeline. The journey isn’t complete yet - she needs to ensure data quality at every step.

María now has the reporting infrastructure in place, but she recognizes that reliable policy decisions require bulletproof data quality assurance throughout the entire pipeline.

Next: Part 4 - Quality Control throughout the pipeline to make our analysis bulletproof!

Part 3: The 'L' - Loading, Leveraging & Reporting Insights

Part 3: The 'L' - Loading, Leveraging & Reporting Insights

Scene: With the data extracted and transformed, the final step is to make it accessible and use it to answer María’s policy questions

What We Will Learn

Data Architecture Evolution

Basic Approach: Manual Output & Ad-hoc Analysis

Loading the data

Create Excel Report

Problem Highlight

Advanced Approach: Dagster Assets for Comprehensive Reporting

Comprehensive Excel Report Generation

Demonstrate Asset-Based Reporting Benefits

Moment of Discovery

results matching ""

No results matching ""