Projects

Real systems.
Real data. Real impact.

A collection of projects across analytics engineering, data platforms, AI experimentation, quantitative systems, and production-style automation — designed to show end-to-end technical thinking, not just isolated code.

Flagship Project

Formula One
Analytics Platform

History Covered
75 Years of F1 Data
Architecture
5-Layer Pipeline
Core Stack
Python · Spark · Snowflake

A full analytics platform designed around the complete data lifecycle: raw ingestion, distributed transformation, warehouse-ready modeling, and insight delivery. Rather than focusing only on analysis, this project was built to demonstrate platform thinking — how historical sports data can be transformed into a scalable analytics system with clear separation between ingestion, processing, storage, and BI consumption.

What it does

Processes decades of Formula One race, driver, constructor, lap, and results data into analysis-ready datasets for historical comparison, performance trend evaluation, and dashboard-based storytelling.

End-to-end architecture

Source datasets → Python ingestion layer → Spark/Databricks transformation layer → cleaned analytical tables → Snowflake-style warehouse modeling → BI dashboards / reporting layer

Engineering focus

Built around modular ETL logic, schema-aware transformations, historical normalization, distributed processing, and analytics-friendly data modeling. The project reflects how a real analytics engineering workflow moves from messy source data toward reliable reporting outputs.

Why it matters

Shows capability across data engineering, transformation design, pipeline structure, warehouse thinking, and stakeholder-ready analytics — not just notebook-level exploration.

Python Apache Spark Databricks Snowflake R Looker Studio ETL Data Modeling
View on GitHub
🏎
Analytics Engineering

GA4 Analytics
Dashboard Pipeline

An end-to-end analytics pipeline built to transform raw GA4 event data into structured, decision-ready dashboards. The project focuses on event collection, metric logic, transformation layers, and reporting outputs that make product and marketing performance easier to interpret.

What it does

Pulls analytics data, organizes key events and user behavior patterns, applies transformation logic, and exposes performance metrics through dashboard layers for easier monitoring and reporting.

End-to-end architecture

GA4 event source → Python extraction / API handling → cleaning + metric transformation → SQL-ready tables / reporting structures → Looker Studio dashboards

Engineering focus

Emphasizes analytics engineering fundamentals: metric definition, event standardization, reporting consistency, transformation logic, and dashboard usability for non-technical stakeholders.

Why it matters

Demonstrates how raw behavioral data becomes business-facing insights through a repeatable pipeline rather than ad hoc reporting.

GA4 Python Looker Studio SQL Analytics Engineering
View on GitHub
📈
Applied AI

AI Studio —
RL Environment

A reinforcement learning experimentation environment designed for training agents under custom reward structures, environment rules, and iterative training workflows. Built to explore decision-making systems rather than static supervised learning only.

What it does

Defines an environment, state-action behavior, reward mechanics, and model training loops that allow an agent to learn through interaction and repeated policy improvement.

End-to-end architecture

Environment design → state representation → reward logic → agent training loop → evaluation runs → performance observation and iteration

Engineering focus

Focuses on experimentation design, reward shaping, training stability, iterative testing, and comparing behavior under different learning configurations. It reflects an AI builder mindset rather than using prepackaged models blindly.

Why it matters

Shows practical exposure to ML system design, learning dynamics, and the tradeoffs involved when building AI environments from scratch.

PyTorch Reinforcement Learning Python TensorFlow Experimentation
View on GitHub
🧠
Quantitative Systems

High-Frequency
Trading Simulation

A Python-based simulation system for testing trading behavior in synthetic market conditions. The project explores latency-sensitive logic, order behavior, market reactions, and how trading strategies perform under fast-changing inputs.

What it does

Simulates market movement, evaluates rule-based or modeled trading decisions, and measures behavior under different timing and execution assumptions.

End-to-end architecture

Synthetic market generator → pricing / signal logic → trade execution simulation → latency-aware evaluation → strategy performance analysis

Engineering focus

Centers on simulation design, algorithmic thinking, numerical analysis, and performance evaluation in systems where speed, timing, and sequential decisions influence outcomes.

Why it matters

Demonstrates strong problem-solving ability in quantitative environments and highlights comfort with logic-heavy, performance-oriented Python systems.

Python Quantitative Modeling Algorithm Design NumPy Simulation
View on GitHub
📉
Big Data

Movie Data Platform
on Databricks

A distributed data processing project built on Databricks and Apache Spark to handle ingestion, transformation, cleaning, and large-scale analysis of movie-related datasets. Designed to reflect modern big data processing patterns.

What it does

Ingests large datasets, applies Spark-based transformations, handles cleaning and reshaping, and prepares structured outputs for analytics or downstream querying.

End-to-end architecture

Raw movie datasets → Databricks ingestion → PySpark transformation jobs → cleaned distributed data layers → SQL analysis / reporting outputs

Engineering focus

Focuses on distributed compute, scalable transformations, Spark workflows, notebook-driven processing, and handling larger datasets more efficiently than local-only analysis.

Why it matters

Shows readiness for modern data platform environments where large-scale transformation and cloud-style workflows are essential.

Databricks Apache Spark PySpark SQL Big Data
View on GitHub
🎬
Data Ingestion

Automated
Data Mining System

A structured scraping and ingestion pipeline built for collecting, validating, and preparing data from external sources for downstream analytics use. The system emphasizes automation, repeatability, and data readiness.

What it does

Extracts structured information from source pages, applies validation and cleaning steps, and organizes the output into usable formats for analytics, storage, or later transformation.

End-to-end architecture

External web sources → scraping layer → parsing + validation → cleaned structured records → storage / analytics-ready datasets

Engineering focus

Built around reliability in data collection, parsing logic, automation flow, data quality checks, and making raw extracted information usable for downstream systems.

Why it matters

Demonstrates a strong understanding of ingestion pipelines, source handling, and the early stages of the data engineering lifecycle where reliability often matters most.

Python Web Scraping Data Pipelines Automation Data Validation
View on GitHub
🕸