Projects — Dhairya Pareek

Flagship Project

Formula One
Analytics Platform

History Covered

75 Years of F1 Data

Architecture

5-Layer Pipeline

Core Stack

Python · Spark · Snowflake

A full analytics platform designed around the complete data lifecycle: raw ingestion, distributed transformation, warehouse-ready modeling, and insight delivery. Rather than focusing only on analysis, this project was built to demonstrate platform thinking — how historical sports data can be transformed into a scalable analytics system with clear separation between ingestion, processing, storage, and BI consumption.

System story

This project is structured like a modern analytics platform: historical motorsport data enters through an ingestion layer, is processed through transformation logic, organized into analytics-ready structures, and finally delivered through reporting and interactive exploration layers.

How it works

Sources

Race & Historical Data

Race results, driver records, constructors, lap data, qualifying data, and season-level history.

Ingestion

Python ETL Layer

Raw inputs are collected, validated, typed, and standardized before downstream processing begins.

Processing

Spark / Databricks

Distributed transforms clean, normalize, enrich, and reshape historical records into reliable analytical layers.

Warehouse

Structured Modeling

Analytics-ready tables and marts support season comparisons, standings, trend analysis, and reporting queries.

Consumption

BI + Live Demo

Dashboards and the interactive Streamlit application expose the platform to analysts, users, and stakeholders.

End-to-end architecture

Source datasets → Python ingestion layer → Spark / Databricks transformation layer → cleaned analytical tables → warehouse-style modeling → dashboards + Streamlit application layer

Engineering focus

Built around modular ETL logic, schema-aware transformations, historical normalization, distributed processing, warehouse-ready marts, and stakeholder-facing analytics delivery. This is designed to show not just data analysis, but the system thinking behind a real analytics workflow.

Business / user output

The final system supports historical comparison, performance trend analysis, standings interpretation, and interactive exploration through dashboards and a live product-style demo — making the pipeline visible from source to consumer.

Why it matters

Shows capability across data engineering, transformation design, pipeline structure, warehouse thinking, dashboarding, and productized analytics — not just notebook-level exploration.

Python Apache Spark Databricks Snowflake R Looker Studio ETL Data Modeling Streamlit

Live Interactive Demo

Interactive F1 Analytics Platform

Explore the platform end to end — ingestion layer, Spark transforms, warehouse model, BI reporting, driver analytics, constructor intelligence, and lap-time performance.

Open Live Demo View on GitHub

🏎

Analytics Engineering

GA4 Analytics
Dashboard Pipeline

An end-to-end analytics pipeline built to transform raw GA4 event data into structured, decision-ready dashboards. The project focuses on event collection, metric logic, transformation layers, and reporting outputs that make product and marketing performance easier to interpret.

System story

This project follows a clean analytics engineering flow: event data is extracted from the source layer, transformed into reporting logic and metric structures, and then delivered to dashboard consumers in a consistent and repeatable way.

How it works

Source

GA4 Events

Raw user behavior, event tracking, and traffic data enter from the analytics layer.

Transform

Metric Logic

Events are cleaned, grouped, and translated into usable KPIs, dimensions, and reporting structures.

Consumption

Dashboards

Looker Studio surfaces the transformed outputs for product, performance, and marketing analysis.

End-to-end architecture

GA4 event source → Python extraction / API handling → cleaning + metric transformation → analytics-ready reporting tables → Looker Studio dashboard consumption

Engineering focus

Emphasizes analytics engineering fundamentals: metric definition, event standardization, reporting consistency, transformation logic, and dashboard usability for non-technical stakeholders.

Why it matters

Demonstrates how raw behavioral data becomes business-facing insights through a repeatable pipeline rather than ad hoc reporting.

GA4 Python Looker Studio SQL Analytics Engineering

View on GitHub

📈

Applied AI

AI Studio —
RL Environment

A reinforcement learning experimentation environment designed for training agents under custom reward structures, environment rules, and iterative training workflows.

What it does

Defines an environment, state-action behavior, reward mechanics, and model training loops that allow an agent to learn through interaction and repeated policy improvement.

End-to-end architecture

Environment design → state representation → reward logic → agent training loop → evaluation runs → performance observation and iteration

Engineering focus

Focuses on experimentation design, reward shaping, training stability, iterative testing, and comparing behavior under different learning configurations.

Why it matters

Shows practical exposure to ML system design, learning dynamics, and the tradeoffs involved when building AI environments from scratch.

PyTorch Reinforcement Learning Python TensorFlow Experimentation

View on GitHub

🧠

Quantitative Systems

High-Frequency
Trading Simulation

A Python-based simulation system for testing trading behavior in synthetic market conditions.

What it does

Simulates market movement, evaluates rule-based or modeled trading decisions, and measures behavior under different timing and execution assumptions.

End-to-end architecture

Synthetic market generator → pricing / signal logic → trade execution simulation → latency-aware evaluation → strategy performance analysis

Engineering focus

Centers on simulation design, algorithmic thinking, numerical analysis, and performance evaluation in systems where speed, timing, and sequential decisions influence outcomes.

Why it matters

Demonstrates strong problem-solving ability in quantitative environments and highlights comfort with logic-heavy, performance-oriented Python systems.

Python Quantitative Modeling Algorithm Design NumPy Simulation

View on GitHub

📉

Big Data

Movie Data Platform
on Databricks

A distributed data processing project built on Databricks and Apache Spark to handle ingestion, transformation, cleaning, and large-scale analysis of movie-related datasets.

What it does

Ingests large datasets, applies Spark-based transformations, handles cleaning and reshaping, and prepares structured outputs for analytics or downstream querying.

End-to-end architecture

Raw movie datasets → Databricks ingestion → PySpark transformation jobs → cleaned distributed data layers → SQL analysis / reporting outputs

Engineering focus

Focuses on distributed compute, scalable transformations, Spark workflows, notebook-driven processing, and handling larger datasets more efficiently than local-only analysis.

Why it matters

Shows readiness for modern data platform environments where large-scale transformation and cloud-style workflows are essential.

Databricks Apache Spark PySpark SQL Big Data

View on GitHub

🎬

Data Ingestion

Automated
Data Mining System

A structured scraping and ingestion pipeline built for collecting, validating, and preparing data from external sources for downstream analytics use.

What it does

Extracts structured information from source pages, applies validation and cleaning steps, and organizes the output into usable formats for analytics, storage, or later transformation.

End-to-end architecture

External web sources → scraping layer → parsing + validation → cleaned structured records → storage / analytics-ready datasets

Engineering focus

Built around reliability in data collection, parsing logic, automation flow, data quality checks, and making raw extracted information usable for downstream systems.

Why it matters

Demonstrates a strong understanding of ingestion pipelines, source handling, and the early stages of the data engineering lifecycle where reliability often matters most.

Python Web Scraping Data Pipelines Automation Data Validation

View on GitHub

🕸

Real systems.Real data. Real impact.

From raw input to decision-ready systems.

Source Systems

Collection & Validation

Transformation Layer

Warehouse / Structured Layer

BI / Application Output

Formula OneAnalytics Platform

System story

How it works

End-to-end architecture

Engineering focus

Business / user output

Why it matters

Interactive F1 Analytics Platform

GA4 AnalyticsDashboard Pipeline

System story

How it works

End-to-end architecture

Engineering focus

Why it matters

AI Studio —RL Environment

What it does

End-to-end architecture

Engineering focus

Why it matters

High-FrequencyTrading Simulation

What it does

End-to-end architecture

Engineering focus

Why it matters

Movie Data Platformon Databricks

What it does

End-to-end architecture

Engineering focus

Why it matters

AutomatedData Mining System

What it does

End-to-end architecture

Engineering focus

Why it matters

Real systems.
Real data. Real impact.

Formula One
Analytics Platform

GA4 Analytics
Dashboard Pipeline

AI Studio —
RL Environment

High-Frequency
Trading Simulation

Movie Data Platform
on Databricks

Automated
Data Mining System