Skip to content

Kafka Streaming Pipeline

The system simulates transaction events (purchases by customers) and processes them through a Kafka topic.A consumer application stores both the raw transactions and aggregated per-customer totals in PostgreSQL.

Kafka Streaming Pipeline cover

Project Brief

A streaming data pipeline that demonstrates ingestion, real-time processing, and storage using Java, Apache Kafka, and PostgreSQL. The system simulates transactions flowing through Kafka topics, with producers generating events and consumers persisting raw and aggregated data to a database.

Repositories

Approach

  • Planning & Objectives
    • Planned the project to simulate a real-time data streaming environment using Kafka as the backbone for event-driven communication
    • Defined core goal: demonstrate ingestion, processing, and storage of continuous data streams with reproducibility and clarity
    • Outlined producer–consumer flow architecture to ensure modularity and ease of testing
  • Architecture Design
    • Designed the pipeline around two main components: Kafka Producer and Kafka Consumer, communicating via a common topic
    • Kafka Producer: responsible for generating synthetic transaction events (UUID, timestamp, customer ID, category, amount)
    • Kafka Consumer: reads each event from the topic, writes raw records into `transactions_raw`, and updates aggregates in `customer_agg`
    • Used Docker Compose to spin up Kafka, Zookeeper, and PostgreSQL for a reproducible local setup
  • Implementation Strategy
    • Developed producer and consumer as standalone Java applications using the Kafka client library
    • Used Gradle for dependency management and build automation
    • Implemented robust serialization/deserialization and ensured consumer idempotency for duplicate event handling
    • Added configuration files for topic names, partition count, and bootstrap servers to keep environment variables flexible
  • Testing & Validation
    • Tested producer output for consistent event schema and message frequency
    • Verified consumer processing logic by comparing record counts between Kafka topics and PostgreSQL tables
    • Conducted aggregation validation — ensuring customer-level totals matched individual transaction sums
  • Deployment & Containerization
    • Containerized the full environment with Docker Compose for one-command setup and teardown
    • Validated connectivity between containers (Producer → Kafka → Consumer → Postgres)
    • Ensured logs were viewable across services for debugging and performance tuning
  • Project Duration (Estimate)
    • Part-time (evenings/weekends): 4–6 weeks total
    • 1 week — setup & architecture design
    • 2 weeks — implementation of producer, consumer, and database logic
    • 1–2 weeks — testing, debugging, and Docker environment polish

Features

  • Kafka Producer: synthetic transaction events (UUID, time, amount, category)
  • Kafka Consumer: persistence to PostgreSQL with raw + aggregates
  • Real-time Aggregation: per-customer totals and counts
  • Configurable Topics: bootstrap servers, partitions, and retention
  • Docker Compose: Kafka, Zookeeper, and Postgres for local dev
  • Validation: record counts and aggregation checks across components

Tools & Technologies

JavaApache KafkaPostgreSQLDockerGradleGitGitHub