Kafka Streaming Pipeline

The system simulates transaction events (purchases by customers) and processes them through a Kafka topic.A consumer application stores both the raw transactions and aggregated per-customer totals in PostgreSQL.

Project Brief

A streaming data pipeline that demonstrates ingestion, real-time processing, and storage using Java, Apache Kafka, and PostgreSQL. The system simulates transactions flowing through Kafka topics, with producers generating events and consumers persisting raw and aggregated data to a database.

Repositories

Backend: https://github.com/NRicky25/kafka-pipeline

Approach

Planning & Objectives
- Planned the project to simulate a real-time data streaming environment using Kafka as the backbone for event-driven communication
- Defined core goal: demonstrate ingestion, processing, and storage of continuous data streams with reproducibility and clarity
- Outlined producer–consumer flow architecture to ensure modularity and ease of testing
Architecture Design
- Designed the pipeline around two main components: Kafka Producer and Kafka Consumer, communicating via a common topic
- Kafka Producer: responsible for generating synthetic transaction events (UUID, timestamp, customer ID, category, amount)
- Kafka Consumer: reads each event from the topic, writes raw records into `transactions_raw`, and updates aggregates in `customer_agg`
- Used Docker Compose to spin up Kafka, Zookeeper, and PostgreSQL for a reproducible local setup
Implementation Strategy
- Developed producer and consumer as standalone Java applications using the Kafka client library
- Used Gradle for dependency management and build automation
- Implemented robust serialization/deserialization and ensured consumer idempotency for duplicate event handling
- Added configuration files for topic names, partition count, and bootstrap servers to keep environment variables flexible
Testing & Validation
- Tested producer output for consistent event schema and message frequency
- Verified consumer processing logic by comparing record counts between Kafka topics and PostgreSQL tables
- Conducted aggregation validation — ensuring customer-level totals matched individual transaction sums
Deployment & Containerization
- Containerized the full environment with Docker Compose for one-command setup and teardown
- Validated connectivity between containers (Producer → Kafka → Consumer → Postgres)
- Ensured logs were viewable across services for debugging and performance tuning
Project Duration (Estimate)
- Part-time (evenings/weekends): 4–6 weeks total
- 1 week — setup & architecture design
- 2 weeks — implementation of producer, consumer, and database logic
- 1–2 weeks — testing, debugging, and Docker environment polish

Features

Kafka Producer: synthetic transaction events (UUID, time, amount, category)
Kafka Consumer: persistence to PostgreSQL with raw + aggregates
Real-time Aggregation: per-customer totals and counts
Configurable Topics: bootstrap servers, partitions, and retention
Docker Compose: Kafka, Zookeeper, and Postgres for local dev
Validation: record counts and aggregation checks across components

Tools & Technologies

JavaApache KafkaPostgreSQLDockerGradleGitGitHub