Backend Architecture & Platform Engineering

Engineering philosophy

Built to separate concerns, not to bundle them

Domain driven service boundaries

Every business domain (catalog, orders, payments, fulfilment, seller, buyer, notifications) owns its own service, its own database connection, and its own deployment lifecycle. No shared state across domain boundaries. Changes in one service cannot cascade into another at the infrastructure level.

Independent scaling per service

Each microservice scales down to zero replicas during low traffic and scales out independently based on its own CPU, memory, and request rate signals. A traffic spike on the search service does not force the payments service to scale. Resources are never wasted on services that are not under load.

Zero trust service mesh with full mutual TLS

Every service to service call on the platform is authenticated and encrypted using mutual TLS. Private certificates are provisioned by an internal certificate authority and managed via Secret Manager with automatic rotation. No service accepts an unauthenticated connection from any peer, internal or external.

Compute topology

Four compute tiers. One coherent platform.

Traffic flows from the public edge through four distinct compute tiers, each optimised for its workload type. Every tier is designed to scale independently and fail gracefully.

Tier 1: Edge and ingress

Global edge layer

Public traffic entry point

All public traffic enters via Google Cloud's global HTTPS load balancer with TLS termination at the edge. Cloud Armor WAF enforces rate limiting, IP policy, and request inspection before any packet reaches compute. Network Endpoint Groups (NEGs) bind the external load balancer to GKE services inside the shared VPC using private connectivity. No backend service is ever exposed to a public IP. Frontend static assets are served from CDN edge nodes with delivery under 100 ms globally.

Cloud HTTPS LB Cloud Armor WAF NEGs Cloud CDN Private VPC binding

Tier 2: API gateway layer

API gateway on GKE

Request routing · auth enforcement · rate limiting · circuit breaking

The API gateway is not a managed cloud service. It is a self hosted Spring Boot application running as a dedicated microservice inside the GKE cluster, giving full control over routing logic, authentication enforcement, header manipulation, rate limiting, circuit breaking, and observability instrumentation without vendor lock in or managed gateway limitations.

The gateway runs in both the Mumbai and Delhi upstream subnets simultaneously with both regions active. All inbound traffic hits the gateway first. The gateway validates JWT tokens, enforces RBAC permissions, applies per route rate limits, injects correlation IDs for distributed tracing, and routes requests to the correct downstream domain service. If a downstream service is unhealthy, the circuit breaker opens at the gateway; the caller receives a structured error immediately rather than waiting on a timeout.

Spring Boot Spring Cloud Gateway JWT validation RBAC enforcement Circuit breaker Correlation ID injection GKE Deployment Horizontal Pod Autoscaler

Tier 3: Core domain microservices

Core domain microservices on GKE (both regions active)

Mumbai upstream and Delhi upstream, both live simultaneously

All core business logic runs as independently deployed Spring Boot microservices inside GKE. Both Mumbai and Delhi clusters run identical service deployments simultaneously, not primary replica and not warm standby. Both regions serve live traffic at all times. If one region degrades, the other absorbs the full load with zero manual intervention.

Each service owns its own deployment configuration, resource requests, HPA policy, and health check behaviour. A traffic spike on any single service triggers its HPA independently; other services are unaffected. Services communicate exclusively over mutual TLS using service mesh routing. No service holds a direct database connection string in code; all secrets are injected at runtime from Secret Manager via workload identity.

Catalog and search service

Product ingestion, indexing, full text search, faceted filtering, and personalised ranking

Cart and checkout service

Cart persistence, inventory reservation, checkout orchestration, and coupon application

Order management service (OMS)

Order lifecycle state machine from creation through fulfilment, cancellation, and returns

Payments service

Payment session creation, gateway routing, webhook verification, refund processing, and settlement triggers

Logistics and fulfilment service

Proximity based carrier selection, AWB generation, shipment tracking aggregation, and reverse logistics

Seller platform service

Seller onboarding, KYC orchestration, dashboard data aggregation, payout scheduling, and performance metrics

Buyer platform service

Buyer profile management, wishlist, order history, loyalty points, and personalisation signals

Notifications service

Transactional email, SMS, WhatsApp, and push dispatch with template management and delivery tracking

Promotions and pricing service

Discount rules engine, coupon validation, flash sale scheduling, and dynamic pricing logic

Reviews and ratings service

Review submission, moderation pipeline, sentiment scoring, and aggregate rating computation

Analytics and reporting service

Event ingestion, GMV aggregation, cohort analysis, seller dashboards, and investor KPI pack

Spring Boot GKE Deployment Horizontal Pod Autoscaler Secret Manager workload identity Mutual TLS service mesh Multi region dual active Independent per service scaling

Tier 4A: Downstream async services

Downstream services on Cloud Run (stateless async)

Event driven · scales to zero · no idle cost

Background processing, event consumers, async pipelines, and non latency critical workloads run on Cloud Run rather than GKE. This separates concerns at the infrastructure level: synchronous user facing paths never share compute with background processing. Cloud Run services scale to zero when idle, eliminating cost for workloads that run in bursts rather than continuously. Each Cloud Run service subscribes to a Pub/Sub topic or is triggered on a schedule. All outbound communication is over mutual TLS to internal services.

Background job processor

Order confirmation emails, seller settlement notifications, analytics event flushing

Event bus consumers

Pub/Sub subscribers for inventory updates, payment confirmations, and shipment status changes

Analytics ingestion pipeline

Clickstream events, user behaviour signals, and funnel data written to BigQuery

Audit logging service

Immutable audit trail for all state changing API calls written to append only store

Scheduled report generator

Daily and weekly seller performance reports, investor KPI packs, and operational summaries

Cloud Run Pub/Sub Scheduled triggers Scale to zero Direct VPC egress BigQuery write

Tier 4B: Stateful services

Stateful services on GCE virtual machines

Persistent storage · fixed identity · VM level control

Workloads that require persistent attached storage, fixed network identity, or VM level process control run on GCE rather than containers. These are not general purpose services; they are purpose built for specific infrastructure roles that benefit from the operational characteristics of a virtual machine. Each VM is provisioned inside the shared VPC, never exposed to public IPs, and managed under the same Secret Manager and IAM policy framework as all other compute.

Internal certificate authority

Provisions and rotates mutual TLS client certificates for all enterprise applications

Firezone VPN server

WireGuard based private admin access with multi zone support

Long running ML inference

Recommendation model serving for personalisation with persistent model storage

Stateful session store

Redis cluster with persistent RDB snapshots for session and cart state recovery

GCE Persistent disk Fixed private IP VPC native routing Managed instance groups

Zero trust security

Every service connection is authenticated. No exceptions.

Vishvakta operates a private internal certificate authority hosted on GCE inside the shared VPC. On service startup, each microservice requests a short lived client certificate from the CA via a provisioning API authenticated by workload identity. The certificate is injected into the service at runtime, never stored in the container image or environment variable.

Every service to service call presents its client certificate. The receiving service validates the certificate chain before accepting the connection. A service that cannot present a valid certificate is rejected at the TLS handshake; the connection never reaches application code. Certificates are rotated automatically before expiry via a controller running in the GKE cluster. Revocation is handled by updating the CA's certificate revocation list, which all services poll on a defined interval.

What this means in practice

A compromised service cannot make unauthorised lateral calls to other services. An attacker who gains access to one pod cannot communicate with the payments service, the orders service, or the database tier without a valid, non revoked certificate signed by Vishvakta's private CA. This is zero trust networking applied at the service layer, not just at the network perimeter.

Private CA on GCE Workload identity provisioning Short lived certificates Automatic rotation Certificate revocation list Mutual TLS on all service mesh paths

Autoscaling model

Scale what is under load. Leave everything else alone.

Horizontal Pod Autoscaler

Each GKE microservice defines its own HPA policy with CPU, memory, and custom request rate thresholds. During off peak hours most services run at one or two replicas. During a product launch or flash sale, the catalog and payments services scale out to tens of replicas within seconds, without touching the notifications service, the reviews service, or any other unrelated service.

Scale to zero

All downstream async services on Cloud Run scale to zero when no events are in flight. A scheduled report job wakes up, processes, and scales back to zero. An event consumer scales out to match Pub/Sub queue depth and scales back when the queue drains. Zero idle cost. Zero over provisioning.

Fixed capacity with managed instance groups

Stateful VMs are managed via instance groups with health check based auto restart. These are not horizontally scaled on demand; they are sized for steady state capacity with redundancy built into the instance group configuration. Recovery from failure is automatic.

Observability and reliability

You cannot operate what you cannot see.

Distributed tracing

Correlation IDs injected at the API gateway propagate through every downstream service call. Full request trace available in one query.

Centralised metrics

Prometheus scrapes all GKE pods and Cloud Run instances. Grafana dashboards show per service latency, error rate, saturation, and throughput in real time.

Log aggregation

Structured JSON logs from all services stream to Cloud Logging. Queryable by service, trace ID, user ID, order ID, or any business identifier.

SLO tracking

Each critical path (checkout, payment, order creation) has a defined SLO. Error budgets are tracked and alerted. On call rotation and runbook documentation are in place for launch.

A microservices platform engineered for national scale