# Flink High Availability Cluster Deployment

## Overview
This project uses Apache Flink Kubernetes Operator to deploy a high availability Flink cluster with persistent storage and automatic failover capabilities.

## Component Architecture
- **JobManager**: 2 replicas with high availability configuration
- **TaskManager**: 3 replicas for distributed processing
- **High Availability**: Kubernetes-based HA with persistent storage
- **Checkpointing**: Persistent checkpoints and savepoints storage

## File Description

### 1. flink-operator-v2.yaml
Flink Kubernetes Operator deployment configuration:
- Operator deployment in `flink-system` namespace
- RBAC configuration for cluster-wide permissions
- Health checks and resource limits
- Enhanced CRD definitions with additional printer columns

### 2. flink-crd.yaml
Custom Resource Definitions for Flink:
- FlinkDeployment CRD
- FlinkSessionJob CRD
- Required for Flink Operator to function

### 3. ha-flink-cluster-v2.yaml
Production-ready HA Flink cluster configuration:
- 2 JobManager replicas with HA enabled
- 3 TaskManager replicas with anti-affinity rules
- Persistent storage for HA data, checkpoints, and savepoints
- Memory and CPU resource allocation
- Exponential delay restart strategy
- Proper volume mounts and storage configuration

### 4. simple-ha-flink-cluster.yaml
Simplified HA Flink cluster configuration:
- Uses ephemeral storage to avoid PVC binding issues
- Basic HA configuration for testing and development
- Minimal resource requirements
- Recommended for development and testing

### 5. flink-storage.yaml
Storage and RBAC configuration:
- PersistentVolumeClaims for HA data, checkpoints, and savepoints
- ServiceAccount and RBAC permissions for Flink cluster
- Azure Disk storage class configuration with correct access modes

### 6. flink-rbac.yaml
Enhanced RBAC configuration:
- Complete permissions for Flink HA functionality
- Both namespace-level and cluster-level permissions
- Includes watch permissions for HA operations

## Deployment Steps

### 1. Install Flink Operator
```bash
# Apply Flink Operator configuration
kubectl apply -f flink-operator-v2.yaml

# Verify operator installation
kubectl get pods -n flink-system
```

### 2. Create Storage Resources (Optional - for production)
```bash
# Apply storage configuration
kubectl apply -f flink-storage.yaml

# Verify PVC creation
kubectl get pvc -n freeleaps-data-platform
```

### 3. Deploy HA Flink Cluster
```bash
# Option A: Deploy with persistent storage (production)
kubectl apply -f ha-flink-cluster-v2.yaml

# Option B: Deploy with ephemeral storage (development/testing)
kubectl apply -f simple-ha-flink-cluster.yaml

# Check deployment status
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink
```

## High Availability Features
- **JobManager HA**: 2 JobManager replicas with Kubernetes-based leader election
- **Persistent State**: Checkpoints and savepoints stored on persistent volumes
- **Automatic Failover**: Exponential delay restart strategy with backoff
- **Pod Anti-affinity**: Ensures components are distributed across different nodes
- **Storage Persistence**: HA data, checkpoints, and savepoints persist across restarts

## Network Configuration
- **JobManager**: Port 8081 (Web UI), 6123 (RPC), 6124 (Blob Server)
- **TaskManager**: Port 6121 (Data), 6122 (RPC), 6126 (Metrics)
- **Service Type**: ClusterIP for internal communication

## Storage Configuration
- **HA Data**: 10Gi for high availability metadata
- **Checkpoints**: 20Gi for application checkpoints
- **Savepoints**: 20Gi for manual savepoints
- **Storage Class**: azure-disk-std-ssd-lrs
- **Access Mode**: ReadWriteOnce (Azure Disk limitation)

## Monitoring and Operations
- **Health Checks**: Built-in readiness and liveness probes
- **Web UI**: Accessible through JobManager service
- **Metrics**: Exposed on port 8080 for Prometheus collection
- **Logging**: Centralized logging through Kubernetes

## Configuration Details

### High Availability Settings
- **Type**: kubernetes (native Kubernetes HA)
- **Storage**: Persistent volume for HA metadata
- **Cluster ID**: ha-flink-cluster-v2

### Checkpointing Configuration
- **Interval**: 60 seconds
- **Timeout**: 10 minutes
- **Min Pause**: 5 seconds
- **Backend**: Filesystem with persistent storage

### Resource Allocation
- **JobManager**: 0.5 CPU, 1024MB memory (HA), 1.0 CPU, 1024MB memory (Simple)
- **TaskManager**: 0.5 CPU, 2048MB memory (HA), 2.0 CPU, 2048MB memory (Simple)

## Troubleshooting

### Common Issues and Solutions

#### 1. PVC Binding Issues
```bash
# Check PVC status
kubectl get pvc -n freeleaps-data-platform

# PVC stuck in Pending state - usually due to:
# - Insufficient storage quota
# - Wrong access mode (ReadWriteMany not supported by Azure Disk)
# - Storage class not available

# Solution: Use ReadWriteOnce access mode or ephemeral storage
```

#### 2. Pod CrashLoopBackOff
```bash
# Check pod status
kubectl get pods -n freeleaps-data-platform -l app=flink

# Check pod logs
kubectl logs <pod-name> -n freeleaps-data-platform

# Check pod events
kubectl describe pod <pod-name> -n freeleaps-data-platform
```

#### 3. ServiceAccount Issues
```bash
# Verify ServiceAccount exists
kubectl get serviceaccount -n freeleaps-data-platform

# Check RBAC permissions
kubectl get rolebinding -n freeleaps-data-platform
```

#### 4. Storage Path Issues
```bash
# Ensure storage paths match volume mounts
# For persistent storage: /opt/flink/ha-data, /opt/flink/checkpoints
# For ephemeral storage: /tmp/flink/ha-data, /tmp/flink/checkpoints
```

### Diagnostic Commands
```bash
# Check Flink Operator logs
kubectl logs -n flink-system -l app.kubernetes.io/name=flink-kubernetes-operator

# Check Flink cluster status
kubectl describe flinkdeployment <cluster-name> -n freeleaps-data-platform

# Check pod events
kubectl get events -n freeleaps-data-platform --sort-by='.lastTimestamp'

# Check storage status
kubectl get pvc -n freeleaps-data-platform
kubectl describe pvc <pvc-name> -n freeleaps-data-platform

# Check operator status
kubectl get pods -n flink-system
kubectl logs -n flink-system deployment/flink-kubernetes-operator
```

## Important Notes
1. **Storage Limitations**: Azure Disk storage class only supports ReadWriteOnce access mode
2. **ServiceAccount**: Ensure the correct ServiceAccount is specified in cluster configuration
3. **Resource Requirements**: Verify cluster has enough CPU/memory for all replicas
4. **Network Policies**: May need adjustment for inter-pod communication
5. **Ephemeral vs Persistent**: Use ephemeral storage for development/testing, persistent for production

## Quick Start (Recommended for Testing)
```bash
# 1. Deploy operator
kubectl apply -f flink-operator-v2.yaml

# 2. Wait for operator to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=flink-kubernetes-operator -n flink-system

# 3. Deploy simple HA cluster (no persistent storage)
kubectl apply -f simple-ha-flink-cluster.yaml

# 4. Monitor deployment
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink
```

## Production Deployment
```bash
# 1. Deploy operator
kubectl apply -f flink-operator-v2.yaml

# 2. Deploy storage resources
kubectl apply -f flink-storage.yaml

# 3. Deploy production HA cluster
kubectl apply -f ha-flink-cluster-v2.yaml

# 4. Monitor deployment
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink
```