# Flink High Availability Cluster Deployment ## Overview This project uses Apache Flink Kubernetes Operator to deploy a high availability Flink cluster with persistent storage and automatic failover capabilities. ## Component Architecture - **JobManager**: 2 replicas with high availability configuration - **TaskManager**: 3 replicas for distributed processing - **High Availability**: Kubernetes-based HA with persistent storage - **Checkpointing**: Persistent checkpoints and savepoints storage ## File Description ### 1. flink-operator-v2.yaml Flink Kubernetes Operator deployment configuration: - Operator deployment in `flink-system` namespace - RBAC configuration for cluster-wide permissions - Health checks and resource limits - Enhanced CRD definitions with additional printer columns ### 2. flink-crd.yaml Custom Resource Definitions for Flink: - FlinkDeployment CRD - FlinkSessionJob CRD - Required for Flink Operator to function ### 3. ha-flink-cluster-v2.yaml Production-ready HA Flink cluster configuration: - 2 JobManager replicas with HA enabled - 3 TaskManager replicas with anti-affinity rules - Persistent storage for HA data, checkpoints, and savepoints - Memory and CPU resource allocation - Exponential delay restart strategy - Proper volume mounts and storage configuration ### 4. simple-ha-flink-cluster.yaml Simplified HA Flink cluster configuration: - Uses ephemeral storage to avoid PVC binding issues - Basic HA configuration for testing and development - Minimal resource requirements - Recommended for development and testing ### 5. flink-storage.yaml Storage and RBAC configuration: - PersistentVolumeClaims for HA data, checkpoints, and savepoints - ServiceAccount and RBAC permissions for Flink cluster - Azure Disk storage class configuration with correct access modes ### 6. flink-rbac.yaml Enhanced RBAC configuration: - Complete permissions for Flink HA functionality - Both namespace-level and cluster-level permissions - Includes watch permissions for HA operations ## Deployment Steps ### 1. Install Flink Operator ```bash # Apply Flink Operator configuration kubectl apply -f flink-operator-v2.yaml # Verify operator installation kubectl get pods -n flink-system ``` ### 2. Create Storage Resources (Optional - for production) ```bash # Apply storage configuration kubectl apply -f flink-storage.yaml # Verify PVC creation kubectl get pvc -n freeleaps-data-platform ``` ### 3. Deploy HA Flink Cluster ```bash # Option A: Deploy with persistent storage (production) kubectl apply -f ha-flink-cluster-v2.yaml # Option B: Deploy with ephemeral storage (development/testing) kubectl apply -f simple-ha-flink-cluster.yaml # Check deployment status kubectl get flinkdeployments -n freeleaps-data-platform kubectl get pods -n freeleaps-data-platform -l app=flink ``` ## High Availability Features - **JobManager HA**: 2 JobManager replicas with Kubernetes-based leader election - **Persistent State**: Checkpoints and savepoints stored on persistent volumes - **Automatic Failover**: Exponential delay restart strategy with backoff - **Pod Anti-affinity**: Ensures components are distributed across different nodes - **Storage Persistence**: HA data, checkpoints, and savepoints persist across restarts ## Network Configuration - **JobManager**: Port 8081 (Web UI), 6123 (RPC), 6124 (Blob Server) - **TaskManager**: Port 6121 (Data), 6122 (RPC), 6126 (Metrics) - **Service Type**: ClusterIP for internal communication ## Storage Configuration - **HA Data**: 10Gi for high availability metadata - **Checkpoints**: 20Gi for application checkpoints - **Savepoints**: 20Gi for manual savepoints - **Storage Class**: azure-disk-std-ssd-lrs - **Access Mode**: ReadWriteOnce (Azure Disk limitation) ## Monitoring and Operations - **Health Checks**: Built-in readiness and liveness probes - **Web UI**: Accessible through JobManager service - **Metrics**: Exposed on port 8080 for Prometheus collection - **Logging**: Centralized logging through Kubernetes ## Configuration Details ### High Availability Settings - **Type**: kubernetes (native Kubernetes HA) - **Storage**: Persistent volume for HA metadata - **Cluster ID**: ha-flink-cluster-v2 ### Checkpointing Configuration - **Interval**: 60 seconds - **Timeout**: 10 minutes - **Min Pause**: 5 seconds - **Backend**: Filesystem with persistent storage ### Resource Allocation - **JobManager**: 0.5 CPU, 1024MB memory (HA), 1.0 CPU, 1024MB memory (Simple) - **TaskManager**: 0.5 CPU, 2048MB memory (HA), 2.0 CPU, 2048MB memory (Simple) ## Troubleshooting ### Common Issues and Solutions #### 1. PVC Binding Issues ```bash # Check PVC status kubectl get pvc -n freeleaps-data-platform # PVC stuck in Pending state - usually due to: # - Insufficient storage quota # - Wrong access mode (ReadWriteMany not supported by Azure Disk) # - Storage class not available # Solution: Use ReadWriteOnce access mode or ephemeral storage ``` #### 2. Pod CrashLoopBackOff ```bash # Check pod status kubectl get pods -n freeleaps-data-platform -l app=flink # Check pod logs kubectl logs -n freeleaps-data-platform # Check pod events kubectl describe pod -n freeleaps-data-platform ``` #### 3. ServiceAccount Issues ```bash # Verify ServiceAccount exists kubectl get serviceaccount -n freeleaps-data-platform # Check RBAC permissions kubectl get rolebinding -n freeleaps-data-platform ``` #### 4. Storage Path Issues ```bash # Ensure storage paths match volume mounts # For persistent storage: /opt/flink/ha-data, /opt/flink/checkpoints # For ephemeral storage: /tmp/flink/ha-data, /tmp/flink/checkpoints ``` ### Diagnostic Commands ```bash # Check Flink Operator logs kubectl logs -n flink-system -l app.kubernetes.io/name=flink-kubernetes-operator # Check Flink cluster status kubectl describe flinkdeployment -n freeleaps-data-platform # Check pod events kubectl get events -n freeleaps-data-platform --sort-by='.lastTimestamp' # Check storage status kubectl get pvc -n freeleaps-data-platform kubectl describe pvc -n freeleaps-data-platform # Check operator status kubectl get pods -n flink-system kubectl logs -n flink-system deployment/flink-kubernetes-operator ``` ## Important Notes 1. **Storage Limitations**: Azure Disk storage class only supports ReadWriteOnce access mode 2. **ServiceAccount**: Ensure the correct ServiceAccount is specified in cluster configuration 3. **Resource Requirements**: Verify cluster has enough CPU/memory for all replicas 4. **Network Policies**: May need adjustment for inter-pod communication 5. **Ephemeral vs Persistent**: Use ephemeral storage for development/testing, persistent for production ## Quick Start (Recommended for Testing) ```bash # 1. Deploy operator kubectl apply -f flink-operator-v2.yaml # 2. Wait for operator to be ready kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=flink-kubernetes-operator -n flink-system # 3. Deploy simple HA cluster (no persistent storage) kubectl apply -f simple-ha-flink-cluster.yaml # 4. Monitor deployment kubectl get flinkdeployments -n freeleaps-data-platform kubectl get pods -n freeleaps-data-platform -l app=flink ``` ## Production Deployment ```bash # 1. Deploy operator kubectl apply -f flink-operator-v2.yaml # 2. Deploy storage resources kubectl apply -f flink-storage.yaml # 3. Deploy production HA cluster kubectl apply -f ha-flink-cluster-v2.yaml # 4. Monitor deployment kubectl get flinkdeployments -n freeleaps-data-platform kubectl get pods -n freeleaps-data-platform -l app=flink ```