freeleaps-ops/cluster/manifests/freeleaps-data-platform/flink/README.md

# Flink High Availability Cluster Deployment

## Overview
This project uses Apache Flink Kubernetes Operator to deploy a high availability Flink cluster with persistent storage and automatic failover capabilities.

## Component Architecture
- **JobManager**: 2 replicas with high availability configuration
- **TaskManager**: 3 replicas for distributed processing
- **High Availability**: Kubernetes-based HA with persistent storage
- **Checkpointing**: Persistent checkpoints and savepoints storage

## File Description

### 1. flink-operator-v2.yaml
Flink Kubernetes Operator deployment configuration:
- Operator deployment in `flink-system` namespace
- RBAC configuration for cluster-wide permissions
- Health checks and resource limits
- Enhanced CRD definitions with additional printer columns

### 2. flink-crd.yaml
Custom Resource Definitions for Flink:
- FlinkDeployment CRD
- FlinkSessionJob CRD
- Required for Flink Operator to function

### 3. ha-flink-cluster-v2.yaml
Production-ready HA Flink cluster configuration:
- 2 JobManager replicas with HA enabled
- 3 TaskManager replicas with anti-affinity rules
- Persistent storage for HA data, checkpoints, and savepoints
- Memory and CPU resource allocation
- Exponential delay restart strategy
- Proper volume mounts and storage configuration

### 4. simple-ha-flink-cluster.yaml
Simplified HA Flink cluster configuration:
- Uses ephemeral storage to avoid PVC binding issues
- Basic HA configuration for testing and development
- Minimal resource requirements
- Recommended for development and testing

### 5. flink-storage.yaml
Storage and RBAC configuration:
- PersistentVolumeClaims for HA data, checkpoints, and savepoints
- ServiceAccount and RBAC permissions for Flink cluster
- Azure Disk storage class configuration with correct access modes

### 6. flink-rbac.yaml
Enhanced RBAC configuration:
- Complete permissions for Flink HA functionality
- Both namespace-level and cluster-level permissions
- Includes watch permissions for HA operations

## Deployment Steps

### 1. Install Flink Operator
```bash
# Apply Flink Operator configuration
kubectl apply -f flink-operator-v2.yaml

# Verify operator installation
kubectl get pods -n flink-system
```

### 2. Create Storage Resources (Optional - for production)
```bash
# Apply storage configuration
kubectl apply -f flink-storage.yaml

# Verify PVC creation
kubectl get pvc -n freeleaps-data-platform
```

### 3. Deploy HA Flink Cluster
```bash
# Option A: Deploy with persistent storage (production)
kubectl apply -f ha-flink-cluster-v2.yaml

# Option B: Deploy with ephemeral storage (development/testing)
kubectl apply -f simple-ha-flink-cluster.yaml

# Check deployment status
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink
```

## High Availability Features
- **JobManager HA**: 2 JobManager replicas with Kubernetes-based leader election
- **Persistent State**: Checkpoints and savepoints stored on persistent volumes
- **Automatic Failover**: Exponential delay restart strategy with backoff
- **Pod Anti-affinity**: Ensures components are distributed across different nodes
- **Storage Persistence**: HA data, checkpoints, and savepoints persist across restarts

## Network Configuration
- **JobManager**: Port 8081 (Web UI), 6123 (RPC), 6124 (Blob Server)
- **TaskManager**: Port 6121 (Data), 6122 (RPC), 6126 (Metrics)
- **Service Type**: ClusterIP for internal communication

## Storage Configuration
- **HA Data**: 10Gi for high availability metadata
- **Checkpoints**: 20Gi for application checkpoints
- **Savepoints**: 20Gi for manual savepoints
- **Storage Class**: azure-disk-std-ssd-lrs
- **Access Mode**: ReadWriteOnce (Azure Disk limitation)

## Monitoring and Operations
- **Health Checks**: Built-in readiness and liveness probes
- **Web UI**: Accessible through JobManager service
- **Metrics**: Exposed on port 8080 for Prometheus collection
- **Logging**: Centralized logging through Kubernetes

## Configuration Details

### High Availability Settings
- **Type**: kubernetes (native Kubernetes HA)
- **Storage**: Persistent volume for HA metadata
- **Cluster ID**: ha-flink-cluster-v2

### Checkpointing Configuration
- **Interval**: 60 seconds
- **Timeout**: 10 minutes
- **Min Pause**: 5 seconds
- **Backend**: Filesystem with persistent storage

### Resource Allocation
- **JobManager**: 0.5 CPU, 1024MB memory (HA), 1.0 CPU, 1024MB memory (Simple)
- **TaskManager**: 0.5 CPU, 2048MB memory (HA), 2.0 CPU, 2048MB memory (Simple)

## Troubleshooting

### Common Issues and Solutions

#### 1. PVC Binding Issues
```bash
# Check PVC status
kubectl get pvc -n freeleaps-data-platform

# PVC stuck in Pending state - usually due to:
# - Insufficient storage quota
# - Wrong access mode (ReadWriteMany not supported by Azure Disk)
# - Storage class not available

# Solution: Use ReadWriteOnce access mode or ephemeral storage
```

#### 2. Pod CrashLoopBackOff
```bash
# Check pod status
kubectl get pods -n freeleaps-data-platform -l app=flink

# Check pod logs
kubectl logs <pod-name> -n freeleaps-data-platform

# Check pod events
kubectl describe pod <pod-name> -n freeleaps-data-platform
```

#### 3. ServiceAccount Issues
```bash
# Verify ServiceAccount exists
kubectl get serviceaccount -n freeleaps-data-platform

# Check RBAC permissions
kubectl get rolebinding -n freeleaps-data-platform
```

#### 4. Storage Path Issues
```bash
# Ensure storage paths match volume mounts
# For persistent storage: /opt/flink/ha-data, /opt/flink/checkpoints
# For ephemeral storage: /tmp/flink/ha-data, /tmp/flink/checkpoints
```

### Diagnostic Commands
```bash
# Check Flink Operator logs
kubectl logs -n flink-system -l app.kubernetes.io/name=flink-kubernetes-operator

# Check Flink cluster status
kubectl describe flinkdeployment <cluster-name> -n freeleaps-data-platform

# Check pod events
kubectl get events -n freeleaps-data-platform --sort-by='.lastTimestamp'

# Check storage status
kubectl get pvc -n freeleaps-data-platform
kubectl describe pvc <pvc-name> -n freeleaps-data-platform

# Check operator status
kubectl get pods -n flink-system
kubectl logs -n flink-system deployment/flink-kubernetes-operator
```

## Important Notes
1. **Storage Limitations**: Azure Disk storage class only supports ReadWriteOnce access mode
2. **ServiceAccount**: Ensure the correct ServiceAccount is specified in cluster configuration
3. **Resource Requirements**: Verify cluster has enough CPU/memory for all replicas
4. **Network Policies**: May need adjustment for inter-pod communication
5. **Ephemeral vs Persistent**: Use ephemeral storage for development/testing, persistent for production

## Quick Start (Recommended for Testing)
```bash
# 1. Deploy operator
kubectl apply -f flink-operator-v2.yaml

# 2. Wait for operator to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=flink-kubernetes-operator -n flink-system

# 3. Deploy simple HA cluster (no persistent storage)
kubectl apply -f simple-ha-flink-cluster.yaml

# 4. Monitor deployment
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink
```

## Production Deployment
```bash
# 1. Deploy operator
kubectl apply -f flink-operator-v2.yaml

# 2. Deploy storage resources
kubectl apply -f flink-storage.yaml

# 3. Deploy production HA cluster
kubectl apply -f ha-flink-cluster-v2.yaml

# 4. Monitor deployment
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink
```
deploy flink 2025-08-21 02:50:07 +00:00			`# Flink High Availability Cluster Deployment`

			`## Overview`
			`This project uses Apache Flink Kubernetes Operator to deploy a high availability Flink cluster with persistent storage and automatic failover capabilities.`

			`## Component Architecture`
			`- JobManager: 2 replicas with high availability configuration`
			`- TaskManager: 3 replicas for distributed processing`
			`- High Availability: Kubernetes-based HA with persistent storage`
			`- Checkpointing: Persistent checkpoints and savepoints storage`

			`## File Description`

			`### 1. flink-operator-v2.yaml`
			`Flink Kubernetes Operator deployment configuration:`
			- Operator deployment in `flink-system` namespace
			`- RBAC configuration for cluster-wide permissions`
			`- Health checks and resource limits`
			`- Enhanced CRD definitions with additional printer columns`

			`### 2. flink-crd.yaml`
			`Custom Resource Definitions for Flink:`
			`- FlinkDeployment CRD`
			`- FlinkSessionJob CRD`
			`- Required for Flink Operator to function`

			`### 3. ha-flink-cluster-v2.yaml`
			`Production-ready HA Flink cluster configuration:`
			`- 2 JobManager replicas with HA enabled`
			`- 3 TaskManager replicas with anti-affinity rules`
			`- Persistent storage for HA data, checkpoints, and savepoints`
			`- Memory and CPU resource allocation`
			`- Exponential delay restart strategy`
			`- Proper volume mounts and storage configuration`

			`### 4. simple-ha-flink-cluster.yaml`
			`Simplified HA Flink cluster configuration:`
			`- Uses ephemeral storage to avoid PVC binding issues`
			`- Basic HA configuration for testing and development`
			`- Minimal resource requirements`
			`- Recommended for development and testing`

			`### 5. flink-storage.yaml`
			`Storage and RBAC configuration:`
			`- PersistentVolumeClaims for HA data, checkpoints, and savepoints`
			`- ServiceAccount and RBAC permissions for Flink cluster`
			`- Azure Disk storage class configuration with correct access modes`

			`### 6. flink-rbac.yaml`
			`Enhanced RBAC configuration:`
			`- Complete permissions for Flink HA functionality`
			`- Both namespace-level and cluster-level permissions`
			`- Includes watch permissions for HA operations`

			`## Deployment Steps`

			`### 1. Install Flink Operator`
			```bash
			`# Apply Flink Operator configuration`
			`kubectl apply -f flink-operator-v2.yaml`

			`# Verify operator installation`
			`kubectl get pods -n flink-system`
			```

			`### 2. Create Storage Resources (Optional - for production)`
			```bash
			`# Apply storage configuration`
			`kubectl apply -f flink-storage.yaml`

			`# Verify PVC creation`
			`kubectl get pvc -n freeleaps-data-platform`
			```

			`### 3. Deploy HA Flink Cluster`
			```bash
			`# Option A: Deploy with persistent storage (production)`
			`kubectl apply -f ha-flink-cluster-v2.yaml`

			`# Option B: Deploy with ephemeral storage (development/testing)`
			`kubectl apply -f simple-ha-flink-cluster.yaml`

			`# Check deployment status`
			`kubectl get flinkdeployments -n freeleaps-data-platform`
			`kubectl get pods -n freeleaps-data-platform -l app=flink`
			```

			`## High Availability Features`
			`- JobManager HA: 2 JobManager replicas with Kubernetes-based leader election`
			`- Persistent State: Checkpoints and savepoints stored on persistent volumes`
			`- Automatic Failover: Exponential delay restart strategy with backoff`
			`- Pod Anti-affinity: Ensures components are distributed across different nodes`
			`- Storage Persistence: HA data, checkpoints, and savepoints persist across restarts`

			`## Network Configuration`
			`- JobManager: Port 8081 (Web UI), 6123 (RPC), 6124 (Blob Server)`
			`- TaskManager: Port 6121 (Data), 6122 (RPC), 6126 (Metrics)`
			`- Service Type: ClusterIP for internal communication`

			`## Storage Configuration`
			`- HA Data: 10Gi for high availability metadata`
			`- Checkpoints: 20Gi for application checkpoints`
			`- Savepoints: 20Gi for manual savepoints`
			`- Storage Class: azure-disk-std-ssd-lrs`
			`- Access Mode: ReadWriteOnce (Azure Disk limitation)`

			`## Monitoring and Operations`
			`- Health Checks: Built-in readiness and liveness probes`
			`- Web UI: Accessible through JobManager service`
			`- Metrics: Exposed on port 8080 for Prometheus collection`
			`- Logging: Centralized logging through Kubernetes`

			`## Configuration Details`

			`### High Availability Settings`
			`- Type: kubernetes (native Kubernetes HA)`
			`- Storage: Persistent volume for HA metadata`
			`- Cluster ID: ha-flink-cluster-v2`

			`### Checkpointing Configuration`
			`- Interval: 60 seconds`
			`- Timeout: 10 minutes`
			`- Min Pause: 5 seconds`
			`- Backend: Filesystem with persistent storage`

			`### Resource Allocation`
			`- JobManager: 0.5 CPU, 1024MB memory (HA), 1.0 CPU, 1024MB memory (Simple)`
			`- TaskManager: 0.5 CPU, 2048MB memory (HA), 2.0 CPU, 2048MB memory (Simple)`

			`## Troubleshooting`

			`### Common Issues and Solutions`

			`#### 1. PVC Binding Issues`
			```bash
			`# Check PVC status`
			`kubectl get pvc -n freeleaps-data-platform`

			`# PVC stuck in Pending state - usually due to:`
			`# - Insufficient storage quota`
			`# - Wrong access mode (ReadWriteMany not supported by Azure Disk)`
			`# - Storage class not available`

			`# Solution: Use ReadWriteOnce access mode or ephemeral storage`
			```

			`#### 2. Pod CrashLoopBackOff`
			```bash
			`# Check pod status`
			`kubectl get pods -n freeleaps-data-platform -l app=flink`

			`# Check pod logs`
			`kubectl logs <pod-name> -n freeleaps-data-platform`

			`# Check pod events`
			`kubectl describe pod <pod-name> -n freeleaps-data-platform`
			```

			`#### 3. ServiceAccount Issues`
			```bash
			`# Verify ServiceAccount exists`
			`kubectl get serviceaccount -n freeleaps-data-platform`

			`# Check RBAC permissions`
			`kubectl get rolebinding -n freeleaps-data-platform`
			```

			`#### 4. Storage Path Issues`
			```bash
			`# Ensure storage paths match volume mounts`
			`# For persistent storage: /opt/flink/ha-data, /opt/flink/checkpoints`
			`# For ephemeral storage: /tmp/flink/ha-data, /tmp/flink/checkpoints`
			```

			`### Diagnostic Commands`
			```bash
			`# Check Flink Operator logs`
			`kubectl logs -n flink-system -l app.kubernetes.io/name=flink-kubernetes-operator`

			`# Check Flink cluster status`
			`kubectl describe flinkdeployment <cluster-name> -n freeleaps-data-platform`

			`# Check pod events`
			`kubectl get events -n freeleaps-data-platform --sort-by='.lastTimestamp'`

			`# Check storage status`
			`kubectl get pvc -n freeleaps-data-platform`
			`kubectl describe pvc <pvc-name> -n freeleaps-data-platform`

			`# Check operator status`
			`kubectl get pods -n flink-system`
			`kubectl logs -n flink-system deployment/flink-kubernetes-operator`
			```

			`## Important Notes`
			`1. Storage Limitations: Azure Disk storage class only supports ReadWriteOnce access mode`
			`2. ServiceAccount: Ensure the correct ServiceAccount is specified in cluster configuration`
			`3. Resource Requirements: Verify cluster has enough CPU/memory for all replicas`
			`4. Network Policies: May need adjustment for inter-pod communication`
			`5. Ephemeral vs Persistent: Use ephemeral storage for development/testing, persistent for production`

			`## Quick Start (Recommended for Testing)`
			```bash
			`# 1. Deploy operator`
			`kubectl apply -f flink-operator-v2.yaml`

			`# 2. Wait for operator to be ready`
			`kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=flink-kubernetes-operator -n flink-system`

			`# 3. Deploy simple HA cluster (no persistent storage)`
			`kubectl apply -f simple-ha-flink-cluster.yaml`

			`# 4. Monitor deployment`
			`kubectl get flinkdeployments -n freeleaps-data-platform`
			`kubectl get pods -n freeleaps-data-platform -l app=flink`
			```

			`## Production Deployment`
			```bash
			`# 1. Deploy operator`
			`kubectl apply -f flink-operator-v2.yaml`

			`# 2. Deploy storage resources`
			`kubectl apply -f flink-storage.yaml`

			`# 3. Deploy production HA cluster`
			`kubectl apply -f ha-flink-cluster-v2.yaml`

			`# 4. Monitor deployment`
			`kubectl get flinkdeployments -n freeleaps-data-platform`
			`kubectl get pods -n freeleaps-data-platform -l app=flink`
			```