236 lines
7.4 KiB
Markdown
236 lines
7.4 KiB
Markdown
# Flink High Availability Cluster Deployment
|
|
|
|
## Overview
|
|
This project uses Apache Flink Kubernetes Operator to deploy a high availability Flink cluster with persistent storage and automatic failover capabilities.
|
|
|
|
## Component Architecture
|
|
- **JobManager**: 2 replicas with high availability configuration
|
|
- **TaskManager**: 3 replicas for distributed processing
|
|
- **High Availability**: Kubernetes-based HA with persistent storage
|
|
- **Checkpointing**: Persistent checkpoints and savepoints storage
|
|
|
|
## File Description
|
|
|
|
### 1. flink-operator-v2.yaml
|
|
Flink Kubernetes Operator deployment configuration:
|
|
- Operator deployment in `flink-system` namespace
|
|
- RBAC configuration for cluster-wide permissions
|
|
- Health checks and resource limits
|
|
- Enhanced CRD definitions with additional printer columns
|
|
|
|
### 2. flink-crd.yaml
|
|
Custom Resource Definitions for Flink:
|
|
- FlinkDeployment CRD
|
|
- FlinkSessionJob CRD
|
|
- Required for Flink Operator to function
|
|
|
|
### 3. ha-flink-cluster-v2.yaml
|
|
Production-ready HA Flink cluster configuration:
|
|
- 2 JobManager replicas with HA enabled
|
|
- 3 TaskManager replicas with anti-affinity rules
|
|
- Persistent storage for HA data, checkpoints, and savepoints
|
|
- Memory and CPU resource allocation
|
|
- Exponential delay restart strategy
|
|
- Proper volume mounts and storage configuration
|
|
|
|
### 4. simple-ha-flink-cluster.yaml
|
|
Simplified HA Flink cluster configuration:
|
|
- Uses ephemeral storage to avoid PVC binding issues
|
|
- Basic HA configuration for testing and development
|
|
- Minimal resource requirements
|
|
- Recommended for development and testing
|
|
|
|
### 5. flink-storage.yaml
|
|
Storage and RBAC configuration:
|
|
- PersistentVolumeClaims for HA data, checkpoints, and savepoints
|
|
- ServiceAccount and RBAC permissions for Flink cluster
|
|
- Azure Disk storage class configuration with correct access modes
|
|
|
|
### 6. flink-rbac.yaml
|
|
Enhanced RBAC configuration:
|
|
- Complete permissions for Flink HA functionality
|
|
- Both namespace-level and cluster-level permissions
|
|
- Includes watch permissions for HA operations
|
|
|
|
## Deployment Steps
|
|
|
|
### 1. Install Flink Operator
|
|
```bash
|
|
# Apply Flink Operator configuration
|
|
kubectl apply -f flink-operator-v2.yaml
|
|
|
|
# Verify operator installation
|
|
kubectl get pods -n flink-system
|
|
```
|
|
|
|
### 2. Create Storage Resources (Optional - for production)
|
|
```bash
|
|
# Apply storage configuration
|
|
kubectl apply -f flink-storage.yaml
|
|
|
|
# Verify PVC creation
|
|
kubectl get pvc -n freeleaps-data-platform
|
|
```
|
|
|
|
### 3. Deploy HA Flink Cluster
|
|
```bash
|
|
# Option A: Deploy with persistent storage (production)
|
|
kubectl apply -f ha-flink-cluster-v2.yaml
|
|
|
|
# Option B: Deploy with ephemeral storage (development/testing)
|
|
kubectl apply -f simple-ha-flink-cluster.yaml
|
|
|
|
# Check deployment status
|
|
kubectl get flinkdeployments -n freeleaps-data-platform
|
|
kubectl get pods -n freeleaps-data-platform -l app=flink
|
|
```
|
|
|
|
## High Availability Features
|
|
- **JobManager HA**: 2 JobManager replicas with Kubernetes-based leader election
|
|
- **Persistent State**: Checkpoints and savepoints stored on persistent volumes
|
|
- **Automatic Failover**: Exponential delay restart strategy with backoff
|
|
- **Pod Anti-affinity**: Ensures components are distributed across different nodes
|
|
- **Storage Persistence**: HA data, checkpoints, and savepoints persist across restarts
|
|
|
|
## Network Configuration
|
|
- **JobManager**: Port 8081 (Web UI), 6123 (RPC), 6124 (Blob Server)
|
|
- **TaskManager**: Port 6121 (Data), 6122 (RPC), 6126 (Metrics)
|
|
- **Service Type**: ClusterIP for internal communication
|
|
|
|
## Storage Configuration
|
|
- **HA Data**: 10Gi for high availability metadata
|
|
- **Checkpoints**: 20Gi for application checkpoints
|
|
- **Savepoints**: 20Gi for manual savepoints
|
|
- **Storage Class**: azure-disk-std-ssd-lrs
|
|
- **Access Mode**: ReadWriteOnce (Azure Disk limitation)
|
|
|
|
## Monitoring and Operations
|
|
- **Health Checks**: Built-in readiness and liveness probes
|
|
- **Web UI**: Accessible through JobManager service
|
|
- **Metrics**: Exposed on port 8080 for Prometheus collection
|
|
- **Logging**: Centralized logging through Kubernetes
|
|
|
|
## Configuration Details
|
|
|
|
### High Availability Settings
|
|
- **Type**: kubernetes (native Kubernetes HA)
|
|
- **Storage**: Persistent volume for HA metadata
|
|
- **Cluster ID**: ha-flink-cluster-v2
|
|
|
|
### Checkpointing Configuration
|
|
- **Interval**: 60 seconds
|
|
- **Timeout**: 10 minutes
|
|
- **Min Pause**: 5 seconds
|
|
- **Backend**: Filesystem with persistent storage
|
|
|
|
### Resource Allocation
|
|
- **JobManager**: 0.5 CPU, 1024MB memory (HA), 1.0 CPU, 1024MB memory (Simple)
|
|
- **TaskManager**: 0.5 CPU, 2048MB memory (HA), 2.0 CPU, 2048MB memory (Simple)
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues and Solutions
|
|
|
|
#### 1. PVC Binding Issues
|
|
```bash
|
|
# Check PVC status
|
|
kubectl get pvc -n freeleaps-data-platform
|
|
|
|
# PVC stuck in Pending state - usually due to:
|
|
# - Insufficient storage quota
|
|
# - Wrong access mode (ReadWriteMany not supported by Azure Disk)
|
|
# - Storage class not available
|
|
|
|
# Solution: Use ReadWriteOnce access mode or ephemeral storage
|
|
```
|
|
|
|
#### 2. Pod CrashLoopBackOff
|
|
```bash
|
|
# Check pod status
|
|
kubectl get pods -n freeleaps-data-platform -l app=flink
|
|
|
|
# Check pod logs
|
|
kubectl logs <pod-name> -n freeleaps-data-platform
|
|
|
|
# Check pod events
|
|
kubectl describe pod <pod-name> -n freeleaps-data-platform
|
|
```
|
|
|
|
#### 3. ServiceAccount Issues
|
|
```bash
|
|
# Verify ServiceAccount exists
|
|
kubectl get serviceaccount -n freeleaps-data-platform
|
|
|
|
# Check RBAC permissions
|
|
kubectl get rolebinding -n freeleaps-data-platform
|
|
```
|
|
|
|
#### 4. Storage Path Issues
|
|
```bash
|
|
# Ensure storage paths match volume mounts
|
|
# For persistent storage: /opt/flink/ha-data, /opt/flink/checkpoints
|
|
# For ephemeral storage: /tmp/flink/ha-data, /tmp/flink/checkpoints
|
|
```
|
|
|
|
### Diagnostic Commands
|
|
```bash
|
|
# Check Flink Operator logs
|
|
kubectl logs -n flink-system -l app.kubernetes.io/name=flink-kubernetes-operator
|
|
|
|
# Check Flink cluster status
|
|
kubectl describe flinkdeployment <cluster-name> -n freeleaps-data-platform
|
|
|
|
# Check pod events
|
|
kubectl get events -n freeleaps-data-platform --sort-by='.lastTimestamp'
|
|
|
|
# Check storage status
|
|
kubectl get pvc -n freeleaps-data-platform
|
|
kubectl describe pvc <pvc-name> -n freeleaps-data-platform
|
|
|
|
# Check operator status
|
|
kubectl get pods -n flink-system
|
|
kubectl logs -n flink-system deployment/flink-kubernetes-operator
|
|
```
|
|
|
|
## Important Notes
|
|
1. **Storage Limitations**: Azure Disk storage class only supports ReadWriteOnce access mode
|
|
2. **ServiceAccount**: Ensure the correct ServiceAccount is specified in cluster configuration
|
|
3. **Resource Requirements**: Verify cluster has enough CPU/memory for all replicas
|
|
4. **Network Policies**: May need adjustment for inter-pod communication
|
|
5. **Ephemeral vs Persistent**: Use ephemeral storage for development/testing, persistent for production
|
|
|
|
## Quick Start (Recommended for Testing)
|
|
```bash
|
|
# 1. Deploy operator
|
|
kubectl apply -f flink-operator-v2.yaml
|
|
|
|
# 2. Wait for operator to be ready
|
|
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=flink-kubernetes-operator -n flink-system
|
|
|
|
# 3. Deploy simple HA cluster (no persistent storage)
|
|
kubectl apply -f simple-ha-flink-cluster.yaml
|
|
|
|
# 4. Monitor deployment
|
|
kubectl get flinkdeployments -n freeleaps-data-platform
|
|
kubectl get pods -n freeleaps-data-platform -l app=flink
|
|
```
|
|
|
|
## Production Deployment
|
|
```bash
|
|
# 1. Deploy operator
|
|
kubectl apply -f flink-operator-v2.yaml
|
|
|
|
# 2. Deploy storage resources
|
|
kubectl apply -f flink-storage.yaml
|
|
|
|
# 3. Deploy production HA cluster
|
|
kubectl apply -f ha-flink-cluster-v2.yaml
|
|
|
|
# 4. Monitor deployment
|
|
kubectl get flinkdeployments -n freeleaps-data-platform
|
|
kubectl get pods -n freeleaps-data-platform -l app=flink
|
|
```
|
|
|
|
|
|
|