Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 171 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# CLAUDE.md

## Project Identity

HyperFleet Sentinel is a **Kubernetes resource watcher** that polls the HyperFleet API for cluster/nodepool updates, makes orchestration decisions via CEL-based decision logic, and publishes CloudEvents to message brokers. Stateless, horizontally scalable via label-based sharding, delegates all state persistence to the API.

- **Language**: Go 1.25 (see `go.mod`)
- **Messaging**: Broker abstraction (RabbitMQ, GCP Pub/Sub, Stub)
- **API Client**: Generated from [hyperfleet-api-spec](https://github.com/openshift-hyperfleet/hyperfleet-api-spec) — see [openapi/README.md](openapi/README.md)
- **Deployment**: Helm chart in `charts/`

Sentinel is one component in the HyperFleet control plane:
- **API** — persists cluster/nodepool state (source of truth)
- **Sentinel** — watches API, decides when resources need reconciliation, publishes events
- **Adapters** — consume events, execute provisioning/deprovisioning, report back to API
- **Broker** (RabbitMQ or Pub/Sub) — decouples Sentinel from adapters

## Critical First Steps

**Generated OpenAPI client is NOT committed to git.** Before any build, test, or development task:

```bash
make generate # Extracts OpenAPI spec from hyperfleet-api-spec module and generates Go client
```

Setup sequence for a fresh clone:
1. `make generate` — generate OpenAPI client in `pkg/api/openapi/`
2. `make download` — fetch Go dependencies
3. `make build` — build `bin/sentinel` binary
4. `make test` — verify unit tests pass

## Verification

| Command | What it does |
|---|---|
| `make verify` | go vet + format check (fast) |
| `make lint` | golangci-lint (comprehensive) |
| `make test` | all tests (`./...`), writes `coverage.out` profile |
| `make test-unit` | unit tests only — specific internal/ and pkg/ packages |
| `make test-integration` | integration tests with testcontainers (Docker required) |
| `make test-coverage` | runs `make test` then opens HTML coverage report |
| `make test-helm` | Helm chart lint + template validation (10 scenarios) |
| `make test-all` | test + test-integration + test-helm + lint |

Quick feedback: `make verify && make test-unit`. Full pre-push: `make test-all`.

**PR pre-flight order:**
1. `make generate`
2. `make fmt`
3. `make lint`
4. `make test-unit`
5. `make test-integration` — if broker/API changes
6. `make test-helm` — if chart changes
7. Update CHANGELOG.md if the change is user-visible

## Source of Truth

| Topic | Where to look |
|---|---|
| Configuration reference | [docs/config.md](docs/config.md) |
| Metrics definitions | [docs/metrics.md](docs/metrics.md), `internal/metrics/` |
| Local/GKE deployment | [docs/running-sentinel.md](docs/running-sentinel.md) |
| Multi-instance sharding | [docs/multi-instance-deployment.md](docs/multi-instance-deployment.md) |
| Alerts and runbooks | [docs/alerts.md](docs/alerts.md), [docs/runbook.md](docs/runbook.md) |
| Helm values | [charts/values.yaml](charts/values.yaml) |
| Contributing and setup | [CONTRIBUTING.md](CONTRIBUTING.md) |
| OpenAPI client generation | [openapi/README.md](openapi/README.md) |
| Example configs | `configs/dev-example.yaml`, `configs/rabbitmq-example.yaml`, `configs/gcp-pubsub-example.yaml` |
| Broker configuration | `broker.yaml` (loaded by hyperfleet-broker; override path via `BROKER_CONFIG_FILE` env var) |
| CloudEvents / CEL payloads | `internal/payload/` |
| Resource profiling | [docs/resource-profiling.md](docs/resource-profiling.md) |

## Architecture Context

Sentinel's job: **decide when**, not **execute how**. It can be killed and restarted at any time without data loss — this is what makes label-based sharding safe. The `message_decision` config uses CEL expressions to decide when to publish — see `DefaultMessageDecision()` in `internal/config/config.go` for default expressions.

### Key Internal Patterns
- **Config validation fails fast** — `Validate()` returns error at startup, `LoadConfig()` propagates to main which exits non-zero
- **Context propagation** — `context.Context` threaded through all calls with correlation keys (OpID, TraceID, SpanID, DecisionReason)
- **Health probes** — `/healthz` (liveness: stale poll detection), `/readyz` (readiness: broker + first successful poll)

## Code Conventions

### Commit Messages
Format: `HYPERFLEET-### - type: description`

Example:
```
HYPERFLEET-427 - feat: add standard metrics labels

Adds resource_type and resource_selector labels to all
Prometheus metrics for consistent querying.

Co-Authored-By: Claude <noreply@anthropic.com>
```
Co-Authored-By trailer required on all Claude-assisted commits.

### Configuration
- Config struct in `internal/config/config.go` — YAML struct tags, validation via `Validate()`
- All durations use `time.Duration` with YAML `duration` format (e.g., `5s`, `30m`)
- Config precedence (highest wins): CLI flags > env vars (`HYPERFLEET_*`) > YAML file > defaults
- Broker credentials handled separately via `broker.yaml` (or `BROKER_CONFIG_FILE` env var)

### CLI Commands
- `sentinel serve --config config.yaml` — run the service
- `sentinel config-dump --config config.yaml` — print merged config (debug precedence issues)
- `sentinel version` — print version, commit, build date
- Run `sentinel serve --help` for full flag list

### Error Handling
- Log at boundaries (main service loop), not deep in call stack

### Logging
- Custom structured logger in `pkg/logger/` — stdlib only, no external deps
- Interface: `logger.HyperFleetLogger` with `Info()`, `Error()`, `Warn()`, `Debug()`, `V(level)` (verbosity), `Extra()`
- Create via `logger.NewHyperFleetLogger()` — uses global config
- Chaining: `logger.Extra("key", val).Extra("key2", val2).Info("msg")`
- **IMPORTANT: always use `pkg/logger`, never `log/slog` directly**

### CloudEvents Payloads
`message_data` config uses CEL expressions, not static values:
```yaml
message_data:
id: resource.id
kind: resource.kind
href: resource.href
```
CEL context:
- `resource` — cluster/nodepool object from API (id, kind, href, generation, status, labels, etc.)
- `reason` — decision reason string from engine (e.g., `"message decision matched"`, `"message decision result is false"`)
- `condition("Type")` — custom function to look up resource status condition by type name
- `now` — current timestamp
- `timestamp()`, `duration()` — standard CEL time functions

### Testing
- Table-driven tests with plain `if` assertions — no testify
- Mocking via simple interface implementations (e.g., MockPublisher), no gomock
- Unit tests live alongside code: `foo_test.go` next to `foo.go`
- Integration tests in `test/integration/` with `//go:build integration` tag
- Prometheus metrics verified with `prometheus/testutil`
- Run single test: `go test -run TestDecisionEngine ./internal/engine/...`

## Git Workflow

- Branch from `main`, PR back to `main`
- Branch naming: `HYPERFLEET-###-short-description`
- Pre-commit hooks: run `make install-hooks` to install — enforces commit message format (`hyperfleet-commitlint`), Go formatting, linting, and vet

## Project Boundaries

**DO NOT**:
- Add business logic to Sentinel — orchestration decisions only, execution belongs in adapters
- Store state in Sentinel — it is stateless, API is source of truth
- Hardcode the resource polling interval — always use `poll_interval` from config for the main sentinel loop; adding a second resource polling loop bypasses the single-ticker backpressure model

**DO**:
- Update `hyperfleet-api-spec` version in `go.mod` and run `make generate` when API spec changes
- New exported functions require unit tests; new broker/API interactions require integration tests
- Add metrics when adding observable behavior — see [docs/metrics.md](docs/metrics.md) for conventions
- Convention: `message_data` should include `id`, `kind`, `href` fields (not enforced by validation, but expected by downstream adapters) — see `configs/dev-example.yaml`
- Use broker abstraction (`hyperfleet-broker`) — never import RabbitMQ/Pub/Sub clients directly

## Gotchas

- **`make generate` is mandatory** — build and tests fail without it; generated code is gitignored
- **`pkg/api/openapi/` is read-only** — never hand-edit, always regenerate
- **Broker config comes from `broker.yaml`** (or `BROKER_CONFIG_FILE` env var), not sentinel YAML config — handled by hyperfleet-broker library
- **CEL expressions in `message_data` are compiled at startup** — syntax errors fail fast, but semantic errors (wrong field names on resource) surface at evaluation time
- **Metrics labels must include `resource_type` and `resource_selector`** — see [docs/metrics.md](docs/metrics.md) for naming conventions
- **Metrics use `sync.Once` registration** — call `ResetSentinelMetrics()` in tests to avoid duplicate registration panics
- **No testify** — project uses plain Go assertions and table-driven tests; don't introduce testify
178 changes: 1 addition & 177 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,177 +1 @@
# CLAUDE.md

## Project Identity

HyperFleet Sentinel is a **Kubernetes resource watcher** that polls the HyperFleet API for cluster/nodepool updates, makes orchestration decisions based on max age intervals, and publishes CloudEvents to message brokers. It is stateless, horizontally scalable via label-based sharding, and delegates all state persistence to the API.

- **Language**: Go 1.25+
- **Messaging**: Broker abstraction supporting RabbitMQ, GCP Pub/Sub, and Stub implementations
- **API Client**: Generated from the [hyperfleet-api-spec](https://github.com/openshift-hyperfleet/hyperfleet-api-spec) Go module — see [openapi/README.md](openapi/README.md)
- **Deployment**: Helm chart with PodMonitoring (GKE) and ServiceMonitor (Prometheus Operator)

## Critical First Steps

**Generated OpenAPI client is NOT committed to git.** Before any build, test, or development task:

```bash
make generate # Extracts OpenAPI spec from hyperfleet-api-spec module and generates Go client
```

Setup sequence for a fresh clone:
1. `make generate` — generate OpenAPI client in `pkg/api/openapi/`
2. `make download` — fetch Go dependencies
3. `make build` — build `bin/sentinel` binary
4. `make test` — verify unit tests pass

## Verification Commands

| Command | What it does |
|---|---|
| `make verify` | go vet + format check (fast) |
| `make lint` | golangci-lint (comprehensive) |
| `make test` | unit tests only (no external deps) |
| `make test-integration` | integration tests with testcontainers (RabbitMQ, Pub/Sub) |
| `make test-helm` | Helm chart lint and validation |
| `make test-all` | lint + unit + integration + helm tests |

Use `make verify && make test` for fast local feedback. Use `make test-all` before pushing.

## Code Conventions

### Commit Messages
Format: `HYPERFLEET-### - type: description`

Example:
```
HYPERFLEET-427 - feat: add standard metrics labels

Adds resource_type and resource_selector labels to all
Prometheus metrics for consistent querying.

Co-Authored-By: Claude <noreply@anthropic.com>
```

### Import Ordering
1. Standard library
2. External packages (`github.com/google/cel-go`, `github.com/prometheus/client_golang`)
3. HyperFleet packages (`github.com/openshift-hyperfleet/hyperfleet-broker`, etc.)
4. Internal packages (`github.com/openshift-hyperfleet/hyperfleet-sentinel/internal/...`)

### Configuration
- Config lives in `internal/config/config.go` — struct tags for YAML, validation via `Validate()`
- All durations use `time.Duration` with YAML `duration` format (e.g., `5s`, `30m`)
- Environment variables override YAML only for broker credentials (via hyperfleet-broker library)
- Config validation fails fast at startup — never run with invalid config

### Error Handling
- Errors propagate with context: `fmt.Errorf("failed to poll API: %w", err)`
- Log errors at the boundary (main service loop), not deep in call stack
- Use structured logging: `logger.Error("msg", "key", value, "error", err)`

### Metrics
- All metrics defined in `pkg/metrics/metrics.go` — use Prometheus client conventions
- Standard labels on all metrics: `resource_type`, `resource_selector`
- Counter: `_total` suffix (e.g., `hyperfleet_sentinel_events_published_total`)
- Gauge: no suffix (e.g., `hyperfleet_sentinel_pending_resources`)
- Histogram: `_seconds` suffix (e.g., `hyperfleet_sentinel_poll_duration_seconds`)

### Testing
- Unit tests: mock external dependencies (API client, broker), fast, deterministic
- Integration tests: testcontainers for real RabbitMQ/Pub/Sub, slower, covers end-to-end flows
- Test file naming: `*_test.go` alongside implementation
- Integration tests: `test/integration/*_test.go` with build tag `//go:build integration`

### CloudEvents Structure
Events use CEL expressions from `message_data` config to build payloads:
```yaml
message_data:
id: resource.id # CEL expressions, not static values
kind: resource.kind
href: resource.href
generation: resource.generation
```

CEL context includes:
- `resource` — the cluster/nodepool object from API
- `reason` — decision string ("not_reconciled", "reconciled_stale", "reconciled_fresh")

## Project Boundaries

**DO NOT**:
- Modify generated code in `pkg/api/openapi/` — regenerate via `make generate` instead
- Add dependencies without checking licenses (`go-licenses` reports in CI)
- Commit broker credentials or GCP service account keys
- Add business logic to Sentinel — orchestration decisions only, execution belongs in adapters
- Store state in Sentinel — it is stateless, API is the source of truth
- Poll faster than API can handle — respect backpressure and rate limits

**DO**:
- Update `hyperfleet-api-spec` version in `go.mod` and run `make generate` when the API spec changes
- Add tests for new features (unit + integration if broker/API interaction)
- Update Prometheus metrics when adding observable behaviors
- Update CHANGELOG.md for user-visible changes
- Follow the ObjectReference pattern for CloudEvents payloads (id, kind, href)
- Use broker abstraction (`hyperfleet-broker`) — never import RabbitMQ/Pub/Sub clients directly

## Architecture Context

Sentinel is one component in the HyperFleet control plane:
- **API** persists cluster/nodepool state (source of truth)
- **Sentinel** watches API, decides when resources need reconciliation, publishes events
- **Adapters** consume events, execute provisioning/deprovisioning, report status back to API
- **Broker** (RabbitMQ or Pub/Sub) decouples Sentinel from adapters

Sentinel's job: **decide when**, not **execute how**. Max age intervals define "when":
- `max_age_not_reconciled`: poll frequently for unstable resources
- `max_age_reconciled`: poll infrequently for stable resources

## Local Development

```bash
# 1. Start HyperFleet API (see hyperfleet-api repo) and RabbitMQ
docker run -d -p 5672:5672 rabbitmq:3-management

# 2. Configure (see configs/dev-example.yaml and broker.yaml for templates)
# 3. Run Sentinel
./bin/sentinel serve --config config.yaml

# Watch events at http://localhost:15672 (guest/guest)
```

For detailed local/GKE deployment, see [docs/running-sentinel.md](docs/running-sentinel.md).

## Helm Chart

Chart lives in `charts/` with values for:
- Multiple Sentinel instances with different `resource_selector` (sharding)
- Monitoring: PodMonitoring (GKE/GMP) or ServiceMonitor (Prometheus Operator)
- Broker config via ConfigMap (type, topic) + Secret (credentials)

Example: deploy 2 Sentinels watching different shards:
```bash
helm install sentinel-shard-1 ./charts \
--set config.resourceSelector[0].label=shard \
--set config.resourceSelector[0].value=1 \
--set broker.topic=hyperfleet-prod-clusters

helm install sentinel-shard-2 ./charts \
--set config.resourceSelector[0].label=shard \
--set config.resourceSelector[0].value=2 \
--set broker.topic=hyperfleet-prod-clusters
```

Both read from the same API and publish to the same topic, but watch different label-filtered subsets.

## Validation Checklist

Before submitting a PR:
1. `make generate` — ensure OpenAPI client is current
2. `make fmt` — format code
3. `make verify` — vet and format check
4. `make lint` — pass golangci-lint
5. `make test` — pass unit tests
6. `make test-integration` — pass integration tests (if broker/API changes)
7. `make test-helm` — validate Helm chart
8. Update CHANGELOG.md for user-visible changes
9. Add metrics if new observable behavior
10. Commit message follows `HYPERFLEET-### - type: description` format
@AGENTS.md
Comment thread
coderabbitai[bot] marked this conversation as resolved.