Skip to content

feat(sandbox): integrate openshell-ocsf crate with dual-file output, OCSF toggle config, and log site migration #393

@johntmyers

Description

@johntmyers

Problem Statement

With the openshell-ocsf crate built and tested (see #392), we need to wire it into the sandbox supervisor to replace all 93 file-level log sites with typed OCSF events, set up the dual-file output (openshell.log shorthand + openshell-ocsf.log JSONL), and add the SandboxConfig proto mechanism for gateway-controlled OCSF toggle with hot-reload.

Scope update (2026-03-17): generalized gateway -> sandbox arbitrary settings are tracked separately in #405 and are out of scope for this issue. #393 only includes the minimal SandboxConfig needed for OCSF toggle hot-reload (config_revision + logging.ocsf_enabled).

Proposed Design

This issue covers all integration work: proto changes, gateway config plumbing, sandbox subscriber wiring, config poll hot-reload, migration of every log site from ad-hoc info!()/warn!() to builder → ocsf_emit!(), OCSF profile enrichment, and E2E verification.

The full design is documented in .opencode/plans/ocsf-log-export.md. Key sections: "SandboxConfig: Gateway -> Sandbox OCSF Toggle Config", "Tracing Layer Integration", "Implementation Plan", "Delivery Plan — Part 2".

Architecture

tracing event
    │
    ▼
OcsfEvent struct (built in openshell-sandbox using openshell-ocsf builders)
    │
    ├──► OcsfShorthandLayer ──► /var/log/openshell.log  (always on)
    │    (openshell-ocsf)         └──► gRPC log push to gateway
    │
    └──► OcsfJsonlLayer     ──► /var/log/openshell-ocsf.log  (toggle via SandboxConfig)
         (openshell-ocsf)        wrapped in reload::Layer for hot-reload

SandboxConfig Proto

message SandboxConfig {
  uint64 config_revision = 1;
  LoggingConfig logging = 2;
}

message LoggingConfig {
  bool ocsf_enabled = 1;
}

Added as optional SandboxConfig sandbox_config = 4 on GetSandboxPolicyResponse. Independent revision tracking from policy version. In this issue, the channel is scoped to OCSF logging toggle only. Generalized arbitrary settings are tracked in #405.

Log Site Migration Scope

93 file-level log sites across 18 source files:

Event Class Count Primary Files
Network Activity [4001] 19 proxy.rs, bypass_monitor.rs
HTTP Activity [4002] 7 proxy.rs, l7/relay.rs
SSH Activity [4007] 10 ssh.rs
Process Activity [1007] 4 lib.rs, process.rs
Detection Finding [2004] 9 ssh.rs, opa.rs, l7/relay.rs, proxy.rs, bypass_monitor.rs
Application Lifecycle [6002] 18 main.rs, lib.rs, netns.rs
Device Config State Change [5019] 24 lib.rs
Base Event [0] 20 netns.rs, mechanistic_mapper.rs, proxy.rs, bypass_monitor.rs

Dual-emit events: BYPASS_DETECT (Network Activity + Detection Finding), NSSH1 nonce replay (SSH Activity + Detection Finding).

Dependencies

Out of Scope

Order of Battle

Each step depends on prior steps unless noted.

Step 1: Proto changes (~0.5 day)

  • Add SandboxConfig and LoggingConfig messages to proto/sandbox.proto:
    message SandboxConfig {
      uint64 config_revision = 1;
      LoggingConfig logging = 2;
    }
    
    message LoggingConfig {
      bool ocsf_enabled = 1;
    }
  • Add optional SandboxConfig sandbox_config = 4 to GetSandboxPolicyResponse
  • Regenerate Rust proto bindings (mise run proto:gen or equivalent)
  • Verify proto compilation succeeds and generated Rust types are accessible
  • Verify existing proto tests still pass

Done when: Proto compiles. SandboxConfig and LoggingConfig types exist in generated Rust code. GetSandboxPolicyResponse has an optional sandbox_config field. Existing tests pass.

Step 2: Gateway config plumbing (~1-1.5 days)

  • In openshell-server, read ocsf_logging_enabled from gateway config sources (priority: YAML > CLI flag > env var OPENSHELL_OCSF_LOGGING)
  • Maintain config_revision: u64 counter in gateway, initialized to 1 on startup, incremented on any config change
  • Populate SandboxConfig in every GetSandboxPolicyResponse with current config_revision and ocsf_enabled value
  • When ocsf_logging_enabled is not configured, default to false
  • Add unit tests: config parsing from each source, revision counter increments, response population

Done when: Gateway populates SandboxConfig in every GetSandboxPolicyResponse. Config revision starts at 1 and increments correctly. Tests verify config source priority and default behavior.

Depends on: Step 1.

Step 3: Sandbox subscriber wiring (~1-1.5 days)

  • Add openshell-ocsf as dependency of openshell-sandbox in Cargo.toml
  • In main.rs, replace existing fmt::Full file layer for /var/log/openshell.log with OcsfShorthandLayer
  • Set up OcsfJsonlLayer for /var/log/openshell-ocsf.log, wrapped in tracing_subscriber::reload::Layer
  • Initialize JSONL layer as None (disabled) or Some(...) (enabled) based on initial SandboxConfig from first GetSandboxPolicy response
  • Create SandboxContext from sandbox config values (sandbox ID, name, image, hostname, proxy address) and store for use by all log sites
  • Store reload::Handle for JSONL layer for use in config poll loop
  • Wire subscriber: registry().with(shorthand_layer).with(jsonl_reload_layer).with(stdout_layer).with(log_push_layer)
  • Add integration tests: subscriber setup with OCSF on/off, verify shorthand always active, JSONL conditional

Done when: Sandbox starts with new subscriber stack. OcsfShorthandLayer writes to openshell.log. OcsfJsonlLayer writes to openshell-ocsf.log only when enabled. SandboxContext created and accessible. Existing stdout and log push layers remain functional.

Depends on: Steps 1, 2.

Step 4: Config poll integration (~0.5-1 day)

  • Add SandboxConfig handling to existing policy poll loop in lib.rs
  • Track current_config_revision: u64 in sandbox state (initialized to 0)
  • On each poll response: if sandbox_config present and config_revision > current_config_revision, apply changes
  • For OCSF toggle: use jsonl_reload_handle.modify(|layer| ...) to hot-reload
    • Enable: create JSONL file + non-blocking writer + Some(OcsfJsonlLayer)
    • Disable: None
  • Emit CONFIG:UPDATED event (via ConfigStateChangeBuilder) recording toggle change
  • Handle absent sandbox_config (older gateway) gracefully: no-op, keep current config
  • Add unit tests: revision tracking, toggle on/off via reload handle, backward compat with absent config

Done when: Config poll loop correctly tracks revisions and hot-reloads the JSONL layer. Toggling ocsf_enabled from gateway config creates/removes JSONL file at runtime without sandbox restart. CONFIG:UPDATED event emitted on change. Older gateway (no sandbox_config) doesn't break sandbox.

Depends on: Steps 2, 3.

Step 5: Log site migration — Network + HTTP events (25 sites) (~1.5-2 days)

  • Refactor all 19 Network Activity [4001] log sites in proxy.rs and bypass_monitor.rs:
    • CONNECT allow (L4-only), CONNECT_L7 (allow, L7 follows), CONNECT deny → NetworkActivityBuilder
    • BYPASS_DETECT network event → NetworkActivityBuilder (with observation_point_id=3)
    • Proxy listen, connection errors, relay errors → NetworkActivityBuilder
    • FORWARD parse/reject/upstream errors → NetworkActivityBuilder
    • SSRF blocks (allowed_ips failed, invalid config, internal IP) → NetworkActivityBuilder
  • Refactor all 7 HTTP Activity [4002] log sites in proxy.rs and l7/relay.rs:
    • FORWARD allow/deny → HttpActivityBuilder with HTTP method → activity_id mapping
    • L7_REQUEST → HttpActivityBuilder
    • SSRF blocks → HttpActivityBuilder
    • Non-inference request at inference.local → HttpActivityBuilder
  • Each refactored site: construct builder → ocsf_emit!(). Remove old info!()/warn!() call
  • Verify shorthand output matches expected patterns for proxy events

Done when: All 25 Network + HTTP log sites use builder → ocsf_emit!(). No ad-hoc info!()/warn!() calls remain for these event types. cargo test passes.

Depends on: Steps 3, 4 (subscriber wired, SandboxContext available).

Step 6: Log site migration — SSH + Process + Finding events (22 sites) (~1.5-2 days)

  • Refactor 10 SSH Activity [4007] log sites in ssh.rs:
    • SSH listen, handshake read/verify/accepted/failed → SshActivityBuilder
    • NSSH1 nonce replay (SSH side of dual-emit) → SshActivityBuilder
    • direct-tcpip refuse/fail, unsupported subsystem → SshActivityBuilder
  • Refactor 4 Process Activity [1007] log sites in lib.rs and process.rs:
    • Process started → ProcessActivityBuilder with launch_type_id=1 (Spawn)
    • Process exited → ProcessActivityBuilder with exit_code
    • Process timed out → ProcessActivityBuilder with forced kill
    • SIGTERM failed → ProcessActivityBuilder
  • Refactor 9 Detection Finding [2004] log sites:
    • NSSH1 nonce replay finding (dual-emit with SSH) → DetectionFindingBuilder with MITRE T1550/TA0008
    • BYPASS_DETECT finding (dual-emit with Network) → DetectionFindingBuilder with MITRE T1090.003/TA0011, remediation.desc from hint
    • Unsafe disk policy → DetectionFindingBuilder
    • L7 policy validation warnings (2x) → DetectionFindingBuilder
    • SQL L7 not implemented, HTTP parse error → DetectionFindingBuilder
    • Inference interception, upstream chunk error → DetectionFindingBuilder
  • Verify dual-emit: NSSH1 replay produces 1 SSH Activity + 1 Detection Finding
  • Verify dual-emit: BYPASS_DETECT produces 1 Network Activity (step 5) + 1 Detection Finding

Done when: All 22 log sites use builder → ocsf_emit!(). Dual-emit sites produce exactly 2 events each. No ad-hoc calls remain for these event types. cargo test passes.

Depends on: Step 5 (BYPASS_DETECT network event already migrated; finding side added here).

Step 7: Log site migration — Lifecycle + Config + Base events (46 sites) (~1.5-2 days)

  • Refactor 18 Application Lifecycle [6002] log sites in main.rs, lib.rs, netns.rs:
    • Sandbox start, log file fallback → AppLifecycleBuilder
    • SSH server ready/failed → AppLifecycleBuilder
    • Provider env fetch success/failure → AppLifecycleBuilder
    • TLS init success/failure → AppLifecycleBuilder
    • SIGCHLD handler, image validation, platform warning → AppLifecycleBuilder
    • Config validation (zero/invalid interval) → AppLifecycleBuilder
    • Bypass detection setup: installing rules, rules installed, iptables not found, install failed → AppLifecycleBuilder
  • Refactor 24+ Device Config State Change [5019] log sites in lib.rs:
    • Policy: load/fetch/reload/fallback/enrichment/poll/report → ConfigStateChangeBuilder
    • Inference routes: file load/gateway fetch/update/bundle status → ConfigStateChangeBuilder
  • Refactor 20 Base Event [0] log sites in netns.rs, mechanistic_mapper.rs, proxy.rs, bypass_monitor.rs:
    • Netns create/cleanup (6 events) → BaseEventBuilder
    • Denial flush events → BaseEventBuilder
    • DNS resolution failures → BaseEventBuilder
    • Proxy operational errors → BaseEventBuilder
    • Bypass detection operational events (rule failures, dmesg failures) → BaseEventBuilder

Done when: All remaining 46 log sites use builder → ocsf_emit!(). Every file-level log statement in the sandbox (93 total) now goes through OCSF builders. No ad-hoc info!()/warn!() calls remain for events that reach the file layer. cargo test passes.

Depends on: Steps 5, 6.

Step 8: Profile enrichment (~1 day)

  • Verify and apply OCSF profiles to all events:
    • Container profile: All events include container from SandboxContext::container()
    • Network Proxy profile: Network Activity and HTTP Activity include proxy_endpoint from SandboxContext::proxy_endpoint()
    • Security Control profile: Network, HTTP, SSH, Detection Finding include action_id, disposition_id, firewall_rule where applicable
    • Host profile: All events include device from SandboxContext::device()
  • Populate v1.6.0+ fields:
    • observation_point_id=2 (Destination) on proxy network events
    • observation_point_id=3 (Inline) on BYPASS_DETECT network events
    • is_src_dst_assignment_known=true on all network events
  • Verify metadata.profiles array is correctly populated per event class
  • Schema validation tests confirm profile fields are present

Done when: Every event includes correct profile fields. metadata.profiles lists profiles applied. Schema validation confirms profile fields present.

Depends on: Steps 5, 6, 7.

Step 9: Log push channel update (~0.5 day)

  • Ensure gRPC log push layer receives shorthand text from OcsfShorthandLayer output
  • Verify message field in SandboxLogLine proto contains the shorthand line
  • Verify openshell sandbox logs CLI command displays shorthand output correctly
  • Verify TUI log panel renders shorthand lines
  • No proto changes needed — same SandboxLogLine message, better formatted content

Done when: Log push delivers shorthand text to gateway. openshell sandbox logs shows shorthand lines. TUI displays correctly. No regressions.

Depends on: Steps 3, 5-7.

Step 10: E2E verification (~1 day)

  • Run mise run e2e and verify all existing tests pass (update assertions as needed for shorthand format)
  • Verify dual-file output:
    • /var/log/openshell.log contains shorthand text lines (not JSON, not old format)
    • /var/log/openshell-ocsf.log (when enabled) contains OCSF JSONL with correct schema structure
    • Line counts match between files (accounting for dual-emit events)
  • Verify config toggle: enable/disable OCSF via gateway config, observe JSONL file creation/cessation without sandbox restart
  • Update test_sandbox_policy.py assertions for shorthand patterns
  • Verify log push and TUI end-to-end

Done when: mise run e2e passes. Manual verification of dual-file output, log push, TUI display, and config toggle all succeed.

Depends on: All prior steps.

Step 11: Cleanup (~0.5 day)

  • Remove all replaced ad-hoc info!()/warn!() calls superseded by ocsf_emit!()
  • Verify no OCSF formatting or serialization code exists in openshell-sandbox — all lives in openshell-ocsf
  • Verify openshell-sandbox only contains: builder call sites, subscriber wiring in main.rs, SandboxContext creation, config poll integration
  • Run mise run pre-commit and fix any issues
  • Run cargo clippy across the workspace — zero warnings
  • Update architecture/ docs to describe new logging architecture (dual-file output, OCSF adoption, SandboxConfig)
  • Update docs/ for any user-facing log format changes

Done when: No dead code remains. mise run pre-commit passes. cargo clippy zero warnings. Architecture docs updated. No OCSF logic in sandbox outside call sites and wiring.

Depends on: Steps 5-10.

Acceptance Criteria

  1. All 93 file-level log sites in openshell-sandbox emit typed OCSF events via builder → ocsf_emit!()
  2. /var/log/openshell.log contains shorthand-formatted text (not old tracing::fmt::Full output)
  3. /var/log/openshell-ocsf.log is created only when SandboxConfig.logging.ocsf_enabled=true and contains valid OCSF JSONL
  4. OCSF JSONL toggle is hot-reloadable via SandboxConfig.config_revision — no sandbox restart needed
  5. Backward compatible: sandbox works correctly with a gateway that does not populate sandbox_config
  6. gRPC log push delivers shorthand text. TUI and openshell sandbox logs display correctly
  7. Dual-emit events (BYPASS_DETECT, NSSH1 replay) produce exactly 2 OCSF events each
  8. All OCSF profiles (Container, Network Proxy, Security Control, Host) correctly applied
  9. mise run e2e passes with updated assertions
  10. mise run pre-commit and cargo clippy pass with zero issues
  11. Architecture docs updated to describe dual-file logging and OCSF adoption

Estimated Effort

~9-12 days (after Part 1 is complete)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions