switch envs, resources are shared on multiple envs and are exclusive#1303
switch envs, resources are shared on multiple envs and are exclusive#1303
Conversation
f67b833 to
f6422f2
Compare
|
/run-security-scan |
alexcos20
left a comment
There was a problem hiding this comment.
AI automated code review (Gemini 3).
Overall risk: medium
Summary:
This Pull Request introduces a significant architectural change to the Compute-to-Data (C2D) environment configuration. It refactors the system to support multiple, distinct compute environments per Docker engine, providing more granular control over resource allocation, fees, and access policies. The DOCKER_COMPUTE_ENVIRONMENTS configuration now expects an array of environments within each Docker cluster definition. Resource management has been enhanced with per-environment resource tracking (CPU, RAM, Disk), global resource availability checks to prevent over-allocation, and explicit handling for shared-exclusive resources like GPUs. Type definitions (C2D.ts) and configuration schemas (schemas.ts) have been updated to reflect this new hierarchical structure. All example and test configurations are adjusted accordingly. This is a major improvement for C2D flexibility and scalability.
Comments:
• [ERROR][bug] The ComputeEnvironmentFreeOptions interface in C2D.ts (which C2DEnvironmentConfig.free uses) does not define minJobDuration or maxJobDuration. However, the ci.yml, config.json, and various integration tests define minJobDuration and maxJobDuration within the free configuration. Furthermore, compute_engine_docker.ts in the start() method explicitly copies these properties if found in envDef.free. This creates a type and schema inconsistency. Please add minJobDuration?: number and maxJobDuration?: number to ComputeEnvironmentFreeOptions in src/@types/C2D/C2D.ts and ComputeEnvironmentFreeOptionsSchema in src/utils/config/schemas.ts.
• [ERROR][bug] The C2DDockerConfigSchema no longer enforces the presence of a 'disk' resource at the top level, as resources are now per-environment. However, C2DEnvironmentConfigSchema also does not enforce it, and C2DEngineDocker.start() does not automatically add a 'disk' resource to env.resources if not explicitly provided (unlike CPU and RAM). A 'disk' resource is fundamental for almost all compute jobs. This could lead to environments being configured without essential storage. Please consider adding a .refine rule to C2DEnvironmentConfigSchema to ensure a 'disk' resource is always present, or adding default 'disk' resources in C2DEngineDocker.start().
• [WARNING][other] This PR introduces a breaking change to the DOCKER_COMPUTE_ENVIRONMENTS configuration structure. The documentation (e.g., README.md) and release notes should clearly communicate this change, provide migration steps for existing deployments, and explain the benefits of the new multi-environment setup.
• [INFO][performance] The envResourceMap creation for (env.resources || []).map((r) => [r.id, r]) within the _getComputeEnvironmentInfo method is done inside the loop for each env. This map is then used inside the for (const job of jobs) loop. While likely negligible for typical scenarios, creating this map repeatedly might be slightly inefficient if there are many environments or jobs. Could envResourceMap be created once per env when the environments are initialized in C2DEngineDocker.start() and stored on the env object or a separate map for quicker lookup? This is a minor optimization suggestion.
• [INFO][style] Consider adding JSDoc comments to the new processFeesForEnvironment method and updating existing JSDocs for methods like checkIfResourcesAreAvailable and allocateCpus to reflect their new parameters and responsibilities. This will improve code clarity and maintainability.
|
/run-security-scan |
alexcos20
left a comment
There was a problem hiding this comment.
AI automated code review (Gemini 3).
Overall risk: medium
Summary:
This pull request introduces a significant architectural change to the Compute-to-Data (C2D) component, allowing for multiple, distinct compute environments within a single Docker cluster. The configuration for compute environments is now hierarchical, enabling granular control over resources, fees, access, and job durations for each environment. It also implements global resource availability checks to prevent over-allocation across environments and refactors CPU core allocation to be environment-specific. This is a substantial improvement in flexibility and resource management, but it introduces a breaking change in the configuration structure.
Comments:
• [ERROR][other] This PR introduces a breaking change to the dockerComputeEnvironments configuration structure. Previously, parameters like storageExpiry, maxJobDuration, fees, and resources were directly under the C2DDockerConfig object. Now, they must be nested within an environments array. While the .env.example and config.json files are updated, existing deployments will require manual configuration migration. Please ensure this breaking change is clearly documented in the release notes with migration instructions.
• [INFO][style] Good catch correcting the typo from 1 hours to 1 hour in the paymentClaimInterval comment.
• [INFO][performance] The addition of envResourceMap for efficient resource lookup is a good optimization, especially with the introduction of multiple environments. This avoids repeated array iteration.
• [WARNING][bug] The physicalLimits map correctly initializes for 'cpu' and 'ram' based on sysinfo. However, there's no corresponding initialization for 'disk'. While the checkGlobalResourceAvailability method for disk will sum total values from configured environments, it won't necessarily reflect the actual physical disk capacity of the host unless manually configured to match. Is there a plan to automatically detect physical disk capacity or enforce a configuration check to ensure the sum of disk.total across environments does not exceed the physical limit?
• [INFO][style] The ComputeEnvironmentFreeOptionsSchema now includes minJobDuration. This is consistent with the paid environment options. Good to ensure consistency here.
• [INFO][other] The validation rule environments: z.array(C2DEnvironmentConfigSchema).min(1) ensures that at least one compute environment is configured within a Docker cluster. This is a sensible default and prevents misconfigurations.
• [INFO][other] The environment ID fallback to String(envIdx) if not explicitly provided is functional for ensuring uniqueness. For user-friendliness and traceability in logs/monitoring, it might be beneficial to encourage explicit id definitions in the configuration, or generate a more descriptive ID based on other environment properties if id is missing.
|
/run-security-scan |
alexcos20
left a comment
There was a problem hiding this comment.
AI automated code review (Gemini 3).
Overall risk: medium
Summary:
This pull request introduces a major architectural refactoring of the compute environment configuration, allowing a single compute node to host multiple distinct compute environments with individual resource definitions, fee structures, and access controls. It moves core environment settings into a nested environments array within the Docker compute configuration. Key improvements include granular per-environment resource management (CPU, RAM, disk), a new global resource availability check to prevent over-allocation, and robust Zod schema validation for the new configuration structure. This is a significant enhancement for compute providers offering varied services.
Comments:
• [INFO][other] The updated example for DOCKER_COMPUTE_ENVIRONMENTS clearly demonstrates the new nested environments array structure. This is a breaking change for existing deployments, so clear migration notes will be essential.
• [INFO][other] The introduction of C2DEnvironmentConfig and the modification of C2DDockerConfig correctly reflect the new multi-environment architecture. This provides much-needed flexibility.
• [INFO][other] The fix from '1 hours' to '1 hour' for paymentClaimInterval comment is a minor but welcome detail correction.
• [INFO][performance] The envResourceMap creation is a good optimization for resource lookups within an environment, improving readability and potentially performance in resource-intensive loops.
• [INFO][other] The new logic distinguishing between shared-exclusive (GPU) and per-env exclusive (CPU, RAM, disk) resource tracking is crucial for accurate multi-environment resource management. This addresses a common challenge in container orchestration.
• [INFO][security] The checkGlobalResourceAvailability method is a critical addition for preventing over-allocation of physical resources across multiple logical compute environments. This enhances the stability and security of the compute node.
• [INFO][other] The processFeesForEnvironment method extracts and centralizes fee processing logic, which improves code organization and maintainability.
• [INFO][other] The start() method's refactoring to iterate over envConfig.environments and dynamically create ComputeEnvironment objects is the core of the multi-environment support. This is a robust implementation of the new design.
• [WARNING][performance] The cpuOffset was removed from the constructor and related logic for CPU allocation is now per-environment. While the new per-environment CPU allocation logic (using envCpuCoresMap) is correct, it's essential to confirm that the cpuOffset functionality for multiple physical Docker clusters (if that was ever a use case) hasn't been implicitly removed without replacement. Currently, C2DEngines iterates c2dClusters but C2DEngineDocker creates only for a single cluster. If there's a use case for multiple DOCKER clusters on one node and they need different CPU affinities, that might need re-evaluation.
• [INFO][performance] Passing envId to allocateCpus and using envCpuCoresMap for environment-specific CPU pinning is a good approach to ensure fair and isolated CPU resource allocation for jobs running in different configured environments.
• [INFO][other] The C2DEnvironmentConfigSchema with its refine clauses provides excellent validation for the new environment configurations. The mandatory disk resource and storageExpiry vs maxJobDuration checks are important for operational stability. The requirement for either fees or free configuration per environment is also sensible.
• [INFO][other] Adding minJobDuration to ComputeEnvironmentFreeOptionsSchema with a default of 60 seconds is a good consistency improvement. This ensures free jobs also have a minimum duration constraint.
• [WARNING][test] In some integration tests, paymentClaimInterval is now set on the DOCKER_COMPUTE_ENVIRONMENTS string at the top level of the Docker config, not within the nested environments array. This is consistent with the C2DDockerConfig definition, but it's important to ensure that this global paymentClaimInterval is respected and applies correctly to all nested environments, or if it should be an environment-specific setting.
d4cb256 to
a34a4f3
Compare
|
/run-security-scan |
alexcos20
left a comment
There was a problem hiding this comment.
AI automated code review (Gemini 3).
Overall risk: medium
Summary:
This pull request introduces a significant architectural change by enabling support for multiple, distinct compute environments within a single C2D Docker cluster. Each environment can now define its own resources (CPU, RAM, disk, GPU), fee structures, access controls, and free tier configurations. Key changes include a revised configuration structure, updated TypeScript types, granular resource management logic distinguishing between shared (GPU) and exclusive (CPU, RAM, disk) resources, and the introduction of global physical resource limits. CPU affinity is now managed per environment.
This is a major feature enhancement that improves the flexibility and scalability of compute offerings.
Comments:
• [INFO][other] The C2DDockerConfig interface has been significantly refactored. Previously, it contained properties like storageExpiry, maxJobDuration, fees, and resources directly. These are now moved into the new C2DEnvironmentConfig interface, and C2DDockerConfig now primarily holds connection details and an array of environments.
This is a breaking change to the configuration structure, which needs to be clearly communicated to users and administrators during upgrade.
• [INFO][other] The resource usage tracking logic has been enhanced to differentiate between gpu resources (shared-exclusive, tracked globally) and other resource types (cpu, ram, disk) which are per-environment exclusive. This is a crucial improvement for supporting heterogeneous environments and ensuring accurate resource allocation across multiple defined compute environments.
• [INFO][other] Introduction of physicalLimits and checkGlobalResourceAvailability is a critical addition. This ensures that even with multiple compute environments, the aggregated resource demands do not exceed the actual physical capacity of the host machine. This adds a robust layer of safety and prevents over-provisioning at a global level.
• [WARNING][performance] In the start() method, if statfsSync(this.getC2DConfig().tempFolder) fails to detect the physical disk size, physicalDiskGB defaults to 0. This could lead to diskResources.total being 0 for all environments, effectively making disk resources unavailable or severely constrained if the detection fails. While the CORE_LOGGER.warn helps, consider if a default fallback (e.g., a minimum reasonable disk size) or a more explicit failure during startup is preferable if disk detection is critical and might fail in some environments.
• [INFO][other] The CPU affinity logic now uses envCpuCoresMap to manage core allocations per environment, rather than a single envCpuCores array. This correctly isolates and assigns CPU cores based on the configuration of each specific compute environment, which is vital for multi-environment support.
• [INFO][style] The C2DDockerConfigSchema now requires environments to be an array with a minimum length of 1 (.min(1)). This implies that a Docker compute cluster must define at least one environment. This is a sensible design choice for clarity and preventing misconfigurations where a cluster might be defined but offers no compute environments.
|
/run-security-scan |
alexcos20
left a comment
There was a problem hiding this comment.
AI automated code review (Gemini 3).
Overall risk: medium
Summary:
This pull request introduces a significant refactoring of the compute-to-data (C2D) environment configuration, allowing for multiple compute environments per Docker cluster. Each environment can now define its own resources (CPU, RAM, disk, GPU), fees, and access controls. It also implements global physical resource limits and environment-specific CPU allocation. A new feature to automatically create a benchmark environment is added. Configuration schemas, environment variable parsing, and core compute logic have been updated to support these changes.
Comments:
• [INFO][other] The DOCKER_COMPUTE_ENVIRONMENTS configuration has undergone a major breaking change, now requiring an environments array. This needs thorough documentation for node operators, including migration steps if applicable, to avoid confusion and misconfigurations during upgrades.
• [WARNING][performance] The start method in C2DEngineDocker has grown quite complex, handling physical limit detection, benchmark environment setup, and iterating through multiple environment configurations. Consider breaking down this method into smaller, more focused private methods to improve readability and maintainability. For example, a _initializePhysicalLimits() and _createEnvironment(envDef, sysinfo, platform, consumerAddress, supportedChains).
• [INFO][bug] The statfsSync call for disk size is wrapped in a try/catch. If this fails for some reason, the disk physical limit will not be set, potentially leading to incorrect disk resource management if a fallback isn't provided. While a warning is logged, for critical resources like disk, a more robust fallback or an explicit error might be considered if disk detection is essential for operation.
• [WARNING][style] The sepoliaChainId and usdcToken for the benchmark environment are hardcoded here. It would be better practice to define these as constants in src/utils/config/constants.ts or make them configurable via environment variables, similar to BENCHMARK_MONITORING_ADDRESS, for better maintainability and flexibility across different network configurations.
• [INFO][bug] The .refine rule for mandatory 'disk' resource: if (!data.resources) return false causes the schema to fail if the resources array is omitted entirely from an environment configuration. This is likely intended behavior to ensure disk is always specified, but it's worth confirming that environments without an explicit resources array (relying on implicit defaults or auto-detection) are not meant to be valid.
• [INFO][style] The C2DDockerConfigSchema has a .min(1) requirement for the environments array. This ensures that at least one compute environment must be defined for each Docker cluster. This is a good constraint for operational stability.
Fixes #1280 .
Changes proposed in this PR:
environment-specific settings.