Conversation
|
@greptileai review and scan for regressions |
Greptile SummaryThis PR is a major infrastructure migration, moving the Key changes include:
Outstanding concerns (some carried from previous review rounds):
Confidence Score: 1/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Client as WakaTime Client
participant API as HackatimeController
participant CH as ClickHouse (heartbeats)
participant Cache as Rails.cache
participant PG as PostgreSQL (users)
participant MV as heartbeat_user_daily_summary_mv
Client->>API: POST /heartbeats (batch)
API->>API: prepare_heartbeat_attrs()
API->>API: assign IDs (timestamp << 10 | rand)
API->>CH: insert_all(records)
API->>Cache: HeartbeatCacheInvalidator.bump_for(user_id)
Note over MV,CH: Every 10 minutes (async)
CH-->>MV: REFRESH MATERIALIZED VIEW
MV->>CH: heartbeat_user_daily_summary updated
Client->>API: GET /dashboard
API->>Cache: fetch(dashboard_cache_version)
Cache-->>API: miss
API->>PG: User.where(id:).pluck(:id, :timezone)
API->>CH: SELECT groupUniqArray(project), ... FROM heartbeats
API->>CH: SELECT weekly project durations (batched)
API->>CH: SELECT duration_seconds (lagInFrame window)
API-->>Client: dashboard JSON
Note over API,CH: Home stats path
API->>CH: SELECT uniq(user_id), sum(duration_s) FROM heartbeat_user_daily_summary FINAL
Prompt To Fix All With AIThis is a comment left during a code review.
Path: app/controllers/concerns/dashboard_data.rb
Line: 101-103
Comment:
**Unquoted timezone interpolation — SQL injection risk**
`current_user.timezone` is interpolated directly into the ClickHouse SQL string at lines 103 (twice in the `select`) without any quoting or sanitization. A crafted timezone value stored in the user's profile (e.g. `UTC') UNION SELECT ...--`) would be executed verbatim.
The identical vulnerability was already identified and fixed in `WeeklySummaryMailer` (using `connection.quote`), but the same pattern here was not addressed:
```suggestion
tz_quoted = Heartbeat.connection.quote(current_user.timezone)
weekly_sql = weekly_hb
.select("toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) as week_start, `project` as grouped_time, least(greatest(time - lagInFrame(time, 1, time) OVER (PARTITION BY `project`, toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) ORDER BY time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 0), #{timeout}) as diff")
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: db/migrate_clickhouse/20260324000003_create_heartbeat_user_daily_summary_refreshable_view.rb
Line: 17-35
Comment:
**Window partitions by `user_id` only, bleeds across day boundaries**
The inner window function partitions by `user_id` alone:
```sql
lagInFrame(time, 1, time) OVER (
PARTITION BY user_id
ORDER BY time ASC
...
)
```
This means the lag for the *first heartbeat of a new day* is computed against the *last heartbeat of the previous day*. If the user was coding right up to midnight, that cross-midnight gap (which can be up to 120 seconds) is added to the new day's `duration_s`, inflating it incorrectly.
The ad-hoc query path in `heartbeatable.rb` correctly partitions by day:
```sql
PARTITION BY toDate(toDateTime(toUInt32(time), '#{timezone}'))
```
The materialized view should use the same approach, partitioning by `(user_id, toDate(toDateTime(toUInt32(time))))`, to stay consistent with individual user stat calculations.
Note: the current structure also partitions by UTC day while the per-user queries use the user's local timezone, so the totals in `heartbeat_user_daily_summary` will diverge from those shown on user dashboards for non-UTC users.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: app/services/heartbeat_import_service.rb
Line: 94-119
Comment:
**`fields_hash` no longer computed on import**
`generate_fields_hash` was removed from `Heartbeat` and `before_save :set_fields_hash!` was also removed. The import service previously set `attrs[:fields_hash]` before inserting, but that line is now gone. All imported heartbeats will have `fields_hash = ''` (the ClickHouse column default).
This means `existing_hashes` deduplication in any future code path, the `find_or_create_by(fields_hash:)` guard added in the controller, and the `UNIQUE` semantics of `ReplacingMergeTree` on `id`—none of these will work correctly for imported data. Two re-imports of the same file will produce duplicate rows that the engine won't merge (since `id` is part of `ORDER BY` and IDs are freshly generated each time).
Consider either restoring `fields_hash` computation in `flush_batch`, or tracking seen external timestamps/entities per-user to prevent re-import duplicates.
How can I resolve this? If you propose a fix, please make it concise.Reviews (3): Last reviewed commit: "Claude-slop" | Re-trigger Greptile |
|
@greptileai re-review and scan for regressions |
| user_id = conn.quote(current_user.id) | ||
| filters = %i[project language operating_system editor category] | ||
| sql = filters.map { |f| "groupUniqArray(#{f}) AS #{f}_values" }.join(", ") | ||
| row = conn.select_one("SELECT #{sql} FROM heartbeats WHERE user_id = #{user_id}") |
Check warning
Code scanning / Brakeman
Possible SQL injection. Warning
| ORDER BY user_id ASC, time ASC | ||
| SQL | ||
|
|
||
| rows = connection.select_all(raw_sql) |
Check warning
Code scanning / Brakeman
Possible SQL injection. Warning
|
@greptileai review and scan for regressions. review the time calculation code and all ClickHouse code, then review whether or not it makes sense, has regressions from the main branch, duplicated clickhouse-activerecord features, can be made more efficient, etc. Also check to see whether or not a piece of code can be shortened WITHOUT LOSING PERFORMANCE OR FUNCTIONALITY. PAY CLOSE ATTENTION TO TIME TRACKING CODE. ALSO LOOK FOR POTENTIAL PERFORMANCE IMPROVEMENTS |
| tz = current_user.timezone | ||
| weekly_sql = weekly_hb | ||
| .select("toStartOfWeek(toDateTime(toUInt32(time), '#{tz}'), 1) as week_start, `project` as grouped_time, least(greatest(time - lagInFrame(time, 1, time) OVER (PARTITION BY `project`, toStartOfWeek(toDateTime(toUInt32(time), '#{tz}'), 1) ORDER BY time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 0), #{timeout}) as diff") |
There was a problem hiding this comment.
Unquoted timezone interpolation — SQL injection risk
current_user.timezone is interpolated directly into the ClickHouse SQL string at lines 103 (twice in the select) without any quoting or sanitization. A crafted timezone value stored in the user's profile (e.g. UTC') UNION SELECT ...--) would be executed verbatim.
The identical vulnerability was already identified and fixed in WeeklySummaryMailer (using connection.quote), but the same pattern here was not addressed:
| tz = current_user.timezone | |
| weekly_sql = weekly_hb | |
| .select("toStartOfWeek(toDateTime(toUInt32(time), '#{tz}'), 1) as week_start, `project` as grouped_time, least(greatest(time - lagInFrame(time, 1, time) OVER (PARTITION BY `project`, toStartOfWeek(toDateTime(toUInt32(time), '#{tz}'), 1) ORDER BY time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 0), #{timeout}) as diff") | |
| tz_quoted = Heartbeat.connection.quote(current_user.timezone) | |
| weekly_sql = weekly_hb | |
| .select("toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) as week_start, `project` as grouped_time, least(greatest(time - lagInFrame(time, 1, time) OVER (PARTITION BY `project`, toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) ORDER BY time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 0), #{timeout}) as diff") |
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/controllers/concerns/dashboard_data.rb
Line: 101-103
Comment:
**Unquoted timezone interpolation — SQL injection risk**
`current_user.timezone` is interpolated directly into the ClickHouse SQL string at lines 103 (twice in the `select`) without any quoting or sanitization. A crafted timezone value stored in the user's profile (e.g. `UTC') UNION SELECT ...--`) would be executed verbatim.
The identical vulnerability was already identified and fixed in `WeeklySummaryMailer` (using `connection.quote`), but the same pattern here was not addressed:
```suggestion
tz_quoted = Heartbeat.connection.quote(current_user.timezone)
weekly_sql = weekly_hb
.select("toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) as week_start, `project` as grouped_time, least(greatest(time - lagInFrame(time, 1, time) OVER (PARTITION BY `project`, toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) ORDER BY time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 0), #{timeout}) as diff")
```
How can I resolve this? If you propose a fix, please make it concise.| SELECT | ||
| user_id, | ||
| time, | ||
| least( | ||
| greatest( | ||
| time - lagInFrame(time, 1, time) OVER ( | ||
| PARTITION BY user_id | ||
| ORDER BY time ASC | ||
| ROWS BETWEEN 1 PRECEDING AND CURRENT ROW | ||
| ), | ||
| 0 | ||
| ), | ||
| 120 | ||
| ) AS diff | ||
| FROM heartbeats | ||
| WHERE time IS NOT NULL | ||
| AND time >= 0 | ||
| AND time <= 253402300799 | ||
| ) |
There was a problem hiding this comment.
Window partitions by
user_id only, bleeds across day boundaries
The inner window function partitions by user_id alone:
lagInFrame(time, 1, time) OVER (
PARTITION BY user_id
ORDER BY time ASC
...
)This means the lag for the first heartbeat of a new day is computed against the last heartbeat of the previous day. If the user was coding right up to midnight, that cross-midnight gap (which can be up to 120 seconds) is added to the new day's duration_s, inflating it incorrectly.
The ad-hoc query path in heartbeatable.rb correctly partitions by day:
PARTITION BY toDate(toDateTime(toUInt32(time), '#{timezone}'))The materialized view should use the same approach, partitioning by (user_id, toDate(toDateTime(toUInt32(time)))), to stay consistent with individual user stat calculations.
Note: the current structure also partitions by UTC day while the per-user queries use the user's local timezone, so the totals in heartbeat_user_daily_summary will diverge from those shown on user dashboards for non-UTC users.
Prompt To Fix With AI
This is a comment left during a code review.
Path: db/migrate_clickhouse/20260324000003_create_heartbeat_user_daily_summary_refreshable_view.rb
Line: 17-35
Comment:
**Window partitions by `user_id` only, bleeds across day boundaries**
The inner window function partitions by `user_id` alone:
```sql
lagInFrame(time, 1, time) OVER (
PARTITION BY user_id
ORDER BY time ASC
...
)
```
This means the lag for the *first heartbeat of a new day* is computed against the *last heartbeat of the previous day*. If the user was coding right up to midnight, that cross-midnight gap (which can be up to 120 seconds) is added to the new day's `duration_s`, inflating it incorrectly.
The ad-hoc query path in `heartbeatable.rb` correctly partitions by day:
```sql
PARTITION BY toDate(toDateTime(toUInt32(time), '#{timezone}'))
```
The materialized view should use the same approach, partitioning by `(user_id, toDate(toDateTime(toUInt32(time))))`, to stay consistent with individual user stat calculations.
Note: the current structure also partitions by UTC day while the per-user queries use the user's local timezone, so the totals in `heartbeat_user_daily_summary` will diverge from those shown on user dashboards for non-UTC users.
How can I resolve this? If you propose a fix, please make it concise.| total_count | ||
| end | ||
|
|
||
| def self.flush_batch(seen_hashes) | ||
| return 0 if seen_hashes.empty? | ||
| def self.flush_batch(user_id, records) | ||
| return 0 if records.empty? | ||
|
|
||
| records = seen_hashes.values | ||
| records.each do |r| | ||
| timestamp = Time.current | ||
| r[:created_at] = timestamp | ||
| r[:updated_at] = timestamp | ||
| now = Time.current | ||
| base_us = (now.to_r * 1_000_000).to_i | ||
| records.each_with_index do |r, i| | ||
| r[:id] = ((base_us + i) << Heartbeat::CLICKHOUSE_ID_RANDOM_BITS) | SecureRandom.random_number(Heartbeat::CLICKHOUSE_ID_RANDOM_MAX) | ||
| r[:created_at] = now | ||
| r[:updated_at] = now | ||
| end | ||
|
|
||
| ActiveRecord::Base.logger.silence do | ||
| result = Heartbeat.upsert_all(records, unique_by: [ :fields_hash ]) | ||
| result.length | ||
| Heartbeat.connection.with_settings(async_insert: 0, wait_for_async_insert: 1) do | ||
| Heartbeat.insert_all(records) | ||
| end | ||
| Heartbeat.connection.clear_query_cache | ||
| end | ||
|
|
||
| HeartbeatCacheInvalidator.bump_for(user_id) | ||
| records.length | ||
| end | ||
|
|
||
| class HeartbeatSaxHandler < Oj::Saj |
There was a problem hiding this comment.
fields_hash no longer computed on import
generate_fields_hash was removed from Heartbeat and before_save :set_fields_hash! was also removed. The import service previously set attrs[:fields_hash] before inserting, but that line is now gone. All imported heartbeats will have fields_hash = '' (the ClickHouse column default).
This means existing_hashes deduplication in any future code path, the find_or_create_by(fields_hash:) guard added in the controller, and the UNIQUE semantics of ReplacingMergeTree on id—none of these will work correctly for imported data. Two re-imports of the same file will produce duplicate rows that the engine won't merge (since id is part of ORDER BY and IDs are freshly generated each time).
Consider either restoring fields_hash computation in flush_batch, or tracking seen external timestamps/entities per-user to prevent re-import duplicates.
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/services/heartbeat_import_service.rb
Line: 94-119
Comment:
**`fields_hash` no longer computed on import**
`generate_fields_hash` was removed from `Heartbeat` and `before_save :set_fields_hash!` was also removed. The import service previously set `attrs[:fields_hash]` before inserting, but that line is now gone. All imported heartbeats will have `fields_hash = ''` (the ClickHouse column default).
This means `existing_hashes` deduplication in any future code path, the `find_or_create_by(fields_hash:)` guard added in the controller, and the `UNIQUE` semantics of `ReplacingMergeTree` on `id`—none of these will work correctly for imported data. Two re-imports of the same file will produce duplicate rows that the engine won't merge (since `id` is part of `ORDER BY` and IDs are freshly generated each time).
Consider either restoring `fields_hash` computation in `flush_batch`, or tracking seen external timestamps/entities per-user to prevent re-import duplicates.
How can I resolve this? If you propose a fix, please make it concise.
DO NOT MERGE UNTIL FURTHER TESTING IS DONE.