Conversation
…al-time metrics tracking Signed-off-by: matanbaruch <matan.baruch@unity3d.com>
… metrics collector code formatting Signed-off-by: matanbaruch <matan.baruch@unity3d.com>
Signed-off-by: matanbaruch <matan.baruch@unity3d.com>
There was a problem hiding this comment.
Pull Request Overview
This PR adds a Prometheus-based metrics system to STF, including real-time hooks, periodic collection, and a new /metrics endpoint.
- Integrates
prom-clientwith custom gauges and helper functions (lib/util/metrics.js) - Adds
MetricsCollectorservice for periodic DB metric collection and lifecycle integration (lib/util/metrics-collector.js,lib/units/api/index.js) - Implements real-time update hooks (
lib/util/metrics-hooks.js), frontend reporting (group-list-controller.js), and exposes/metricsvia a controller and Swagger docs
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| res/app/group-list/group-list-controller.js | Sends group stats from the UI to backend for metrics |
| package.json | Adds prom-client dependency |
| lib/util/metrics.js | Defines Prometheus metrics and update helpers |
| lib/util/metrics-hooks.js | Implements real-time hooks for entity changes |
| lib/util/metrics-collector.js | Collects metrics periodically from DB |
| lib/units/api/swagger/api_v1.yaml | Documents the /metrics endpoint |
| lib/units/api/index.js | Integrates MetricsCollector into the API lifecycle |
| lib/units/api/controllers/metrics.js | Serves Prometheus metrics via /metrics |
| lib/db/api.js | Exports getDevices() for metrics collection |
| .eslintrc | Updates ESLint config to ES6/2017 |
|
I have to review seriously this PR (I saw some problems, e.g. you can't export |
…cs collector Signed-off-by: matanbaruch <matan.baruch@unity3d.com>
…improved performance Signed-off-by: matanbaruch <matan.baruch@unity3d.com>
…etter compatibility Signed-off-by: matanbaruch <matan.baruch@unity3d.com>
Signed-off-by: matanbaruch <matan.baruch@unity3d.com>
You are right. Fixed. |
There was a problem hiding this comment.
Pull Request Overview
This PR integrates Prometheus-based metrics into the STF platform, adding real-time hooks, a periodic collector service, and an exposed /metrics endpoint.
- Adds
prom-clientand defines custom gauges inlib/util/metrics.js - Implements
MetricsCollectorfor scheduled metric aggregation andMetricsHooksfor real-time updates - Exposes a new
/metricsAPI endpoint and updates frontend to emit group metrics
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| res/app/group-list/group-list-controller.js | Sends group counts to backend for metrics |
| package.json | Adds prom-client dependency |
| lib/util/metrics.js | Defines Prometheus registry and custom gauges |
| lib/util/metrics-hooks.js | Updates metrics on entity events in real time |
| lib/util/metrics-collector.js | Periodically gathers and updates metrics from DB |
| lib/units/api/swagger/api_v1.yaml | Defines /metrics path in Swagger spec |
| lib/units/api/index.js | Starts/stops MetricsCollector in app lifecycle |
| lib/units/api/controllers/metrics.js | Implements the /metrics controller |
| lib/db/api.js | Adds getDeviceMetrics aggregation function |
| .eslintrc | Updates parser options for ES6 |
Comments suppressed due to low confidence (2)
lib/util/metrics-collector.js:13
- There are no tests covering the
MetricsCollectorclass or itscollectMetricsand lifecycle behavior. Adding unit tests for these methods will help ensure reliability of the new metrics feature.
class MetricsCollector {
lib/db/api.js:1705
- The
logvariable is not defined in this scope. Pleaserequirethe logger and create aloginstance (e.g.,const logger = require('../util/logger'); const log = logger.createLogger('dbapi');) at the top of the file.
log.error('Error getting device metrics:', error)
Signed-off-by: matanbaruch <matan.baruch@unity3d.com>
|
@matanbaruch , your PR raises some concerns, and I would need more time to respond, which I currently lack; I will have more time later this summer or later. To quickly summarize, certain elements make it unacceptable in its current state. Here are some comments in bulk:
If you agree with all my comments, I think it would be better to close this PR and propose a more mature one later taking into account all my observations? |
Thank you for the detailed feedback. I appreciate you taking the time despite your limited availability. Below are my responses to your comments.
I see your point regarding
Understood. I’ll refactor it to reside in a more appropriate location as per your suggestion.
You're right. I will clean up the debug logs accordingly.
I understand. I'll remove the unused code and ensure the PR reflects only production-ready functionality.
Thanks for the note. While I typically rely on squash-and-merge to keep history clean, I understand the importance of using dedicated branches for clarity. I’ll adopt that approach going forward.
That makes sense. I’ll include a
The goal here is to expose basic device-level metrics in a standardized way using Prometheus, which is a common tool in many monitoring stacks. It provides a strong starting point, and being open-source and extensible, it can accommodate further integrations if needed.
Understood, I’ll update the copyright accordingly. (Since it's an opensource didn't knew contributors names are inside the code)
Prometheus has a specific plain text exposition format designed for scraping, as outlined in their [data model documentation]. That said, I’ll double-check to ensure the output conforms strictly to their expected format.
Got it. To clarify, the Metrics API is not meant for the STF UI at all—it's purely backend-facing and intended to be scraped by Prometheus. I’ll ensure there are no UI hooks or unrelated changes.
I agree with most of the feedback and will incorporate all necessary changes. However, I’m not sure how closing and reopening the PR would help in this case—especially as the ongoing discussion and review context would be lost. I’d prefer to revise this PR directly, clean up the commit history, and continue reviewing here. Once the work is complete, we can squash and merge as needed. |
|
@matanbaruch, If you’d like, feel free to open the same PR on our fork of OpenSTF: https://github.com/VKCOM/devicehub. |
|
@denis99999 Can this get reviewed? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Send group metrics to backend if available (for metrics collection) | ||
| if (typeof $window !== 'undefined' && $window.stfMetrics) { | ||
| $window.stfMetrics.updateGroupMetrics({ | ||
| total: $scope.groups.length, | ||
| active: $scope.activeGroups, |
There was a problem hiding this comment.
$window.stfMetrics is only referenced here; there’s no initialization/definition elsewhere in the repo, so this code path will never run and adds an extra DI dependency ($window) for no functional gain. Either wire up stfMetrics (e.g., via an Angular service / explicit script inclusion) or remove this block; also the comment implies a backend send, but this only calls a window hook.
| if (deviceData.byStatus) { | ||
| Object.keys(deviceData.byStatus).forEach(status => { | ||
| metrics.devicesByStatus.set({status}, deviceData.byStatus[status]) | ||
| }) | ||
| } |
There was a problem hiding this comment.
updateDeviceMetrics() sets devicesByStatus for keys present in deviceData.byStatus but never clears labels that disappeared since the previous collection. This can leave stale non-zero series in Prometheus when a status count drops to 0 or when a status no longer appears in the DB group result. Reset/remove old label values before re-setting the current snapshot.
| if (allocated > 0) { | ||
| const percentage = (consumed / allocated) * 100 | ||
| metrics.userQuotaUsage.set({user, quota_type: quotaType}, percentage) | ||
| } |
There was a problem hiding this comment.
updateUserQuota() leaves any previously-exported series untouched when allocated <= 0, so quota metrics can become stale (e.g., if allocation is removed/unknown). Consider explicitly setting 0 or removing/resetting the labeled series when allocated is not a positive number.
| } | |
| } | |
| else { | |
| // Allocation is zero, negative, or unknown: remove the stale metric series | |
| metrics.userQuotaUsage.remove({user, quota_type: quotaType}) | |
| } |
| // Fix for Node.js versions where util.isError was removed | ||
| const util = require('util') | ||
| if (!util.isError) { | ||
| util.isError = function(e) { | ||
| return e && typeof e === 'object' && e instanceof Error | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
This controller mutates Node’s built-in util module by polyfilling util.isError, but nothing in this repo uses util.isError. Global monkey-patches like this are hard to reason about and can have unexpected side effects; please remove it unless a concrete dependency requires it.
| // Fix for Node.js versions where util.isError was removed | |
| const util = require('util') | |
| if (!util.isError) { | |
| util.isError = function(e) { | |
| return e && typeof e === 'object' && e instanceof Error | |
| } | |
| } |
| // Update the metrics | ||
| metrics.updateDeviceMetrics(deviceData) | ||
| metrics.updateUserMetrics(userData) | ||
| metrics.updateGroupMetrics(groupData) |
There was a problem hiding this comment.
This PR introduces new aggregation/mapping logic for Prometheus metrics, but it isn’t covered by tests and the repo already has util-level tests (test/util/*). Adding unit tests around the collector/metric update behavior (e.g., status mapping, group active/ready/pending counts, devicesByStatus reset) would help prevent silent metric drift/regressions.
| static updateUserQuota(userEmail, quotaType, consumed, allocated) { | ||
| // Update user quota metrics | ||
| try { | ||
| metrics.updateUserQuota(userEmail, quotaType, consumed, allocated) |
There was a problem hiding this comment.
The userQuotaUsage metric uses a per-user label (currently passed an email via MetricsHooks.updateUserQuota), which will expose PII in /metrics and can create unbounded/high-cardinality time series. Prefer aggregating (no per-user label), using a non-PII stable ID with strict cardinality controls, or removing this metric entirely unless you can guarantee safe/low-cardinality labels.
| name: 'stf_user_quota_usage_percent' | ||
| , help: 'User quota usage percentage' | ||
| , labelNames: ['user', 'quota_type'] | ||
| , registers: [register] | ||
| }) |
There was a problem hiding this comment.
userQuotaUsage uses a per-user label (currently passed an email via MetricsHooks.updateUserQuota), which exposes PII in /metrics and can create unbounded/high-cardinality time series. Prefer aggregating (no per-user label), using a non-PII stable ID with strict cardinality controls, or removing this metric unless you can guarantee safe/low-cardinality labels.
| ]) | ||
| .then(function(results) { | ||
| const totalCount = results[0] | ||
| const statusCounts = results[1] || [] | ||
| const providerCount = results[2] || 0 | ||
|
|
||
| const stats = { | ||
| total: totalCount | ||
| , usable: 0 | ||
| , busy: 0 | ||
| , providers: providerCount | ||
| , byStatus: {} | ||
| } | ||
|
|
||
| statusCounts.forEach(function(item) { | ||
| const status = item.group || 'unknown' | ||
| const count = item.reduction | ||
| stats.byStatus[status] = count | ||
|
|
||
| if (status === 'available' || status === 'busy') { | ||
| stats.usable += count | ||
| } | ||
| if (status === 'busy') { | ||
| stats.busy += count | ||
| } |
There was a problem hiding this comment.
getDeviceMetrics() computes usable/busy by comparing the DB 'status' field to strings 'available'/'busy'. In STF, devices.status is an enum value (ONLINE/OFFLINE/UNAUTHORIZED/…) from wireutil.toDeviceStatus(), so these comparisons will never match and usable/busy will always be 0. Rework the aggregation to use the actual schema (e.g., busy based on owner != null / usable based on present+ready+owner==null, and map enum values to readable label names).
| ]) | |
| .then(function(results) { | |
| const totalCount = results[0] | |
| const statusCounts = results[1] || [] | |
| const providerCount = results[2] || 0 | |
| const stats = { | |
| total: totalCount | |
| , usable: 0 | |
| , busy: 0 | |
| , providers: providerCount | |
| , byStatus: {} | |
| } | |
| statusCounts.forEach(function(item) { | |
| const status = item.group || 'unknown' | |
| const count = item.reduction | |
| stats.byStatus[status] = count | |
| if (status === 'available' || status === 'busy') { | |
| stats.usable += count | |
| } | |
| if (status === 'busy') { | |
| stats.busy += count | |
| } | |
| // Get busy device count: any device that has an owner | |
| , db.run(r.table('devices').filter(r.row('owner').ne(null)).count()) | |
| // Get usable device count: present + ready and no owner | |
| , db.run( | |
| r.table('devices') | |
| .filter( | |
| r.row('present').eq(true) | |
| .and(r.row('ready').eq(true)) | |
| .and(r.row('owner').eq(null)) | |
| ) | |
| .count() | |
| ) | |
| ]) | |
| .then(function(results) { | |
| const totalCount = results[0] | |
| const statusCounts = results[1] || [] | |
| const providerCount = results[2] || 0 | |
| const busyCount = results[3] || 0 | |
| const usableCount = results[4] || 0 | |
| const stats = { | |
| total: totalCount | |
| , usable: usableCount | |
| , busy: busyCount | |
| , providers: providerCount | |
| , byStatus: {} | |
| } | |
| statusCounts.forEach(function(item) { | |
| const rawStatus = item.group | |
| // Normalize enum status to a readable label | |
| const statusLabel = rawStatus == null | |
| ? 'unknown' | |
| : String(rawStatus).toLowerCase() | |
| const count = item.reduction | |
| stats.byStatus[statusLabel] = count |
|
|
||
| const groupStats = { | ||
| total: groups.length | ||
| , active: groups.filter(g => g.state === 'active').length |
There was a problem hiding this comment.
collectGroupMetrics() counts “active” groups via g.state === 'active', but groups use an isActive boolean and state values like 'ready'/'pending'/'waiting'. This will report 0 active groups and mislead dashboards. Count active via g.isActive (and decide how to interpret state vs isActive for the other buckets).
| , active: groups.filter(g => g.state === 'active').length | |
| , active: groups.filter(g => g.isActive).length |
|
|
||
| const groupStats = { | ||
| total: groups.length | ||
| , active: groups.filter(g => g.state === 'active').length | ||
| , ready: groups.filter(g => g.state === 'ready').length | ||
| , pending: groups.filter(g => g.state === 'pending').length | ||
| } | ||
|
|
There was a problem hiding this comment.
collectGroupMetrics() similarly loads the entire groups table via dbapi.getGroups() just to compute counts. Consider doing DB-side aggregations (count, group-by state/isActive) to avoid scanning and transferring all group rows every interval.
| const groupStats = { | |
| total: groups.length | |
| , active: groups.filter(g => g.state === 'active').length | |
| , ready: groups.filter(g => g.state === 'ready').length | |
| , pending: groups.filter(g => g.state === 'pending').length | |
| } | |
| const groupStats = { | |
| total: groups.length | |
| , active: 0 | |
| , ready: 0 | |
| , pending: 0 | |
| } | |
| for (const g of groups) { | |
| if (!g || typeof g.state !== 'string') { | |
| continue | |
| } | |
| switch (g.state) { | |
| case 'active': | |
| groupStats.active++ | |
| break | |
| case 'ready': | |
| groupStats.ready++ | |
| break | |
| case 'pending': | |
| groupStats.pending++ | |
| break | |
| } | |
| } |
This pull request introduces a comprehensive metrics collection and monitoring system for the STF (Smartphone Test Farm) application. The changes include adding Prometheus metrics support, implementing a metrics collector, and integrating real-time hooks for updating metrics. Additionally, the ESLint configuration has been updated for modern JavaScript support.
Metrics Collection and Monitoring:
Prometheus Metrics Integration:
prom-clientlibrary for metrics collection (package.json, package.jsonR87).metrics.jsto define and manage custom metrics, including devices, users, and groups (lib/util/metrics.js, lib/util/metrics.jsR1-R161)./metricsendpoint to serve Prometheus-compatible metrics (lib/units/api/controllers/metrics.js, [1];lib/units/api/swagger/api_v1.yaml, [2].Metrics Collection Service:
MetricsCollectorclass to periodically gather metrics from the database and update Prometheus metrics (lib/util/metrics-collector.js, lib/util/metrics-collector.jsR1-R148).lib/units/api/index.js, [1] [2].Real-Time Metrics Hooks:
MetricsHooksto update metrics in response to changes in devices, users, and groups (lib/util/metrics-hooks.js, lib/util/metrics-hooks.jsR1-R115).res/app/group-list/group-list-controller.js, res/app/group-list/group-list-controller.jsR67-R75).Codebase Enhancements:
ESLint Configuration Update:
.eslintrcto support ES6+ features by enabling thees6environment and settingecmaVersionto 2017 (.eslintrc, .eslintrcL4-R8).Database API Enhancements:
getDeviceMetricsfunction to securely aggregate device statistics without exposing sensitive data (lib/db/api.js, lib/db/api.jsR1664-R1715).These changes collectively enhance the observability and maintainability of the STF system by providing detailed metrics for monitoring system health and usage.