Feat: metrics summary #1646

MarceloRobert · 2025-12-02T18:26:38Z

Adds a metrics summary notification and refactors some functions from the notifications command

Changes

Refactors the action match so that the complexity in the handle method is lower
Moves out of notifications.py some functions that are used by a couple of actions
Adds the queries, models, logic and cron job for the metrics summary

How to test

Run poetry run python3 manage.py notifications --action metrics_summary and check the result, you can also change the interval in the code to check how that reflects in the query time and data.

Example output:

> KernelCI Metrics Report - 2025-12-12 18:04 UTC


KernelCI Metrics Summary
========================
Period: 2025-12-05 18:04 UTC to 2025-12-12 18:04 UTC


ACTIVITY
--------
    32 issues
   420 incidents
   100 checkouts
  1942 builds
291453 tests


BUILD REGRESSIONS
-----------------
Incidents are any occurrences of an issue.
New regressions are the first incident of an issue.

Origin      Total Incidents    New Regressions
──────────────────────────────────────────────
maestro     354                26


LAB ACTIVITY
------------
There were 12 labs registered. Incidents reported per lab:

Origin      Lab                    Builds    Boots     Tests
────────────────────────────────────────────────────────────
maestro     k8s-all                1826      0         273
maestro     k8s-gke-eu-west4       0         0         22788
maestro     lava-baylibre          0         1         0
maestro     lava-broonie           0         506       126589
maestro     lava-cip               0         209       0
maestro     lava-clabbe            0         3         0
maestro     lava-collabora         0         2058      99918
maestro     lava-foundriesio       0         53        0
maestro     lava-kci-qualcomm      0         61        20258
maestro     lava-kontron           0         96        0
maestro     lava-pengutronix       0         154       0
ti          opentest-ti            0         7         0
────────────────────────────────────────────────────────────
Total                              1826      3148      269826



--
This is an experimental report format. Please send feedback in!
Talk to us at [email protected]

Made with love by the KernelCI team - https://kernelci.org

==============================================

Closes #1623

backend/kernelCI_app/management/commands/templates/metrics_report.txt.j2

padovan · 2025-12-02T21:29:17Z

Issues and incidents are the complicated part I believe. For issues we could start only with the build ones and deploy the report with that info. And then be clear on the language that incidents refers to issues.

Frequency could be weekly at first to iterate faster? And for a limited audience I guess.

bhcopeland · 2025-12-03T09:07:30Z

Looks good. A few things that confuse me, the use of 'incidents' and 'regressions'.

Regression = a new issue (first time this problem was seen)
Incident = any occurrence of an issue (including recurring ones)

So "5 new regressions" + "25 recurring incidents" = "30 total incidents"

The total then is a bit confusing, I see "X new incidents" listed in the breakdown and the total.

I had a little play with the report and changed the formatting:

  KernelCI Metrics Summary
  ========================
  Period: 2025-11-25 18:26 to 2025-12-02 18:26


  ACTIVITY
  --------
      12 checkouts
      40 builds
    6000 tests (boot + non-boot)


  REGRESSIONS
  -----------
  5 new regressions detected (issues with their first incident this period)

  30 incidents total, broken down by type:
       2 build failures
       5 boot failures
      23 test failures

  53 issues tracked (includes recurring issues from previous periods)


  LAB ACTIVITY
  ------------
  Incidents reported per lab:

      Lab                   Build    Boot    Test
      ─────────────────────────────────────────────
      lab1 (origin1)            5       2       0
      lab2 (origin2)            0       0      10
      ─────────────────────────────────────────────
      Total                     5       2      10

I know its just draft, but I see there is setup_jinja_template() already in the codebase, maybe this could be reused? Or, if not maybe the generate_metrics_report() can be moved into notifications.py

MarceloRobert · 2025-12-03T11:43:18Z

I know its just draft, but I see there is setup_jinja_template() already in the codebase, maybe this could be reused? Or, if not maybe the generate_metrics_report() can be moved into notifications.py
@bhcopeland

I didn't want to add this code to notifications.py since notifications.py is already a pretty big file with functions for multiple actions, but then I couldn't use the setup_jinja_template() in this new file because it would create a circular dependency and the subfolder that contains the template is different. But I'm already checking on a small refactor just to move that helper function out of notifications.py so that I can reuse it

MarceloRobert · 2025-12-03T13:10:37Z

Issues and incidents are the complicated part I believe. For issues we could start only with the build ones and deploy the report with that info. And then be clear on the language that incidents refers to issues.
@padovan

Looks good. A few things that confuse me, the use of 'incidents' and 'regressions'.

Regression = a new issue (first time this problem was seen)

Incident = any occurrence of an issue (including recurring ones)
So "5 new regressions" + "25 recurring incidents" = "30 total incidents"

The total then is a bit confusing, I see "X new incidents" listed in the breakdown and the total.
@bhcopeland

What about something like this:

- 5 new regressions (issues that had their first incident in the given interval).
-
-
- 30 total incidents in this interval, being:
-     2 build incidents
-     5 boot incidents
-     23 test incidents
+ 20 build incidents in total, 3 of which are new regressions (the first incident of an issue)

(made up numbers unrelated to the prior example)

I don't want to just say "failures" because I'm not gathering builds that failed without any related issue. And is it useful to separate these incidents/regressions by origin?

Btw thanks @bhcopeland for the formatting suggestion. I'll make some modifications and push changes today

MarceloRobert · 2025-12-10T16:03:10Z

As me and @gustavobtflores are working on the ingester, we added some prometheus metrics to it. When we get it on production, we will be able to send metrics (and notifications) using Grafana directly, meaning that this PR would not be used then. Would you guys say it's ok to wait for that or should we move forward with the db queries for now?

cc @tales-aparecida @bhcopeland @padovan

bhcopeland · 2025-12-10T16:07:01Z

As me and @gustavobtflores are working on the ingester, we added some prometheus metrics to it. When we get it on production, we will be able to send metrics using Grafana directly, meaning that this PR would not be used then. Would you guys say it's ok to wait for that or should we move forward with the db queries for now?

cc @tales-aparecida @bhcopeland @padovan

Do you have a link to this work? Will it support all the same metrics as the report? I personally see them as both useful, Prometheus is a "moving target", i.e. monitoring, this is reporting. Reporting to me is a capture a moment in time (or between two dates) which serves a slightly different purpose. Prom can do this, but we have to create dashboards and filter by time. It depends on its implementation. I still see value in both ways.

MarceloRobert · 2025-12-10T16:24:13Z

Do you have a link to this work?

We added the metrics in this PR: #1660

Will it support all the same metrics as the report?

It'll support most if not all the metrics, only the new regressions that might be tricky.

Also, I mean that Grafana can send email notifications actually, not just provide a dashboard for visualization

tales-aparecida · 2025-12-10T16:50:19Z

Keep in mind counters get reset on restart, so depending on which metrics you want to show, you'll need to make sure you have handled their initialization properly

MarceloRobert · 2025-12-11T18:59:34Z

@tales-aparecida that's true. It is possible to get around these problems but given this issue and also the fact that we would have to wait for the ingester integration for the grafana metrics, and also that using SQL would allow for querying metrics from much earlier intervals, I'll keep working on this.
Having the grafana metrics will be good too, but then it will be just an addition to this feature basically.

cc @AmadeusK525

Using a match case instead of if-elif won't trigger the complexity warning later on

AmadeusK525 · 2025-12-11T20:52:21Z

Since is always using a time frame (taken from your snippet, Period: 2025-11-26 17:11 UTC to 2025-12-03 17:11 UTC), counter restarts will never be a problem, since we can still get the total amount of stuff being report (and get the rate, if necessary, Grafana is pretty flexible in that regard).

What I wouldn't know how to handle, though, is "regressions", I haven't thought too much about it, but I'm assuming that it will have to be a direct DB query, yeah.

AmadeusK525 · 2025-12-11T20:56:04Z

Can we not use an object with function pointers instead of a match case? Very inconvenient to read that file as it is, with the gigantic match case

MarceloRobert · 2025-12-12T13:58:00Z

Can we not use an object with function pointers instead of a match case? Very inconvenient to read that file as it is, with the gigantic match case

We could use subparsers to make it even better, but I think this is for a next PR, for now I just changed the if/else for a match/case to lower the complexity

backend/kernelCI_app/management/commands/helpers/common.py

gustavobtflores · 2025-12-15T14:07:21Z

backend/kernelCI_app/management/commands/helpers/common.py

+        if not recipients:
+            recipients = _get_default_tree_recipients(
+                signup_folder=signup_folder,
+                search_url=git_url,


type for git_url in the function definition is Optional[str] and _get_default_tree_recipients expects str, we don't need to check if git_url is None?

I think the problem is just the typing for _get_default_tree_recipients, because we already check for is not None inside of it

Fixed, it didn't really made sense to search for a git_url if it is None or empty so I added a new validation in the beginning of the function

Moves setup_jinja_template, ask_confirmation and send_email_report outside of notifications.py so that the command file is not too big Also fixes None validation in _get_default_tree_recipients

Adds the new action, queries, classes and cron job Closes kernelci#1623

gustavobtflores

LGTM

MarceloRobert self-assigned this Dec 2, 2025

barbieri reviewed Dec 2, 2025

View reviewed changes

backend/kernelCI_app/management/commands/templates/metrics_report.txt.j2 Outdated Show resolved Hide resolved

MarceloRobert force-pushed the feat/metrics-summary branch from 6eb81ec to a46ae4d Compare December 2, 2025 19:16

MarceloRobert added the notifications label Dec 3, 2025

MarceloRobert force-pushed the feat/metrics-summary branch 2 times, most recently from a099c20 to 158ce66 Compare December 3, 2025 17:11

refactor: lower notifications command complexity

bb642c0

Using a match case instead of if-elif won't trigger the complexity warning later on

MarceloRobert force-pushed the feat/metrics-summary branch from 158ce66 to 328381b Compare December 11, 2025 20:08

tales-aparecida reviewed Dec 12, 2025

View reviewed changes

backend/kernelCI_app/management/commands/helpers/common.py Outdated Show resolved Hide resolved

MarceloRobert force-pushed the feat/metrics-summary branch from 328381b to 751f571 Compare December 12, 2025 18:03

MarceloRobert marked this pull request as ready for review December 12, 2025 18:42

tales-aparecida mentioned this pull request Dec 12, 2025

Consider using Django Jinja2 backend instead of calling jinja2.Environment directly #1676

Open

gustavobtflores reviewed Dec 15, 2025

View reviewed changes

MarceloRobert added 2 commits December 15, 2025 15:29

refactor: move generic notification functions out

87cc6e7

Moves setup_jinja_template, ask_confirmation and send_email_report outside of notifications.py so that the command file is not too big Also fixes None validation in _get_default_tree_recipients

feat: add metrics notification

81393ee

Adds the new action, queries, classes and cron job Closes kernelci#1623

MarceloRobert force-pushed the feat/metrics-summary branch from 751f571 to 81393ee Compare December 15, 2025 18:30

gustavobtflores approved these changes Dec 15, 2025

View reviewed changes

MarceloRobert added this pull request to the merge queue Dec 15, 2025

Merged via the queue into kernelci:main with commit 992981f Dec 15, 2025
7 checks passed

Feat: metrics summary #1646

Feat: metrics summary #1646

Conversation

MarceloRobert commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

How to test

Uh oh!

Uh oh!

padovan commented Dec 2, 2025

Uh oh!

bhcopeland commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarceloRobert commented Dec 3, 2025

Uh oh!

MarceloRobert commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarceloRobert commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhcopeland commented Dec 10, 2025

Uh oh!

MarceloRobert commented Dec 10, 2025

Uh oh!

tales-aparecida commented Dec 10, 2025

Uh oh!

MarceloRobert commented Dec 11, 2025

Uh oh!

AmadeusK525 commented Dec 11, 2025

Uh oh!

AmadeusK525 commented Dec 11, 2025

Uh oh!

MarceloRobert commented Dec 12, 2025

Uh oh!

Uh oh!

gustavobtflores Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

MarceloRobert Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarceloRobert Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gustavobtflores left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

MarceloRobert commented Dec 2, 2025 •

edited

Loading

bhcopeland commented Dec 3, 2025 •

edited

Loading

MarceloRobert commented Dec 3, 2025 •

edited

Loading

MarceloRobert commented Dec 10, 2025 •

edited

Loading

MarceloRobert Dec 15, 2025 •

edited

Loading