Skip to content

Conversation

@MarceloRobert
Copy link
Collaborator

@MarceloRobert MarceloRobert commented Dec 2, 2025

Adds a metrics summary notification and refactors some functions from the notifications command

Changes

  • Refactors the action match so that the complexity in the handle method is lower
  • Moves out of notifications.py some functions that are used by a couple of actions
  • Adds the queries, models, logic and cron job for the metrics summary

How to test

Run poetry run python3 manage.py notifications --action metrics_summary and check the result, you can also change the interval in the code to check how that reflects in the query time and data.

Example output:

> KernelCI Metrics Report - 2025-12-12 18:04 UTC


KernelCI Metrics Summary
========================
Period: 2025-12-05 18:04 UTC to 2025-12-12 18:04 UTC


ACTIVITY
--------
    32 issues
   420 incidents
   100 checkouts
  1942 builds
291453 tests


BUILD REGRESSIONS
-----------------
Incidents are any occurrences of an issue.
New regressions are the first incident of an issue.

Origin      Total Incidents    New Regressions
──────────────────────────────────────────────
maestro     354                26


LAB ACTIVITY
------------
There were 12 labs registered. Incidents reported per lab:

Origin      Lab                    Builds    Boots     Tests
────────────────────────────────────────────────────────────
maestro     k8s-all                1826      0         273
maestro     k8s-gke-eu-west4       0         0         22788
maestro     lava-baylibre          0         1         0
maestro     lava-broonie           0         506       126589
maestro     lava-cip               0         209       0
maestro     lava-clabbe            0         3         0
maestro     lava-collabora         0         2058      99918
maestro     lava-foundriesio       0         53        0
maestro     lava-kci-qualcomm      0         61        20258
maestro     lava-kontron           0         96        0
maestro     lava-pengutronix       0         154       0
ti          opentest-ti            0         7         0
────────────────────────────────────────────────────────────
Total                              1826      3148      269826



--
This is an experimental report format. Please send feedback in!
Talk to us at [email protected]

Made with love by the KernelCI team - https://kernelci.org

==============================================

Closes #1623

@MarceloRobert MarceloRobert self-assigned this Dec 2, 2025
@padovan
Copy link
Contributor

padovan commented Dec 2, 2025

Issues and incidents are the complicated part I believe. For issues we could start only with the build ones and deploy the report with that info. And then be clear on the language that incidents refers to issues.

Frequency could be weekly at first to iterate faster? And for a limited audience I guess.

@bhcopeland
Copy link
Member

bhcopeland commented Dec 3, 2025

Looks good. A few things that confuse me, the use of 'incidents' and 'regressions'.

  • Regression = a new issue (first time this problem was seen)
  • Incident = any occurrence of an issue (including recurring ones)

So "5 new regressions" + "25 recurring incidents" = "30 total incidents"

The total then is a bit confusing, I see "X new incidents" listed in the breakdown and the total.

I had a little play with the report and changed the formatting:

  KernelCI Metrics Summary
  ========================
  Period: 2025-11-25 18:26 to 2025-12-02 18:26


  ACTIVITY
  --------
      12 checkouts
      40 builds
    6000 tests (boot + non-boot)


  REGRESSIONS
  -----------
  5 new regressions detected (issues with their first incident this period)

  30 incidents total, broken down by type:
       2 build failures
       5 boot failures
      23 test failures

  53 issues tracked (includes recurring issues from previous periods)


  LAB ACTIVITY
  ------------
  Incidents reported per lab:

      Lab                   Build    Boot    Test
      ─────────────────────────────────────────────
      lab1 (origin1)            5       2       0
      lab2 (origin2)            0       0      10
      ─────────────────────────────────────────────
      Total                     5       2      10

I know its just draft, but I see there is setup_jinja_template() already in the codebase, maybe this could be reused? Or, if not maybe the generate_metrics_report() can be moved into notifications.py

@MarceloRobert
Copy link
Collaborator Author

I know its just draft, but I see there is setup_jinja_template() already in the codebase, maybe this could be reused? Or, if not maybe the generate_metrics_report() can be moved into notifications.py
@bhcopeland

I didn't want to add this code to notifications.py since notifications.py is already a pretty big file with functions for multiple actions, but then I couldn't use the setup_jinja_template() in this new file because it would create a circular dependency and the subfolder that contains the template is different. But I'm already checking on a small refactor just to move that helper function out of notifications.py so that I can reuse it

@MarceloRobert
Copy link
Collaborator Author

MarceloRobert commented Dec 3, 2025

Issues and incidents are the complicated part I believe. For issues we could start only with the build ones and deploy the report with that info. And then be clear on the language that incidents refers to issues.
@padovan

Looks good. A few things that confuse me, the use of 'incidents' and 'regressions'.

  • Regression = a new issue (first time this problem was seen)
  • Incident = any occurrence of an issue (including recurring ones)
    So "5 new regressions" + "25 recurring incidents" = "30 total incidents"

The total then is a bit confusing, I see "X new incidents" listed in the breakdown and the total.
@bhcopeland

What about something like this:

- 5 new regressions (issues that had their first incident in the given interval).
-
-
- 30 total incidents in this interval, being:
-     2 build incidents
-     5 boot incidents
-     23 test incidents
+ 20 build incidents in total, 3 of which are new regressions (the first incident of an issue)

(made up numbers unrelated to the prior example)

I don't want to just say "failures" because I'm not gathering builds that failed without any related issue. And is it useful to separate these incidents/regressions by origin?

Btw thanks @bhcopeland for the formatting suggestion. I'll make some modifications and push changes today

@MarceloRobert MarceloRobert force-pushed the feat/metrics-summary branch 2 times, most recently from a099c20 to 158ce66 Compare December 3, 2025 17:11
@MarceloRobert
Copy link
Collaborator Author

MarceloRobert commented Dec 10, 2025

As me and @gustavobtflores are working on the ingester, we added some prometheus metrics to it. When we get it on production, we will be able to send metrics (and notifications) using Grafana directly, meaning that this PR would not be used then. Would you guys say it's ok to wait for that or should we move forward with the db queries for now?

cc @tales-aparecida @bhcopeland @padovan

@bhcopeland
Copy link
Member

As me and @gustavobtflores are working on the ingester, we added some prometheus metrics to it. When we get it on production, we will be able to send metrics using Grafana directly, meaning that this PR would not be used then. Would you guys say it's ok to wait for that or should we move forward with the db queries for now?

cc @tales-aparecida @bhcopeland @padovan

Do you have a link to this work? Will it support all the same metrics as the report? I personally see them as both useful, Prometheus is a "moving target", i.e. monitoring, this is reporting. Reporting to me is a capture a moment in time (or between two dates) which serves a slightly different purpose. Prom can do this, but we have to create dashboards and filter by time. It depends on its implementation. I still see value in both ways.

@MarceloRobert
Copy link
Collaborator Author

Do you have a link to this work?

We added the metrics in this PR: #1660

Will it support all the same metrics as the report?

It'll support most if not all the metrics, only the new regressions that might be tricky.

Also, I mean that Grafana can send email notifications actually, not just provide a dashboard for visualization

@tales-aparecida
Copy link

Keep in mind counters get reset on restart, so depending on which metrics you want to show, you'll need to make sure you have handled their initialization properly

@MarceloRobert
Copy link
Collaborator Author

@tales-aparecida that's true. It is possible to get around these problems but given this issue and also the fact that we would have to wait for the ingester integration for the grafana metrics, and also that using SQL would allow for querying metrics from much earlier intervals, I'll keep working on this.
Having the grafana metrics will be good too, but then it will be just an addition to this feature basically.

cc @AmadeusK525

Using a match case instead of if-elif won't trigger the complexity warning later on
@AmadeusK525
Copy link
Contributor

Since is always using a time frame (taken from your snippet, Period: 2025-11-26 17:11 UTC to 2025-12-03 17:11 UTC), counter restarts will never be a problem, since we can still get the total amount of stuff being report (and get the rate, if necessary, Grafana is pretty flexible in that regard).

What I wouldn't know how to handle, though, is "regressions", I haven't thought too much about it, but I'm assuming that it will have to be a direct DB query, yeah.

@AmadeusK525
Copy link
Contributor

Can we not use an object with function pointers instead of a match case? Very inconvenient to read that file as it is, with the gigantic match case

@MarceloRobert
Copy link
Collaborator Author

Can we not use an object with function pointers instead of a match case? Very inconvenient to read that file as it is, with the gigantic match case

We could use subparsers to make it even better, but I think this is for a next PR, for now I just changed the if/else for a match/case to lower the complexity

if not recipients:
recipients = _get_default_tree_recipients(
signup_folder=signup_folder,
search_url=git_url,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type for git_url in the function definition is Optional[str] and _get_default_tree_recipients expects str, we don't need to check if git_url is None?

Copy link
Collaborator Author

@MarceloRobert MarceloRobert Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is just the typing for _get_default_tree_recipients, because we already check for is not None inside of it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, it didn't really made sense to search for a git_url if it is None or empty so I added a new validation in the beginning of the function

Moves setup_jinja_template, ask_confirmation and send_email_report outside of notifications.py so that the command file is not too big

Also fixes None validation in _get_default_tree_recipients
Adds the new action, queries, classes and cron job

Closes kernelci#1623
Copy link
Contributor

@gustavobtflores gustavobtflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MarceloRobert MarceloRobert added this pull request to the merge queue Dec 15, 2025
Merged via the queue into kernelci:main with commit 992981f Dec 15, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Basic counters

7 participants