-
Notifications
You must be signed in to change notification settings - Fork 21
Feat: metrics summary #1646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: metrics summary #1646
Conversation
backend/kernelCI_app/management/commands/templates/metrics_report.txt.j2
Outdated
Show resolved
Hide resolved
6eb81ec to
a46ae4d
Compare
|
Issues and incidents are the complicated part I believe. For issues we could start only with the build ones and deploy the report with that info. And then be clear on the language that incidents refers to issues. Frequency could be weekly at first to iterate faster? And for a limited audience I guess. |
|
Looks good. A few things that confuse me, the use of 'incidents' and 'regressions'.
So "5 new regressions" + "25 recurring incidents" = "30 total incidents" The total then is a bit confusing, I see "X new incidents" listed in the breakdown and the total. I had a little play with the report and changed the formatting: I know its just draft, but I see there is setup_jinja_template() already in the codebase, maybe this could be reused? Or, if not maybe the generate_metrics_report() can be moved into notifications.py |
I didn't want to add this code to |
What about something like this: - 5 new regressions (issues that had their first incident in the given interval).
-
-
- 30 total incidents in this interval, being:
- 2 build incidents
- 5 boot incidents
- 23 test incidents
+ 20 build incidents in total, 3 of which are new regressions (the first incident of an issue)(made up numbers unrelated to the prior example) I don't want to just say "failures" because I'm not gathering builds that failed without any related issue. And is it useful to separate these incidents/regressions by origin? Btw thanks @bhcopeland for the formatting suggestion. I'll make some modifications and push changes today |
a099c20 to
158ce66
Compare
|
As me and @gustavobtflores are working on the ingester, we added some prometheus metrics to it. When we get it on production, we will be able to send metrics (and notifications) using Grafana directly, meaning that this PR would not be used then. Would you guys say it's ok to wait for that or should we move forward with the db queries for now? |
Do you have a link to this work? Will it support all the same metrics as the report? I personally see them as both useful, Prometheus is a "moving target", i.e. monitoring, this is reporting. Reporting to me is a capture a moment in time (or between two dates) which serves a slightly different purpose. Prom can do this, but we have to create dashboards and filter by time. It depends on its implementation. I still see value in both ways. |
We added the metrics in this PR: #1660
It'll support most if not all the metrics, only the new regressions that might be tricky. Also, I mean that Grafana can send email notifications actually, not just provide a dashboard for visualization |
|
Keep in mind counters get reset on restart, so depending on which metrics you want to show, you'll need to make sure you have handled their initialization properly |
|
@tales-aparecida that's true. It is possible to get around these problems but given this issue and also the fact that we would have to wait for the ingester integration for the grafana metrics, and also that using SQL would allow for querying metrics from much earlier intervals, I'll keep working on this. cc @AmadeusK525 |
Using a match case instead of if-elif won't trigger the complexity warning later on
158ce66 to
328381b
Compare
|
Since is always using a time frame (taken from your snippet, What I wouldn't know how to handle, though, is "regressions", I haven't thought too much about it, but I'm assuming that it will have to be a direct DB query, yeah. |
|
Can we not use an object with function pointers instead of a match case? Very inconvenient to read that file as it is, with the gigantic match case |
We could use subparsers to make it even better, but I think this is for a next PR, for now I just changed the if/else for a match/case to lower the complexity |
328381b to
751f571
Compare
| if not recipients: | ||
| recipients = _get_default_tree_recipients( | ||
| signup_folder=signup_folder, | ||
| search_url=git_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type for git_url in the function definition is Optional[str] and _get_default_tree_recipients expects str, we don't need to check if git_url is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is just the typing for _get_default_tree_recipients, because we already check for is not None inside of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, it didn't really made sense to search for a git_url if it is None or empty so I added a new validation in the beginning of the function
Moves setup_jinja_template, ask_confirmation and send_email_report outside of notifications.py so that the command file is not too big Also fixes None validation in _get_default_tree_recipients
Adds the new action, queries, classes and cron job Closes kernelci#1623
751f571 to
81393ee
Compare
gustavobtflores
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Adds a metrics summary notification and refactors some functions from the notifications command
Changes
handlemethod is lowerHow to test
Run
poetry run python3 manage.py notifications --action metrics_summaryand check the result, you can also change the interval in the code to check how that reflects in the query time and data.Example output:
Closes #1623