Implement multiple flagstores per service#87
Implement multiple flagstores per service#87FlorianKothmeier wants to merge 1 commit intofausecteam:masterfrom
Conversation
| sla_ok_by_servicegroup_tick AS ( | ||
| SELECT team_id, | ||
| scoring_service.service_group_id AS service_group_id, | ||
| 0.5 * servicegroup.services_count * (COUNT(*) = servicegroup.services_count)::int AS sla_ok |
There was a problem hiding this comment.
This is the SLA formula for each service group, which should be easy to replace in case we want to use another formula.
There was a problem hiding this comment.
This actually evaluates to 0.5 * len(services[group]) for each check result instead of 0.5 * len(services[group])**2 as I initially thought.
But, because it gets evaluated once for each service in the group this still results in 0.5 * len(services[group])**2. I think this was unintended, as their scoring provides the formula
sla = (count(ticks_with_status['up'] + 0.5 * ticks_with_status['recovering'])) * sqrt(count(teams)) * count(flag_stores_in_service)
Also note that this formula is the same as the formula for recovering sla points, so this formula should have had a factor of 1.0 instead.
To get the documented behavior, the formulas should have been:
sla: 1.0 * (COUNT(*) = servicegroup.services_count)::int
sla_recover: 0.5 * (COUNT(*) = servicegroup.services_count)::int
EDIT: their formula does work as intended actually
Some additional thoughts:
- Do we want sla to scale linearly with the number of flagstores? Other options include sla that doesn't depend on the number of flagstores or some non-linear scaling (maybe
sqrt(len(services[group])?) - Currently no partial sla points are awarded for having some flagstores available, when others are down. Maybe we should award
y/x * slaif y of x services in the group are up?
This PR copies the flagstore implementation from ECSC 2022.
To support multiple flagstores, multiple checker scripts have to be provided by the service authors. Each flagstore is then registered as a seperate service. Additionally, service categories group the flagstores to provide the "real" services.
Tested locally and the python checkerlib is confirmed working. I have tested multiple flagstore result combinations and they all behave as expected (
OK+OK -> OK,OK+FAULTY -> FAULTY,OK+NOT CHECKED -> NOT CHECKED, etc.)I suspect that the go checkerlib will need refactoring, because each (python) checker function additionally returns a message that is shown on the scoreboard. However, the go checkerlib has not been adapted to support this. I don't know whether we want to keep that feature. Furthermore, I know next to nothing about go, therefore someone else would need to take a look at it if we decide to keep it.
Additionally, I'm not sure about the scoring formula. AFAIK, each service group gets SLA points equal to
COUNT(FLAGSTORE)iff all of them are OK. This should be easy to adapt though, once we decide on what scoring formula to use.There is still some commented out SQL, which should be removed before merging.
Resolves #77