When aggregating over multiple problems / problem instances, we can get imbalanced results w.r.t. the number of runs of each algorithm. *Idea*: Drop results and warn about excluded results.