Competition Design and Mechanisms Feedback and Ideas

Hey there! I wanted to open this issue to share and discuss [some of the issues I've found](https://docs.google.com/document/d/102p_G4_Ih-bASh4NPdfgOLV8MtI_Y_1e23itfpkroJY/edit?tab=t.ipbneopm3h73#heading=h.gm9dbgoglbb2) as a competition participant, and, as someone that got nerdsniped into thinking about this.

Before jumping into the issues and ideas, in this specific case, I'm looking at the deepfunding problem as decomposable into two independent mechanisms.

1. Creating an up to date and flexible Graph of Dependencies.
2. Assigning "accurate" weights to an arbitrary Graph of Dependencies. This implies all the scaling human judgment, preference aggregation, ...

## Juror Evaluations Statistical Significance

Currently, most Juror ratings lack consistency and statistical significance. This makes any model [trained on them very noisy](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out). Collecting edges randomly results in very few of them ending up with a solid weight.

If future implementations require data gathering, this should be done in [more adaptive ways](https://en.wikipedia.org/wiki/Active_learning_(machine_learning)) that collect the maximum amount of information from each juror evaluation. There might be also a chance to use consensus building algorithms from Community Notes or similar projects that aggregate information in multiple ways.

The two main open questions for me here are: How to balance juror diversity with consistency needs? Should consistently deviating jurors be penalized or weighted differently? We probably want very diverse jurors but also for them to be "consistent enough".

## Dependency Graph Collection Accuracy and Completeness

Since the real dependency graph is not static, any "snapshot" of it quickly becomes outdated. If collected programmatically like now, the process can result in incomplete data too. This results in graphs that have mismatches between competition data and actual project dependencies. There is also an ongoing question of how to allow other kinds of dependencies that are more abstract (e.g: papers, references, ...).

A simple solution for this would be to have the programmatic script that now generates the CSVs/JSON to also generate a bunch of YAML files in a repository that maintainers and other community members could inspect, edit, and discuss. In theory, it's possible in the current setup but editing CSV files is not as simple as doing it on small project focused YAML files. The complex solution other folks are thinking about is Prediction Markets, but that probably deserves its own issue.

The YAML on a repo solution seems to me like a great compromise as it doesn't add much overhead to the current process and invites anyone to collaborate while keeping the "admin decides" approach of the current one.

There is also one big question in this area: how to prevent projects doing adversarial behavior (hiding dependencies, fake dependencies)?

## Intensity Scoring in Pairwise Comparisons

Asking jurors for intensity in the pairwise comparisons introduces noise and bias. More importantly, it also breaks some interesting pairwise properties like order dependency. Some of the issues of asking and using intensity are:

- **Order dependence**: first comparison becomes anchor point affecting all subsequent ratings.
- Jurors provide wildly different intensity values (e.g., 999x vs 100x for same comparison). Can be dealt with later with log or other transformations, but each juror will have its own scale.
- Jurors lack global view - only see tiny percentage of entries, making scale impossible

Intensity measurements in pairwise comparisons introduce noise and a temporal dependency between evaluations.

The cleanest solution in my mind is to completely remove intensity. Use only binary comparisons. These simpler binary comparisons layered with an aggregation step (Elo / BT) should provide more robust results.

## Model Training Approach for Graph Completion

Currently, participants' models are expected to scale the human judgment. I think there are a few issues with that after spending some time with the data and generating models from it.

- Models will overfit to the limited training data.
- Models waste the signal kept away in the test set. The model that wins and decides the final weights won't be trained with all the data from jurors, only a subset! This is useful to avoid overfitting at the competition time, but makes models less powerful than they could be.
- Small number of L1 repositories have disproportionate influence on entire graph as the entrypoints.
- The train/test split is sensible and makes shuffling data to have significant impact on the results.
- L1 and L2 nodes have few samples and different semantic meaning.
- Static rewards don't incentivize edge-case discoveries or model diversity.

These are issues usually also present in Kaggle competitions and hard to deal with. That said, I'd like to propose a potential alternative approach. The idea is to reverse the order of the mechanisms. It could look like this:

1. Organizers create a canonical graph without weights
2. Participants send the same graph with weights. They can use any techniques they want. LLMs, classic ML with GH data, ...
3. All juror's evaluations can be used to score submissions or to rank participant's graphs based on alignment. Here, models could also be rewarded based on edge-level prediction accuracy, not just overall score. Another alternative could be to allow jurors to vote on model results directly.

## Sybil Resistance and Competition Integrity

Another issue that is more relevant due to small amount of samples in L1 and L2 is that there is no effective sybil resistance against multiple accounts or coordinated groups. Moreover, even without a sybil attack, since the platform allows 3 submission per day, you could train a Gaussian process to overfit and nail the current L1 and L2 training weights. That gives an amazing score that is not useful at all.

As for the solution, [I've written a bit about an idea of using Git to run these sorts of competitions](https://davidgasquez.com/credible-neutral-ai-competitions/). There might be other interesting solutions to explore like using stake requirements.

--- 

Would love to hear if anything doesn't make sense or you have any feedback on these ideas. I'm sure they're flawed in ways I'd love to learn more about! 😉 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Competition Design and Mechanisms Feedback and Ideas #30

Juror Evaluations Statistical Significance

Dependency Graph Collection Accuracy and Completeness

Intensity Scoring in Pairwise Comparisons

Model Training Approach for Graph Completion

Sybil Resistance and Competition Integrity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Competition Design and Mechanisms Feedback and Ideas #30

Description

Juror Evaluations Statistical Significance

Dependency Graph Collection Accuracy and Completeness

Intensity Scoring in Pairwise Comparisons

Model Training Approach for Graph Completion

Sybil Resistance and Competition Integrity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions