Improve data quality by zverianskii · Pull Request #15 · withmartian/code-review-benchmark

zverianskii · 2026-03-05T08:54:14Z

No description provided.

Copilot

Pull request overview

This PR improves dataset quality by (1) extracting/mending missing PR author metadata and (2) adding “quality” filtering (self-authored PRs, non-empty reviews) to both the Streamlit dashboard and the Rust API service.

Changes:

Extend PR metadata extraction to derive title/author/created-at from IssueCommentEvent payloads when needed.
Add an async backfill job + CLI command to populate prs.pr_author from the GitHub REST API and persist raw PR API responses.
Add “Exclude self-authored” + “Require reviews” filters end-to-end (DB query → API snapshot → UI controls).

Reviewed changes

Copilot reviewed 15 out of 21 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
online/etl/pipeline/discover.py	Populate PR metadata from `IssueCommentEvent` payloads.
online/etl/pipeline/assemble.py	Same metadata fallback during assembly.
online/etl/pipeline/backfill_pr_author.py	New async backfill worker pool to fetch `pr_author` and store `pr_api_raw`.
online/etl/main.py	Wire new `backfill-pr-author` subcommand.
online/etl/db/schema.py	Add `pr_api_raw` column migration.
online/etl/dashboard/data.py	Include `pr_author` + `has_reviews` in analyses query.
online/etl/dashboard/app.py	Add Streamlit quality filter controls and apply them.
online/etl/config.py	Add an additional default chatbot username.
online/api_service/static/index.html	Add quality filter checkboxes and propagate query params.
online/api_service/src/model.rs	Add `self_authored`/`has_reviews` fields + filter params.
online/api_service/src/handlers.rs	Parse new query params into `FilterParams`.
online/api_service/src/db.rs	Load `pr_author` + `has_reviews`; compute `self_authored`.
online/api_service/src/compute.rs	Enforce new filters in `record_matches`.
offline/results/openai_gpt-5.2/candidates.json	Update offline results data (new model key/entries).
offline/results/anthropic_claude-opus-4-5-20251101/candidates.json	Update offline results data (new model key/entries).
.gitignore	Ignore additional local/editor files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-05T08:59:35Z

online/etl/main.py

+    cfg = DBConfig(verbose=args.verbose)
+    if args.database_url:
+        cfg.database_url = args.database_url
+
+    db = DBAdapter(cfg.database_url)


cmd_backfill_pr_author creates DBConfig but never builds/validates the GitHub token list (unlike cmd_enrich). If env vars are missing, this can run with an empty token and then write incorrect placeholders. Reuse the token-selection + validation logic from cmd_enrich before calling the backfill.

Copilot · 2026-03-05T08:59:35Z

online/etl/pipeline/backfill_pr_author.py

+    tokens = cfg.github_tokens if cfg.github_tokens else [cfg.github_token]
+    pool = TokenPool(tokens)
+    n_tokens = pool.size
+    n_workers = n_tokens * 10
+    logger.info(f"Using {n_tokens} GitHub token(s), {n_workers} workers")


tokens = cfg.github_tokens if cfg.github_tokens else [cfg.github_token] will produce [""] when neither GITHUB_TOKENS nor GITHUB_TOKEN is set (DBConfig defaults to empty string). That leads to unauthenticated API calls and, with the current logic, can mass-update rows incorrectly. Validate tokens here (non-empty strings) and fail fast if none are configured.

Copilot · 2026-03-05T08:59:35Z

online/etl/pipeline/backfill_pr_author.py

+                                # API returned but no user — mark as empty string
+                                # so we don't re-fetch
+                                await db.execute(
+                                    *db._translate_params(
+                                        "UPDATE prs SET pr_author = '' WHERE id = $1",


The “no author” path updates pr_author to an empty string to avoid re-fetching. However _fetch_pr_author returns None for transient errors and bad credentials too (because rest_get returns None in those cases), so this can permanently mask recoverable failures. Prefer leaving pr_author as NULL (or marking status/error) unless you can distinguish a non-retriable condition (e.g., 404).

Copilot · 2026-03-05T08:59:35Z

online/etl/pipeline/backfill_pr_author.py

+                        try:
+                            pr_author, raw_data = await _fetch_pr_author(
+                                gh, owner, repo, pr_number
+                            )
+                            if pr_author is not None:


Inside the worker, only RateLimitExhaustedError is handled. Any other exception (DB errors, unexpected response, JSON decode) will bubble out and likely abort the whole batch, while batch_errors remains unused. Add a broader per-item exception handler that increments batch_errors (and optionally marks the row as error) so one bad PR doesn’t stop the run.

Copilot · 2026-03-05T08:59:36Z

online/etl/pipeline/discover.py

                meta["pr_created_at"] = pr_obj.get("created_at")
+        elif event["type"] == "IssueCommentEvent":
+            issue_obj = payload.get("issue", {})
+            pr_obj = issue_obj.get("pull_request", {})


In the IssueCommentEvent branch, pr_obj = issue_obj.get('pull_request', {}) is assigned but never used. Either remove it or use it to guard that this IssueCommentEvent is actually tied to a PR (e.g., skip if issue_obj.get('pull_request') is missing).

Suggested change

pr_obj = issue_obj.get("pull_request", {})

Copilot · 2026-03-05T08:59:36Z

online/etl/pipeline/assemble.py

                meta["pr_created_at"] = pr_obj.get("created_at")
+        elif event["type"] == "IssueCommentEvent":
+            issue_obj = payload.get("issue", {})
+            pr_obj = issue_obj.get("pull_request", {})


In the IssueCommentEvent branch, pr_obj = issue_obj.get('pull_request', {}) is assigned but unused. Consider removing it, or use it as an explicit guard that the comment belongs to a PR before copying title/author/created_at from the issue object.

Suggested change

pr_obj = issue_obj.get("pull_request", {})

pr_obj = issue_obj.get("pull_request")

# Only treat IssueCommentEvent as PR-related if the issue has a pull_request field

if not pr_obj:

continue

Copilot · 2026-03-05T08:59:36Z

online/etl/main.py

+    p_bfa = sub.add_parser("backfill-pr-author", help="Backfill pr_author from GitHub API")
+    p_bfa.add_argument("--database-url")
+    p_bfa.add_argument("--batch-size", type=int, default=5000)
+    p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")


The backfill-pr-author CLI doesn’t accept --github-token/--github-tokens, and the command implementation doesn’t validate that any token is configured. This makes the command hard to use and risks running unauthenticated. Consider adding token flags (matching enrich) and failing fast if no usable token(s) are present.

Suggested change

p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")

p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")

p_bfa.add_argument(

"--github-token",

help="GitHub token to use for API requests (single token; may be specified multiple times)",

action="append",

dest="github_tokens",

)

p_bfa.add_argument(

"--github-tokens",

help="Comma-separated list of GitHub tokens to use for API requests",

)

zverianskii added 2 commits March 3, 2026 13:00

add qodo new version and devin

03c6c6a

add filter for self authored prs

6758d56

Copilot AI review requested due to automatic review settings March 5, 2026 08:54

Copilot started reviewing on behalf of zverianskii March 5, 2026 08:54 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve data quality#15

Improve data quality#15
zverianskii wants to merge 2 commits intomainfrom
improve-data-quality

zverianskii commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            pr_obj = issue_obj.get("pull_request", {})
+            pr_obj = issue_obj.get("pull_request")
+            # Only treat IssueCommentEvent as PR-related if the issue has a pull_request field
+            if not pr_obj:
+                continue

-    p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")
+    p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")
+    p_bfa.add_argument(
+        "--github-token",
+        help="GitHub token to use for API requests (single token; may be specified multiple times)",
+        action="append",
+        dest="github_tokens",
+    )
+    p_bfa.add_argument(
+        "--github-tokens",
+        help="Comma-separated list of GitHub tokens to use for API requests",
+    )

Conversation

zverianskii commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants