Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves dataset quality by (1) extracting/mending missing PR author metadata and (2) adding “quality” filtering (self-authored PRs, non-empty reviews) to both the Streamlit dashboard and the Rust API service.
Changes:
- Extend PR metadata extraction to derive title/author/created-at from
IssueCommentEventpayloads when needed. - Add an async backfill job + CLI command to populate
prs.pr_authorfrom the GitHub REST API and persist raw PR API responses. - Add “Exclude self-authored” + “Require reviews” filters end-to-end (DB query → API snapshot → UI controls).
Reviewed changes
Copilot reviewed 15 out of 21 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| online/etl/pipeline/discover.py | Populate PR metadata from IssueCommentEvent payloads. |
| online/etl/pipeline/assemble.py | Same metadata fallback during assembly. |
| online/etl/pipeline/backfill_pr_author.py | New async backfill worker pool to fetch pr_author and store pr_api_raw. |
| online/etl/main.py | Wire new backfill-pr-author subcommand. |
| online/etl/db/schema.py | Add pr_api_raw column migration. |
| online/etl/dashboard/data.py | Include pr_author + has_reviews in analyses query. |
| online/etl/dashboard/app.py | Add Streamlit quality filter controls and apply them. |
| online/etl/config.py | Add an additional default chatbot username. |
| online/api_service/static/index.html | Add quality filter checkboxes and propagate query params. |
| online/api_service/src/model.rs | Add self_authored/has_reviews fields + filter params. |
| online/api_service/src/handlers.rs | Parse new query params into FilterParams. |
| online/api_service/src/db.rs | Load pr_author + has_reviews; compute self_authored. |
| online/api_service/src/compute.rs | Enforce new filters in record_matches. |
| offline/results/openai_gpt-5.2/candidates.json | Update offline results data (new model key/entries). |
| offline/results/anthropic_claude-opus-4-5-20251101/candidates.json | Update offline results data (new model key/entries). |
| .gitignore | Ignore additional local/editor files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| cfg = DBConfig(verbose=args.verbose) | ||
| if args.database_url: | ||
| cfg.database_url = args.database_url | ||
|
|
||
| db = DBAdapter(cfg.database_url) |
There was a problem hiding this comment.
cmd_backfill_pr_author creates DBConfig but never builds/validates the GitHub token list (unlike cmd_enrich). If env vars are missing, this can run with an empty token and then write incorrect placeholders. Reuse the token-selection + validation logic from cmd_enrich before calling the backfill.
| tokens = cfg.github_tokens if cfg.github_tokens else [cfg.github_token] | ||
| pool = TokenPool(tokens) | ||
| n_tokens = pool.size | ||
| n_workers = n_tokens * 10 | ||
| logger.info(f"Using {n_tokens} GitHub token(s), {n_workers} workers") |
There was a problem hiding this comment.
tokens = cfg.github_tokens if cfg.github_tokens else [cfg.github_token] will produce [""] when neither GITHUB_TOKENS nor GITHUB_TOKEN is set (DBConfig defaults to empty string). That leads to unauthenticated API calls and, with the current logic, can mass-update rows incorrectly. Validate tokens here (non-empty strings) and fail fast if none are configured.
| # API returned but no user — mark as empty string | ||
| # so we don't re-fetch | ||
| await db.execute( | ||
| *db._translate_params( | ||
| "UPDATE prs SET pr_author = '' WHERE id = $1", |
There was a problem hiding this comment.
The “no author” path updates pr_author to an empty string to avoid re-fetching. However _fetch_pr_author returns None for transient errors and bad credentials too (because rest_get returns None in those cases), so this can permanently mask recoverable failures. Prefer leaving pr_author as NULL (or marking status/error) unless you can distinguish a non-retriable condition (e.g., 404).
| try: | ||
| pr_author, raw_data = await _fetch_pr_author( | ||
| gh, owner, repo, pr_number | ||
| ) | ||
| if pr_author is not None: |
There was a problem hiding this comment.
Inside the worker, only RateLimitExhaustedError is handled. Any other exception (DB errors, unexpected response, JSON decode) will bubble out and likely abort the whole batch, while batch_errors remains unused. Add a broader per-item exception handler that increments batch_errors (and optionally marks the row as error) so one bad PR doesn’t stop the run.
| meta["pr_created_at"] = pr_obj.get("created_at") | ||
| elif event["type"] == "IssueCommentEvent": | ||
| issue_obj = payload.get("issue", {}) | ||
| pr_obj = issue_obj.get("pull_request", {}) |
There was a problem hiding this comment.
In the IssueCommentEvent branch, pr_obj = issue_obj.get('pull_request', {}) is assigned but never used. Either remove it or use it to guard that this IssueCommentEvent is actually tied to a PR (e.g., skip if issue_obj.get('pull_request') is missing).
| pr_obj = issue_obj.get("pull_request", {}) |
| meta["pr_created_at"] = pr_obj.get("created_at") | ||
| elif event["type"] == "IssueCommentEvent": | ||
| issue_obj = payload.get("issue", {}) | ||
| pr_obj = issue_obj.get("pull_request", {}) |
There was a problem hiding this comment.
In the IssueCommentEvent branch, pr_obj = issue_obj.get('pull_request', {}) is assigned but unused. Consider removing it, or use it as an explicit guard that the comment belongs to a PR before copying title/author/created_at from the issue object.
| pr_obj = issue_obj.get("pull_request", {}) | |
| pr_obj = issue_obj.get("pull_request") | |
| # Only treat IssueCommentEvent as PR-related if the issue has a pull_request field | |
| if not pr_obj: | |
| continue |
| p_bfa = sub.add_parser("backfill-pr-author", help="Backfill pr_author from GitHub API") | ||
| p_bfa.add_argument("--database-url") | ||
| p_bfa.add_argument("--batch-size", type=int, default=5000) | ||
| p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)") |
There was a problem hiding this comment.
The backfill-pr-author CLI doesn’t accept --github-token/--github-tokens, and the command implementation doesn’t validate that any token is configured. This makes the command hard to use and risks running unauthenticated. Consider adding token flags (matching enrich) and failing fast if no usable token(s) are present.
| p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)") | |
| p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)") | |
| p_bfa.add_argument( | |
| "--github-token", | |
| help="GitHub token to use for API requests (single token; may be specified multiple times)", | |
| action="append", | |
| dest="github_tokens", | |
| ) | |
| p_bfa.add_argument( | |
| "--github-tokens", | |
| help="Comma-separated list of GitHub tokens to use for API requests", | |
| ) |
No description provided.