Skip to content

Improve data quality#15

Open
zverianskii wants to merge 2 commits intomainfrom
improve-data-quality
Open

Improve data quality#15
zverianskii wants to merge 2 commits intomainfrom
improve-data-quality

Conversation

@zverianskii
Copy link
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings March 5, 2026 08:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves dataset quality by (1) extracting/mending missing PR author metadata and (2) adding “quality” filtering (self-authored PRs, non-empty reviews) to both the Streamlit dashboard and the Rust API service.

Changes:

  • Extend PR metadata extraction to derive title/author/created-at from IssueCommentEvent payloads when needed.
  • Add an async backfill job + CLI command to populate prs.pr_author from the GitHub REST API and persist raw PR API responses.
  • Add “Exclude self-authored” + “Require reviews” filters end-to-end (DB query → API snapshot → UI controls).

Reviewed changes

Copilot reviewed 15 out of 21 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
online/etl/pipeline/discover.py Populate PR metadata from IssueCommentEvent payloads.
online/etl/pipeline/assemble.py Same metadata fallback during assembly.
online/etl/pipeline/backfill_pr_author.py New async backfill worker pool to fetch pr_author and store pr_api_raw.
online/etl/main.py Wire new backfill-pr-author subcommand.
online/etl/db/schema.py Add pr_api_raw column migration.
online/etl/dashboard/data.py Include pr_author + has_reviews in analyses query.
online/etl/dashboard/app.py Add Streamlit quality filter controls and apply them.
online/etl/config.py Add an additional default chatbot username.
online/api_service/static/index.html Add quality filter checkboxes and propagate query params.
online/api_service/src/model.rs Add self_authored/has_reviews fields + filter params.
online/api_service/src/handlers.rs Parse new query params into FilterParams.
online/api_service/src/db.rs Load pr_author + has_reviews; compute self_authored.
online/api_service/src/compute.rs Enforce new filters in record_matches.
offline/results/openai_gpt-5.2/candidates.json Update offline results data (new model key/entries).
offline/results/anthropic_claude-opus-4-5-20251101/candidates.json Update offline results data (new model key/entries).
.gitignore Ignore additional local/editor files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +556 to +560
cfg = DBConfig(verbose=args.verbose)
if args.database_url:
cfg.database_url = args.database_url

db = DBAdapter(cfg.database_url)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd_backfill_pr_author creates DBConfig but never builds/validates the GitHub token list (unlike cmd_enrich). If env vars are missing, this can run with an empty token and then write incorrect placeholders. Reuse the token-selection + validation logic from cmd_enrich before calling the backfill.

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +57
tokens = cfg.github_tokens if cfg.github_tokens else [cfg.github_token]
pool = TokenPool(tokens)
n_tokens = pool.size
n_workers = n_tokens * 10
logger.info(f"Using {n_tokens} GitHub token(s), {n_workers} workers")
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokens = cfg.github_tokens if cfg.github_tokens else [cfg.github_token] will produce [""] when neither GITHUB_TOKENS nor GITHUB_TOKEN is set (DBConfig defaults to empty string). That leads to unauthenticated API calls and, with the current logic, can mass-update rows incorrectly. Validate tokens here (non-empty strings) and fail fast if none are configured.

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +145
# API returned but no user — mark as empty string
# so we don't re-fetch
await db.execute(
*db._translate_params(
"UPDATE prs SET pr_author = '' WHERE id = $1",
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “no author” path updates pr_author to an empty string to avoid re-fetching. However _fetch_pr_author returns None for transient errors and bad credentials too (because rest_get returns None in those cases), so this can permanently mask recoverable failures. Prefer leaving pr_author as NULL (or marking status/error) unless you can distinguish a non-retriable condition (e.g., 404).

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +130
try:
pr_author, raw_data = await _fetch_pr_author(
gh, owner, repo, pr_number
)
if pr_author is not None:
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the worker, only RateLimitExhaustedError is handled. Any other exception (DB errors, unexpected response, JSON decode) will bubble out and likely abort the whole batch, while batch_errors remains unused. Add a broader per-item exception handler that increments batch_errors (and optionally marks the row as error) so one bad PR doesn’t stop the run.

Copilot uses AI. Check for mistakes.
meta["pr_created_at"] = pr_obj.get("created_at")
elif event["type"] == "IssueCommentEvent":
issue_obj = payload.get("issue", {})
pr_obj = issue_obj.get("pull_request", {})
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the IssueCommentEvent branch, pr_obj = issue_obj.get('pull_request', {}) is assigned but never used. Either remove it or use it to guard that this IssueCommentEvent is actually tied to a PR (e.g., skip if issue_obj.get('pull_request') is missing).

Suggested change
pr_obj = issue_obj.get("pull_request", {})

Copilot uses AI. Check for mistakes.
meta["pr_created_at"] = pr_obj.get("created_at")
elif event["type"] == "IssueCommentEvent":
issue_obj = payload.get("issue", {})
pr_obj = issue_obj.get("pull_request", {})
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the IssueCommentEvent branch, pr_obj = issue_obj.get('pull_request', {}) is assigned but unused. Consider removing it, or use it as an explicit guard that the comment belongs to a PR before copying title/author/created_at from the issue object.

Suggested change
pr_obj = issue_obj.get("pull_request", {})
pr_obj = issue_obj.get("pull_request")
# Only treat IssueCommentEvent as PR-related if the issue has a pull_request field
if not pr_obj:
continue

Copilot uses AI. Check for mistakes.
p_bfa = sub.add_parser("backfill-pr-author", help="Backfill pr_author from GitHub API")
p_bfa.add_argument("--database-url")
p_bfa.add_argument("--batch-size", type=int, default=5000)
p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backfill-pr-author CLI doesn’t accept --github-token/--github-tokens, and the command implementation doesn’t validate that any token is configured. This makes the command hard to use and risks running unauthenticated. Consider adding token flags (matching enrich) and failing fast if no usable token(s) are present.

Suggested change
p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")
p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")
p_bfa.add_argument(
"--github-token",
help="GitHub token to use for API requests (single token; may be specified multiple times)",
action="append",
dest="github_tokens",
)
p_bfa.add_argument(
"--github-tokens",
help="Comma-separated list of GitHub tokens to use for API requests",
)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants