Purpose: Prepared answers for demo/handover call questions.
Question: What's the expected message volume? Do we need persistence guarantees beyond Redis?
Answer: Redis Streams was chosen for the POC because:
- Simplicity: Single dependency for both storage and message queue
- Low latency: In-memory with optional persistence (AOF enabled in
docker-compose.yaml) - Consumer groups: Built-in support for competing consumers (
XREADGROUP) - Lightweight: No separate cluster management (vs. Kafka/RabbitMQ)
Expected volume: Low-to-medium (PR events, not high-frequency). For a typical org, expect 10-100 PRs/day.
Production considerations:
- If volume exceeds 1000s of messages/sec or need cross-region replication → consider Kafka
- If need guaranteed delivery with complex routing → consider RabbitMQ
- Current setup has AOF persistence enabled (
redis-server --appendonly yes)
Question: How do we add new scanner types? Is there shared code we should extract?
Answer: Rationale for separate workers:
- Isolation: Different dependencies (KICS needs Go binary, Blackduck needs JRE + 200MB JAR)
- Independent scaling: Blackduck scans take longer → scale differently than KICS
- Failure isolation: One scanner crashing doesn't affect others
- Deployment flexibility: Update KICS without touching Blackduck
Adding a new scanner:
- Create
services/workers/<scanner-name>/directory - Copy worker template (app.py, model.py, Dockerfile)
- Add infra Dockerfile if special dependencies needed
- Add stream name to API fan-out list
- Add service to
docker-compose.yaml
Shared code to extract:
model.py(MessagePayload, StoragePayload) - duplicated across workersstream_listener()pattern - nearly identical in blackduck/kickssend_callback()function - identical
Consider creating a services/workers/common/ package.
Question: What if we want selective scanning (only KICS for IaC repos)?
Answer:
Current design: API pushes to multiple streams (worker-1, worker-kics), each worker reads its own stream.
Why this approach:
- Explicit control: API decides which scanners run
- Visibility: Easy to see which streams received messages
- Simple workers: Workers don't need filtering logic
For selective scanning (future):
- API can check repo contents/labels before fan-out
- Example: Only push to
worker-kicsstream if repo contains.tf,.yaml, orDockerfile - Could add
scan_typesfield to webhook config per repo
Alternative (single stream):
- All workers read from one stream, filter locally
- Pro: Simpler API
- Con: All workers receive all messages, wasted processing
Question: Are messages re-delivered? Is there a visibility timeout?
Answer: Current behavior:
- Workers use Redis consumer groups with
XREADGROUP - Messages are tracked per-consumer until
XACKis called - If worker crashes before
XACK, message stays in Pending Entries List (PEL)
Recovery mechanism:
- Currently NOT implemented - pending messages are orphaned
- Need to add:
XAUTOCLAIMorXPENDING+XCLAIMlogic to reclaim old messages
Recommended fix:
# Add to worker startup - claim messages pending > 60 seconds
messages = Redis.xautoclaim(STREAM_NAME, CONSUMER_GROUP, CONSUMER_NAME,
min_idle_time=60000, start_id='0-0', count=10)Current safeguard: Messages are deleted after ACK (XDEL) to prevent reprocessing.
Question: Can we run multiple instances of the same worker? How does consumer group balancing work?
Answer: Yes, multiple instances are supported. Redis consumer groups handle this automatically.
How it works:
- All workers in same
CONSUMER_GROUP(default:"workers") share the stream - Each message delivered to only ONE consumer in the group
- Redis round-robins between available consumers
To scale:
# docker-compose.yaml
worker-1:
deploy:
replicas: 3Or in Kubernetes:
spec:
replicas: 3Important: Each instance needs unique CONSUMER_NAME (currently uses APP_NAME, would need APP_NAME-{instance_id}).
Question: Old messages never expire currently - is there a cleanup job?
Answer:
Current state: No TTL configured. Messages deleted immediately after processing (XDEL).
Storage keys (storage:{id}): No expiration - will accumulate forever.
Recommended fixes:
-
Add TTL to storage keys:
Redis.setex(f"storage:{id}", 86400, json.dumps(storage)) # 24hr TTL
-
Add stream max length:
Redis.xadd(stream, {"data": msg}, maxlen=10000) # Keep last 10k
-
Or use
XTRIMin a cleanup job:redis-cli XTRIM worker-1 MAXLEN ~ 1000
Question: What metrics should we track? Where are the dashboards?
Answer: Currently available:
/statusendpoint on API and each worker (health + Redis connectivity)- Basic logging to stdout (Docker logs)
Recommended metrics to add:
| Metric | Source | Purpose |
|---|---|---|
webhook_events_received |
API | Ingest rate |
messages_published{stream} |
API | Fan-out health |
messages_processed{worker} |
Workers | Throughput |
processing_duration_seconds |
Workers | Scan performance |
callback_success/failure |
Workers | Integration health |
redis_stream_length{stream} |
Redis | Backlog monitoring |
redis_pending_messages{stream,consumer} |
Redis | Stuck messages |
Integration options:
- Prometheus + Grafana (add
/metricsendpoint withprometheus-fastapi-instrumentator) - Datadog (add
ddtraceinstrumentation) - CloudWatch (if on AWS)
No dashboards exist currently.
Question: Base64 in env vars isn't production-ready - what's the plan? Who rotates the GitHub App private key?
Answer: Current state:
GITHUB_APP_PRIVATE_KEY: Base64-encoded PEM in environment variableGITHUB_WEBHOOK_SECRET: Plain text in environment variable- Loaded via shell:
export GITHUB_APP_PRIVATE_KEY=$(base64 -i private-key.pem)
Why Base64: Avoids newline issues in env vars (PEM files have multiple lines).
Production recommendations:
| Option | Complexity | Notes |
|---|---|---|
| Kubernetes Secrets | Low | Mount as files, not env vars |
| HashiCorp Vault | Medium | Dynamic secrets, audit logging |
| AWS Secrets Manager | Medium | Native if on AWS/EKS |
| Azure Key Vault | Medium | Native if on Azure/AKS |
Key rotation:
- GitHub App private keys can be rotated in GitHub App settings
- Generate new key → update secret store → restart services
- No automated rotation implemented
Code change needed: Support reading key from file path, not just env var.
Question: Can anyone POST to /callback? Should workers use a shared secret?
Answer:
Current state: NO AUTHENTICATION. Anyone can POST to /callback.
# services/api/app.py:86
@app.post("/callback")
def callback(payload: dict): # No auth checkRisk: Attacker could inject fake scan results.
Recommended fixes:
-
Shared secret (simple):
CALLBACK_SECRET = os.getenv("CALLBACK_SECRET") @app.post("/callback") def callback(payload: dict, x_callback_secret: str = Header(...)): if x_callback_secret != CALLBACK_SECRET: raise HTTPException(401, "Invalid callback secret")
-
Network isolation (defense in depth):
- Keep callback endpoint internal-only (not exposed via ingress)
- Currently OK in Docker network, but verify in production
-
Message signing:
- Workers sign payload with HMAC
- API verifies signature
Question: Currently no auth - what's the production config?
Answer: Current state: NO AUTHENTICATION.
# docker-compose.yaml
redis:
image: "redis:8"
command: redis-server --appendonly yes # No --requirepassRisk: Any container on the network can read/write Redis data.
Recommended fixes:
-
Add password:
redis: command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD}
-
Update clients:
Redis = redis.Redis( host=os.getenv("REDIS_HOST"), port=int(os.getenv("REDIS_PORT", 6379)), password=os.getenv("REDIS_PASSWORD"), # Add this decode_responses=True )
-
Production: Use managed Redis (ElastiCache, Azure Cache, etc.) with:
- TLS encryption
- IAM authentication (AWS) or Azure AD
- VPC/private networking
Question: Clone the repo? Use GitHub API for file contents? Full repo scan vs. changed files only? Where do scan results get stored?
Answer:
Current state: Workers receive PR metadata but scanning is stubbed (time.sleep(5)).
Recommended implementation:
| Scanner | Approach | Notes |
|---|---|---|
| KICS | Clone repo → run kics scan |
Needs full repo for context |
| Blackduck | Clone repo → run Detect JAR | Full dependency analysis |
Clone approach (recommended):
from git import Repo
def clone_repo(owner: str, name: str, branch: str, token: str) -> str:
url = f"https://x-access-token:{token}@github.com/{owner}/{name}.git"
path = f"/tmp/repos/{owner}-{name}-{branch}"
Repo.clone_from(url, path, branch=branch, depth=1)
return pathGitHub API approach (alternative):
- Use for changed-files-only scanning
GET /repos/{owner}/{repo}/pulls/{pull_number}/files- Less resource-intensive but limited context
Results storage:
- Short-term: Redis (
scan_result:{id}) with TTL - Long-term: PostgreSQL or S3 for audit/compliance
- Currently: Only logged, not persisted
Question: PR comments? Check runs? Status checks? What format should results be in?
Answer: Options:
| Method | Visibility | Best For |
|---|---|---|
| PR Comment | In PR thread | Human-readable summaries |
| Check Run | Checks tab, PR status | Detailed results with annotations |
| Status Check | PR status badge | Simple pass/fail |
Recommended: Check Runs (best UX)
# In callback handler, after receiving results
from githubapp import GitHubApp
async def post_check_run(installation_id, owner, repo, head_sha, results):
gh = await github_app.get_installation_client(installation_id)
await gh.post(f"/repos/{owner}/{repo}/check-runs", data={
"name": "KICS Security Scan",
"head_sha": head_sha,
"status": "completed",
"conclusion": "success" if results.passed else "failure",
"output": {
"title": "KICS Scan Results",
"summary": f"Found {results.issue_count} issues",
"annotations": [...] # Line-level findings
}
})Required GitHub App permissions:
checks: writepull_requests: read
Question: Blackduck can take 10+ minutes - how does that affect the flow?
Answer: Expected durations:
| Scanner | Typical Duration | Factors |
|---|---|---|
| KICS | 30s - 2min | Repo size, file count |
| Blackduck | 5 - 15min | Dependencies, network speed |
Current handling:
- Workers block during scan (synchronous)
block=5000inXREADGROUP- 5 second timeout for new messages- No overall timeout on scan execution
Concerns:
- Long scans block the worker thread
- No way to cancel in-progress scans
- User sees no progress on PR
Recommendations:
-
Add scan timeout:
import signal signal.alarm(600) # 10 minute timeout
-
Post "in progress" check run immediately, update when complete
-
Consider async scanning:
- Worker spawns subprocess
- Polls for completion
- Can handle multiple scans concurrently
Question: Kubernetes? ECS? Cloud Run?
Answer: Current state: Docker Compose only. No production deployment configs.
Recommended path:
| Platform | Effort | Best For |
|---|---|---|
| Kubernetes | Medium | Existing K8s clusters, complex scaling |
| ECS Fargate | Low | AWS shops, simpler ops |
| Cloud Run | Low | GCP shops, auto-scaling |
What's needed:
- Kubernetes manifests or Helm chart
- Health/readiness probes (already have
/status) - Resource limits (CPU/memory)
- Horizontal Pod Autoscaler configs
- Ingress/load balancer for API
- Managed Redis (ElastiCache/Memorystore)
Smee not needed in production - GitHub webhooks hit API directly via public endpoint.
Question: Is there a circuit breaker? Do we need Redis clustering/sentinel?
Answer: Current behavior:
- API:
/statusreturnsredis: "error", but webhook handler would crash - Workers: Catch
redis.ConnectionError, sleep 5s, retry forever
# services/workers/blackduck/app.py:101
except redis.ConnectionError as e:
logger.error(f"[{APP_NAME}] Redis connection error: {e}")
time.sleep(5) # Retry loopNo circuit breaker implemented.
Recommendations:
-
Circuit breaker (e.g.,
pybreaker):from pybreaker import CircuitBreaker redis_breaker = CircuitBreaker(fail_max=5, reset_timeout=60) @redis_breaker def publish_message(stream, data): Redis.xadd(stream, data)
-
Redis high availability:
- Development: Single node is fine
- Production: Redis Sentinel or Redis Cluster
- Managed: ElastiCache Multi-AZ, Azure Cache Premium
-
Graceful degradation:
- API could queue webhooks locally if Redis down (with risk of loss)
- Return 503 to GitHub (will retry)
Question: The decorators exist but aren't implemented.
Answer: Current state: Decorator exists but is a passthrough.
# services/api/app.py:106
@with_rate_limit_handling(github_app)
def handle_pr():
...The @with_rate_limit_handling decorator is from fastapi-githubapp library and should handle rate limits automatically by:
- Checking
X-RateLimit-Remainingheader - Sleeping until
X-RateLimit-Resetif exhausted
Rate limits:
| Resource | Limit | Reset |
|---|---|---|
| GitHub App installation | 5,000/hr | Rolling |
| Search API | 30/min | Rolling |
Recommendations:
- Verify decorator actually works (test with high volume)
- Add monitoring for rate limit headers
- For posting results, batch if possible
- Consider caching GitHub API responses (repo metadata doesn't change often)
Question: Are the fixture scripts sufficient? Is there a mock GitHub webhook generator?
Answer: Current testing tools:
-
Fixture files:
scripts/fixtures/pr-opened.json- Sample webhook payloadscripts/post-pr-opened.sh- Sends fixture to API
-
Manual fan-out:
curl -X POST http://localhost:8080/fanout
This bypasses GitHub entirely and tests Redis → Workers → Callback.
-
Smee.io for real webhooks:
- Create test repo with GitHub App installed
- Open PR → webhook flows through Smee → local API
Recommended additions:
-
Webhook simulator:
# tests/conftest.py @pytest.fixture def mock_webhook(): return { "action": "opened", "pull_request": {"number": 1, "head": {"sha": "abc123"}}, "repository": {"name": "test", "owner": {"login": "testorg"}} }
-
Integration test with testcontainers:
from testcontainers.redis import RedisContainer def test_full_flow(): with RedisContainer() as redis: # Start API, workers, send webhook, verify callback
Question: pytest is configured but no tests exist.
Answer: Current state:
pytest.iniexistspytest-cov,pytest-mock,pytest-asyncioin dev dependencies- Zero tests implemented
Recommended coverage targets:
| Component | Priority | Target |
|---|---|---|
utils.py (base64, datetime) |
High | 100% |
model.py (data classes) |
High | 100% |
| Webhook signature validation | High | 100% |
| Stream consumer logic | Medium | 80% |
| Callback handling | Medium | 80% |
| Integration (API → Redis → Worker) | Medium | Key paths |
Quick wins:
# tests/test_utils.py
def test_decode_base64_key():
encoded = base64.b64encode(b"test-key").decode()
assert decode_base64_key(encoded) == "test-key"
# tests/test_model.py
def test_storage_payload_from_json():
json_str = '{"id": "1", "name": "repo", "owner": "org", "branch": "main"}'
payload = StoragePayload.from_json(json_str)
assert payload.name == "repo"Question: Redis only? Need a database? Audit/compliance requirements?
Answer: Current state:
- PR metadata: Redis (
storage:{id}) - no TTL - Scan results: Logged only, not persisted
- No audit trail
Recommendations:
| Data | Short-term | Long-term |
|---|---|---|
| PR metadata | Redis (24hr TTL) | PostgreSQL or skip |
| Scan results | Redis (7 days) | PostgreSQL + S3 |
| Audit log | - | PostgreSQL (immutable) |
For compliance (SOC2, etc.):
- Who triggered scan
- What was scanned (commit SHA)
- When scan ran
- What was found
- What action was taken
Schema suggestion:
CREATE TABLE scan_results (
id UUID PRIMARY KEY,
pr_id INT,
repo_owner TEXT,
repo_name TEXT,
commit_sha TEXT,
scanner TEXT,
started_at TIMESTAMP,
completed_at TIMESTAMP,
status TEXT,
findings JSONB,
created_at TIMESTAMP DEFAULT NOW()
);Question: GDPR/data retention policies?
Answer: Current state: Retained forever (no TTL on Redis keys).
What's stored:
storage:{id}: repo name, owner, branch, PR ID- No PII directly (usernames are GitHub handles, not personal data)
Recommendations:
| Data Type | Retention | Justification |
|---|---|---|
| PR metadata | 30 days | Operational debugging |
| Scan results | 1 year | Compliance, trend analysis |
| Audit logs | 7 years | Regulatory (SOC2, etc.) |
Implementation:
# Add TTL when storing
Redis.setex(f"storage:{id}", 30 * 86400, json.dumps(storage)) # 30 daysGDPR considerations:
- GitHub usernames may be considered pseudonymous data
- Ensure data deletion process exists if requested
- Document data flows in privacy policy
These are most likely to come up first - ensure answers are prepared:
| # | Topic | Why It's Critical |
|---|---|---|
| 1 | Redis vs alternatives | Architecture justification |
| 8 | Secrets management | Security concern |
| 9 | Callback authentication | Security concern |
| 11 | Scanner implementation | Core functionality gap |
| 12 | GitHub integration | Core functionality gap |
| 14 | Deployment platform | Ops planning |
Last updated: February 2026