Q&A - Anticipated Engineer Questions

Purpose: Prepared answers for demo/handover call questions.

Architecture & Design Decisions

1. Why Redis Streams over Kafka/RabbitMQ/SQS?

Question: What's the expected message volume? Do we need persistence guarantees beyond Redis?

Answer: Redis Streams was chosen for the POC because:

Simplicity: Single dependency for both storage and message queue
Low latency: In-memory with optional persistence (AOF enabled in docker-compose.yaml)
Consumer groups: Built-in support for competing consumers (XREADGROUP)
Lightweight: No separate cluster management (vs. Kafka/RabbitMQ)

Expected volume: Low-to-medium (PR events, not high-frequency). For a typical org, expect 10-100 PRs/day.

Production considerations:

If volume exceeds 1000s of messages/sec or need cross-region replication → consider Kafka
If need guaranteed delivery with complex routing → consider RabbitMQ
Current setup has AOF persistence enabled (redis-server --appendonly yes)

2. Why separate workers per scanner vs. one worker with plugins?

Question: How do we add new scanner types? Is there shared code we should extract?

Answer: Rationale for separate workers:

Isolation: Different dependencies (KICS needs Go binary, Blackduck needs JRE + 200MB JAR)
Independent scaling: Blackduck scans take longer → scale differently than KICS
Failure isolation: One scanner crashing doesn't affect others
Deployment flexibility: Update KICS without touching Blackduck

Adding a new scanner:

Create services/workers/<scanner-name>/ directory
Copy worker template (app.py, model.py, Dockerfile)
Add infra Dockerfile if special dependencies needed
Add stream name to API fan-out list
Add service to docker-compose.yaml

Shared code to extract:

model.py (MessagePayload, StoragePayload) - duplicated across workers
stream_listener() pattern - nearly identical in blackduck/kicks
send_callback() function - identical

Consider creating a services/workers/common/ package.

3. Why fan-out at the API vs. letting workers subscribe to a single stream?

Question: What if we want selective scanning (only KICS for IaC repos)?

Answer: Current design: API pushes to multiple streams (worker-1, worker-kics), each worker reads its own stream.

Why this approach:

Explicit control: API decides which scanners run
Visibility: Easy to see which streams received messages
Simple workers: Workers don't need filtering logic

For selective scanning (future):

API can check repo contents/labels before fan-out
Example: Only push to worker-kics stream if repo contains .tf, .yaml, or Dockerfile
Could add scan_types field to webhook config per repo

Alternative (single stream):

All workers read from one stream, filter locally
Pro: Simpler API
Con: All workers receive all messages, wasted processing

Operational Concerns

4. What happens when a worker crashes mid-scan?

Question: Are messages re-delivered? Is there a visibility timeout?

Answer: Current behavior:

Workers use Redis consumer groups with XREADGROUP
Messages are tracked per-consumer until XACK is called
If worker crashes before XACK, message stays in Pending Entries List (PEL)

Recovery mechanism:

Currently NOT implemented - pending messages are orphaned
Need to add: XAUTOCLAIM or XPENDING + XCLAIM logic to reclaim old messages

Recommended fix:

# Add to worker startup - claim messages pending > 60 seconds
messages = Redis.xautoclaim(STREAM_NAME, CONSUMER_GROUP, CONSUMER_NAME,
                            min_idle_time=60000, start_id='0-0', count=10)

Current safeguard: Messages are deleted after ACK (XDEL) to prevent reprocessing.

5. How do we scale workers?

Question: Can we run multiple instances of the same worker? How does consumer group balancing work?

Answer: Yes, multiple instances are supported. Redis consumer groups handle this automatically.

How it works:

All workers in same CONSUMER_GROUP (default: "workers") share the stream
Each message delivered to only ONE consumer in the group
Redis round-robins between available consumers

To scale:

# docker-compose.yaml
worker-1:
  deploy:
    replicas: 3

Or in Kubernetes:

spec:
  replicas: 3

Important: Each instance needs unique CONSUMER_NAME (currently uses APP_NAME, would need APP_NAME-{instance_id}).

6. What's the message retention/TTL policy?

Question: Old messages never expire currently - is there a cleanup job?

Answer: Current state: No TTL configured. Messages deleted immediately after processing (XDEL).

Storage keys (storage:{id}): No expiration - will accumulate forever.

Recommended fixes:

Add TTL to storage keys:

Redis.setex(f"storage:{id}", 86400, json.dumps(storage))  # 24hr TTL

Add stream max length:

Redis.xadd(stream, {"data": msg}, maxlen=10000)  # Keep last 10k

Or use XTRIM in a cleanup job:
```
redis-cli XTRIM worker-1 MAXLEN ~ 1000
```

7. How do we monitor this in production?

Question: What metrics should we track? Where are the dashboards?

Answer: Currently available:

/status endpoint on API and each worker (health + Redis connectivity)
Basic logging to stdout (Docker logs)

Recommended metrics to add:

Metric	Source	Purpose
`webhook_events_received`	API	Ingest rate
`messages_published{stream}`	API	Fan-out health
`messages_processed{worker}`	Workers	Throughput
`processing_duration_seconds`	Workers	Scan performance
`callback_success/failure`	Workers	Integration health
`redis_stream_length{stream}`	Redis	Backlog monitoring
`redis_pending_messages{stream,consumer}`	Redis	Stuck messages

Integration options:

Prometheus + Grafana (add /metrics endpoint with prometheus-fastapi-instrumentator)
Datadog (add ddtrace instrumentation)
CloudWatch (if on AWS)

No dashboards exist currently.

Security Questions

8. How are secrets managed?

Question: Base64 in env vars isn't production-ready - what's the plan? Who rotates the GitHub App private key?

Answer: Current state:

GITHUB_APP_PRIVATE_KEY: Base64-encoded PEM in environment variable
GITHUB_WEBHOOK_SECRET: Plain text in environment variable
Loaded via shell: export GITHUB_APP_PRIVATE_KEY=$(base64 -i private-key.pem)

Why Base64: Avoids newline issues in env vars (PEM files have multiple lines).

Production recommendations:

Option	Complexity	Notes
Kubernetes Secrets	Low	Mount as files, not env vars
HashiCorp Vault	Medium	Dynamic secrets, audit logging
AWS Secrets Manager	Medium	Native if on AWS/EKS
Azure Key Vault	Medium	Native if on Azure/AKS

Key rotation:

GitHub App private keys can be rotated in GitHub App settings
Generate new key → update secret store → restart services
No automated rotation implemented

Code change needed: Support reading key from file path, not just env var.

9. Is the callback endpoint authenticated?

Question: Can anyone POST to /callback? Should workers use a shared secret?

Answer: Current state: NO AUTHENTICATION. Anyone can POST to /callback.

# services/api/app.py:86
@app.post("/callback")
def callback(payload: dict):  # No auth check

Risk: Attacker could inject fake scan results.

Recommended fixes:

Shared secret (simple):

CALLBACK_SECRET = os.getenv("CALLBACK_SECRET")

@app.post("/callback")
def callback(payload: dict, x_callback_secret: str = Header(...)):
    if x_callback_secret != CALLBACK_SECRET:
        raise HTTPException(401, "Invalid callback secret")

Network isolation (defense in depth):
- Keep callback endpoint internal-only (not exposed via ingress)
- Currently OK in Docker network, but verify in production
Message signing:
- Workers sign payload with HMAC
- API verifies signature

10. Is Redis authenticated?

Question: Currently no auth - what's the production config?

Answer: Current state: NO AUTHENTICATION.

# docker-compose.yaml
redis:
  image: "redis:8"
  command: redis-server --appendonly yes  # No --requirepass

Risk: Any container on the network can read/write Redis data.

Recommended fixes:

Add password:

redis:
  command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD}

Update clients:

Redis = redis.Redis(
    host=os.getenv("REDIS_HOST"),
    port=int(os.getenv("REDIS_PORT", 6379)),
    password=os.getenv("REDIS_PASSWORD"),  # Add this
    decode_responses=True
)

Production: Use managed Redis (ElastiCache, Azure Cache, etc.) with:
- TLS encryption
- IAM authentication (AWS) or Azure AD
- VPC/private networking

Implementation Gaps

11. What exactly should the scanners do?

Question: Clone the repo? Use GitHub API for file contents? Full repo scan vs. changed files only? Where do scan results get stored?

Answer: Current state: Workers receive PR metadata but scanning is stubbed (time.sleep(5)).

Recommended implementation:

Scanner	Approach	Notes
KICS	Clone repo → run `kics scan`	Needs full repo for context
Blackduck	Clone repo → run Detect JAR	Full dependency analysis

Clone approach (recommended):

from git import Repo

def clone_repo(owner: str, name: str, branch: str, token: str) -> str:
    url = f"https://x-access-token:{token}@github.com/{owner}/{name}.git"
    path = f"/tmp/repos/{owner}-{name}-{branch}"
    Repo.clone_from(url, path, branch=branch, depth=1)
    return path

GitHub API approach (alternative):

Use for changed-files-only scanning
GET /repos/{owner}/{repo}/pulls/{pull_number}/files
Less resource-intensive but limited context

Results storage:

Short-term: Redis (scan_result:{id}) with TTL
Long-term: PostgreSQL or S3 for audit/compliance
Currently: Only logged, not persisted

12. How do we post results back to GitHub?

Question: PR comments? Check runs? Status checks? What format should results be in?

Answer: Options:

Method	Visibility	Best For
PR Comment	In PR thread	Human-readable summaries
Check Run	Checks tab, PR status	Detailed results with annotations
Status Check	PR status badge	Simple pass/fail

Recommended: Check Runs (best UX)

# In callback handler, after receiving results
from githubapp import GitHubApp

async def post_check_run(installation_id, owner, repo, head_sha, results):
    gh = await github_app.get_installation_client(installation_id)
    await gh.post(f"/repos/{owner}/{repo}/check-runs", data={
        "name": "KICS Security Scan",
        "head_sha": head_sha,
        "status": "completed",
        "conclusion": "success" if results.passed else "failure",
        "output": {
            "title": "KICS Scan Results",
            "summary": f"Found {results.issue_count} issues",
            "annotations": [...]  # Line-level findings
        }
    })

Required GitHub App permissions:

checks: write
pull_requests: read

13. What's the expected scan duration?

Question: Blackduck can take 10+ minutes - how does that affect the flow?

Answer: Expected durations:

Scanner	Typical Duration	Factors
KICS	30s - 2min	Repo size, file count
Blackduck	5 - 15min	Dependencies, network speed

Current handling:

Workers block during scan (synchronous)
block=5000 in XREADGROUP - 5 second timeout for new messages
No overall timeout on scan execution

Concerns:

Long scans block the worker thread
No way to cancel in-progress scans
User sees no progress on PR

Recommendations:

Add scan timeout:

import signal
signal.alarm(600)  # 10 minute timeout

Post "in progress" check run immediately, update when complete
Consider async scanning:
- Worker spawns subprocess
- Polls for completion
- Can handle multiple scans concurrently

Deployment & Reliability

14. What's the target deployment platform?

Question: Kubernetes? ECS? Cloud Run?

Answer: Current state: Docker Compose only. No production deployment configs.

Recommended path:

Platform	Effort	Best For
Kubernetes	Medium	Existing K8s clusters, complex scaling
ECS Fargate	Low	AWS shops, simpler ops
Cloud Run	Low	GCP shops, auto-scaling

What's needed:

Kubernetes manifests or Helm chart
Health/readiness probes (already have /status)
Resource limits (CPU/memory)
Horizontal Pod Autoscaler configs
Ingress/load balancer for API
Managed Redis (ElastiCache/Memorystore)

Smee not needed in production - GitHub webhooks hit API directly via public endpoint.

15. What happens if Redis goes down?

Question: Is there a circuit breaker? Do we need Redis clustering/sentinel?

Answer: Current behavior:

API: /status returns redis: "error", but webhook handler would crash
Workers: Catch redis.ConnectionError, sleep 5s, retry forever

# services/workers/blackduck/app.py:101
except redis.ConnectionError as e:
    logger.error(f"[{APP_NAME}] Redis connection error: {e}")
    time.sleep(5)  # Retry loop

No circuit breaker implemented.

Recommendations:

Circuit breaker (e.g., pybreaker):

from pybreaker import CircuitBreaker
redis_breaker = CircuitBreaker(fail_max=5, reset_timeout=60)

@redis_breaker
def publish_message(stream, data):
    Redis.xadd(stream, data)

Redis high availability:
- Development: Single node is fine
- Production: Redis Sentinel or Redis Cluster
- Managed: ElastiCache Multi-AZ, Azure Cache Premium
Graceful degradation:
- API could queue webhooks locally if Redis down (with risk of loss)
- Return 503 to GitHub (will retry)

16. How do we handle GitHub API rate limits?

Question: The decorators exist but aren't implemented.

Answer: Current state: Decorator exists but is a passthrough.

# services/api/app.py:106
@with_rate_limit_handling(github_app)
def handle_pr():
    ...

The @with_rate_limit_handling decorator is from fastapi-githubapp library and should handle rate limits automatically by:

Checking X-RateLimit-Remaining header
Sleeping until X-RateLimit-Reset if exhausted

Rate limits:

Resource	Limit	Reset
GitHub App installation	5,000/hr	Rolling
Search API	30/min	Rolling

Recommendations:

Verify decorator actually works (test with high volume)
Add monitoring for rate limit headers
For posting results, batch if possible
Consider caching GitHub API responses (repo metadata doesn't change often)

Testing & Validation

17. How do we test this end-to-end without real PRs?

Question: Are the fixture scripts sufficient? Is there a mock GitHub webhook generator?

Answer: Current testing tools:

Fixture files:
- scripts/fixtures/pr-opened.json - Sample webhook payload
- scripts/post-pr-opened.sh - Sends fixture to API
Manual fan-out:
```
curl -X POST http://localhost:8080/fanout
```
This bypasses GitHub entirely and tests Redis → Workers → Callback.
Smee.io for real webhooks:
- Create test repo with GitHub App installed
- Open PR → webhook flows through Smee → local API

Recommended additions:

Webhook simulator:

# tests/conftest.py
@pytest.fixture
def mock_webhook():
    return {
        "action": "opened",
        "pull_request": {"number": 1, "head": {"sha": "abc123"}},
        "repository": {"name": "test", "owner": {"login": "testorg"}}
    }

Integration test with testcontainers:

from testcontainers.redis import RedisContainer

def test_full_flow():
    with RedisContainer() as redis:
        # Start API, workers, send webhook, verify callback

18. What's the test coverage expectation?

Question: pytest is configured but no tests exist.

Answer: Current state:

pytest.ini exists
pytest-cov, pytest-mock, pytest-asyncio in dev dependencies
Zero tests implemented

Recommended coverage targets:

Component	Priority	Target
`utils.py` (base64, datetime)	High	100%
`model.py` (data classes)	High	100%
Webhook signature validation	High	100%
Stream consumer logic	Medium	80%
Callback handling	Medium	80%
Integration (API → Redis → Worker)	Medium	Key paths

Quick wins:

# tests/test_utils.py
def test_decode_base64_key():
    encoded = base64.b64encode(b"test-key").decode()
    assert decode_base64_key(encoded) == "test-key"

# tests/test_model.py
def test_storage_payload_from_json():
    json_str = '{"id": "1", "name": "repo", "owner": "org", "branch": "main"}'
    payload = StoragePayload.from_json(json_str)
    assert payload.name == "repo"

Data & Compliance

19. Where do historical scan results live?

Question: Redis only? Need a database? Audit/compliance requirements?

Answer: Current state:

PR metadata: Redis (storage:{id}) - no TTL
Scan results: Logged only, not persisted
No audit trail

Recommendations:

Data	Short-term	Long-term
PR metadata	Redis (24hr TTL)	PostgreSQL or skip
Scan results	Redis (7 days)	PostgreSQL + S3
Audit log	-	PostgreSQL (immutable)

For compliance (SOC2, etc.):

Who triggered scan
What was scanned (commit SHA)
When scan ran
What was found
What action was taken

Schema suggestion:

CREATE TABLE scan_results (
    id UUID PRIMARY KEY,
    pr_id INT,
    repo_owner TEXT,
    repo_name TEXT,
    commit_sha TEXT,
    scanner TEXT,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    status TEXT,
    findings JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

20. How long do we retain PR metadata?

Question: GDPR/data retention policies?

Answer: Current state: Retained forever (no TTL on Redis keys).

What's stored:

storage:{id}: repo name, owner, branch, PR ID
No PII directly (usernames are GitHub handles, not personal data)

Recommendations:

Data Type	Retention	Justification
PR metadata	30 days	Operational debugging
Scan results	1 year	Compliance, trend analysis
Audit logs	7 years	Regulatory (SOC2, etc.)

Implementation:

# Add TTL when storing
Redis.setex(f"storage:{id}", 30 * 86400, json.dumps(storage))  # 30 days

GDPR considerations:

GitHub usernames may be considered pseudonymous data
Ensure data deletion process exists if requested
Document data flows in privacy policy

Priority Questions

These are most likely to come up first - ensure answers are prepared:

#	Topic	Why It's Critical
1	Redis vs alternatives	Architecture justification
8	Secrets management	Security concern
9	Callback authentication	Security concern
11	Scanner implementation	Core functionality gap
12	GitHub integration	Core functionality gap
14	Deployment platform	Ops planning

Last updated: February 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q&A - Anticipated Engineer Questions

Architecture & Design Decisions

1. Why Redis Streams over Kafka/RabbitMQ/SQS?

2. Why separate workers per scanner vs. one worker with plugins?

3. Why fan-out at the API vs. letting workers subscribe to a single stream?

Operational Concerns

4. What happens when a worker crashes mid-scan?

5. How do we scale workers?

6. What's the message retention/TTL policy?

7. How do we monitor this in production?

Security Questions

8. How are secrets managed?

9. Is the callback endpoint authenticated?

10. Is Redis authenticated?

Implementation Gaps

11. What exactly should the scanners do?

12. How do we post results back to GitHub?

13. What's the expected scan duration?

Deployment & Reliability

14. What's the target deployment platform?

15. What happens if Redis goes down?

16. How do we handle GitHub API rate limits?

Testing & Validation

17. How do we test this end-to-end without real PRs?

18. What's the test coverage expectation?

Data & Compliance

19. Where do historical scan results live?

20. How long do we retain PR metadata?

Priority Questions

Uh oh!

FilesExpand file tree

q-and-a.md

Latest commit

History

q-and-a.md

File metadata and controls

Q&A - Anticipated Engineer Questions

Architecture & Design Decisions

1. Why Redis Streams over Kafka/RabbitMQ/SQS?

2. Why separate workers per scanner vs. one worker with plugins?

3. Why fan-out at the API vs. letting workers subscribe to a single stream?

Operational Concerns

4. What happens when a worker crashes mid-scan?

5. How do we scale workers?

6. What's the message retention/TTL policy?

7. How do we monitor this in production?

Security Questions

8. How are secrets managed?

9. Is the callback endpoint authenticated?

10. Is Redis authenticated?

Implementation Gaps

11. What exactly should the scanners do?

12. How do we post results back to GitHub?

13. What's the expected scan duration?

Deployment & Reliability

14. What's the target deployment platform?

15. What happens if Redis goes down?

16. How do we handle GitHub API rate limits?

Testing & Validation

17. How do we test this end-to-end without real PRs?

18. What's the test coverage expectation?

Data & Compliance

19. Where do historical scan results live?

20. How long do we retain PR metadata?

Priority Questions