Make Keylime easily deployable on Kubernetes/Openshift

After several discussions with @mpeters @ansasaki Lukas Vrabec @galmasi and Marcus Hesse, we collectively decided that the time to have Keylime easily deployed on Kubernetes/Openshift has come. I propose we use this issue to concentrate all the relevant discussion on this topic.

I will start by listing some common relevant points, and I do thank Marcus Hesse for starting the discussion on the `keylime-operator` on CNCF's Slack. I believe I have addressed most of your questions on this writeup.

The main goal is to end with an "Attestation Operator", which can not only automatically add nodes (i.e., `agents`) to specific `verifiers` but can also properly react to administrative activities such as node reboots or cordoning off. 

I am not an Kubernetes/Openshift expert by any means, and therefore my proposal here is bound to be incomplete/incorrect, and therefore additions/corrects are welcome. That being said, I see  the following set of intermediate steps, in increasing order of complexity, as a good way to achieve our goal.

1. Ensure that all `keylime` components can be fully executed in an containerized manner. For this the following requirements should be satisfied.
   a. Unmodified public images. I suggest we expand https://quay.io/organization/keylime (under Red Hat's control), already offering the "latest"  `verifier`, `registrar` and `tenant` to also include the rust `agent` image (@ansasaki is pursing this)
   b. Carefully determine the least amount of (container) privileges will be required to run the `agent`
   c. Provide some tool to perform containerized `keylime` deployments (@maugustosilva and @galmasi have a tool, which is about to be released into open-source, to perform this task).

2. Create a simple Kubernetes application for `keylime`. At this point, we should be able to start by writing progressively more `yaml` files

   a. The idea is to start with very simple `Deployment` with the following objects:
       * A`StatefulSet` (initially of 1) for the `Registrar`
       * A`StatefulSet` (initially of 1) for the `Verifier`
       * A `DaemonSet` for the `Agents`
       * Both exposed as `Service` (`type=NodePort`)
       * mTLS certificates stored as `Secrets`
       * Given the fact `keylime` can be fully configured via environment variables, we shall use environment dependent variables on our yaml.

    b. Initially, I propose we make the following simplifying boundary conditions
       * Given the use of the `sqlite` we could start without any DB deployment
        * mTLS certificates are pre-generated (with `keyime_ca` commands) and added to the Kubernetes cluster
        * Environment variables will be also set and maintained by some external tool
        * The `tenant` will NOT be part of the initial deployment.
        * Make use of the "Node Feature Discovery" to mark all the nodes with `tpm` devices (and make it part of the `DaemonSet` node selector)
    
    c. From this point we should expand for an "scale-out" deployment.
       * Multiple `Registrars` and `Verifiers`
       * A pre-packaged `helm` deployment of some SQL database server will be used.
       * A `Service` (`type=LoadBalancer`)
  
    d. At this point, the following technical considerations should be made.
       * I am hoping we can "get away" with a pre-packaged n-way replicated SQL DB server.
       * `Verifiers` are identified by a "verifier ID", which I assume can be take from the "persistent identifier within a StatefulSet"
       * The load balancing algorithm will have to use the URI (which contains the `agent` UUID) for the selection of the backend (i.e., we cannot use round-robin or source IP, given that presently a single `tenant` will add all the `agents` to the set of `verifiers`)
       * Tenant is still considered as a component outside of the whole deployment

4. Create an `Operator` for `keylime`. My experience writing operators is fairly limited, but I will point out some of the desirable characteristics:
    * Ability to automatically generate all pertinent certificates
    * Ability to deal with environment variables
    * Ability to automatically add `agents` to `verifiers`
    * Ability to react to administrative tasks on node, such as reboot, drainage, cordoning off.

5. Make the `Operator` more "production-ready"
   * How to deal with (`measured boot` and `runtime/IMA`) policies?
   * How to deal with "scale-out" operations (i.e., if the number of `verifier` pods increase, should we perform "rebalancing")?
   * How to integrate "durable attestation" on this scenario?

6. The majority of the aforementioned stakeholders (@maugustosilva @mpeters @ansasaki Lukas Vrabec @galmasi and Marcus Hesse) voted for having this worked developed on a new repository within the `keylime` project. I will create such repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make Keylime easily deployable on Kubernetes/Openshift #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make Keylime easily deployable on Kubernetes/Openshift #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions