-
Notifications
You must be signed in to change notification settings - Fork 10
Description
After several discussions with @mpeters @ansasaki Lukas Vrabec @galmasi and Marcus Hesse, we collectively decided that the time to have Keylime easily deployed on Kubernetes/Openshift has come. I propose we use this issue to concentrate all the relevant discussion on this topic.
I will start by listing some common relevant points, and I do thank Marcus Hesse for starting the discussion on the keylime-operator on CNCF's Slack. I believe I have addressed most of your questions on this writeup.
The main goal is to end with an "Attestation Operator", which can not only automatically add nodes (i.e., agents) to specific verifiers but can also properly react to administrative activities such as node reboots or cordoning off.
I am not an Kubernetes/Openshift expert by any means, and therefore my proposal here is bound to be incomplete/incorrect, and therefore additions/corrects are welcome. That being said, I see the following set of intermediate steps, in increasing order of complexity, as a good way to achieve our goal.
-
Ensure that all
keylimecomponents can be fully executed in an containerized manner. For this the following requirements should be satisfied.
a. Unmodified public images. I suggest we expand https://quay.io/organization/keylime (under Red Hat's control), already offering the "latest"verifier,registrarandtenantto also include the rustagentimage (@ansasaki is pursing this)
b. Carefully determine the least amount of (container) privileges will be required to run theagent
c. Provide some tool to perform containerizedkeylimedeployments (@maugustosilva and @galmasi have a tool, which is about to be released into open-source, to perform this task). -
Create a simple Kubernetes application for
keylime. At this point, we should be able to start by writing progressively moreyamlfilesa. The idea is to start with very simple
Deploymentwith the following objects:
* AStatefulSet(initially of 1) for theRegistrar
* AStatefulSet(initially of 1) for theVerifier
* ADaemonSetfor theAgents
* Both exposed asService(type=NodePort)
* mTLS certificates stored asSecrets
* Given the factkeylimecan be fully configured via environment variables, we shall use environment dependent variables on our yaml.b. Initially, I propose we make the following simplifying boundary conditions
* Given the use of thesqlitewe could start without any DB deployment
* mTLS certificates are pre-generated (withkeyime_cacommands) and added to the Kubernetes cluster
* Environment variables will be also set and maintained by some external tool
* Thetenantwill NOT be part of the initial deployment.
* Make use of the "Node Feature Discovery" to mark all the nodes withtpmdevices (and make it part of theDaemonSetnode selector)c. From this point we should expand for an "scale-out" deployment.
* MultipleRegistrarsandVerifiers
* A pre-packagedhelmdeployment of some SQL database server will be used.
* AService(type=LoadBalancer)d. At this point, the following technical considerations should be made.
* I am hoping we can "get away" with a pre-packaged n-way replicated SQL DB server.
*Verifiersare identified by a "verifier ID", which I assume can be take from the "persistent identifier within a StatefulSet"
* The load balancing algorithm will have to use the URI (which contains theagentUUID) for the selection of the backend (i.e., we cannot use round-robin or source IP, given that presently a singletenantwill add all theagentsto the set ofverifiers)
* Tenant is still considered as a component outside of the whole deployment -
Create an
Operatorforkeylime. My experience writing operators is fairly limited, but I will point out some of the desirable characteristics:- Ability to automatically generate all pertinent certificates
- Ability to deal with environment variables
- Ability to automatically add
agentstoverifiers - Ability to react to administrative tasks on node, such as reboot, drainage, cordoning off.
-
Make the
Operatormore "production-ready"- How to deal with (
measured bootandruntime/IMA) policies? - How to deal with "scale-out" operations (i.e., if the number of
verifierpods increase, should we perform "rebalancing")? - How to integrate "durable attestation" on this scenario?
- How to deal with (
-
The majority of the aforementioned stakeholders (@maugustosilva @mpeters @ansasaki Lukas Vrabec @galmasi and Marcus Hesse) voted for having this worked developed on a new repository within the
keylimeproject. I will create such repository.