Add support for Preservation of Machines and Backing nodes#1059
Add support for Preservation of Machines and Backing nodes#1059thiyyakat wants to merge 79 commits intogardener:masterfrom
Conversation
|
@thiyyakat You need rebase this pull request with latest master branch. Please check. |
06ecf58 to
89f2900
Compare
|
Questions that remain unanswered:
|
There was a problem hiding this comment.
Note: A review meeting was held today for this PR. The comments were given during the meeting.
During the meeting, we revisited the decision to move drain to Failed state for preserved machine. The reason discussed previously was that it didn't make sense semantically to move the machine to Terminating and then do the drain, because there is a possibility that the machine may recover. Since Terminating is a final state, the drain (separate from the drain in triggerDeletionFlow) will be performed in Failed phase. There was no change proposed during the meeting. This design decision was only reconfirmed.
takoverflow
left a comment
There was a problem hiding this comment.
Have only gone through half of the PR, have some suggestions PTAL.
22c646e to
7c062b5
Compare
e2a7ea7 to
74603a4
Compare
a487a18 to
508b1ba
Compare
aaronfern
left a comment
There was a problem hiding this comment.
Thanks for the PR @thiyyakat!
A few questions/nits from me, please address them
| UpdateFailed string = "UpdateFailed" | ||
| ) | ||
|
|
||
| const ( |
There was a problem hiding this comment.
These condition constants feel like they are in the wrong place as we already have conditions at pkg/apis/machine/types.go. Also, I don't think the Node prefix should be used for the condition constant names as they are used in Machine objects too. @unmarshall should these even be exposed in API ?
There was a problem hiding this comment.
I've added them here after seeing the constants for InPlaceUpdates added just above:
The NodeCondition for InPlace is named NodeInPlaceUpdate, and I've followed the same.
@elankath , @unmarshall , please let me know what change you would like me to make.
There was a problem hiding this comment.
@thiyyakat Ok, but the the reason constants like NodePreservedByMCM, etc should just be PreservedByMCM - that is also the convention followed by in-place update constants.
PreservedNodeDrainSuccessful -> DrainSuccessful
There was a problem hiding this comment.
Will make the change to the other constant names and shorten them.
This one: PreservedNodeDrainSuccessful -> DrainSuccessful I am unsure of what to do. DrainSuccessful is used as a Reason for InPlaceUpdate, and the comment indicates the same. Is it okay to re-use it for a Message?
Ref:
…e accessing maps.
- Modify sort function to de-prioritize preserve machines - Add test for the same - Improve logging - Fix bug in stopMachinePreservationIfPreserved when node is not found - Update default MachinePreserveTimeout to 3 days as per doc
- Reuse function to write annotation on machine - Minor refactoring
- Make changes to add auto-preserve-stopped on recovered, auto-preserved previously failed machines. - Change stopMachinePreservationIfPreserved to removeCA annotation when preserve=false on a recovered failed, preserved machine
* remove stop annotation value * remove CA scale-down annotation when preservation stops * change preservation annotation handling semantics for machine and node * remove auto-preserve-stopped annotation value * Add preserveExpiryTime to NodeCondition.Message * modify test cases
…eserved machines if autoPreservedFailedMachineMax is decreased in the shoot spec.
…liedNodePreserveValue for persisting node annotation values that have been applied.
…th bugfix on master
gagan16k
aaronfern
takoverflow
left a comment
There was a problem hiding this comment.
Have requested a couple of minor changes, PTAL
| for _, m := range machinesWithoutUpdateSuccessfulLabel { | ||
| if machineutils.IsMachineFailed(m) { | ||
| staleMachines = append(staleMachines, m) | ||
| if c.shouldFailedMachineBeTerminated(m) { |
There was a problem hiding this comment.
Should this function be renamed to isFailedMachineBeingPreserved? To clarify the intent better (This would require reversing the semantics of the bool returned though, and the check here needs to be inverted). WDYT
There was a problem hiding this comment.
There was a bit of a discussion on this: #1059 (comment).
Please let me know if you still feel the change is required.
There was a problem hiding this comment.
I still see value in explicitly conveying what the function is checking. Rather than making it ambiguous and then needing to look at the function definition to realise that the criteria being used for terminating failed machine is just preservation.
takoverflow - part 1
4e8c12f to
1538501
Compare
gagan16k - part 2
What this PR does / why we need it:
This PR introduces a feature that allows operators and endusers to preserve a machine/node and the backing VM for diagnostic purposes.
The expected behaviour, use cases and usage are detailed in the proposal that can be found here
Which issue(s) this PR fixes:
Fixes #1008
Special notes for your reviewer:
The following tests were carried out serially with the machine-controller-manager-provider-virtual: #1059 (comment)
Please also take a look at the questions asked here.
Release note: