[cinder-csi-plugin] Detaching a volume on controller has too much of a delay #2661

CallMeFoxie · 2024-09-18T06:52:47Z

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
Whenever we have maintenance in Openstack we do a graceful shutdown on a node. That triggers draining the pods. However detaching the cinder volume gets noticed by the controllerplugin too late in the stack.

I0917 15:21:34.010628       1 csi_handler.go:234] Error processing "csi-045061d9a5678eed8e0d57a59e01420bb5e13899386cc6581378d5558301b39e": failed to detach: rpc error: code = Internal desc = ControllerUnpublishVolume Detach Volume failed with error failed to detach volume cbb2daa9-6cf7-4bd9-aa0b-1902a1b30498 from compute e7eda096-1267-45a1-9164-f7a7d85c7f4e : Expected HTTP response code [202 204] when accessing [DELETE https://api.ouropenst.ack:8774/v2.1/servers/e7eda096-1267-45a1-9164-f7a7d85c7f4e/os-volume_attachments/cbb2daa9-6cf7-4bd9-aa0b-1902a1b30498], but got 409 instead: {"conflictingRequest": {"code": 409, "message": "Cannot 'detach_volume' instance e7eda096-1267-45a1-9164-f7a7d85c7f4e while it is in task_state reboot_started"}}

which means that while the server is in maintenance we cannot re-attach our volumes.

I0917 15:21:34.429151       1 csi_handler.go:251] Attaching "csi-ae19a426cd7dbc8de74de71d711bc63ae0e71e6c28a0238785e84338ad3beb6c"
I0917 15:21:34.599180       1 csi_handler.go:234] Error processing "csi-ae19a426cd7dbc8de74de71d711bc63ae0e71e6c28a0238785e84338ad3beb6c": failed to attach: rpc error: code = Internal desc = [ControllerPublishVolume] Attach Volume failed with error failed to attach cbb2daa9-6cf7-4bd9-aa0b-1902a1b30498 volume to b3d51121-6c84-4670-a4a7-e83729d2004a compute: Expected HTTP response code [200] when accessing [POST https://api.ouropenst.ack:8774/v2.1/servers/b3d51121-6c84-4670-a4a7-e83729d2004a/os-volume_attachments], but got 400 instead: {"badRequest": {"code": 400, "message": "Invalid volume: volume cbb2daa9-6cf7-4bd9-aa0b-1902a1b30498 is already attached to instances: e7eda096-1267-45a1-9164-f7a7d85c7f4e"}}

What you expected to happen:
Unmount the volume before node goes into reboot

How to reproduce it:
do a graceful shutdown in a cluster

Anything else we need to know?:
don't hink so

Environment:

openstack-cloud-controller-manager(or other related binary) version: 1.31.0 cinder-csi, 2.31.0 the whole package (helm chart)
OpenStack version: 20.3.1 (cinder module)
Others:

The text was updated successfully, but these errors were encountered:

CallMeFoxie · 2024-09-19T06:11:29Z

Actually it might be problem in the whole idea of ACPI shutdown - OpenStack seems to set the reboot_started flag rightaway as a user presses ctrl-alt-del / shutdown / whatever button in horizon (or API) so by the time the pods get drained off the node (and volumes unmounted) the OpenStack no longer accepts any volume detachments.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cinder-csi-plugin] Detaching a volume on controller has too much of a delay #2661

[cinder-csi-plugin] Detaching a volume on controller has too much of a delay #2661

CallMeFoxie commented Sep 18, 2024

CallMeFoxie commented Sep 19, 2024

[cinder-csi-plugin] Detaching a volume on controller has too much of a delay #2661

[cinder-csi-plugin] Detaching a volume on controller has too much of a delay #2661

Comments

CallMeFoxie commented Sep 18, 2024

CallMeFoxie commented Sep 19, 2024