Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A cancelled run may leave you with a broken .NET installation #501

Open
2 of 5 tasks
prplecake opened this issue Jan 24, 2024 · 8 comments
Open
2 of 5 tasks

A cancelled run may leave you with a broken .NET installation #501

prplecake opened this issue Jan 24, 2024 · 8 comments
Labels
feature request New feature or request to improve the current logic

Comments

@prplecake
Copy link

prplecake commented Jan 24, 2024

Description:
Cancelling a workflow, manually or due to concurrency groups rules, while it's trying to setup a .NET installation may leave the runner in a broken state.

Task version:
v4

Platform:

  • Ubuntu
  • macOS
  • Windows

Runner type:

  • Hosted
  • Self-hosted

Repro steps:
Cancel a workflow while it's running the setup-dotnet action, possibly specifically while it's extracting an archive.

setup-dotnet log
Run actions/setup-dotnet@v4
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -NoLogo -Sta -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -Command & 'D:\actions-runner\_work\_actions\actions\setup-dotnet\v4\externals\install-dotnet.ps1' -SkipNonVersionedFiles -Runtime dotnet -Channel LTS
dotnet-install: .NET Core Runtime with version '8.0.1' is already installed.
dotnet-install: Adding to current process PATH: "C:\Program Files\dotnet\". Note: This change will not be visible if PowerShell was run as a child process.
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -NoLogo -Sta -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -Command & 'D:\actions-runner\_work\_actions\actions\setup-dotnet\v4\externals\install-dotnet.ps1' -SkipNonVersionedFiles -Channel 6.0
dotnet-install: Downloaded file https://dotnetcli.azureedge.net/dotnet/Sdk/6.0.418/dotnet-sdk-6.0.418-win-x64.zip size is 262714741 bytes.
dotnet-install: Either downloaded or local package size can not be measured. One of them may be corrupted.
dotnet-install: Extracting the archive.
Error: The operation was canceled.

Specifically, this left me with a half-extracted directory at C:\Program Files\dotnet\packs\Microsoft.NETCore.App.Ref\6.0.26. Deleting the 6.0.26 directory, then re-running the workflow was successful.

Expected behavior:
Either finish the archive extraction, or rollback unfinished changes.

Actual behavior:
Workflow stops immediately, leaving runner in broken state.

@prplecake prplecake added bug Something isn't working needs triage labels Jan 24, 2024
@HarithaVattikuti
Copy link
Contributor

Hello @prplecake
Thank you for creating this issue. We will investigate it and get back to you as soon as we have some feedback.

@shaanmugapriya
Copy link

Hello @prplecake

Thank you for reaching out to us and providing the information about your issue. I'm unable to reproduce it. To help us understand it better, could you kindly share the repo link with minimum code(or workflow file) to reproduce the issue.

Please feel free to reach out us in case of any further queries!

@prplecake
Copy link
Author

It shouldn't matter what the repo contents are. The workflow was cancelled in the middle of an archive extraction.

Then subsequent attempts to run the workflow would fail because setup-dotnet thinks .NET finished installing since the directory is present, I assume.

C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -NoLogo -Sta -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -Command & 'D:\actions-runner\_work\_actions\actions\setup-dotnet\v4\externals\install-dotnet.ps1' -SkipNonVersionedFiles -Channel 6.0
dotnet-install: .NET Core SDK with version '6.0.418' is already installed.
dotnet-install: Adding to current process PATH: "C:\Program Files\dotnet\". Note: This change will not be visible if PowerShell was run as a child process.

However, since the directory is only half-extracted, dotnet did not finish installing, so trying to use it threw a bunch of CS0006 errors.

I really can't provide more information than that since that's all of it! Get a workflow to cancel in the middle of a dotnet extraction - that's the reproduction. It's possible the chances of this happening are low enough I was just extremely unlucky to have been affected. There needs to be a better "cleanup process" for cancelled workflows.

@shaanmugapriya
Copy link

Hello @prplecake,

The issue you encountered where cancelling a workflow during the setup-dotnet action left the runner in a broken state appears to be a transient one, specific to that particular run. The reason is that GitHub Actions are designed to ensure each job starts in a clean state. This is achieved by automatically cleaning up runners between jobs, which includes removing any changes made to the runner's environment during the job execution.

This means the issue you experienced should not persist across different runs, and a new run should start with a clean, operational runner. However, in this case, the cancellation appears to have happened at a critical point during the archive extraction, which led to the unexpected state. But the same issue was not reproducing from our side if we cancelling the run in mid of the archieve extraction and the subsequent runs installing the respective .net version freshly irrespective of the previous if it got cancelled in midway. In this URL, tried to cancel the stepup-dotnet step in mid of the archieve extraction. But the .net installation got succeeded immediately in the next run.

Please feel free to reach us incase of any other concerns. Thank You!!

@prplecake
Copy link
Author

Got it. I appreciate the explanation. It's probably mostly a result of my self-hosted runner being a bit less ephemeral than normal runners.

@shaanmugapriya
Copy link

shaanmugapriya commented Feb 16, 2024

Hello @prplecake,

Thank you for the confirmation!! For now we are closing this issue as this is not a recurrent issue and it is successfully installing the partially installed files in the subsequent runs even it is partially installed for the cancelled run.

Please feel free to reach us in case of any further concerns.

@cliffchapmanrbx
Copy link

Hello @shaanmugapriya,

Today we encountered this exact same error. A cancelled workflow cancelled at exactly the right moment, and we ended up with a 8.0.204 installation directory that was missing some of the SDKs. Our build logs were full of error messages like

"C:\Users\Administrator\AppData\Local\Microsoft\dotnet\sdk\8.0.204\Sdks\Microsoft.SourceLink.Common\build\Microsoft.SourceLink.Common.props" was not found. Confirm that the expression in the Import declaration

When we inspected this runner manually we confirmed the \sdk\8.0.204\Sdks\ directory was incomplete compared to other runners. We then located a build that had been cancelled during the setup-dotnet step. This is the log from the setup-dotnet step including timestamps.

Tue, 09 Apr 2024 19:08:34 GMT Run actions/setup-dotnet@v2
Tue, 09 Apr 2024 19:08:34 GMT "C:\Program Files\PowerShell\7\pwsh.exe" -NoLogo -Sta -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -Command "& 'C:\w\_actions\actions\setup-dotnet\v2\externals\install-dotnet.ps1' -Version 8.0.204"
Tue, 09 Apr 2024 19:08:35 GMT dotnet-install: Note that the intended use of this script is for Continuous Integration (CI) scenarios, where:
Tue, 09 Apr 2024 19:08:35 GMT dotnet-install: - The SDK needs to be installed without user interaction and without admin rights.
Tue, 09 Apr 2024 19:08:35 GMT dotnet-install: - The SDK installation doesn't need to persist across multiple CI runs.
Tue, 09 Apr 2024 19:08:35 GMT dotnet-install: To set up a development environment or to run apps, use installers rather than this script. Visit https://dotnet.microsoft.com/download to get the installer.
Tue, 09 Apr 2024 19:08:35 GMT 
Tue, 09 Apr 2024 19:08:40 GMT dotnet-install: Extracting the archive.
Tue, 09 Apr 2024 19:08:52 GMT Error: The operation was canceled.

Note the 12 seconds between Extracting the archive and The operation was cancelled. We suspect this allowed the majority of the SDK extraction to complete except for these additional SDK packages. This was sufficient for the build to proceed until it could not find the expected SDKs and failed.

All subsequent executions of setup-dotnet on that same runner logged

dotnet-install: .NET Core SDK with version '8.0.204' is already installed.

Had it attempted to repair the missing SDKs we would not have observed this issue.

We use GitHub Enterprise Server and for a variety of performance reasons we have long-lived GHA runners that only rotate out on failed health checks or a 24 hour time limit. Once we identified the problematic runner and the specific issue of the incomplete /sdks/ directory we rotated the runner manually.

Possible repro steps

We have not tested this, though we believe it would simulate the problem we encountered.

  1. Allow setup-dotnet to complete on a static GHA runner.
  2. Manually delete the contents of AppData\Local\Microsoft\dotnet\sdk\8.0.204\Sdks\, such as Microsoft.SourceLink.Common.
  3. Re-run the workflow, observing that setup-dotnet takes no action and the build fails.

Suggested low-tech solution: installation lockfiles

If setup-dotnet attempts to install an SDK and does not run to successful completion, it should not mark the SDK as actually installed. Preferably it should not appear in dotnet --list-sdks, but at least setup-dotnet should detect the failed installation on re-run and start the installation again.

A straightforward fix would be a lockfile in the 8.0.204 directory, added immediately after directory creation and only removed as the very last action setup-dotnet takes. The setup-dotnet action could look for this lockfile on next execution and react appropriately.

@aparnajyothi-y
Copy link
Contributor

Hello Everyone, We are reopening this issue for Implementing Lockfile Mechanism to handle Incomplete .NET SDK Installations caused by cancelled jobs.
We believe this could effectively prevent setup-dotnet from incorrectly recognizing incomplete SDK installations as fully installed, keeping the Windows self-hosted runners in a healthier state. We'll try to schedule this feature request in future

@aparnajyothi-y aparnajyothi-y reopened this May 2, 2024
@aparnajyothi-y aparnajyothi-y added the feature request New feature or request to improve the current logic label May 2, 2024
@aparnajyothi-y aparnajyothi-y removed the bug Something isn't working label May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request to improve the current logic
Projects
None yet
Development

No branches or pull requests

5 participants