Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geth can't recover after a crash #30867

Open
nuliknol opened this issue Dec 6, 2024 · 3 comments
Open

Geth can't recover after a crash #30867

nuliknol opened this issue Dec 6, 2024 · 3 comments
Labels

Comments

@nuliknol
Copy link

nuliknol commented Dec 6, 2024

One of the machines I am running geth had a kernel panic, so I had to reboot. After this , geth launch failed with the following message:

INFO [12-06|14:47:40.430] 
INFO [12-06|14:47:40.430] Post-Merge hard forks (timestamp based):
INFO [12-06|14:47:40.430]  - Shanghai:                    @1681338455 (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/shanghai.md)
INFO [12-06|14:47:40.430]  - Cancun:                      @1710338135 (https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/cancun.md)
INFO [12-06|14:47:40.430] 
INFO [12-06|14:47:40.430] ---------------------------------------------------------------------------------------------------------------------------------------------------------
INFO [12-06|14:47:40.430] 
INFO [12-06|14:47:40.442] Loaded most recent local block           number=20,845,315 hash=0d4416..63386a td=58,750,003,716,598,352,816,469 age=2mo1w2d
INFO [12-06|14:47:40.446] Loaded most recent local finalized block number=20,845,232 hash=cc902a..c47210 td=58,750,003,716,598,352,816,469 age=2mo1w2d
INFO [12-06|14:47:40.446] Loaded last snap-sync pivot marker       number=17,672,411
WARN [12-06|14:47:40.482] Head state missing, repairing            number=20,845,315 hash=0d4416..63386a snaproot=d78370..0157e3
INFO [12-06|14:47:48.484] Block state missing, rewinding further   number=20,845,109 hash=04abd1..ca24d8 elapsed=8.001s
INFO [12-06|14:47:56.519] Block state missing, rewinding further   number=20,844,802 hash=9439ff..0100f0 elapsed=16.036s
INFO [12-06|14:48:04.538] Block state missing, rewinding further   number=20,844,488 hash=5cd690..6cb55e elapsed=24.055s
INFO [12-06|14:48:12.542] Block state missing, rewinding further   number=20,844,215 hash=eeb732..456849 elapsed=32.059s
INFO [12-06|14:48:20.551] Block state missing, rewinding further   number=20,843,869 hash=1a351e..0fd79f elapsed=40.069s
INFO [12-06|14:48:28.552] Block state missing, rewinding further   number=20,843,578 hash=7a8504..28343c elapsed=48.069s
INFO [12-06|14:48:36.563] Block state missing, rewinding further   number=20,843,153 hash=52c938..4bd7f4 elapsed=56.080s
INFO [12-06|14:48:44.571] Block state missing, rewinding further   number=20,842,859 hash=ee2784..d11cfd elapsed=1m4.088s
INFO [12-06|14:48:52.588] Block state missing, rewinding further   number=20,842,456 hash=8f66cc..041ea5 elapsed=1m12.105s
INFO [12-06|14:49:00.594] Block state missing, rewinding further   number=20,841,817 hash=098351..5cb6fd elapsed=1m20.111s
INFO [12-06|14:49:03.827] Rewound to block with state              number=20,841,568 hash=80be06..0ed10b
ERROR[12-06|14:49:38.302] Error in block freeze operation          err="canonical hash missing, can't freeze block 20845001"
ERROR[12-06|14:50:38.329] Error in block freeze operation          err="canonical hash missing, can't freeze block 20844539"
ERROR[12-06|14:51:38.582] Error in block freeze operation          err="canonical hash missing, can't freeze block 20844190"
ERROR[12-06|14:52:38.339] Error in block freeze operation          err="canonical hash missing, can't freeze block 20843785"
ERROR[12-06|14:53:39.458] Error in block freeze operation          err="canonical hash missing, can't freeze block 20843422"
ERROR[12-06|14:54:38.581] Error in block freeze operation          err="canonical hash missing, can't freeze block 20843029"

Version: geth 1.14.8

I think you need to run tests during which you do hard resets of the machine, or disconnect the disk drive simulating unexpected shutdown.

@holiman
Copy link
Contributor

holiman commented Dec 6, 2024

There are two distinct variants of "unexpected shutdown".

  1. The classic crash, due to some application error, or e.g. the OOM-reaper killing the process, or docker shutdown which takes too long and results in the process being killed. In this scenario, anything written from the application to the operating system survives. The os/filesystem may not yet have flushed it to disk, but eventually it will.
  2. The harder crash, where the os itself crashes, or the disk is yanked (physically or virtually). In this case, whatever the os/filesystem had in it's caches is gone.

In order to protect against 1), we just need to ensure that the writes happen in proper order, primarily so that the relation between freezer vs leveldb is consistent.

In order to protect against 2), we need to ensure that fsync is performed at certain times. However, fsync is a very expensive operation. If we do it too often, it has a sizeable impact on performance.

Your version of geth is a few releases old, it's from August. ISTR @rjl493456442 did some work related to recovery of freezer-data somewhat recently.

@MariusVanDerWijden
Copy link
Member

@nuliknol could you try updating your geth version and try to recover again?
Or did you already resync your node?

@walkerlala
Copy link

@holiman Hi, is there any ways/parameters to control the tradeoff between performance and safety, so that when the parameter is tuned to one side, the geth node has best performance but might lose data on machine failure, and when the parameter is tuned to the other side, the geth node is safe against machine failure but is less performant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants