Ethereum merge testnet Kintsugi split by bug, here’s why
Kintsugi, the testnet to test the Ethereum 2.0 merge, has run into a chain split causing the Proof-of-Stake blockchain to run in several parallel versions of the “truth." This is the first major incident for Kintsugi since its start in December.
The merge event on the Ethereum network is the transition to the Proof-of-Stake consensus model from the currently employed Proof-of-Work model. This merger means that the current Ethereum mainnet system and the new Beacon chain, often referred to as Ethereum 2.0, will merge into one blockchain.
To test the merge, the Kintsugi testnet was deployed in December. The purpose of the testnet is to run different edge cases and observe how the system behaves. One of the developers involved in running tests on Kintsugi is Marius van der Wijden, Ethereum core developer working with the Geth (Go-Ethereum) client team.
“The testnet ran flawlessly for a couple of weeks. Last week I created a fuzzer which would send invalid blocks. A block contains a lot of information, like the transactions, the hash of the previous block, the gas limit, et cetera,” Marius van der Wijden says.
Some implementations did not execute and verify the block
A fuzzer is a common type of testing tool used among developers to generate random inputs to functions or other pieces of code, and try to make them break in some way or another. It’s about generating malformed and unexpected inputs and watching what happens to the system.
The fuzzer created by van der Wijden produces a valid block and changes one element of it to make it invalid. One technique that it uses is to change an element to another. In this case, the fuzzer changed the blockhash to the parent hash.
“Nodes should reject such a changed block. However, since the parent hash pointed to a valid block itself, some implementations did not actually execute and verify the block but looked it up in a cache instead. Since the previous block was valid and in the cache, they assumed the new block to be valid as well,” van der Wijden explains.
Network split twice
The result was that half the network, the Geth clients, rejected the block, while the other half, the Nethermind- and Besu clients, accepted it, causing the chain to split since we now had two different views of the correct state. To make things worse, there was another issue on top.
According to van der Wijden, the Geth chain nodes, in turn, which consists of Lighthouse-Geth, Prysm-Geth, Lodestar-Geth, Nimbus-Geth and Teku-Geth, also split in between them.
“This split is still being investigated, but it looks like Teku might also have some caching mechanism that failed,” van der Wijden says.
Since several different forks of the Kintsugi testnet exist at the moment of writing, and every node thinks that they are on a correct fork, the network is not finalizing anymore.
“We’ll figure something out to get the network back together. We have updated the Nethermind client already and those nodes are on the correct chain now. We do still need the fix to Teku, since more than 33 percent of nodes are Teku, otherwise the chain won’t finalize,” van der Wijden says.
Incident brings some good
According to van der Wijden, this incident does not prohibit or delay further testing of the Ethereum merge, nor does it delay the merge itself. In fact, van der Wijden says the incident actually helps to test edge cases that would have been difficult to test if the network was running properly.
“Long periods of non-finalization are challenging for the nodes and it’s very important for us to see how they behave right now. We think that the testnet will eventually get back together again, but I don’t think that we will try to manually fix it, as it gives us the opportunity to test interesting edge cases.”
“I don’t think that this will delay the merge, since the merge is not scheduled yet. But it shows how important testing is. I think the merge is progressing really well. We need a couple more weeks to get the software in an acceptable state and then we need a couple of months for testing it,” van der Wijden says.
What if this happens on mainnet?
An interesting question is what would have happened if a bug like this had occurred on the mainchain.
“We’ve started testing pretty early, so we expected a couple of bugs like this. Such a bug on mainnet would be pretty nasty though, since we would need to find and fix the bug, which we’re pretty good at, release the code and then let all stakers know that they should update their nodes. The last part is the hard part in my opinion, since some users are not following the development too closely,” van der Wijden says.
For more details, the interested reader is encouraged to read Marius van der Wijden’s tweets on the incident.