Skip to content

Fixes for migration reliability#1691

Draft
kevin-montrose wants to merge 2 commits intomainfrom
users/kmontrose/vectorSetMigrationReliablity
Draft

Fixes for migration reliability#1691
kevin-montrose wants to merge 2 commits intomainfrom
users/kmontrose/vectorSetMigrationReliablity

Conversation

@kevin-montrose
Copy link
Copy Markdown
Contributor

@kevin-montrose kevin-montrose commented Apr 10, 2026

Four separate fixes here, aimed at improving Vector Set migration test reliability.

1. Async all migration tasks

Rarely (but not so rarely it doesn't happen in CIs) migrations can deadlock on this .Result.

I suspect, but cannot prove, this is because we block an IOCP thread that would serve the read side of the request (where MIGRATE ... would be executed) on the destination node. If I'm right, this means we only suffer this in tests where we host all the Garnet nodes in the same process.

The fix is to punch async/await all the way down so we don't have to .Result there.

2. Handle TransmitSlot failures

There are a number of places where we ignored the result of a Task<bool> or Task<bool>[] in migrations. This is unlikely to cause test flakiness directly (they'd still fail if the task failed) but made debugging very painful.

3. Pulled in a separate fix to SessionParseState

I never this directly cause an issue, but it's a issue reported separately in the same general area of migration/replication.

4. Removed pointers from LightEpoch.InstanceIndexBuffer

If you'd asked me last week if the code in InstanceIndexBuffer.GetRef is safe I'd have said yes. I would expect [ThreadStatic] struct to be implicitly pinned. But, rarely, I have seen some evidence that it was relocated. Nothing definitive but out of an abundance of caution I've redone it in equivalent but safe-ish code.

Safe-ish because I'm still doing a Unsafe.Add, but we never have a pointer in a local so any movement between ref field0 and Unsafe.Add will be correctly updated by the GC.

TODOs

  • Stress harder
  • Remove debugger/logging/etc.
  • Audit async/await changes for places to remove allocations

Other thoughts

Because it's Friday, writing this down for Monday:

  • If the IOCP concern is correct, the spins in CanOperateOnKey and WaitForSlotToStabalize are also risky
  • If the thread identity is important, we have to signal back to RespServerSession.ProcessMessages we need more "data" (actually we need other sessions to make progress)
  • If the thread identity is unimportant, we need to yield for real time - perhaps a proper sleep
    • I suspect thread identity is important though

/// (2) in Metadata, to track per-thread epoch table entries for each LightEpoch instance.
/// </summary>
[StructLayout(LayoutKind.Explicit, Size = MaxInstances * sizeof(int))]
[InlineArray(MaxInstances)]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not proven to be necessary, we can retain the prior version. I initially did InlineArray but noted some inefficiencies which led to the current struct design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants