Preload scope classes to prevent virtual thread deadlock#11111
Preload scope classes to prevent virtual thread deadlock#11111gh-worker-dd-mergequeue-cf854d[bot] merged 3 commits intomasterfrom
Conversation
3cba921 to
7b8e62f
Compare
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 60 metrics, 11 unstable metrics. Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.62.0-SNAPSHOT~d223f7924b, baseline=1.62.0-SNAPSHOT~1ee99fb92c
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.053 s) : 0, 1053494
Total [baseline] (8.808 s) : 0, 8807676
Agent [candidate] (1.056 s) : 0, 1055528
Total [candidate] (8.824 s) : 0, 8823761
section iast
Agent [baseline] (1.237 s) : 0, 1237427
Total [baseline] (9.657 s) : 0, 9656913
Agent [candidate] (1.227 s) : 0, 1227254
Total [candidate] (9.565 s) : 0, 9565392
gantt
title insecure-bank - break down per module: candidate=1.62.0-SNAPSHOT~d223f7924b, baseline=1.62.0-SNAPSHOT~1ee99fb92c
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.225 ms) : 0, 1225
crashtracking [candidate] (1.264 ms) : 0, 1264
BytebuddyAgent [baseline] (632.322 ms) : 0, 632322
BytebuddyAgent [candidate] (632.183 ms) : 0, 632183
AgentMeter [baseline] (29.383 ms) : 0, 29383
AgentMeter [candidate] (29.3 ms) : 0, 29300
GlobalTracer [baseline] (248.063 ms) : 0, 248063
GlobalTracer [candidate] (248.752 ms) : 0, 248752
AppSec [baseline] (32.306 ms) : 0, 32306
AppSec [candidate] (32.26 ms) : 0, 32260
Debugger [baseline] (58.783 ms) : 0, 58783
Debugger [candidate] (58.903 ms) : 0, 58903
Remote Config [baseline] (593.817 µs) : 0, 594
Remote Config [candidate] (587.179 µs) : 0, 587
Telemetry [baseline] (7.973 ms) : 0, 7973
Telemetry [candidate] (8.019 ms) : 0, 8019
Flare Poller [baseline] (6.616 ms) : 0, 6616
Flare Poller [candidate] (8.096 ms) : 0, 8096
section iast
crashtracking [baseline] (1.254 ms) : 0, 1254
crashtracking [candidate] (1.227 ms) : 0, 1227
BytebuddyAgent [baseline] (810.084 ms) : 0, 810084
BytebuddyAgent [candidate] (804.466 ms) : 0, 804466
AgentMeter [baseline] (11.596 ms) : 0, 11596
AgentMeter [candidate] (11.483 ms) : 0, 11483
GlobalTracer [baseline] (242.02 ms) : 0, 242020
GlobalTracer [candidate] (238.876 ms) : 0, 238876
IAST [baseline] (26.103 ms) : 0, 26103
IAST [candidate] (25.789 ms) : 0, 25789
AppSec [baseline] (34.255 ms) : 0, 34255
AppSec [candidate] (29.1 ms) : 0, 29100
Debugger [baseline] (57.909 ms) : 0, 57909
Debugger [candidate] (62.916 ms) : 0, 62916
Remote Config [baseline] (540.218 µs) : 0, 540
Remote Config [candidate] (1.143 ms) : 0, 1143
Telemetry [baseline] (13.482 ms) : 0, 13482
Telemetry [candidate] (12.281 ms) : 0, 12281
Flare Poller [baseline] (3.533 ms) : 0, 3533
Flare Poller [candidate] (3.395 ms) : 0, 3395
Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.62.0-SNAPSHOT~d223f7924b, baseline=1.62.0-SNAPSHOT~1ee99fb92c
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.062 s) : 0, 1062051
Total [baseline] (11.141 s) : 0, 11140948
Agent [candidate] (1.062 s) : 0, 1062211
Total [candidate] (11.071 s) : 0, 11071010
section appsec
Agent [baseline] (1.252 s) : 0, 1252484
Total [baseline] (11.109 s) : 0, 11108914
Agent [candidate] (1.249 s) : 0, 1248687
Total [candidate] (11.127 s) : 0, 11127294
section iast
Agent [baseline] (1.233 s) : 0, 1232882
Total [baseline] (11.345 s) : 0, 11344833
Agent [candidate] (1.234 s) : 0, 1233914
Total [candidate] (11.236 s) : 0, 11236210
section profiling
Agent [baseline] (1.193 s) : 0, 1192784
Total [baseline] (11.048 s) : 0, 11047904
Agent [candidate] (1.186 s) : 0, 1185979
Total [candidate] (11.18 s) : 0, 11179991
gantt
title petclinic - break down per module: candidate=1.62.0-SNAPSHOT~d223f7924b, baseline=1.62.0-SNAPSHOT~1ee99fb92c
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.244 ms) : 0, 1244
crashtracking [candidate] (1.211 ms) : 0, 1211
BytebuddyAgent [baseline] (636.235 ms) : 0, 636235
BytebuddyAgent [candidate] (634.432 ms) : 0, 634432
AgentMeter [baseline] (29.583 ms) : 0, 29583
AgentMeter [candidate] (29.402 ms) : 0, 29402
GlobalTracer [baseline] (249.807 ms) : 0, 249807
GlobalTracer [candidate] (249.194 ms) : 0, 249194
AppSec [baseline] (32.42 ms) : 0, 32420
AppSec [candidate] (32.412 ms) : 0, 32412
Debugger [baseline] (60.192 ms) : 0, 60192
Debugger [candidate] (60.075 ms) : 0, 60075
Remote Config [baseline] (593.424 µs) : 0, 593
Remote Config [candidate] (597.691 µs) : 0, 598
Telemetry [baseline] (8.135 ms) : 0, 8135
Telemetry [candidate] (8.092 ms) : 0, 8092
Flare Poller [baseline] (7.507 ms) : 0, 7507
Flare Poller [candidate] (10.687 ms) : 0, 10687
section appsec
crashtracking [baseline] (1.245 ms) : 0, 1245
crashtracking [candidate] (1.218 ms) : 0, 1218
BytebuddyAgent [baseline] (664.253 ms) : 0, 664253
BytebuddyAgent [candidate] (661.887 ms) : 0, 661887
AgentMeter [baseline] (12.124 ms) : 0, 12124
AgentMeter [candidate] (12.052 ms) : 0, 12052
GlobalTracer [baseline] (249.85 ms) : 0, 249850
GlobalTracer [candidate] (249.177 ms) : 0, 249177
IAST [baseline] (24.603 ms) : 0, 24603
IAST [candidate] (24.46 ms) : 0, 24460
AppSec [baseline] (185.515 ms) : 0, 185515
AppSec [candidate] (184.97 ms) : 0, 184970
Debugger [baseline] (66.001 ms) : 0, 66001
Debugger [candidate] (65.959 ms) : 0, 65959
Remote Config [baseline] (607.891 µs) : 0, 608
Remote Config [candidate] (602.209 µs) : 0, 602
Telemetry [baseline] (8.377 ms) : 0, 8377
Telemetry [candidate] (8.371 ms) : 0, 8371
Flare Poller [baseline] (3.55 ms) : 0, 3550
Flare Poller [candidate] (3.528 ms) : 0, 3528
section iast
crashtracking [baseline] (1.235 ms) : 0, 1235
crashtracking [candidate] (1.23 ms) : 0, 1230
BytebuddyAgent [baseline] (808.571 ms) : 0, 808571
BytebuddyAgent [candidate] (808.582 ms) : 0, 808582
AgentMeter [baseline] (11.496 ms) : 0, 11496
AgentMeter [candidate] (11.549 ms) : 0, 11549
GlobalTracer [baseline] (240.429 ms) : 0, 240429
GlobalTracer [candidate] (240.672 ms) : 0, 240672
IAST [baseline] (25.994 ms) : 0, 25994
IAST [candidate] (25.953 ms) : 0, 25953
AppSec [baseline] (32.642 ms) : 0, 32642
AppSec [candidate] (33.076 ms) : 0, 33076
Debugger [baseline] (59.355 ms) : 0, 59355
Debugger [candidate] (59.23 ms) : 0, 59230
Remote Config [baseline] (542.431 µs) : 0, 542
Remote Config [candidate] (540.846 µs) : 0, 541
Telemetry [baseline] (12.648 ms) : 0, 12648
Telemetry [candidate] (13.32 ms) : 0, 13320
Flare Poller [baseline] (3.596 ms) : 0, 3596
Flare Poller [candidate] (3.418 ms) : 0, 3418
section profiling
crashtracking [baseline] (1.204 ms) : 0, 1204
crashtracking [candidate] (1.173 ms) : 0, 1173
BytebuddyAgent [baseline] (696.321 ms) : 0, 696321
BytebuddyAgent [candidate] (691.984 ms) : 0, 691984
AgentMeter [baseline] (9.138 ms) : 0, 9138
AgentMeter [candidate] (9.115 ms) : 0, 9115
GlobalTracer [baseline] (208.789 ms) : 0, 208789
GlobalTracer [candidate] (207.015 ms) : 0, 207015
AppSec [baseline] (33.005 ms) : 0, 33005
AppSec [candidate] (32.993 ms) : 0, 32993
Debugger [baseline] (66.175 ms) : 0, 66175
Debugger [candidate] (65.932 ms) : 0, 65932
Remote Config [baseline] (575.249 µs) : 0, 575
Remote Config [candidate] (570.458 µs) : 0, 570
Telemetry [baseline] (7.816 ms) : 0, 7816
Telemetry [candidate] (7.823 ms) : 0, 7823
Flare Poller [baseline] (3.585 ms) : 0, 3585
Flare Poller [candidate] (3.609 ms) : 0, 3609
ProfilingAgent [baseline] (94.086 ms) : 0, 94086
ProfilingAgent [candidate] (94.541 ms) : 0, 94541
Profiling [baseline] (94.669 ms) : 0, 94669
Profiling [candidate] (95.12 ms) : 0, 95120
LoadParameters
See matching parameters
SummaryFound 1 performance improvements and 1 performance regressions! Performance is the same for 19 metrics, 15 unstable metrics.
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.62.0-SNAPSHOT~d223f7924b, baseline=1.62.0-SNAPSHOT~1ee99fb92c
dateFormat X
axisFormat %s
section baseline
no_agent (17.943 ms) : 17763, 18123
. : milestone, 17943,
appsec (18.893 ms) : 18704, 19083
. : milestone, 18893,
code_origins (17.931 ms) : 17751, 18111
. : milestone, 17931,
iast (17.852 ms) : 17677, 18027
. : milestone, 17852,
profiling (18.263 ms) : 18080, 18446
. : milestone, 18263,
tracing (17.834 ms) : 17662, 18007
. : milestone, 17834,
section candidate
no_agent (19.162 ms) : 18968, 19356
. : milestone, 19162,
appsec (18.698 ms) : 18513, 18884
. : milestone, 18698,
code_origins (17.75 ms) : 17576, 17923
. : milestone, 17750,
iast (18.093 ms) : 17913, 18274
. : milestone, 18093,
profiling (18.303 ms) : 18122, 18484
. : milestone, 18303,
tracing (17.958 ms) : 17780, 18136
. : milestone, 17958,
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.62.0-SNAPSHOT~d223f7924b, baseline=1.62.0-SNAPSHOT~1ee99fb92c
dateFormat X
axisFormat %s
section baseline
no_agent (1.247 ms) : 1235, 1259
. : milestone, 1247,
iast (3.385 ms) : 3335, 3436
. : milestone, 3385,
iast_FULL (6.182 ms) : 6119, 6245
. : milestone, 6182,
iast_GLOBAL (3.757 ms) : 3695, 3819
. : milestone, 3757,
profiling (2.186 ms) : 2165, 2206
. : milestone, 2186,
tracing (1.829 ms) : 1814, 1845
. : milestone, 1829,
section candidate
no_agent (1.254 ms) : 1241, 1267
. : milestone, 1254,
iast (3.25 ms) : 3203, 3297
. : milestone, 3250,
iast_FULL (6.122 ms) : 6059, 6185
. : milestone, 6122,
iast_GLOBAL (3.577 ms) : 3518, 3636
. : milestone, 3577,
profiling (2.396 ms) : 2373, 2420
. : milestone, 2396,
tracing (1.909 ms) : 1892, 1926
. : milestone, 1909,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 12 metrics, 0 unstable metrics. Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.62.0-SNAPSHOT~d223f7924b, baseline=1.62.0-SNAPSHOT~1ee99fb92c
dateFormat X
axisFormat %s
section baseline
no_agent (15.322 s) : 15322000, 15322000
. : milestone, 15322000,
appsec (15.049 s) : 15049000, 15049000
. : milestone, 15049000,
iast (18.393 s) : 18393000, 18393000
. : milestone, 18393000,
iast_GLOBAL (17.902 s) : 17902000, 17902000
. : milestone, 17902000,
profiling (14.72 s) : 14720000, 14720000
. : milestone, 14720000,
tracing (14.774 s) : 14774000, 14774000
. : milestone, 14774000,
section candidate
no_agent (15.009 s) : 15009000, 15009000
. : milestone, 15009000,
appsec (14.613 s) : 14613000, 14613000
. : milestone, 14613000,
iast (18.171 s) : 18171000, 18171000
. : milestone, 18171000,
iast_GLOBAL (17.599 s) : 17599000, 17599000
. : milestone, 17599000,
profiling (14.793 s) : 14793000, 14793000
. : milestone, 14793000,
tracing (15.115 s) : 15115000, 15115000
. : milestone, 15115000,
Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.62.0-SNAPSHOT~d223f7924b, baseline=1.62.0-SNAPSHOT~1ee99fb92c
dateFormat X
axisFormat %s
section baseline
no_agent (1.496 ms) : 1485, 1508
. : milestone, 1496,
appsec (2.562 ms) : 2507, 2617
. : milestone, 2562,
iast (2.287 ms) : 2218, 2357
. : milestone, 2287,
iast_GLOBAL (2.343 ms) : 2273, 2414
. : milestone, 2343,
profiling (2.121 ms) : 2066, 2175
. : milestone, 2121,
tracing (2.092 ms) : 2038, 2145
. : milestone, 2092,
section candidate
no_agent (1.497 ms) : 1485, 1508
. : milestone, 1497,
appsec (2.561 ms) : 2506, 2616
. : milestone, 2561,
iast (2.303 ms) : 2233, 2373
. : milestone, 2303,
iast_GLOBAL (2.335 ms) : 2265, 2404
. : milestone, 2335,
profiling (2.116 ms) : 2061, 2172
. : milestone, 2116,
tracing (2.097 ms) : 2043, 2150
. : milestone, 2097,
|
|
|
||
| // Preload classes used by swap() to avoid class loading on the virtual thread mount path. | ||
| // DatadogClassLoader loads these from a JarFile using synchronized I/O, which pins | ||
| // virtual thread carrier threads and can deadlock the application. | ||
| ScopeContext.class.getName(); | ||
| ScopeStack.class.getName(); |
There was a problem hiding this comment.
Would it make sense to also have a test reproducer for such deadlock as part of PR?
There was a problem hiding this comment.
I have a fork test that checks the loaded classes before and after context swap but that’s very brittle.
I would like to figure out if this change fully fixes the issue first before trying to build testing for.
But yes, we need it to avoid the regression from #10273 (which does not have test 😞 )
There was a problem hiding this comment.
I'm wondering if we should have a more general preload mechanism. Although, even if we do, I think we should leave that for another PR.
There was a problem hiding this comment.
Also, ScopeStack and ScopeContext (a wrapper for ScopeStack) should be removed at some point when we will move away from scope stack to context pointers...
So I don’t think this fix will be future proof and testing is definitely needed as it might break later with @mcculls changes on the scope manager.
There was a problem hiding this comment.
EDIT: currently this is specific to IAST instrumentations, but we can make this a generic feature...
You can list classes to be preloaded in the instrumentation class:
@Override
public String[] getClassNamesToBePreloaded() {
return SOME_CLASSES_TO_BE_PRELOADED;
}
at the moment you'd need to maintain this list manually, but we could add a build plugin that scanned all the classes used by the instrumentation (and their dependent classes) and construct the list at build time - much like we're thinking of building the list of helper classes.
There was a problem hiding this comment.
I was coming to comment about the IAST support only.
If you’re telling me that moving it to a more generic feature (for InstrumenterModule), I can do the refactoring and base the patch on the feature.
There was a problem hiding this comment.
Yes, it should just be a matter of moving that feature up to InstrumenterModule from InstrumenterModule.Iast
(This could be done as a follow-up PR if you want to prioritise merging this fix)
There was a problem hiding this comment.
Alright, on it!
Done. Rewrote the history as the original fix implementation was scrapped.
There was a problem hiding this comment.
Opened for review. About testing, I'm still waiting for a reproducer so I could integrate it as part of our testing suite.
353b58e to
5115362
Compare
AlexeyKuznetsov-DD
left a comment
There was a problem hiding this comment.
LGTM, left very minor comments.
Hope it will be possible to cover with tests.
| } | ||
|
|
||
| public List<Instrumenter> typeInstrumentations() { | ||
| preloadClassNames(); |
There was a problem hiding this comment.
nit: I think method name should be preloadClasses()? As preloadClassNames() sounds like - preload some list of class names to me.
There was a problem hiding this comment.
makes sense to improve the name while refactoring - preloadClasses is a good choice
| } | ||
|
|
||
| /** Get classes to force load */ | ||
| protected String[] getClassNamesToBePreloaded() { |
There was a problem hiding this comment.
nit: same here, I think classesToPreload() or similar would be a better name and a bit more consistent with already existing names in this class (like adviceShading() for example).
There was a problem hiding this comment.
I will go with preloadClassNames() to keep it consistent with the existing helperClassNames() and muzzleIgnoredClassNames()` that returns the same type (class names String array).
| // Preload classes used by Context.swap() to avoid class loading on the virtual thread mount path. | ||
| // DatadogClassLoader loads these from a JarFile using synchronized I/O, which pins | ||
| // virtual thread carrier threads and can deadlock the application. | ||
| private static final String[] PRELOAD_CLASSES = { |
There was a problem hiding this comment.
nit: CLASSES_TO_PRELOAD ?
There was a problem hiding this comment.
The original name was FORCE_LOADING from
I will redo a pass for better naming / consistency then.
mcculls
left a comment
There was a problem hiding this comment.
Similar comments to Alexey around improving on the old names.
Wrt. testing - I'm ok with this being merged based on dogfooding feedback (working on a test can continue in parallel)
Great, I will update it then. I did not want to change it 1. as part of a refactoring for a fix, 2. not as the original author. |
…deadlock Preload ScopeContext and ScopeStack classes in ContinuableScopeManager constructor to avoid class loading on virtual thread mount path. DatadogClassLoader's synchronized I/O from JarFile can pin carrier threads and deadlock the application.
…te method Support both afterDone() and afterTerminate() cleanup methods to handle different JDK 21 virtual thread implementations. Both methods cancel the help continuation and release the context scope when the virtual thread completes.
5115362 to
d223f79
Compare
|
/merge |
1 similar comment
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
|
View all feedbacks in Devflow UI.
PR already in the queue with status queued |
What Does This Do
This PR preloads ScopeContext and ScopeStack classes in ContinuableScopeManager constructor to avoid class loading on virtual thread mount path. DatadogClassLoader's synchronized I/O from JarFile can pin carrier threads and deadlock the application.
It additionally adds support for early versions of virtual threads internals.
Motivation
Related to #10273 but the API (so the classes loaded on hot path) changed when re-implementing the virtual thread instrumentation in #11009
Some versions of the JDK 21.0.1 to 21.0.4 hang, 21.0.5 crashes. 21.0.6 and above seem working fine (as CI is running with latest JDK versions, this wasn't caught in testing).
Additional Notes
Sharing the CI dev build to help with investigation
Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issueJira ticket: APMS-19222 [APMS-19230]
GitHub ticket: #11103
Note: Once your PR is ready to merge, add it to the merge queue by commenting
/merge./merge -ccancels the queue request./merge -f --reason "reason"skips all merge queue checks; please use this judiciously, as some checks do not run at the PR-level. For more information, see this doc.