[2/7] Telemetry Infrastructure: CircuitBreaker and FeatureFlagCache#325
[2/7] Telemetry Infrastructure: CircuitBreaker and FeatureFlagCache#325samikshya-db wants to merge 27 commits intomainfrom
Conversation
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
06b1f7a to
1cdb716
Compare
Fix TypeScript compilation error by implementing getAuthHeaders method required by IClientContext interface.
7ec20b2 to
689e561
Compare
Added osArch, runtimeVendor, localeName, charSetEncoding, and processName fields to DriverConfiguration to match JDBC implementation.
| const authHeaders = authenticatedExport ? await this.context.getAuthHeaders() : {}; | ||
|
|
||
| // Make HTTP POST request with authentication | ||
| const response: Response = await this.fetchFn(endpoint, { |
There was a problem hiding this comment.
Great point! This is a miss on the PR : included this now. [I have merged this as part of PR #330 : as it had other changes around calling telemetry endpoint too]
There was a problem hiding this comment.
Yes. Both the exporter and FeatureFlagCache route HTTP through connectionProvider.getAgent(), which picks up proxy config from ConnectionOptions.proxy via HttpConnection (same plumbing CloudFetchResultHandler and DBSQLSession use). See sendRequest/fetchWithRetry in both files.
| * Prevents rate limiting by caching feature flag responses. | ||
| * Instance-based, stored in DBSQLClient. | ||
| */ | ||
| export default class FeatureFlagCache { |
There was a problem hiding this comment.
[F6] FeatureFlagCache landed without any consumer (Severity: High)
FeatureFlagCache has no non-test import anywhere in this PR — grep confirms it's only loaded by its own unit test. ~500 LOC of ref-counted caching, thundering-herd dedup, and server-provided TTL clamping ship speculatively without a caller to validate the API shape. TelemetryEventEmitter and large parts of MetricsAggregator have the same problem: no subscription wiring exists in this PR, so event-shape bugs stay invisible until PR N/7.
Suggested fix: Either land a minimal consumer in DBSQLClient here (even feature-flagged-off) so the contract is exercised end-to-end, or drop FeatureFlagCache from this PR and re-add it with its consumer. If that's not feasible, call out in the PR description that Emitter/Aggregator/FFC are intentionally dead code until PR N/7.
Posted by code-review-squad • flagged by maintainability, architecture.
| * Used by telemetry and feature flag fetching to authenticate REST API calls. | ||
| * @returns Promise resolving to headers object with authentication, or empty object if no auth | ||
| */ | ||
| getAuthHeaders(): Promise<HeadersInit>; |
There was a problem hiding this comment.
[F7] IClientContext bloat; HeadersInit leaks node-fetch into the public contract (Severity: High)
Two concerns:
- Scope creep.
IClientContextwas a narrow logger/config/driver/connection facade. AddinggetAuthHeaders()forces every implementer and mock (including all external subclasses) to learn about auth, and couples telemetry/FeatureFlagCache to a layer that shouldn't carry auth concerns. - Unsafe type.
HeadersInitfromnode-fetchis the unionHeaders | string[][] | Record<string,string>. The exporter casts toRecord<string,string>with an apologetic comment; object-spread over aHeadersor tuple-array input produces garbage silently.
Suggested fix: Pass IAuthentication (or () => Promise<Record<string, string>>) directly into the DatabricksTelemetryExporter and FeatureFlagCache constructors; leave IClientContext narrow. At minimum, narrow the return type to Promise<Record<string, string>> and drop the cast.
Posted by code-review-squad • flagged by architecture, language, agent-compat.
| // Check if circuit is open | ||
| if (this.state === CircuitBreakerState.OPEN) { | ||
| if (this.nextAttempt && Date.now() < this.nextAttempt.getTime()) { | ||
| throw new Error('Circuit breaker OPEN'); |
There was a problem hiding this comment.
[F8] Breaker-open identified by string message — brittle (Severity: High)
The thrown new Error('Circuit breaker OPEN') is identified downstream by error.message === 'Circuit breaker OPEN'. Any future edit to include host context, timestamp, or a localization key silently breaks the branch — no compiler signal.
Suggested fix: Export a named error class:
export class CircuitBreakerOpenError extends Error {
readonly code = 'CIRCUIT_BREAKER_OPEN';
constructor(msg = 'Circuit breaker OPEN') { super(msg); }
}Consumers use instanceof CircuitBreakerOpenError or err.code === 'CIRCUIT_BREAKER_OPEN'.
Posted by code-review-squad • flagged by agent-compat, language.
| const protoLogs = telemetryLogs.map((log) => JSON.stringify(log)); | ||
|
|
||
| const payload: DatabricksTelemetryPayload = { | ||
| uploadTime: Date.now(), |
There was a problem hiding this comment.
[F9] Unauthenticated endpoint still carries correlation IDs (Severity: High)
When telemetryAuthenticatedExport === false, the payload shipped to /telemetry-unauth still contains workspace_id, session_id, statement_id, and errorMessage (via toTelemetryLog). An on-path observer can link sessions → workspaces → user activity without any auth needed.
Suggested fix: When authenticatedExport is false, either (a) omit session_id, statement_id, workspace_id and redact error detail in toTelemetryLog, or (b) refuse to export unless the caller explicitly sets a separate telemetryAllowUnauthenticated flag.
Posted by code-review-squad • flagged by security.
| // Cap the buffer to avoid unbounded memory growth when exports keep failing | ||
| if (this.pendingMetrics.length > this.maxPendingMetrics) { | ||
| const dropped = this.pendingMetrics.length - this.maxPendingMetrics; | ||
| this.pendingMetrics = this.pendingMetrics.slice(dropped); |
There was a problem hiding this comment.
[F10] O(n) slice() on every overflow push (Severity: Medium)
this.pendingMetrics = this.pendingMetrics.slice(dropped) runs on every overflow push. Under sustained overflow (e.g., breaker open), the array is recopied O(n) per push — quadratic total work and extra GC pressure exactly when the process is already unhealthy. Also, since the cap is checked per-insertion, dropped is always 1; the log message "Dropped ${dropped} oldest telemetry metrics" is misleading in incident review.
Suggested fix: Use a ring buffer with a head index, or a single this.pendingMetrics.shift() when over cap. Refine the log: "Dropped 1 oldest metric (buffer full at ${maxPendingMetrics})".
Posted by code-review-squad • flagged by performance, ops.
| * Ensures each host has its own isolated circuit breaker to prevent | ||
| * failures on one host from affecting telemetry to other hosts. | ||
| */ | ||
| export class CircuitBreakerRegistry { |
There was a problem hiding this comment.
[F11] CircuitBreakerRegistry never shrinks — per-host leak (Severity: Medium)
Only removeCircuitBreaker / clear shrink this.breakers; nothing invokes them on session close. A long-lived multi-tenant process leaks one entry per distinct host forever.
Suggested fix: Have DBSQLClient.close() call registry.removeCircuitBreaker(host); or put an LRU cap on the map; ensure the registry is client-scoped (not a module singleton).
Posted by code-review-squad • flagged by ops.
| * Follows JDBC TelemetryCollector.java:29-30 pattern. | ||
| */ | ||
| export default class MetricsAggregator { | ||
| private statementMetrics: Map<string, StatementTelemetryDetails> = new Map(); |
There was a problem hiding this comment.
[F12] Unbounded statementMetrics and per-statement error buffer (Severity: Medium)
maxPendingMetrics guards only pendingMetrics. statementMetrics: Map<string, StatementDetails> and details.errors: TelemetryEvent[] are unbounded. Abandoned statements (process crash, operation timeout without completeStatement()) or a retry storm that generates many retryable errors on one statement produce unbounded growth.
Suggested fix: Cap details.errors at ~50 with drop-oldest; evict statementMetrics entries whose startTime is older than a configurable TTL (e.g. 1 hour) on every flush/close.
Posted by code-review-squad • flagged by security.
| this.pendingMetrics.push(metric); | ||
|
|
||
| // Cap the buffer to avoid unbounded memory growth when exports keep failing | ||
| if (this.pendingMetrics.length > this.maxPendingMetrics) { |
There was a problem hiding this comment.
[F13] FIFO overflow drops the most valuable events (Severity: Medium)
Under sustained failure the first-failure error event — usually the most valuable signal for debugging — is exactly what gets dropped.
Suggested fix: On overflow, preferentially drop metricType === 'connection' / 'statement' and preserve metricType === 'error' until an error-specific cap is reached.
Posted by code-review-squad • flagged by ops.
| let lastError: Error | null = null; | ||
|
|
||
| /* eslint-disable no-await-in-loop */ | ||
| for (let attempt = 0; attempt <= maxRetries; attempt += 1) { |
There was a problem hiding this comment.
[F14] maxRetries=3 actually issues 4 HTTP calls (Severity: Medium)
for (let attempt = 0; attempt <= maxRetries; attempt += 1) with default maxRetries=3 performs attempts 0, 1, 2, 3 — four requests. Conventionally, "max retries" is retries after the first attempt.
Suggested fix: Change the bound to attempt < maxRetries, or rename the config field to telemetryMaxAttempts and update docs so users know what they're setting.
Posted by code-review-squad • flagged by devils-advocate.
| } | ||
|
|
||
| // Calculate backoff with exponential + jitter (100ms - 1000ms) | ||
| const baseDelay = Math.min(100 * 2 ** attempt, 1000); |
There was a problem hiding this comment.
[F15] Backoff constants hardcoded; total can overshoot cap (Severity: Medium)
baseDelay = Math.min(100 * 2 ** attempt, 1000) then delay = baseDelay + jitter. Total delay reaches 1100ms — a 10% overshoot past the intended 1000ms cap. The 100/1000/100 constants are hardcoded while the rest of the exporter is config-driven via ClientConfig.
Suggested fix: Expose telemetryBackoffBaseMs, telemetryBackoffMaxMs, telemetryBackoffJitterMs on ClientConfig with defaults in DEFAULT_TELEMETRY_CONFIG; clamp delay to maxMs after adding jitter.
Posted by code-review-squad • flagged by devils-advocate.
| // Skip export if authenticated mode is requested but no auth headers available. | ||
| // Note: all auth providers in this codebase return plain objects (Record<string, string>). | ||
| const headersObj = authHeaders as Record<string, string>; | ||
| if (authenticatedExport && (!headersObj || !headersObj.Authorization)) { |
There was a problem hiding this comment.
[F16] Auth-missing silently counts as a circuit-breaker success (Severity: Medium)
Three overlapping problems in this return path:
- The
CircuitBreaker.execute()wrapper sees the missing-auth early-return as a SUCCESS — it can close a HALF_OPEN breaker that never actually talked to the endpoint. - Operators get no signal that telemetry is silently dark during PAT misconfig or startup races between
authProviderand the exporter. headersObj.Authorizationis a case-sensitive, plain-object-only check.HeadersInitpermitsHeadersandstring[][]; custom providers returning those silently skip as well.Authorization: Basic ...passes through unvalidated.
Suggested fix: Throw a dedicated AuthenticationError when auth is missing so the breaker records it as terminal (not a success). Normalize with new Headers(authHeaders as HeadersInit).has('authorization'). Warn once per session on first skip so operators discover the misconfig.
Posted by code-review-squad • flagged by security, ops, devils-advocate.
| charSetEncoding: string; | ||
|
|
||
| /** Process name */ | ||
| processName: string; |
There was a problem hiding this comment.
[F17] processName exports raw absolute paths → PII (Severity: Medium)
DriverConfiguration.processName is mapped directly to process_name in the wire log. Typical population sources (process.argv[1], process.title) contain absolute paths like /home/<username>/app/worker.js — stable per-operator PII.
Suggested fix: Send only path.basename(processName) or a SHA-256 hash of the full path; drop the field when it looks absolute and is outside node_modules. Document the sanitation contract on the DriverConfiguration field so PR 3+ producers comply.
Posted by code-review-squad • flagged by security.
| char_set_encoding?: string; | ||
| process_name?: string; | ||
| }; | ||
| driver_connection_params?: any; |
There was a problem hiding this comment.
[F19] driver_connection_params?: any — dead scaffolding (Severity: Medium)
Declared on DatabricksTelemetryLog but never populated anywhere (grep of toTelemetryLog and producers confirms). An untyped any slot with no producer is exactly the kind of "we'll fill it in later" shape that decays.
Suggested fix: Delete the field. Re-add it alongside its producer when connection-parameter shipping actually lands, and give it a concrete type at that point.
Posted by code-review-squad • flagged by maintainability, language.
| : buildTelemetryUrl(this.host, '/telemetry-unauth'); | ||
|
|
||
| // Format payload - each log is JSON-stringified to match JDBC format | ||
| const telemetryLogs = metrics.map((m) => this.toTelemetryLog(m)); |
There was a problem hiding this comment.
[F20] Double JSON.stringify inflates body and CPU (Severity: Medium)
Each log is stringified (logs.map(log => JSON.stringify(log))), then embedded as string elements in the outer payload which is re-stringified — every " becomes \". At large batch sizes this is measurable CPU and body-size overhead, and it roughly doubles the wire bytes for a payload that is already mostly JSON escape sequences.
Suggested fix: Build the body by hand:
const body = `{"uploadTime":${Date.now()},"items":[],"protoLogs":[${logs.map(l => JSON.stringify(l)).join(',')}]}`;Or emit protoLogs as a JSON array directly and have the server parse it as such.
Posted by code-review-squad • flagged by performance.
Code Review Squad ReportMerge Safety Score: 54/100 — HIGH RISK 9 parallel reviewers (security, architecture, language, ops-safety, performance, tests, maintainability, agent-compat, devil's advocate) reviewed this PR. Summary: good scaffolding and happy-path test coverage, but blocking issues around data loss on shutdown, host-validation-based SSRF, PII leakage, and a concurrency race in the circuit breaker. Findings posted as inline commentsHigh severity (9)
Medium severity (11)
Low severity (12, not posted inline)Runtime-name hardcoded ( Verified & dismissed
Ship recommendationAddress F1–F9 before merge. F10–F21 can land as a follow-up before PR 3/7 wires this into Feedback? Drop it in #code-review-squad-feedback. |
|
Addressing Vikrant's review. Summary of changes in the latest push: Direct comments
F1–F20 (posted via code-review-squad)
F6 — FFC without consumer: this is a 7-PR stack; the consumer lands in [5/7]. To de-risk the "shape drifts before it's exercised" concern, FFC and the exporter now share the HTTP/auth/user-agent plumbing, so any drift would break both at once. Happy to drop FFC from this PR and re-add it with its consumer if you'd rather. All 145 telemetry unit tests pass, plus the full unit suite (632 passing) — excluding two OAuth test files whose pre-existing |
Security (Critical): - buildTelemetryUrl now rejects protocol-relative //prefix, zero-width and BOM codepoints, CR/LF/tab, userinfo, path/query/fragment, and loopback/RFC1918/IMDS/localhost/GCP+Azure-metadata hosts. Defeats the SSRF-shaped Bearer-token exfil vector an attacker-influenced host (env var, tampered config) could trigger. - redactSensitive now covers real Databricks secret shapes: dapi/ dkea/dskea/dsapi/dose PATs, JWT triplets, JSON-quoted access_token/ client_secret/refresh_token/id_token/password/api_key, Basic auth, URL-embedded credentials. Re-applies after truncation. - Unauthenticated export now omits system_configuration entirely, strips userAgentEntry from User-Agent, and blanks stack_trace, so on-path observers cannot re-identify clients on the unauth path. - sanitizeProcessName drops argv tail (handles node --db-password=X app.js shape). Correctness (High): - MetricsAggregator gained a closed flag; close() no longer races with batch-triggered flushes that would resurrect the interval. - evictExpiredStatements now runs from the periodic flush timer so idle clients actually reclaim orphan statement entries. - Evicted statements emit their buffered error events as standalone metrics before being dropped — first-failure signal survives. - Batch-size and terminal-error flush paths pass resetTimer=false so sustained overflow cant starve the periodic tail drain. - TelemetryTerminalError introduced for host-validation refusals, separating that taxonomy from AuthenticationError. - authMissingWarned re-arms after a successful export so operators see a fresh signal the next time auth breaks. - Retry log denominator uses totalAttempts (not maxRetries); negative maxRetries falls back to default; retry log includes the redacted failing error so ops can see whats being retried. API / hygiene: - CircuitBreakerOpenError, CIRCUIT_BREAKER_OPEN_CODE, and TelemetryTerminalError re-exported from lib/index.ts so consumers can instanceof-check. - DBSQLClient.getAuthProvider marked @internal. - DEFAULT_TELEMETRY_CONFIG / DEFAULT_CIRCUIT_BREAKER_CONFIG frozen. - pushBoundedError uses if instead of while. - CIRCUIT_BREAKER_OPEN_CODE typed as const. - Default export on buildTelemetryUrl removed (no callers). - Dropped wasted new Error allocation in processErrorEvent. Tests: - New telemetryUtils.test.ts (53 tests): URL-rejection table covering every known bypass, redactor shapes, sanitize process name. - DatabricksTelemetryExporter: 13 new tests covering Authorization on-the-wire, authMissingWarned idempotency + re-arm, unauth correlation/system_configuration/userAgentEntry/stack_trace stripping, malformed-host drop, loopback drop, dispose idempotency, errorStack to redacted stack_trace flow. - MetricsAggregator: 2 new tests for async close() awaiting the exporter promise (prevents process.exit truncation) and no timer resurrection after close. 700 unit tests pass, ESLint clean. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
Clarify that the JWT string in the redactor test is intentionally fake and is built from parts so the assembled token never appears as a source literal (to satisfy pre-commit secret scanners). Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
CI's TypeScript was stricter than the local version and rejected the untyped `c.args[1]` passed to `RegExp.test()`. Wrap in `String(...)` so the tests compile on Node 14/16/18/20 runners. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
|
|
||
| // In HALF_OPEN state, allow only one probe request at a time | ||
| if (this.state === CircuitBreakerState.HALF_OPEN && this.halfOpenInflight > 0) { | ||
| throw new Error('Circuit breaker OPEN'); |
There was a problem hiding this comment.
Fixed: HALF_OPEN now admits only one probe at a time via atomic tryAdmit(). Any failure there immediately reopens the circuit; successThreshold consecutive successes close it. New concurrent-probe test at CircuitBreaker.test.ts:~411 verifies the invariant.
| return out; | ||
| } | ||
| const out: Record<string, string> = {}; | ||
| for (const [k, v] of Object.entries(raw as Record<string, unknown>)) { |
There was a problem hiding this comment.
Added — normalizeHeaders now guards with typeof raw === 'object' before Object.entries, and filters out non-string values and non-string keys. Lives in telemetryUtils.ts and is shared between exporter and FFC (addresses your duplication comment below as well).
| const errorCode = (error as any).code; | ||
| if ( | ||
| errorCode === 'ECONNREFUSED' || | ||
| errorCode === 'ENOTFOUND' || |
There was a problem hiding this comment.
how about other code like,
- ECONNRESET
- ETIMEDOUT
- EHOSTUNREACH
- EAI_AGAIN
There was a problem hiding this comment.
Done. Removed ENOTFOUND from retryable with a comment explaining why (DNS "not found" is deterministic — retrying just pushes load at the resolver). Kept ECONNRESET and EHOSTUNREACH; added ETIMEDOUT and EAI_AGAIN. Tests updated in ExceptionClassifier.test.ts (inverted ENOTFOUND expectation, added cases for the two new codes).
| this.userAgent = buildUserAgentString(this.context.getConfig().userAgentEntry); | ||
| } | ||
|
|
||
| getOrCreateContext(host: string): FeatureFlagContext { |
There was a problem hiding this comment.
may not apply to node, please double check
There was a problem hiding this comment.
Confirmed per your follow-up — Node is single-threaded, all Map accesses happen in sequential event-loop ticks. No concurrent mutation is possible. In-flight fetchPromises dedup handles the thundering-herd case where two refCount-bumps race with an expired cache.
| return ctx; | ||
| } | ||
|
|
||
| releaseContext(host: string): void { |
There was a problem hiding this comment.
Same as above — single event loop, no concurrent access needed.
|
|
||
| const isExpired = !ctx.lastFetched || Date.now() - ctx.lastFetched.getTime() > ctx.cacheDuration; | ||
|
|
||
| if (isExpired) { |
There was a problem hiding this comment.
i know this call is async, but can we make this call out of this loop, find some way that we can trigger the fetch automatically so that we don't see delay here.
There was a problem hiding this comment.
The blocking fetch only happens once per host per 15 minutes (cache TTL), and concurrent callers are deduped via fetchPromises so only one in-flight request per host. First call after cache expiry does take the round-trip latency. Proactive refresh is feasible but needs a trigger signal from the consumer (currently none exists — the consumer-wiring lands in #327). Happy to add a background refresher in that PR once we know the call pattern.
|
|
||
| logger.log(LogLevel.debug, `Fetching feature flags from ${endpoint}`); | ||
|
|
||
| const response = await this.sendRequest(endpoint, { |
There was a problem hiding this comment.
Added. fetchWithRetry now retries once on transient errors (ECONNRESET / ETIMEDOUT / EHOSTUNREACH / EAI_AGAIN / 5xx) via the shared ExceptionClassifier. One retry is intentional: without it, a single DNS hiccup disables telemetry for the full 15-minute TTL; more than one retry starts hammering a broken endpoint.
| return fetch(url, { ...init, agent }); | ||
| } | ||
|
|
||
| private async getAuthHeaders(): Promise<Record<string, string>> { |
There was a problem hiding this comment.
Extracted. normalizeHeaders and hasAuthorization moved to telemetryUtils.ts; both exporter and FFC now import them. ~40 lines of drift risk gone.
- ExceptionClassifier: remove ENOTFOUND from retryable (DNS "not found"
is deterministic — retrying just pushes load at the resolver without
any expectation of success). Add ETIMEDOUT and EAI_AGAIN per Jade's
follow-up list.
- Extract shared `normalizeHeaders` + `hasAuthorization` helpers into
telemetryUtils; DatabricksTelemetryExporter and FeatureFlagCache both
use them — eliminates the ~40-line duplication Jade flagged.
- normalizeHeaders now guards `typeof raw === 'object'` before
Object.entries, preventing Object.entries('some-string') index entries
from leaking in as headers (Jade: "should we do type check here?").
- FeatureFlagCache.fetchFeatureFlag: add single-retry on transient errors
(classified via ExceptionClassifier). Without a retry, one DNS hiccup
would disable telemetry for the full 15-minute cache TTL; one retry
gives an ephemeral failure a second chance without pushing sustained
load at a broken endpoint.
- Drop the private hasAuthorization/normalizeHeaders on the exporter;
drop the inlined branching in FFC getAuthHeaders.
- Update ExceptionClassifier tests: invert ENOTFOUND expectation with
a comment explaining why; add cases for ETIMEDOUT and EAI_AGAIN.
702 unit tests pass, ESLint clean.
Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Part 2 of 7-part Telemetry Implementation Stack
This layer adds infrastructure components for resilience and feature flag management.
Summary
Implements CircuitBreaker for per-host endpoint protection and FeatureFlagCache for efficient feature flag management with reference counting.
Components
CircuitBreaker (
lib/telemetry/CircuitBreaker.ts)State machine for endpoint protection:
Configuration:
CircuitBreakerRegistry
Manages circuit breakers per host for isolation:
FeatureFlagCache (
lib/telemetry/FeatureFlagCache.ts)Per-host feature flag caching with lifecycle management:
Testing
Next Steps
This PR is followed by:
Dependencies
Depends on: [1/7] Telemetry Foundation (#324)