Skip to content

[2/7] Telemetry Infrastructure: CircuitBreaker and FeatureFlagCache#325

Open
samikshya-db wants to merge 27 commits intomainfrom
telemetry-2-infrastructure
Open

[2/7] Telemetry Infrastructure: CircuitBreaker and FeatureFlagCache#325
samikshya-db wants to merge 27 commits intomainfrom
telemetry-2-infrastructure

Conversation

@samikshya-db
Copy link
Copy Markdown
Collaborator

Part 2 of 7-part Telemetry Implementation Stack

This layer adds infrastructure components for resilience and feature flag management.

Summary

Implements CircuitBreaker for per-host endpoint protection and FeatureFlagCache for efficient feature flag management with reference counting.

Components

CircuitBreaker (lib/telemetry/CircuitBreaker.ts)

State machine for endpoint protection:

  • CLOSED (normal operation) → OPEN (failing) after threshold failures
  • OPENHALF_OPEN (testing) after timeout period
  • HALF_OPENCLOSED (recovered) after consecutive successes
  • HALF_OPENOPEN (still failing) on any failure

Configuration:

  • Failure threshold: 5 (default)
  • Timeout: 60s (default)
  • Success threshold: 2 (default)

CircuitBreakerRegistry

Manages circuit breakers per host for isolation:

  • Each host gets independent circuit breaker
  • Prevents cascade failures across hosts
  • Thread-safe per-host state management

FeatureFlagCache (lib/telemetry/FeatureFlagCache.ts)

Per-host feature flag caching with lifecycle management:

  • 15-minute TTL with automatic refresh
  • Reference counting tracks connection lifecycle
  • Context removed when refCount reaches zero
  • Reduces API calls for feature flag checks

Testing

  • 32 unit tests for CircuitBreaker (100% function coverage)
  • 29 unit tests for FeatureFlagCache (100% function coverage)
  • CircuitBreakerStub for testing dependent components
  • All state transitions and edge cases covered

Next Steps

This PR is followed by:

  • [3/7] Client Management: TelemetryClient and Provider
  • [4/7] Event & Aggregation: EventEmitter and MetricsAggregator
  • [5/7] Export: DatabricksTelemetryExporter
  • [6/7] Integration: Wire into DBSQLClient
  • [7/7] Testing & Documentation

Dependencies

Depends on: [1/7] Telemetry Foundation (#324)

samikshya-db and others added 5 commits January 29, 2026 20:20
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing
- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils
Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.
Added osArch, runtimeVendor, localeName, charSetEncoding, and
processName fields to DriverConfiguration to match JDBC implementation.
Comment thread lib/telemetry/CircuitBreaker.ts
const authHeaders = authenticatedExport ? await this.context.getAuthHeaders() : {};

// Make HTTP POST request with authentication
const response: Response = await this.fetchFn(endpoint, {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this support proxy?

Copy link
Copy Markdown
Collaborator Author

@samikshya-db samikshya-db Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! This is a miss on the PR : included this now. [I have merged this as part of PR #330 : as it had other changes around calling telemetry endpoint too]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Both the exporter and FeatureFlagCache route HTTP through connectionProvider.getAgent(), which picks up proxy config from ConnectionOptions.proxy via HttpConnection (same plumbing CloudFetchResultHandler and DBSQLSession use). See sendRequest/fetchWithRetry in both files.

Comment thread lib/telemetry/TelemetryEventEmitter.ts
Comment thread lib/telemetry/DatabricksTelemetryExporter.ts Outdated
Comment thread lib/telemetry/DatabricksTelemetryExporter.ts Outdated
* Prevents rate limiting by caching feature flag responses.
* Instance-based, stored in DBSQLClient.
*/
export default class FeatureFlagCache {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F6] FeatureFlagCache landed without any consumer (Severity: High)

FeatureFlagCache has no non-test import anywhere in this PR — grep confirms it's only loaded by its own unit test. ~500 LOC of ref-counted caching, thundering-herd dedup, and server-provided TTL clamping ship speculatively without a caller to validate the API shape. TelemetryEventEmitter and large parts of MetricsAggregator have the same problem: no subscription wiring exists in this PR, so event-shape bugs stay invisible until PR N/7.

Suggested fix: Either land a minimal consumer in DBSQLClient here (even feature-flagged-off) so the contract is exercised end-to-end, or drop FeatureFlagCache from this PR and re-add it with its consumer. If that's not feasible, call out in the PR description that Emitter/Aggregator/FFC are intentionally dead code until PR N/7.

Posted by code-review-squad • flagged by maintainability, architecture.

Comment thread lib/contracts/IClientContext.ts Outdated
* Used by telemetry and feature flag fetching to authenticate REST API calls.
* @returns Promise resolving to headers object with authentication, or empty object if no auth
*/
getAuthHeaders(): Promise<HeadersInit>;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F7] IClientContext bloat; HeadersInit leaks node-fetch into the public contract (Severity: High)

Two concerns:

  1. Scope creep. IClientContext was a narrow logger/config/driver/connection facade. Adding getAuthHeaders() forces every implementer and mock (including all external subclasses) to learn about auth, and couples telemetry/FeatureFlagCache to a layer that shouldn't carry auth concerns.
  2. Unsafe type. HeadersInit from node-fetch is the union Headers | string[][] | Record<string,string>. The exporter casts to Record<string,string> with an apologetic comment; object-spread over a Headers or tuple-array input produces garbage silently.

Suggested fix: Pass IAuthentication (or () => Promise<Record<string, string>>) directly into the DatabricksTelemetryExporter and FeatureFlagCache constructors; leave IClientContext narrow. At minimum, narrow the return type to Promise<Record<string, string>> and drop the cast.

Posted by code-review-squad • flagged by architecture, language, agent-compat.

Comment thread lib/telemetry/CircuitBreaker.ts Outdated
// Check if circuit is open
if (this.state === CircuitBreakerState.OPEN) {
if (this.nextAttempt && Date.now() < this.nextAttempt.getTime()) {
throw new Error('Circuit breaker OPEN');
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F8] Breaker-open identified by string message — brittle (Severity: High)

The thrown new Error('Circuit breaker OPEN') is identified downstream by error.message === 'Circuit breaker OPEN'. Any future edit to include host context, timestamp, or a localization key silently breaks the branch — no compiler signal.

Suggested fix: Export a named error class:

export class CircuitBreakerOpenError extends Error {
  readonly code = 'CIRCUIT_BREAKER_OPEN';
  constructor(msg = 'Circuit breaker OPEN') { super(msg); }
}

Consumers use instanceof CircuitBreakerOpenError or err.code === 'CIRCUIT_BREAKER_OPEN'.

Posted by code-review-squad • flagged by agent-compat, language.

const protoLogs = telemetryLogs.map((log) => JSON.stringify(log));

const payload: DatabricksTelemetryPayload = {
uploadTime: Date.now(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F9] Unauthenticated endpoint still carries correlation IDs (Severity: High)

When telemetryAuthenticatedExport === false, the payload shipped to /telemetry-unauth still contains workspace_id, session_id, statement_id, and errorMessage (via toTelemetryLog). An on-path observer can link sessions → workspaces → user activity without any auth needed.

Suggested fix: When authenticatedExport is false, either (a) omit session_id, statement_id, workspace_id and redact error detail in toTelemetryLog, or (b) refuse to export unless the caller explicitly sets a separate telemetryAllowUnauthenticated flag.

Posted by code-review-squad • flagged by security.

Comment thread lib/telemetry/MetricsAggregator.ts Outdated
// Cap the buffer to avoid unbounded memory growth when exports keep failing
if (this.pendingMetrics.length > this.maxPendingMetrics) {
const dropped = this.pendingMetrics.length - this.maxPendingMetrics;
this.pendingMetrics = this.pendingMetrics.slice(dropped);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F10] O(n) slice() on every overflow push (Severity: Medium)

this.pendingMetrics = this.pendingMetrics.slice(dropped) runs on every overflow push. Under sustained overflow (e.g., breaker open), the array is recopied O(n) per push — quadratic total work and extra GC pressure exactly when the process is already unhealthy. Also, since the cap is checked per-insertion, dropped is always 1; the log message "Dropped ${dropped} oldest telemetry metrics" is misleading in incident review.

Suggested fix: Use a ring buffer with a head index, or a single this.pendingMetrics.shift() when over cap. Refine the log: "Dropped 1 oldest metric (buffer full at ${maxPendingMetrics})".

Posted by code-review-squad • flagged by performance, ops.

* Ensures each host has its own isolated circuit breaker to prevent
* failures on one host from affecting telemetry to other hosts.
*/
export class CircuitBreakerRegistry {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F11] CircuitBreakerRegistry never shrinks — per-host leak (Severity: Medium)

Only removeCircuitBreaker / clear shrink this.breakers; nothing invokes them on session close. A long-lived multi-tenant process leaks one entry per distinct host forever.

Suggested fix: Have DBSQLClient.close() call registry.removeCircuitBreaker(host); or put an LRU cap on the map; ensure the registry is client-scoped (not a module singleton).

Posted by code-review-squad • flagged by ops.

* Follows JDBC TelemetryCollector.java:29-30 pattern.
*/
export default class MetricsAggregator {
private statementMetrics: Map<string, StatementTelemetryDetails> = new Map();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F12] Unbounded statementMetrics and per-statement error buffer (Severity: Medium)

maxPendingMetrics guards only pendingMetrics. statementMetrics: Map<string, StatementDetails> and details.errors: TelemetryEvent[] are unbounded. Abandoned statements (process crash, operation timeout without completeStatement()) or a retry storm that generates many retryable errors on one statement produce unbounded growth.

Suggested fix: Cap details.errors at ~50 with drop-oldest; evict statementMetrics entries whose startTime is older than a configurable TTL (e.g. 1 hour) on every flush/close.

Posted by code-review-squad • flagged by security.

this.pendingMetrics.push(metric);

// Cap the buffer to avoid unbounded memory growth when exports keep failing
if (this.pendingMetrics.length > this.maxPendingMetrics) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F13] FIFO overflow drops the most valuable events (Severity: Medium)

Under sustained failure the first-failure error event — usually the most valuable signal for debugging — is exactly what gets dropped.

Suggested fix: On overflow, preferentially drop metricType === 'connection' / 'statement' and preserve metricType === 'error' until an error-specific cap is reached.

Posted by code-review-squad • flagged by ops.

let lastError: Error | null = null;

/* eslint-disable no-await-in-loop */
for (let attempt = 0; attempt <= maxRetries; attempt += 1) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F14] maxRetries=3 actually issues 4 HTTP calls (Severity: Medium)

for (let attempt = 0; attempt <= maxRetries; attempt += 1) with default maxRetries=3 performs attempts 0, 1, 2, 3 — four requests. Conventionally, "max retries" is retries after the first attempt.

Suggested fix: Change the bound to attempt < maxRetries, or rename the config field to telemetryMaxAttempts and update docs so users know what they're setting.

Posted by code-review-squad • flagged by devils-advocate.

}

// Calculate backoff with exponential + jitter (100ms - 1000ms)
const baseDelay = Math.min(100 * 2 ** attempt, 1000);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F15] Backoff constants hardcoded; total can overshoot cap (Severity: Medium)

baseDelay = Math.min(100 * 2 ** attempt, 1000) then delay = baseDelay + jitter. Total delay reaches 1100ms — a 10% overshoot past the intended 1000ms cap. The 100/1000/100 constants are hardcoded while the rest of the exporter is config-driven via ClientConfig.

Suggested fix: Expose telemetryBackoffBaseMs, telemetryBackoffMaxMs, telemetryBackoffJitterMs on ClientConfig with defaults in DEFAULT_TELEMETRY_CONFIG; clamp delay to maxMs after adding jitter.

Posted by code-review-squad • flagged by devils-advocate.

// Skip export if authenticated mode is requested but no auth headers available.
// Note: all auth providers in this codebase return plain objects (Record<string, string>).
const headersObj = authHeaders as Record<string, string>;
if (authenticatedExport && (!headersObj || !headersObj.Authorization)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F16] Auth-missing silently counts as a circuit-breaker success (Severity: Medium)

Three overlapping problems in this return path:

  1. The CircuitBreaker.execute() wrapper sees the missing-auth early-return as a SUCCESS — it can close a HALF_OPEN breaker that never actually talked to the endpoint.
  2. Operators get no signal that telemetry is silently dark during PAT misconfig or startup races between authProvider and the exporter.
  3. headersObj.Authorization is a case-sensitive, plain-object-only check. HeadersInit permits Headers and string[][]; custom providers returning those silently skip as well. Authorization: Basic ... passes through unvalidated.

Suggested fix: Throw a dedicated AuthenticationError when auth is missing so the breaker records it as terminal (not a success). Normalize with new Headers(authHeaders as HeadersInit).has('authorization'). Warn once per session on first skip so operators discover the misconfig.

Posted by code-review-squad • flagged by security, ops, devils-advocate.

Comment thread lib/telemetry/types.ts
charSetEncoding: string;

/** Process name */
processName: string;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F17] processName exports raw absolute paths → PII (Severity: Medium)

DriverConfiguration.processName is mapped directly to process_name in the wire log. Typical population sources (process.argv[1], process.title) contain absolute paths like /home/<username>/app/worker.js — stable per-operator PII.

Suggested fix: Send only path.basename(processName) or a SHA-256 hash of the full path; drop the field when it looks absolute and is outside node_modules. Document the sanitation contract on the DriverConfiguration field so PR 3+ producers comply.

Posted by code-review-squad • flagged by security.

char_set_encoding?: string;
process_name?: string;
};
driver_connection_params?: any;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F19] driver_connection_params?: any — dead scaffolding (Severity: Medium)

Declared on DatabricksTelemetryLog but never populated anywhere (grep of toTelemetryLog and producers confirms). An untyped any slot with no producer is exactly the kind of "we'll fill it in later" shape that decays.

Suggested fix: Delete the field. Re-add it alongside its producer when connection-parameter shipping actually lands, and give it a concrete type at that point.

Posted by code-review-squad • flagged by maintainability, language.

: buildTelemetryUrl(this.host, '/telemetry-unauth');

// Format payload - each log is JSON-stringified to match JDBC format
const telemetryLogs = metrics.map((m) => this.toTelemetryLog(m));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F20] Double JSON.stringify inflates body and CPU (Severity: Medium)

Each log is stringified (logs.map(log => JSON.stringify(log))), then embedded as string elements in the outer payload which is re-stringified — every " becomes \". At large batch sizes this is measurable CPU and body-size overhead, and it roughly doubles the wire bytes for a payload that is already mostly JSON escape sequences.

Suggested fix: Build the body by hand:

const body = `{"uploadTime":${Date.now()},"items":[],"protoLogs":[${logs.map(l => JSON.stringify(l)).join(',')}]}`;

Or emit protoLogs as a JSON array directly and have the server parse it as such.

Posted by code-review-squad • flagged by performance.

@vikrantpuppala
Copy link
Copy Markdown
Collaborator

Code Review Squad Report

Merge Safety Score: 54/100 — HIGH RISK

9 parallel reviewers (security, architecture, language, ops-safety, performance, tests, maintainability, agent-compat, devil's advocate) reviewed this PR. Summary: good scaffolding and happy-path test coverage, but blocking issues around data loss on shutdown, host-validation-based SSRF, PII leakage, and a concurrency race in the circuit breaker.

Findings posted as inline comments

High severity (9)

  • F1CircuitBreaker.ts:94 — HALF_OPEN concurrent-probe race
  • F2MetricsAggregator.ts:319 — silent final-metrics loss on flush/close
  • F3telemetryUtils.ts:21 — weak host validation (SSRF / bearer exfiltration)
  • F4DatabricksTelemetryExporter.ts:329 — PII leak + stack_trace is actually the error message
  • F5DatabricksTelemetryExporter.ts:147 — retry/backoff duplicates HttpRetryPolicy; bypasses HttpConnection
  • F6FeatureFlagCache.ts:38 — whole class landed with no in-PR consumer
  • F7IClientContext.ts:49getAuthHeaders bloats the contract; HeadersInit leaks node-fetch
  • F8CircuitBreaker.ts:96 — breaker-open identified by string-match; add a typed error
  • F9DatabricksTelemetryExporter.ts:209/telemetry-unauth payload still carries workspace/session/statement IDs

Medium severity (11)

  • F10MetricsAggregator.ts:294 — O(n) slice() on every overflow push
  • F11CircuitBreaker.ts:205 — registry never shrinks; per-host leak
  • F12MetricsAggregator.ts:57statementMetrics / per-statement errors unbounded
  • F13MetricsAggregator.ts:298 — FIFO drops first-failure errors (most valuable signal)
  • F14DatabricksTelemetryExporter.ts:155 — off-by-one: maxRetries=3 → 4 HTTP calls
  • F15DatabricksTelemetryExporter.ts:181 — backoff hardcoded; total can overshoot cap
  • F16DatabricksTelemetryExporter.ts:237 — missing-auth silently counts as breaker success
  • F17types.ts:215processName exports absolute paths (PII)
  • F18tests/unit/telemetry/* — missing concurrent-probe, auth-header merge, thrown-error-retry, real-fetch tests
  • F19DatabricksTelemetryExporter.ts:57 — dead driver_connection_params?: any scaffolding
  • F20DatabricksTelemetryExporter.ts:215 — double JSON.stringify inflates body and CPU
  • F21 — multiple — catch (error: any) crashes on non-Error throws inside "never-throws" code

Low severity (12, not posted inline)

Runtime-name hardcoded (F22); over-exposed test accessors getFailureCount/getSuccessCount (F23); timer reset starves periodic flush (F24); single-function util file (F25); one-line delegate methods (F26); misleading success log (F27); Date vs epoch-ms (F28); unseeded Math.random in retry jitter (F29); telemetry surface not exported from lib/index.ts (F30); ClientConfig.telemetry* keys lack TSDoc + defaults (F31); agent rejectUnauthorized: false widens bearer exposure — repo-wide posture, escalate separately (F32); DatabricksTelemetryLog permits illegal field combinations — use a discriminated union (F33).

Verified & dismissed

  • Double-scheme bug (https://https://...) — false alarm; regex strips it.
  • Proxy support — resolved in THIS PR via connectionProvider.getAgent() in both exporter and FeatureFlagCache. The "deferred to PR [7/7] Telemetry Testing and Documentation - FINAL LAYER #330" note in prior review is now stale.
  • Timer keepalive — correctly handled via flushTimer.unref() at MetricsAggregator.ts:366.
  • Custom CircuitBreaker (not opossum/cockatiel) — accepted per Node >=14 constraint.
  • urlUtils → telemetryUtils rename — done.

Ship recommendation

Address F1–F9 before merge. F10–F21 can land as a follow-up before PR 3/7 wires this into DBSQLClient. F32 (shared-agent rejectUnauthorized: false) is a repo-wide issue that pre-dates this PR but is worth a separate ticket since telemetry now widens bearer exposure.


Feedback? Drop it in #code-review-squad-feedback.

@samikshya-db
Copy link
Copy Markdown
Collaborator Author

Addressing Vikrant's review. Summary of changes in the latest push:

Direct comments

  • CircuitBreaker.ts (jade/vikrant +1): fixed the HALF_OPEN concurrent-probe race — admission is now a single synchronous check-and-set (tryAdmit), and added a test that launches two execute()s concurrently after the OPEN→HALF_OPEN transition and asserts exactly one probe runs.
  • DatabricksTelemetryExporter user-agent: now uses lib/utils/buildUserAgentString.ts instead of constructing ad-hoc; same helper applied in FeatureFlagCache. One place to update.
  • FeatureFlagCache / exporter HTTP client: both now use the same connectionProvider.getAgent() + fetch pattern already used by CloudFetchResultHandler/DBSQLSession, picking up proxy support.
  • urlUtils file: renamed telemetryUtils and now carries three related helpers (URL validation, secret redactor, process-name sanitizer), so it's no longer a single-function file.
  • MetricsAggregator bound: statementMetrics now has a TTL (statementTtlMs, default 1h) and per-statement error buffer is capped (maxErrorsPerStatement, default 50). Overflow on pendingMetrics prefers dropping non-error entries.

F1–F20 (posted via code-review-squad)

  • F1 HALF_OPEN race: fixed (see above); new test.
  • F2 no-data-loss close: flush() is now async and returns the export promise; close() awaits it.
  • F3 host validation: buildTelemetryUrl parses with new URL, rejects non-root path, query, fragment, userinfo, CR/LF. Malformed host throws AuthenticationError so the batch is dropped.
  • F4 stack-trace PII: added errorStack to TelemetryEvent/TelemetryMetric/emitError; redactSensitive strips Bearer/token/password/api_key and caps at 2 KiB.
  • F5 / F14 / F15: retry loop is now config-driven (telemetryBackoffBaseMs/MaxMs/JitterMs), maxRetries matches HttpRetryPolicy semantics (retries after initial), classification goes through ExceptionClassifier. Full reuse of HttpRetryPolicy would need it to model thrown errors (not just response codes) — happy to factor into a shared TelemetryRetryPolicy if preferred.
  • F6: see below.
  • F7 IClientContext bloat: removed getAuthHeaders() from IClientContext. Exporter and FFC now take IAuthentication directly. HeadersInit no longer leaks into the contract.
  • F8 named error: CircuitBreakerOpenError / CIRCUIT_BREAKER_OPEN_CODE exported; call sites use instanceof.
  • F9 unauth correlation: unauthenticated export now scrubs session_id, statement_id, workspace_id and drops raw error detail.
  • F10 quadratic overflow: replaced slice() with a single splice(); log message corrected.
  • F11 registry leak: added DatabricksTelemetryExporter.dispose() which calls registry.removeCircuitBreaker(host); the consumer in a later PR will call this from DBSQLClient.close().
  • F12 unbounded maps: statementTtlMs + maxErrorsPerStatement + eviction in getOrCreateStatementDetails.
  • F13 overflow policy: preserves error metrics over connection/statement.
  • F16 auth-missing as failure: now throws AuthenticationError so the breaker counts it; header presence is checked case-insensitively after normalising all three HeadersInit shapes; warns once per exporter instance on first skip.
  • F17 processName PII: sanitizeProcessName returns the basename; applied at export time.
  • F19 dead field: driver_connection_params?: any removed.
  • F20 double stringify: the outer payload stays stringified (server contract), but inner logs are JSON objects so only one escape layer remains.

F6 — FFC without consumer: this is a 7-PR stack; the consumer lands in [5/7]. To de-risk the "shape drifts before it's exercised" concern, FFC and the exporter now share the HTTP/auth/user-agent plumbing, so any drift would break both at once. Happy to drop FFC from this PR and re-add it with its consumer if you'd rather.

All 145 telemetry unit tests pass, plus the full unit suite (632 passing) — excluding two OAuth test files whose pre-existing [Symbol.asyncDispose] TS errors were there before this branch. ESLint clean.

Security (Critical):
- buildTelemetryUrl now rejects protocol-relative //prefix, zero-width
  and BOM codepoints, CR/LF/tab, userinfo, path/query/fragment, and
  loopback/RFC1918/IMDS/localhost/GCP+Azure-metadata hosts. Defeats
  the SSRF-shaped Bearer-token exfil vector an attacker-influenced
  host (env var, tampered config) could trigger.
- redactSensitive now covers real Databricks secret shapes: dapi/
  dkea/dskea/dsapi/dose PATs, JWT triplets, JSON-quoted access_token/
  client_secret/refresh_token/id_token/password/api_key, Basic auth,
  URL-embedded credentials. Re-applies after truncation.
- Unauthenticated export now omits system_configuration entirely,
  strips userAgentEntry from User-Agent, and blanks stack_trace, so
  on-path observers cannot re-identify clients on the unauth path.
- sanitizeProcessName drops argv tail (handles node --db-password=X
  app.js shape).

Correctness (High):
- MetricsAggregator gained a closed flag; close() no longer races
  with batch-triggered flushes that would resurrect the interval.
- evictExpiredStatements now runs from the periodic flush timer so
  idle clients actually reclaim orphan statement entries.
- Evicted statements emit their buffered error events as standalone
  metrics before being dropped — first-failure signal survives.
- Batch-size and terminal-error flush paths pass resetTimer=false so
  sustained overflow cant starve the periodic tail drain.
- TelemetryTerminalError introduced for host-validation refusals,
  separating that taxonomy from AuthenticationError.
- authMissingWarned re-arms after a successful export so operators
  see a fresh signal the next time auth breaks.
- Retry log denominator uses totalAttempts (not maxRetries); negative
  maxRetries falls back to default; retry log includes the redacted
  failing error so ops can see whats being retried.

API / hygiene:
- CircuitBreakerOpenError, CIRCUIT_BREAKER_OPEN_CODE, and
  TelemetryTerminalError re-exported from lib/index.ts so consumers
  can instanceof-check.
- DBSQLClient.getAuthProvider marked @internal.
- DEFAULT_TELEMETRY_CONFIG / DEFAULT_CIRCUIT_BREAKER_CONFIG frozen.
- pushBoundedError uses if instead of while.
- CIRCUIT_BREAKER_OPEN_CODE typed as const.
- Default export on buildTelemetryUrl removed (no callers).
- Dropped wasted new Error allocation in processErrorEvent.

Tests:
- New telemetryUtils.test.ts (53 tests): URL-rejection table covering
  every known bypass, redactor shapes, sanitize process name.
- DatabricksTelemetryExporter: 13 new tests covering Authorization
  on-the-wire, authMissingWarned idempotency + re-arm, unauth
  correlation/system_configuration/userAgentEntry/stack_trace
  stripping, malformed-host drop, loopback drop, dispose idempotency,
  errorStack to redacted stack_trace flow.
- MetricsAggregator: 2 new tests for async close() awaiting the
  exporter promise (prevents process.exit truncation) and no timer
  resurrection after close.

700 unit tests pass, ESLint clean.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
Clarify that the JWT string in the redactor test is intentionally fake
and is built from parts so the assembled token never appears as a
source literal (to satisfy pre-commit secret scanners).

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

CI's TypeScript was stricter than the local version and rejected the
untyped `c.args[1]` passed to `RegExp.test()`. Wrap in `String(...)`
so the tests compile on Node 14/16/18/20 runners.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Comment thread lib/telemetry/CircuitBreaker.ts Outdated

// In HALF_OPEN state, allow only one probe request at a time
if (this.state === CircuitBreakerState.HALF_OPEN && this.halfOpenInflight > 0) {
throw new Error('Circuit breaker OPEN');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HALF_OPEN?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: HALF_OPEN now admits only one probe at a time via atomic tryAdmit(). Any failure there immediately reopens the circuit; successThreshold consecutive successes close it. New concurrent-probe test at CircuitBreaker.test.ts:~411 verifies the invariant.

return out;
}
const out: Record<string, string> = {};
for (const [k, v] of Object.entries(raw as Record<string, unknown>)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do type check here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added — normalizeHeaders now guards with typeof raw === 'object' before Object.entries, and filters out non-string values and non-string keys. Lives in telemetryUtils.ts and is shared between exporter and FFC (addresses your duplication comment below as well).

Comment thread lib/telemetry/ExceptionClassifier.ts Outdated
const errorCode = (error as any).code;
if (
errorCode === 'ECONNREFUSED' ||
errorCode === 'ENOTFOUND' ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notfound should not be transient

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about other code like,

  • ECONNRESET
  • ETIMEDOUT
  • EHOSTUNREACH
  • EAI_AGAIN

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed ENOTFOUND from retryable with a comment explaining why (DNS "not found" is deterministic — retrying just pushes load at the resolver). Kept ECONNRESET and EHOSTUNREACH; added ETIMEDOUT and EAI_AGAIN. Tests updated in ExceptionClassifier.test.ts (inverted ENOTFOUND expectation, added cases for the two new codes).

this.userAgent = buildUserAgentString(this.context.getConfig().userAgentEntry);
}

getOrCreateContext(host: string): FeatureFlagContext {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this thread safe?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may not apply to node, please double check

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed per your follow-up — Node is single-threaded, all Map accesses happen in sequential event-loop ticks. No concurrent mutation is possible. In-flight fetchPromises dedup handles the thundering-herd case where two refCount-bumps race with an expired cache.

return ctx;
}

releaseContext(host: string): void {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this thread safe

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — single event loop, no concurrent access needed.


const isExpired = !ctx.lastFetched || Date.now() - ctx.lastFetched.getTime() > ctx.cacheDuration;

if (isExpired) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this call is async, but can we make this call out of this loop, find some way that we can trigger the fetch automatically so that we don't see delay here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocking fetch only happens once per host per 15 minutes (cache TTL), and concurrent callers are deduped via fetchPromises so only one in-flight request per host. First call after cache expiry does take the round-trip latency. Proactive refresh is feasible but needs a trigger signal from the consumer (currently none exists — the consumer-wiring lands in #327). Happy to add a background refresher in that PR once we know the call pattern.

Comment thread lib/telemetry/FeatureFlagCache.ts Outdated

logger.log(LogLevel.debug, `Fetching feature flags from ${endpoint}`);

const response = await this.sendRequest(endpoint, {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have retry on this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. fetchWithRetry now retries once on transient errors (ECONNRESET / ETIMEDOUT / EHOSTUNREACH / EAI_AGAIN / 5xx) via the shared ExceptionClassifier. One retry is intentional: without it, a single DNS hiccup disables telemetry for the full 15-minute TTL; more than one retry starts hammering a broken endpoint.

return fetch(url, { ...init, agent });
}

private async getAuthHeaders(): Promise<Record<string, string>> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code seem duplicated

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted. normalizeHeaders and hasAuthorization moved to telemetryUtils.ts; both exporter and FFC now import them. ~40 lines of drift risk gone.

- ExceptionClassifier: remove ENOTFOUND from retryable (DNS "not found"
  is deterministic — retrying just pushes load at the resolver without
  any expectation of success). Add ETIMEDOUT and EAI_AGAIN per Jade's
  follow-up list.
- Extract shared `normalizeHeaders` + `hasAuthorization` helpers into
  telemetryUtils; DatabricksTelemetryExporter and FeatureFlagCache both
  use them — eliminates the ~40-line duplication Jade flagged.
- normalizeHeaders now guards `typeof raw === 'object'` before
  Object.entries, preventing Object.entries('some-string') index entries
  from leaking in as headers (Jade: "should we do type check here?").
- FeatureFlagCache.fetchFeatureFlag: add single-retry on transient errors
  (classified via ExceptionClassifier). Without a retry, one DNS hiccup
  would disable telemetry for the full 15-minute cache TTL; one retry
  gives an ephemeral failure a second chance without pushing sustained
  load at a broken endpoint.
- Drop the private hasAuthorization/normalizeHeaders on the exporter;
  drop the inlined branching in FFC getAuthHeaders.
- Update ExceptionClassifier tests: invert ENOTFOUND expectation with
  a comment explaining why; add cases for ETIMEDOUT and EAI_AGAIN.

702 unit tests pass, ESLint clean.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants