Skip to content

Track compressed resource state sizes in deploy telemetry (direct engine)#5608

Open
shreyas-goenka wants to merge 1 commit into
mainfrom
shreyas-goenka/telemetry-compressed-resource-sizes
Open

Track compressed resource state sizes in deploy telemetry (direct engine)#5608
shreyas-goenka wants to merge 1 commit into
mainfrom
shreyas-goenka/telemetry-compressed-resource-sizes

Conversation

@shreyas-goenka

@shreyas-goenka shreyas-goenka commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What

Bundle deploy telemetry already reports per-resource-type raw state-size statistics (state_size_{max,mean,median}_bytes in ResourceMetadata). The same per-resource state is stored compressed downstream, so this adds the compressed-size counterparts to gauge how much resource state shrinks under compression, not just the raw sizes:

  • state_compressed_size_max_bytes
  • state_compressed_size_mean_bytes
  • state_compressed_size_median_bytes

Performance

flate runs at state-export time over individually small resource states (each well under the server's per-resource limit), not in a tight loop, so even large bundles compress in a few milliseconds — negligible next to a deploy's network I/O. No background goroutine is warranted.

Flate vs Zstd

The server uses Zstd.compress(raw) — the single-arg luben call, which is zstd's default level (3). So the right comparison is flate-L6 (what the CLI uses) vs zstd-L3, and the data says the proxy is good:

┌───────────────────────┬─────────┬─────────────────┬──────────────────┬──────────────────┐
│        Sample         │   raw   │ flate L6 (CLI)  │ zstd L3 (server) │ flate vs zstd-L3 │
├───────────────────────┼─────────┼─────────────────┼──────────────────┼──────────────────┤
│ varied JSON, 64 KB    │ 64 KB   │ 11.6 KB (18.1%) │ 11.1 KB (17.3%)  │ +4.6%            │
├───────────────────────┼─────────┼─────────────────┼──────────────────┼──────────────────┤
│ varied JSON, 1 MB     │ 1024 KB │ 179 KB (17.6%)  │ 183 KB (17.9%)   │ −2.1%            │
├───────────────────────┼─────────┼─────────────────┼──────────────────┼──────────────────┤
│ realistic JSON, 64 KB │ 64 KB   │ 2.2 KB (3.4%)   │ 2.0 KB (3.1%)    │ +11.2%           │
├───────────────────────┼─────────┼─────────────────┼──────────────────┼──────────────────┤
│ realistic JSON, 1 MB  │ 1024 KB │ 27 KB (2.6%)    │ 29 KB (2.8%)     │ −6.9%            │
└───────────────────────┴─────────┴─────────────────┴──────────────────┴──────────────────┘

Takeaways:
- flate-L6 tracks the server's zstd-L3 within ~±10%, with no consistent bias — sometimes a touch larger (small blobs), sometimes a touch smaller (at ~1 MB flate actually beats zstd-L3). For the intended purpose — understanding how much state shrinks and rough server-storage sizing — that's a faithful proxy. And since the error isn't one-directional, it largely washes out in the aggregate max/mean/median.
- The one important caveat: the proxy is good because the server compresses at zstd's default level (3). zstd's real edge over DEFLATE only shows at higher levels — e.g., on realistic/1 MB, zstd --best got 21 KB vs flate's 27 KB (~28% smaller). So if the server ever raises its zstd level, flate would start systematically over-estimating stored size by ~10–30%. Worth keeping in mind, but at the current default level the proxy is within noise.

Net: for what this telemetry is for, flate is a solidly good stand-in for the server's zstd — within ~10% today. (These are synthetic JSON samples; real resource state will vary in absolute ratio, but the flate-vs-zstd relationship is stable for JSON/text.)

Against some realiish data:
Pulled 6 real .lvdash.json dashboards from public GitHub repos (databrickslabs/dqx, databrickslabs/sandbox's DBR-monitor, andre-salvati/databricks-template, etc.), 1.3 KB–240 KB, and compressed each with flate-L6 (what the CLI does) vs zstd-L3 (the server's confirmed default):

┌──────────────────────┬─────────┬──────────┬─────────┬───────────┬──────────────────┐
│      Dashboard       │   raw   │ flate L6 │ zstd L3 │ zstd best │ flate vs zstd-L3 │
├──────────────────────┼─────────┼──────────┼─────────┼───────────┼──────────────────┤
│ mtest                │ 1.3 KB  │ 503 B    │ 538 B   │ 521 B     │ −6.5%            │
├──────────────────────┼─────────┼──────────┼─────────┼───────────┼──────────────────┤
│ airflow              │ 5.4 KB  │ 961 B    │ 1060 B  │ 996 B     │ −9.3%            │
├──────────────────────┼─────────┼──────────┼─────────┼───────────┼──────────────────┤
│ worldcup             │ 8.8 KB  │ 1439 B   │ 1621 B  │ 1481 B    │ −11.2%           │
├──────────────────────┼─────────┼──────────┼─────────┼───────────┼──────────────────┤
│ orders               │ 16.6 KB │ 1571 B   │ 1787 B  │ 1601 B    │ −12.1%           │
├──────────────────────┼─────────┼──────────┼─────────┼───────────┼──────────────────┤
│ dbrmon (DBR monitor) │ 155 KB  │ 9298 B   │ 8690 B  │ 7388 B    │ +7.0%            │
├──────────────────────┼─────────┼──────────┼─────────┼───────────┼──────────────────┤
│ dqx                  │ 240 KB  │ 14907 B  │ 14869 B │ 12563 B   │ +0.3%            │
├──────────────────────┼─────────┼──────────┼─────────┼───────────┼──────────────────┤
│ TOTAL                │ 428 KB  │ 28679 B  │ 28565 B │ —         │ +0.4%            │

@eng-dev-ecosystem-bot

eng-dev-ecosystem-bot commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: c9ac30f

Run: 27685801795

Env 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 15 264 1009 7:17
🟨​ aws windows 7 15 266 1007 11:49
💚​ aws-ucws linux 7 15 360 923 8:13
💚​ aws-ucws windows 7 15 362 921 10:29
💚​ azure linux 1 17 267 1007 6:32
💚​ azure windows 1 17 269 1005 7:54
🔄​ azure-ucws linux 3 17 363 919 11:00
💚​ azure-ucws windows 1 17 367 917 9:45
🔄​ gcp linux 3 1 17 260 1010 10:01
🔄​ gcp windows 2 1 17 263 1008 13:07
27 interesting tests: 15 SKIP, 7 KNOWN, 5 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 💚​R 💚​R 💚​R 💚​R 🔄​f 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestAccept/bundle/resources/apps/inline_config ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p
🔄​ TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/grants/select 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestFetchRepositoryInfoAPI_FromRepo ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p
🔄​ TestFetchRepositoryInfoAPI_FromRepo/root ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f 🔄​f
🔄​ TestFetchRepositoryInfoAPI_FromRepo/subdir ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f 🔄​f
Top 24 slowest tests (at least 2 minutes):
duration env testname
5:15 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:40 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:27 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:26 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:50 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:44 gcp windows TestAccept
3:30 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:28 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:21 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:16 azure-ucws windows TestAccept
3:02 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:02 azure windows TestAccept
2:56 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:55 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:55 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:48 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:45 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:45 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:41 aws-ucws windows TestAccept
2:38 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:33 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:32 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:32 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:27 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/telemetry-compressed-resource-sizes branch 4 times, most recently from 1aacd8b to 8c625fa Compare June 17, 2026 02:58
@shreyas-goenka shreyas-goenka requested a review from pietern June 17, 2026 03:01
@shreyas-goenka shreyas-goenka marked this pull request as ready for review June 17, 2026 03:01
@github-actions

Copy link
Copy Markdown
Contributor

Approval status: pending

/acceptance/bundle/ - needs approval

Files: acceptance/bundle/telemetry/deploy/out.resources_metadata.direct.txt
Suggested: @denik
Also eligible: @pietern, @janniklasrose, @anton-107, @andrewnester, @lennartkats-db

/bundle/ - needs approval

6 files changed
Suggested: @denik
Also eligible: @pietern, @janniklasrose, @anton-107, @andrewnester, @lennartkats-db

/libs/telemetry/ - needs approval

Files: libs/telemetry/protos/bundle_deploy.go
Eligible: @simonfaltum, @renaudhartert-db, @hectorcast-db, @parthban-db, @tanmay-db, @Divyansh-db, @tejaskochar-db, @mihaimitrea-db, @chrisst, @rauchy

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @simonfaltum, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/telemetry-compressed-resource-sizes branch 2 times, most recently from 505c536 to b0b017d Compare June 17, 2026 09:31
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/telemetry-compressed-resource-sizes branch from b0b017d to ca58e77 Compare June 17, 2026 09:41
Comment thread bundle/direct/dstate/state.go Outdated

result := make(map[string]int, len(keys))
for i, key := range keys {
result[key] = sizes[i]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical pattern here is to run a goroutine for each element in the map and have it return a {key, int} on a channel. The main loop then drains that channel to collect the results and stores them in a map. There is no need to deal with GOMAXPROCS or "workers". The Go runtime takes care of scheduling.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — switched to the canonical pattern: one goroutine per resource sending {key, size} on a buffered channel, drained into the map. Dropped the GOMAXPROCS/worker-pool machinery entirely.

Comment thread bundle/direct/dstate/state.go Outdated
}
return buf.Len()
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code belongs in a separate file where we have all the compression related stuff.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — moved the compression code (compressedStateSize + compressStateSizes) into bundle/direct/dstate/compress.go.

}
})
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't have anything to do with state, only with compression.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — the compression test and benchmarks now live in compress_test.go alongside the compression code.

…ine)

Deploy telemetry already reports per-resource-type raw state-size statistics
(state_size_{max,mean,median}_bytes). The deployment metadata service stores
that same per-resource state compressed, so this adds compressed-size
counterparts to gauge how much resource state shrinks under compression rather
than just the raw sizes:

  - state_compressed_size_max_bytes
  - state_compressed_size_mean_bytes
  - state_compressed_size_median_bytes

The compressed length is computed per resource at state-export time (alongside
the existing raw length) using the standard library's compress/flate -- a
deliberately rough proxy for the server side (which uses zstd) that keeps the
dependency/supply-chain surface small while still giving useful signal on
compressibility. Since the largest resource states (~1 MB, ~20 ms to compress)
dominate the cost, the per-resource compression is fanned out across workers,
keeping multi-resource bundles cheap. Only the direct engine is measured,
matching the existing raw-size behavior.

Co-authored-by: Isaac
@shreyas-goenka shreyas-goenka force-pushed the shreyas-goenka/telemetry-compressed-resource-sizes branch from ca58e77 to c9ac30f Compare June 17, 2026 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants