docs: move TopK user defined operator example into extending-operators guide#22871
docs: move TopK user defined operator example into extending-operators guide#22871sjhddh wants to merge 3 commits into
Conversation
…s guide The `extending-operators` user guide previously only covered the µWheel optimizer at a high level, while a full worked example of building a custom operator lived in `user_defined_plan.rs` as a test. The test module's own header noted the code would be better placed in the docs. This moves the narrative walkthrough into the user guide: defining a custom `UserDefinedLogicalNodeCore`, an `OptimizerRule` that rewrites `Limit` + `Sort` into the node, a matching `ExecutionPlan`, the `ExtensionPlanner`/`QueryPlanner` wiring, and how to register and run it. `user_defined_plan.rs` keeps the implementation as a test because it also exercises user defined plan invariants, which the documentation example omits for clarity. Its header comment now points at the guide instead of duplicating the walkthrough. Closes apache#15774 Signed-off-by: sjhddh <151469562+sjhddh@users.noreply.github.com>
Would you be willing to move them, as a follow on PR, to unit tests (if they aren't already covered)? I agree that piggy backing invariant tests in other examples makes them harder to follow |
| Out of the box, DataFusion plans this as a `Sort` feeding a `Limit`: | ||
|
|
||
| ```text | ||
| > EXPLAIN SELECT customer_id, revenue FROM sales ORDER BY revenue DESC LIMIT 3; |
There was a problem hiding this comment.
This is out of date as DataFusion now has a special TopK mode in its sort operator. I recommend we just point this out and then move on with the example
> EXPLAIN SELECT customer_id, revenue FROM sales ORDER BY revenue DESC LIMIT 3;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ SortExec(TopK) │ |
| | │ -------------------- │ |
| | │ limit: 3 │ |
| | │ │ |
| | │ revenue@1 DESC │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ EmptyExec │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.020 seconds.
> EXPLAIN FORMAT INDENT SELECT customer_id, revenue FROM sales ORDER BY revenue DESC LIMIT 3;
+---------------+-------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-------------------------------------------------------------------------------+
| logical_plan | Sort: sales.revenue DESC NULLS FIRST, fetch=3 |
| | TableScan: sales projection=[customer_id, revenue] |
| physical_plan | SortExec: TopK(fetch=3), expr=[revenue@1 DESC], preserve_partitioning=[false] |
| | EmptyExec |
| | |
+---------------+-------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.004 seconds.
| SELECT customer_id, revenue FROM sales ORDER BY revenue DESC LIMIT 3; | ||
| ``` | ||
|
|
||
| Out of the box, DataFusion plans this as a `Sort` feeding a `Limit`: |
There was a problem hiding this comment.
I think we should also note that DataFusion has a much more sophisticated topk implementation built in, but this is just for an example
Maybe we can update the example so it disables the limit pushdown optimizer pass 🤔
| Out of the box, DataFusion plans this as a `Sort` feeding a `Limit`: | |
| Out of the box, DataFusion already contains an optimized TopK implementation and our example here | |
| is just for demonstration purposes. If we disable the LimitPushdown optimization, we see the original plan is a `Sort` feeding a `Limit`: |
| let df = ctx | ||
| .sql("SELECT customer_id, revenue FROM sales ORDER BY revenue DESC LIMIT 3") | ||
| .await?; | ||
| df.show().await?; |
There was a problem hiding this comment.
it woudl be great here to update the example to use assert_batches_eq! so that the actual output is captured in the example too
…hes_eq!) - Note that DataFusion already ships an optimized TopK and frame this operator as a demonstration; the Sort -> Limit plan shown is the original plan with LimitPushdown disabled. - Capture the example query output with assert_batches_eq! instead of df.show() so the expected result is visible in the docs. Signed-off-by: sjhddh <151469562+sjhddh@users.noreply.github.com>
|
Thanks for the review @alamb! Pushed an update:
On the follow-on: agreed the invariant tests piggybacking on the example hurts readability. I'll move |
Signed-off-by: sjhddh <151469562+sjhddh@users.noreply.github.com>
Which issue does this PR close?
user_defined_plan.rsto theextending-operatorsdoc #15774.Rationale for this change
The
extending-operatorsuser guide only documented the µWheel optimizer at a high level. The full worked example of building a custom operator lived indatafusion/core/tests/user_defined/user_defined_plan.rs, whose own module header noted the code "is better to put ... in examples". #15774 asks to move that example into the user guide, usingcustom-table-providers.mdas the format reference.What changes are included in this PR?
docs/source/library-user-guide/extending-operators.mdwith a completeTopKwalkthrough:Sort+Limitplan it improves on,UserDefinedLogicalNodeCore),OptimizerRulethat rewritesLimit+Sortinto the node,ExecutionPlan) and its streaming reader,ExtensionPlanner/QueryPlannerwiring, andSessionStateand run a query.user_defined_plan.rsheader so the guide is the single source of the walkthrough; the header now links to the guide.This addresses the two follow-ups alamb raised on #15832:
On the first point: I kept the implementation in
user_defined_plan.rsrather than deleting the file, because the module has grown to also test user defined plan invariants (InvariantMock, thetopk_invariants*tests). Those tests are not documentation and would be lost on a full delete. Happy to move them elsewhere or delete more aggressively if you'd prefer.Following
custom-table-providers.md, the code blocks userust,ignore: they reference the surrounding types and a test-only schema, so they are illustrative rather than standalone-compilable.Are these changes tested?
The migrated code is the existing, tested
TopKimplementation;cargo test --test user_defined_integration -p datafusion topkstill passes (4 tests, including the invariant tests). The guide is rendered docs only.Are there any user-facing changes?
Documentation only. No API changes.