Skip to content

docs: move TopK user defined operator example into extending-operators guide#22871

Open
sjhddh wants to merge 3 commits into
apache:mainfrom
sjhddh:docs/15774-extending-operators-custom-operator
Open

docs: move TopK user defined operator example into extending-operators guide#22871
sjhddh wants to merge 3 commits into
apache:mainfrom
sjhddh:docs/15774-extending-operators-custom-operator

Conversation

@sjhddh

@sjhddh sjhddh commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The extending-operators user guide only documented the µWheel optimizer at a high level. The full worked example of building a custom operator lived in datafusion/core/tests/user_defined/user_defined_plan.rs, whose own module header noted the code "is better to put ... in examples". #15774 asks to move that example into the user guide, using custom-table-providers.md as the format reference.

What changes are included in this PR?

  • Expand docs/source/library-user-guide/extending-operators.md with a complete TopK walkthrough:
    • the problem and the naive Sort + Limit plan it improves on,
    • the logical node (UserDefinedLogicalNodeCore),
    • the OptimizerRule that rewrites Limit + Sort into the node,
    • the physical operator (ExecutionPlan) and its streaming reader,
    • the ExtensionPlanner / QueryPlanner wiring, and
    • how to register everything on a SessionState and run a query.
  • Trim the now-redundant narrative from the user_defined_plan.rs header so the guide is the single source of the walkthrough; the header now links to the guide.

This addresses the two follow-ups alamb raised on #15832:

  1. Remove the redundant example. The explanatory walkthrough is removed from the test and now lives only in the guide.
  2. Add more detail. Each component has prose explaining what the trait methods are for, not just the code.

On the first point: I kept the implementation in user_defined_plan.rs rather than deleting the file, because the module has grown to also test user defined plan invariants (InvariantMock, the topk_invariants* tests). Those tests are not documentation and would be lost on a full delete. Happy to move them elsewhere or delete more aggressively if you'd prefer.

Following custom-table-providers.md, the code blocks use rust,ignore: they reference the surrounding types and a test-only schema, so they are illustrative rather than standalone-compilable.

Are these changes tested?

The migrated code is the existing, tested TopK implementation; cargo test --test user_defined_integration -p datafusion topk still passes (4 tests, including the invariant tests). The guide is rendered docs only.

Are there any user-facing changes?

Documentation only. No API changes.

…s guide

The `extending-operators` user guide previously only covered the µWheel
optimizer at a high level, while a full worked example of building a
custom operator lived in `user_defined_plan.rs` as a test. The test
module's own header noted the code would be better placed in the docs.

This moves the narrative walkthrough into the user guide: defining a
custom `UserDefinedLogicalNodeCore`, an `OptimizerRule` that rewrites
`Limit` + `Sort` into the node, a matching `ExecutionPlan`, the
`ExtensionPlanner`/`QueryPlanner` wiring, and how to register and run it.

`user_defined_plan.rs` keeps the implementation as a test because it also
exercises user defined plan invariants, which the documentation example
omits for clarity. Its header comment now points at the guide instead of
duplicating the walkthrough.

Closes apache#15774

Signed-off-by: sjhddh <151469562+sjhddh@users.noreply.github.com>
@github-actions github-actions Bot added documentation Improvements or additions to documentation core Core DataFusion crate labels Jun 10, 2026
@alamb

alamb commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

On the first point: I kept the implementation in user_defined_plan.rs rather than deleting the file, because the module has grown to also test user defined plan invariants (InvariantMock, the topk_invariants* tests). Those tests are not documentation and would be lost on a full delete. Happy to move them elsewhere or delete more aggressively if you'd prefer.

Would you be willing to move them, as a follow on PR, to unit tests (if they aren't already covered)? I agree that piggy backing invariant tests in other examples makes them harder to follow

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnk you @sjhddh -- I kicked off the tests and left some comments about how to frame the example. Let me know what you think

Out of the box, DataFusion plans this as a `Sort` feeding a `Limit`:

```text
> EXPLAIN SELECT customer_id, revenue FROM sales ORDER BY revenue DESC LIMIT 3;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of date as DataFusion now has a special TopK mode in its sort operator. I recommend we just point this out and then move on with the example

> EXPLAIN SELECT customer_id, revenue FROM sales ORDER BY revenue DESC LIMIT 3;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │       SortExec(TopK)      │ |
|               | │    --------------------   │ |
|               | │          limit: 3         │ |
|               | │                           │ |
|               | │       revenue@1 DESC      │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │         EmptyExec         │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.020 seconds.

> EXPLAIN FORMAT INDENT SELECT customer_id, revenue FROM sales ORDER BY revenue DESC LIMIT 3;
+---------------+-------------------------------------------------------------------------------+
| plan_type     | plan                                                                          |
+---------------+-------------------------------------------------------------------------------+
| logical_plan  | Sort: sales.revenue DESC NULLS FIRST, fetch=3                                 |
|               |   TableScan: sales projection=[customer_id, revenue]                          |
| physical_plan | SortExec: TopK(fetch=3), expr=[revenue@1 DESC], preserve_partitioning=[false] |
|               |   EmptyExec                                                                   |
|               |                                                                               |
+---------------+-------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.004 seconds.

SELECT customer_id, revenue FROM sales ORDER BY revenue DESC LIMIT 3;
```

Out of the box, DataFusion plans this as a `Sort` feeding a `Limit`:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also note that DataFusion has a much more sophisticated topk implementation built in, but this is just for an example

Maybe we can update the example so it disables the limit pushdown optimizer pass 🤔

Suggested change
Out of the box, DataFusion plans this as a `Sort` feeding a `Limit`:
Out of the box, DataFusion already contains an optimized TopK implementation and our example here
is just for demonstration purposes. If we disable the LimitPushdown optimization, we see the original plan is a `Sort` feeding a `Limit`:

let df = ctx
.sql("SELECT customer_id, revenue FROM sales ORDER BY revenue DESC LIMIT 3")
.await?;
df.show().await?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it woudl be great here to update the example to use assert_batches_eq! so that the actual output is captured in the example too

…hes_eq!)

- Note that DataFusion already ships an optimized TopK and frame this
  operator as a demonstration; the Sort -> Limit plan shown is the
  original plan with LimitPushdown disabled.
- Capture the example query output with assert_batches_eq! instead of
  df.show() so the expected result is visible in the docs.

Signed-off-by: sjhddh <151469562+sjhddh@users.noreply.github.com>
@sjhddh

sjhddh commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review @alamb! Pushed an update:

  1. Reframed the intro to point out that DataFusion already ships an optimized TopK and that this operator is just for demonstration. Took your suggested wording verbatim. The Sort -> Limit EXPLAIN below it is now described as the original plan you get with LimitPushdown disabled, rather than the default. I left it as an illustrative block since there isn't a clean SQL knob to disable a single logical rule mid-session - happy to wire the example up to actually drop the pass from the optimizer if you'd prefer that over the prose note.

  2. Swapped df.show() for assert_batches_eq! in the "Putting It Together" block so the expected output is captured inline. Rows are taken from the runnable user_defined_plan.rs test.

On the follow-on: agreed the invariant tests piggybacking on the example hurts readability. I'll move InvariantMock / the topk_invariants* tests into proper unit tests in a separate PR so this one stays focused on the docs example.

Signed-off-by: sjhddh <151469562+sjhddh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move code in user_defined_plan.rs to the extending-operators doc

3 participants