Skip to content

fix: Omit NULL values from build side of hash joins#22893

Open
neilconway wants to merge 1 commit into
apache:mainfrom
neilconway:neilc/fix-hashjoin-nulls
Open

fix: Omit NULL values from build side of hash joins#22893
neilconway wants to merge 1 commit into
apache:mainfrom
neilconway:neilc/fix-hashjoin-nulls

Conversation

@neilconway

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Previously, HashMap-backed hash joins included NULLs but ArrayMap-backed hash joins omitted them. Under NullEqualsNothing, we can safely omit rows that have a NULL in any of their join keys, because they will never contribute to the output of the join. Omitting NULLs reduces the size of the build-side hash table.

The previous probe behavior also resulted in searching the hash table for probe rows with NULLs in their join keys. This was wasted work; indeed, because all NULL build rows will end up in the same hash chain, this could actually be very expensive for joins over NULL-heavy data sets. For example, joining two 10k tables on all-NULL join keys took ~6 seconds (!). That drops to a few milliseconds after this PR.

What changes are included in this PR?

  • Omit build rows with one or more NULLs in their join keys from HashMap
  • Don't probe the map for probe rows with NULLs in their join keys
  • Fix a few places that assumes that an empty build-side hash table meant the build input was empty
  • Add unit tests

Are these changes tested?

Yes; new tests added.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 10, 2026
@github-actions

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  35.969s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.127s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  35.305s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.123s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.662s] 223 checks: 221 pass, 2 fail, 0 warn, 30 skip

--- failure function_parameter_count_changed: pub fn parameter count changed ---

Description:
A publicly-visible function now takes a different number of parameters.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/function_parameter_count_changed.ron

Failed in:
  datafusion_physical_plan::joins::join_hash_map::get_matched_indices_with_limit_offset now takes 8 parameters instead of 7, in /home/runner/work/datafusion/datafusion/datafusion/physical-plan/src/joins/join_hash_map.rs:389
  datafusion_physical_plan::joins::utils::update_hash now takes 9 parameters instead of 8, in /home/runner/work/datafusion/datafusion/datafusion/physical-plan/src/joins/utils.rs:2113

--- failure trait_method_parameter_count_changed: pub trait method parameter count changed ---

Description:
A trait method now takes a different number of parameters.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#trait-item-signature
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/trait_method_parameter_count_changed.ron

Failed in:
  JoinHashMapType::get_matched_indices_with_limit_offset now takes 6 instead of 5 parameters, in file /home/runner/work/datafusion/datafusion/datafusion/physical-plan/src/joins/join_hash_map.rs:124
  JoinHashMapType::get_matched_indices_with_limit_offset now takes 6 instead of 5 parameters, in file /home/runner/work/datafusion/datafusion/datafusion/physical-plan/src/joins/join_hash_map.rs:124

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  74.076s] datafusion-physical-plan

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hash join should omit NULLs from build side under NullEqualsNothing

1 participant