Commit e65018a
[SPARK-56511][CORE] Fix NPE in ShuffleInMemorySorter.getMemoryUsage after failed reset
### What changes were proposed in this pull request?
Null-check `array` in `ShuffleInMemorySorter.getMemoryUsage()`.
When `ShuffleExternalSorter.spill()` → `ShuffleInMemorySorter.reset()` → `MemoryConsumer.allocateArray()` throws OOM, `ShuffleInMemorySorter.array` is left null. The OOM propagates to
`UnsafeShuffleWriter.write()`'s finally block, which calls `ShuffleExternalSorter.cleanupResources()` → `freeMemory()` → `updatePeakMemoryUsed()` → `ShuffleInMemorySorter.getMemoryUsage()` → NPE on `array.size()`.
Example stack trace we see in prod:
```
java.lang.NullPointerException
at org.apache.spark.shuffle.sort.ShuffleInMemorySorter.getMemoryUsage(ShuffleInMemorySorter.java:131)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.getMemoryUsage(ShuffleExternalSorter.java:349)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:472)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:297)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:213)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:58)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:87)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:82)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:39)
...
```
Returning 0 for `ShuffleInMemorySorter.getMemoryUsage()` is correct: when `array` is null the pointer array was already freed by `ShuffleInMemorySorter.reset()` and never reallocated — the actual memory usage IS zero. The value is only consumed by
`ShuffleExternalSorter.updatePeakMemoryUsed()` for bookkeeping.
### Why are the changes needed?
Without the fix, the NPE in `cleanupResources()` prevents `ShuffleInMemorySorter.free()` and page cleanup from running, causing a memory leak on top of the original OOM.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Unit test (`testGetMemoryUsageAfterFree`) verifying `ShuffleInMemorySorter.getMemoryUsage()` returns 0 after `free()`
- Integration test in `ShuffleExternalSorterSuite`: constrains memory so `ShuffleInMemorySorter.reset()` → `allocateArray()` fails with OOM, then verifies `ShuffleExternalSorter.cleanupResources()` does not throw
Before fix: `NullPointerException: Cannot invoke "LongArray.size()" because "this.array" is null at ShuffleExternalSorter.cleanupResources`
After fix: test passes.
### Was this patch authored or co-authored using generative AI tooling?
Yes.
Closes #55373 from timlee0119/fix-shuffle-sorter-npe.
Authored-by: Tim Lee <tim.lee@databricks.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>1 parent 57b6522 commit e65018a
File tree
3 files changed
+101
-0
lines changed- core/src
- main/java/org/apache/spark/shuffle/sort
- test
- java/org/apache/spark/shuffle/sort
- scala/org/apache/spark/shuffle/sort
3 files changed
+101
-0
lines changedLines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
128 | 128 | | |
129 | 129 | | |
130 | 130 | | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
131 | 134 | | |
132 | 135 | | |
133 | 136 | | |
| |||
Lines changed: 9 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
120 | 120 | | |
121 | 121 | | |
122 | 122 | | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
123 | 132 | | |
124 | 133 | | |
125 | 134 | | |
| |||
Lines changed: 89 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
114 | 114 | | |
115 | 115 | | |
116 | 116 | | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
117 | 206 | | |
0 commit comments