81 lines
3.4 KiB
Diff
81 lines
3.4 KiB
Diff
From ea3ec10cacc746176a25dbd74c8d168e1c096a62 Mon Sep 17 00:00:00 2001
|
|
From: Omar Sandoval <osandov@fb.com>
|
|
Date: Fri, 25 Apr 2025 01:51:24 -0700
|
|
Subject: sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
|
|
|
|
There is a code path in dequeue_entities() that can set the slice of a
|
|
sched_entity to U64_MAX, which sometimes results in a crash.
|
|
|
|
The offending case is when dequeue_entities() is called to dequeue a
|
|
delayed group entity, and then the entity's parent's dequeue is delayed.
|
|
In that case:
|
|
|
|
1. In the if (entity_is_task(se)) else block at the beginning of
|
|
dequeue_entities(), slice is set to
|
|
cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then
|
|
it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX.
|
|
2. The first for_each_sched_entity() loop dequeues the entity.
|
|
3. If the entity was its parent's only child, then the next iteration
|
|
tries to dequeue the parent.
|
|
4. If the parent's dequeue needs to be delayed, then it breaks from the
|
|
first for_each_sched_entity() loop _without updating slice_.
|
|
5. The second for_each_sched_entity() loop sets the parent's ->slice to
|
|
the saved slice, which is still U64_MAX.
|
|
|
|
This throws off subsequent calculations with potentially catastrophic
|
|
results. A manifestation we saw in production was:
|
|
|
|
6. In update_entity_lag(), se->slice is used to calculate limit, which
|
|
ends up as a huge negative number.
|
|
7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
|
|
is negative, vlag > limit, so se->vlag is set to the same huge
|
|
negative number.
|
|
8. In place_entity(), se->vlag is scaled, which overflows and results in
|
|
another huge (positive or negative) number.
|
|
9. The adjusted lag is subtracted from se->vruntime, which increases or
|
|
decreases se->vruntime by a huge number.
|
|
10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which
|
|
incorrectly returns false because the vruntime is so far from the
|
|
other vruntimes on the queue, causing the
|
|
(vruntime - cfs_rq->min_vruntime) * load calulation to overflow.
|
|
11. Nothing appears to be eligible, so pick_eevdf() returns NULL.
|
|
12. pick_next_entity() tries to dereference the return value of
|
|
pick_eevdf() and crashes.
|
|
|
|
Dumping the cfs_rq states from the core dumps with drgn showed tell-tale
|
|
huge vruntime ranges and bogus vlag values, and I also traced se->slice
|
|
being set to U64_MAX on live systems (which was usually "benign" since
|
|
the rest of the runqueue needed to be in a particular state to crash).
|
|
|
|
Fix it in dequeue_entities() by always setting slice from the first
|
|
non-empty cfs_rq.
|
|
|
|
Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
|
|
Signed-off-by: Omar Sandoval <osandov@fb.com>
|
|
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com
|
|
---
|
|
kernel/sched/fair.c | 4 +---
|
|
1 file changed, 1 insertion(+), 3 deletions(-)
|
|
|
|
--- a/kernel/sched/fair.c
|
|
+++ b/kernel/sched/fair.c
|
|
@@ -7096,9 +7096,6 @@ static int dequeue_entities(struct rq *r
|
|
h_nr_idle = task_has_idle_policy(p);
|
|
if (task_sleep || task_delayed || !se->sched_delayed)
|
|
h_nr_runnable = 1;
|
|
- } else {
|
|
- cfs_rq = group_cfs_rq(se);
|
|
- slice = cfs_rq_min_slice(cfs_rq);
|
|
}
|
|
|
|
for_each_sched_entity(se) {
|
|
@@ -7108,6 +7105,7 @@ static int dequeue_entities(struct rq *r
|
|
if (p && &p->se == se)
|
|
return -1;
|
|
|
|
+ slice = cfs_rq_min_slice(cfs_rq);
|
|
break;
|
|
}
|
|
|