Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751199AbdHAI7D (ORCPT ); Tue, 1 Aug 2017 04:59:03 -0400 Received: from mail-it0-f67.google.com ([209.85.214.67]:34769 "EHLO mail-it0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751012AbdHAI7B (ORCPT ); Tue, 1 Aug 2017 04:59:01 -0400 MIME-Version: 1.0 In-Reply-To: <20170801083812.GH6524@worktop.programming.kicks-ass.net> References: <1501581716-8608-1-git-send-email-laoar.shao@gmail.com> <20170801083812.GH6524@worktop.programming.kicks-ass.net> From: Yafang Shao Date: Tue, 1 Aug 2017 16:58:59 +0800 Message-ID: Subject: Re: [PATCH] sched: fix NULL pointer issue in pick_next_entity() To: Peter Zijlstra Cc: mingo@redhat.com, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2325 Lines: 65 Hi Peter, Thanks for your response! > CFS, CFQ is a block IO scheduler. Sorry for the mistake. It should be CFS. > And how would that happen? We only call pick_next_entity(.curr=NULL) > when we _know_ cfs_rq->nr_running. It crashed my machine when I did hadoop test, and after I made this change it works now. On SMP system, cfs_rq->nr_running isn't protected well, although we _know_ cfs_rq->nr_running, but it is modified by other thread running on other CPU and the sched_entity is set NULL as well. Then this thread broken here as accessed the NULL pointer here. > You diff is broken, this is very much not the clear_buddies() function. You're right. it is in pick_next_entity(). 2017-08-01 16:38 GMT+08:00 Peter Zijlstra : > On Tue, Aug 01, 2017 at 06:01:56PM +0800, Yafang Shao wrote: >> When we select CFQ as the scheduler, in function pick_next_task_fair > > CFS, CFQ is a block IO scheduler. > >> it will pass NULL as the 2nd argument to pick_next_entity: >> pick_next_entity(cfs_rq, NULL); >> >> And once __pick_first_entity() is called, it could return NULL as well. >> >> So in function pick_next_entity(), the local variable 'left' and 'curr' >> could both be NULL, then this will cause NULL pointer issue. >> >> In order to fix this issue, we just need return NULL under the condition >> that both 'left' and 'curr' are NULL, meaning that no entity available. > > And how would that happen? We only call pick_next_entity(.curr=NULL) > when we _know_ cfs_rq->nr_running. > >> Signed-off-by: Yafang Shao >> --- >> kernel/sched/fair.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index c95880e..e64c359 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -3903,6 +3903,8 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se) > > You diff is broken, this is very much not the clear_buddies() function. > >> struct sched_entity *left = __pick_first_entity(cfs_rq); >> struct sched_entity *se; >> >> + if (!left && !curr) >> + return NULL; >> /* >> * If curr is set we have to see if its left of the leftmost entity >> * still in the tree, provided there was anything in the tree at all. >> -- >> 1.8.3.1 >>