Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp6275552ybl; Wed, 15 Jan 2020 01:46:37 -0800 (PST) X-Google-Smtp-Source: APXvYqxH0tGXy9Sd4fdIywYL/IoKuAwWVIUUCFePLlH/yDFFvdeOF2J1cX3SjAv+J2F/c9baY47w X-Received: by 2002:aca:b286:: with SMTP id b128mr20740143oif.147.1579081596849; Wed, 15 Jan 2020 01:46:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1579081596; cv=none; d=google.com; s=arc-20160816; b=VE7QCis25zdoi1ottKlYcaY6TRqqEupRHbb+fit6Gi6sa5nBMSSvUnVjEDfHeRuv5R XEER2pBzmtOyT9E80F3rlBIZlom7mmEdJdTrvFtotuv0WvrHA9lHED/h4nhYGAWQGGf+ 9e96MxuWpqf3IkmdFIAojYjooZGtAAiVbdZ16W7p4ZrpeiJZ68PlGqyzJW6ANKGZ7mDW aULtxLUjNPEx2+43frGsGdHX2jKUTtfBK0w6Ag36FOjPZ8q8RbJ48JT4ZdeKl/AQ3CgQ RFn/SzOlanfSzhuZcG3YcliPpsn31r90FE2/vkLc0jJXS9MMn9lQVssW34uClCdMwrCn TfyQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=1ROfISiH4ooExFgjHNAdgfYubJgODcwW3R9XcNZ3seo=; b=OVfNjffD082bMjrhIO1YWlvc11xudbHCDbLjgsM93Whnfx2rMa4QiJyyswN5b8pVQP KhOM6UqtGGFkYpvPPpyJdjzsc42WO389xmFMTpVYXW1xPF6IYjokUwiBhAs1DTrKagEt 09pZmVV97qhY60v2cBBBjhyJcGiO8FdG6BTQ4dSfoFloS7fKalGJX6NOUvxRRqECdQz4 3iOxb+hrZCM2yly2eRmVaiPoSg4xiEygP2JfEJJAIN/3baGSCGD+rA50kCyKhsUXpI02 bLwJmThnF6ojNuQiT8mpKMeFd9jEBAzah/g6kMAegw9fVDZM5Ur2iVq9o8ivuD48+3Pp QWqA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=virtuozzo.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l11si9287892oie.231.2020.01.15.01.46.24; Wed, 15 Jan 2020 01:46:36 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=virtuozzo.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729536AbgAOJpY (ORCPT + 99 others); Wed, 15 Jan 2020 04:45:24 -0500 Received: from relay.sw.ru ([185.231.240.75]:43806 "EHLO relay.sw.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729459AbgAOJpX (ORCPT ); Wed, 15 Jan 2020 04:45:23 -0500 Received: from dhcp-172-16-24-104.sw.ru ([172.16.24.104]) by relay.sw.ru with esmtp (Exim 4.92.3) (envelope-from ) id 1irfEj-0002LN-LS; Wed, 15 Jan 2020 12:45:17 +0300 Subject: Re: [PATCH RESEND] mm: fix tick_sched timer blocked by pgdat_resize_lock To: Shile Zhang , Andrew Morton , Pavel Tatashin Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20200110082510.172517-2-shile.zhang@linux.alibaba.com> <20200110093053.34777-1-shile.zhang@linux.alibaba.com> <1ee6088c-9e72-8824-3a9a-fc099d196faf@virtuozzo.com> <9420eab3-5e5e-150f-53c9-6cd40bacf859@virtuozzo.com> From: Kirill Tkhai Message-ID: Date: Wed, 15 Jan 2020 12:45:17 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.3.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 14.01.2020 11:54, Shile Zhang wrote: > > > On 2020/1/13 16:11, Kirill Tkhai wrote: >> On 13.01.2020 03:54, Shile Zhang wrote: >>> >>> On 2020/1/10 19:42, Kirill Tkhai wrote: >>>> On 10.01.2020 12:30, Shile Zhang wrote: >>>>> When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdat_resize_lock' >>>>> will be called inside 'pgdatinit' kthread to initialise the deferred >>>>> pages with local interrupts disabled. Which is introduced by >>>>> commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred >>>>> pages"). >>>>> >>>>> But 'pgdatinit' kthread is possible be pined on the boot CPU (CPU#0 by >>>>> default), especially in small system with NRCPUS <= 2. In this case, the >>>>> interrupts are disabled on boot CPU during memory initialising, which >>>>> caused the tick_sched timer be blocked, leading to wall clock stuck. >>>>> >>>>> Fixes: commit 3a2d7fa8a3d5 ("mm: disable interrupts while initializing >>>>> deferred pages") >>>>> >>>>> Signed-off-by: Shile Zhang >>>>> --- >>>>>    include/linux/memory_hotplug.h | 16 ++++++++++++++-- >>>>>    1 file changed, 14 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h >>>>> index ba0dca6aac6e..be69a6dc4fee 100644 >>>>> --- a/include/linux/memory_hotplug.h >>>>> +++ b/include/linux/memory_hotplug.h >>>>> @@ -6,6 +6,8 @@ >>>>>    #include >>>>>    #include >>>>>    #include >>>>> +#include >>>>> +#include >>>>>      struct page; >>>>>    struct zone; >>>>> @@ -282,12 +284,22 @@ static inline bool movable_node_is_enabled(void) >>>>>    static inline >>>>>    void pgdat_resize_lock(struct pglist_data *pgdat, unsigned long *flags) >>>>>    { >>>>> -    spin_lock_irqsave(&pgdat->node_size_lock, *flags); >>>>> +    /* >>>>> +     * Disable local interrupts on boot CPU will stop the tick_sched >>>>> +     * timer, which will block jiffies(wall clock) update. >>>>> +     */ >>>>> +    if (current->cpu != get_boot_cpu_id()) >>>>> +        spin_lock_irqsave(&pgdat->node_size_lock, *flags); >>>>> +    else >>>>> +        spin_lock(&pgdat->node_size_lock); >>>>>    } >>>>>    static inline >>>>>    void pgdat_resize_unlock(struct pglist_data *pgdat, unsigned long *flags) >>>>>    { >>>>> -    spin_unlock_irqrestore(&pgdat->node_size_lock, *flags); >>>>> +    if (current->cpu != get_boot_cpu_id()) >>>>> +        spin_unlock_irqrestore(&pgdat->node_size_lock, *flags); >>>>> +    else >>>>> +        spin_unlock(&pgdat->node_size_lock); >>>>>    } >>>>>    static inline >>>>>    void pgdat_resize_init(struct pglist_data *pgdat) >>>> 1)Linux kernel is *preemptible*. Kernel with CONFIG_PREEMPT_RT option even may preempt >>>> *kernel* code in the middle of function. When you are executing a code containing >>>> pgdat_resize_lock() and pgdat_resize_unlock(), the process may migrate to another cpu >>>> between them. >>>> >>>> bool cpu               another cpu >>>> ---------------------------------- >>>> pgdat_resize_lock() >>>>     spin_lock() >>>>     --> migrate to another cpu >>>>                         pgdat_resize_unlock() >>>>                         spin_unlock_irqrestore() >>>> >>>> (Yes, in case of CONFIG_PREEMPT_RT, process is preemptible even after spin_lock() call). >>>> >>>> This looks like a bad helpers, and we should not introduce such the design. >>> Hi Kirill, >>> >>> Thanks for your comments! >>> Sorry for I'm not very clear about this lock/unlock, but I encountered this issue >>> with "CONFIG_PREEMPT is not set". >> The thing is we simply shouldn't introduce such the primitives since the thread >> may migrate to another cpu, while you own the lock. This looks like a buggy design. >> >>>> 2)I think there is no the problem this patch solves. Do we really this statistics? >>>> Can't we simple remove print message from deferred_init_memmap() and solve this? >>> Sorry for I've not put this issue very clearly. It's *not* just one statistics log >>> with wrong time calculate, but the wall clock is stuck. >>> So the 'systemd-analyze' command also give a wrong time as I mentioned in the cover >>> letter. I don't think is OK just remove the log, it cannot solve the wall clock latency. >> Have you tried temporary enabling interrupts in the middle of cycle after a huge enough >> memory block is initialized? Something like: >> >> deferred_init_memmap() >> { >>     while (spfn < epfn) { >>         nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn); >>         local_irq_enable(); >>         local_irq_disable(); >>     } >> } > > Yes, I'd tried this for issue confirm, before I sent this patch. Likes I mentioned the debug log in cover letter, I also add mdelay between local_irq_enable/disable, this system jiffies is stuck without update. > So I think there must be problem to use spin_lock_irqsave in early boot path on boot CPU. This time SMP is enabled. You have many threads are running. Interrupts are enabled and they occur. So, it's OK to disable interrupts for some time. My opinion is we should try to enable interrupts in the cycle after some fixed amount of memory is initialized. Say, every 1GB. This should resolve two problems: handling timer interrupt with update jiffies at time, and keeping the fix for the issue, that Pavel fixed in 3a2d7fa8a3d5. >> Or, maybe, enable/disable interrupts somewhere inside deferred_init_maxorder(). >> >>>> Also, you may try to check that sched_clock() gives better results with interrupts >>>> disabled (on x86 it uses rdtsc, when it's possible. But it also may fallback to >>>> jiffies-based clock in some hardware cases, and they also won't go with interrupts >>>> disabled). >