Received: by 10.213.65.68 with SMTP id h4csp3598461imn; Tue, 10 Apr 2018 01:20:25 -0700 (PDT) X-Google-Smtp-Source: AIpwx4+eUHe1b9CqDWH654MKpOJJ3xWGmW0igD1u6sPXQQV7LfKqFiOEavhWcjDK4+kqFgZF0sV+ X-Received: by 10.98.58.75 with SMTP id h72mr1896771pfa.209.1523348425859; Tue, 10 Apr 2018 01:20:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1523348425; cv=none; d=google.com; s=arc-20160816; b=PxtlxRha5rs1JB0OzMsGnRy47VJtbBgL4ERKn7fo5OairzYAEwofAuImY4SRSWOdyW QBUBTLw6xFvKH/PrqyZ44Vs9W8KfT6PAP9bBWaZvT5oF+cYEYM0vmrvQ7ZCgfezO4i50 i9B70ZpqDywFuwQjmX/9sVfiSQzM3efwkvizeqVhxotpWqD+KxH0g1jnj1eeuY8uaxxV nB0Fo1c+xp7FizafnE1u7nhEELic9/pneX6uiAZLFHM6TGApietkvDpxND2JzjvmaHPU f8ajj3nJ3N/irp4yXJJlEmgWNQqlRdveCP3oWWgu3zMGqDNgtkjEpHNF7xtS6nIYtfY0 /ToQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :arc-authentication-results; bh=VyK4HgCZnhrMz3EyzXd1Z8E4qAW8w5EJ/g6X6GF1t18=; b=sjpHq1TTnNxfFAeq43BDbHJGl0+l8SgbzU66YpfmKOSmFJLwzxK6ixp8s0WYHsb1Pj ItSyk15eggKHpQsV6FoF2VSQGcD3RpYlShHmWiZiEVT3LZ+7wvfvFJep4BIE1wE5cMhS tWHCZgD+eZK2CmY79i0Q3yS3JaHR9WPO+z13zBB68JseOywWZrJarKs5Fhe0itLf/v5h 2igYY20FRR89zZaW1VBXVC63toHdnv3XUmNygep2KEW6WMMYC/E9mdnRwoCGC8d76hDV Xtxi8tGhqO5Dpx+MpGpS3ZA3Dj6+StNA5YHkddRF8LaLJCh61wueUkdDmYIs+RFxmwd1 njag== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q7si1500658pgn.372.2018.04.10.01.19.49; Tue, 10 Apr 2018 01:20:25 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752426AbeDJIPs (ORCPT + 99 others); Tue, 10 Apr 2018 04:15:48 -0400 Received: from mx2.suse.de ([195.135.220.15]:42563 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752770AbeDJIPl (ORCPT ); Tue, 10 Apr 2018 04:15:41 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 653D4AD0B; Tue, 10 Apr 2018 08:15:39 +0000 (UTC) From: Vlastimil Babka To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Vlastimil Babka , Joonsoo Kim , David Rientjes , Pekka Enberg , Christoph Lameter , Tejun Heo , Lai Jiangshan , John Stultz , Thomas Gleixner , Stephen Boyd Subject: [RFC] mm, slab: reschedule cache_reap() on the same CPU Date: Tue, 10 Apr 2018 10:15:31 +0200 Message-Id: <20180410081531.18053-1-vbabka@suse.cz> X-Mailer: git-send-email 2.16.3 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org cache_reap() is initially scheduled in start_cpu_timer() via schedule_delayed_work_on(). But then the next iterations are scheduled via schedule_delayed_work(), thus using WORK_CPU_UNBOUND. AFAIU there is thus no guarantee the future iterations will happen on the intended cpu, although it's preferred. I was able to demonstrate this with /sys/module/workqueue/parameters/debug_force_rr_cpu. IIUC the timer code, it may also happen due to migrating timers in nohz context. As a result, some cpu's would be calling cache_reap() more frequently and others never. What would be even worse is a potential scenario where WORK_CPU_UNBOUND would result in being run via kworker thread that's not pinned to any single CPU (although I haven't observed that in my simple tests). Migration to another CPU during cache_reap() e.g. between cpu_cache_get() and drain_array() would result in operating on non-local cpu array cache and might race with the other cpu. Migration to another numa node than the one obtained with numa_mem_id() could result in slabs being moved to a list on a wrong node, which would then be modified with a wrong lock, againn potentially racing. This patch makes sure schedule_delayed_work_on() is used with the proper cpu when scheduling the next iteration. The cpu is stored with delayed_work on a new slab_reap_work_struct super-structure. Signed-off-by: Vlastimil Babka Cc: Joonsoo Kim Cc: David Rientjes Cc: Pekka Enberg Cc: Christoph Lameter Cc: Tejun Heo Cc: Lai Jiangshan Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd --- Hi, this patch is a result of hunting some rare crashes in our (4.4-based) kernel where slabs misplaced on wrong nodes were identified in the crash dumps. I don't yet know if cache_reap() is the culprit and if this patch fill fix it, but the problem seems real to me nevertheless. I CC'd workqueue and timer maintainers and would like to check if my assumptions in changelog are correct, and especially if there's a guarantee that work scheduled with schedule_delayed_work_on(cpu) will never migrate to another cpu. If that's not guaranteed (including past stable kernel versions), we will have to be even more careful and e.g. disable interrupts sooner. Thanks, Vlastimil mm/slab.c | 29 +++++++++++++++++++++-------- 1 file changed, 21 insertions(+), 8 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 9095c3945425..b3e3d082099c 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -429,7 +429,12 @@ static struct kmem_cache kmem_cache_boot = { .name = "kmem_cache", }; -static DEFINE_PER_CPU(struct delayed_work, slab_reap_work); +struct slab_reap_work_struct { + struct delayed_work dwork; + int cpu; +}; + +static DEFINE_PER_CPU(struct slab_reap_work_struct, slab_reap_work); static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) { @@ -551,12 +556,15 @@ static void next_reap_node(void) */ static void start_cpu_timer(int cpu) { - struct delayed_work *reap_work = &per_cpu(slab_reap_work, cpu); - if (reap_work->work.func == NULL) { + struct slab_reap_work_struct *reap_work = &per_cpu(slab_reap_work, cpu); + struct delayed_work *dwork = &reap_work->dwork; + + if (dwork->work.func == NULL) { + reap_work->cpu = cpu; init_reap_node(cpu); - INIT_DEFERRABLE_WORK(reap_work, cache_reap); - schedule_delayed_work_on(cpu, reap_work, + INIT_DEFERRABLE_WORK(dwork, cache_reap); + schedule_delayed_work_on(cpu, dwork, __round_jiffies_relative(HZ, cpu)); } } @@ -1120,9 +1128,9 @@ static int slab_offline_cpu(unsigned int cpu) * expensive but will only modify reap_work and reschedule the * timer. */ - cancel_delayed_work_sync(&per_cpu(slab_reap_work, cpu)); + cancel_delayed_work_sync(&per_cpu(slab_reap_work, cpu).dwork); /* Now the cache_reaper is guaranteed to be not running. */ - per_cpu(slab_reap_work, cpu).work.func = NULL; + per_cpu(slab_reap_work, cpu).dwork.work.func = NULL; return 0; } @@ -4027,11 +4035,15 @@ static void cache_reap(struct work_struct *w) struct kmem_cache_node *n; int node = numa_mem_id(); struct delayed_work *work = to_delayed_work(w); + struct slab_reap_work_struct *reap_work = + container_of(work, struct slab_reap_work_struct, dwork); if (!mutex_trylock(&slab_mutex)) /* Give up. Setup the next iteration. */ goto out; + WARN_ON_ONCE(reap_work->cpu != smp_processor_id()); + list_for_each_entry(searchp, &slab_caches, list) { check_irq_on(); @@ -4074,7 +4086,8 @@ static void cache_reap(struct work_struct *w) next_reap_node(); out: /* Set up the next iteration */ - schedule_delayed_work(work, round_jiffies_relative(REAPTIMEOUT_AC)); + schedule_delayed_work_on(reap_work->cpu, work, + round_jiffies_relative(REAPTIMEOUT_AC)); } void get_slabinfo(struct kmem_cache *cachep, struct slabinfo *sinfo) -- 2.16.3