Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752709Ab2FXQzn (ORCPT ); Sun, 24 Jun 2012 12:55:43 -0400 Received: from mail-lb0-f174.google.com ([209.85.217.174]:53112 "EHLO mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751244Ab2FXQzm convert rfc822-to-8bit (ORCPT ); Sun, 24 Jun 2012 12:55:42 -0400 MIME-Version: 1.0 In-Reply-To: References: Date: Sun, 24 Jun 2012 22:25:40 +0530 Message-ID: Subject: Re: [RFC][RT][PATCH] mm: Do not use stop_machine() for __zone_pcp_udpate() for CONFIG_PREEMPT_RT_FULL From: Aaditya Kumar To: KOSAKI Motohiro Cc: Thomas Gleixner , linux-kernel@vger.kernel.org, rostedt@goodmis.org, mingo@kernel.org, C.Emde@osadl.org, jkacur@redhat.com, frank.rowand@am.sony.com, tim.bird@am.sony.com, takuzo.ohara@ap.sony.com, kan.iibuchi@jp.sony.com, aaditya.kumar@ap.sony.com, KAMEZAWA Hiroyuki Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5742 Lines: 192 On Sat, Jun 23, 2012 at 9:07 AM, KOSAKI Motohiro wrote: > (6/19/12 2:32 PM), Aaditya Kumar wrote: >> The code path of __zone_pcp_update() has following locks, which in >> CONFIG_PREEMPT_RT_FULL=y are rt-mutex. >> ? - pa_lock locked by cpu_lock_irqsave() >> ? - zone->lock locked by free_pcppages_bulk() >> >> Since __zone_pcp_update() is called from stop_machine(), so with >> CONFIG_PREEMPT_RT_FULL=y >> we get following backtrace when __zone_pcp_update() is called during >> memory hot plugging while >> doing heavy file I/O. >> >> I think stop_machine() may not be required for calling __zone_pcp_update() >> in case of CONFIG_PREEMPT_RT_FULL=y as acquiring pa_lock in __zone_pcp_update() >> should be sufficient to isolate pcp pages and to setup per cpu pagesets. >> >> Can someone please let me know if am missing anything here? > Hello Kosaki-san, > First off, you should cc memory hotplug experts when discussing memory > hotplug topic. Sorry for that. > Second, stop_machine() is required because usually zone->pageset is > per-cpu variable. > the regular access rule is, 1) owner cpu can always access their own > pcp, 2) offlined cpu's > pcp can be accessed from any cpus because is has no race chance 3) > otherwise caller must > use stop_machine for preventing owner cpu accesses pcp. Thank you very much for your explanation, yes, my approach was not correct. Since "mm: page_alloc: rt-friendly per-cpu pages" from RT patch set introduces a preemptible lock (pa_lock which becomes an rt-mutex) for accessing pcp, (http://www.kernel.org/pub/linux/kernel/projects/rt/3.4/patch-3.4.3-rt12-rc1.patch.xz) So while calling zone_pcp_update() (with RT-patch set applied and with CONFIG_PREEMPT_RT_FULL=y), we have possibly two options to fix the BUG caused by taking a sleeping lock in stop_machine. Option 1. Revert the patch which introduces the sleeping(pa_lock) lock. Option 2. Fix the calling frame work.(Use another framework) Since usually memory hot plug is not that frequent an activity in the system, So a little more overhead occurred while taking option 2, I think is acceptable. My approach in my below patch for zone_pcp_update() is: 1. For each online cpu, setup pageset of a cpu by scheduling work on that cpu. 2. For each offline cpu, setup pageset of a cpu from current cpu. 3. Flush the all the work spawned in step1. I will re-send this as a formal patch if there are no objections to this approach. --- mm/page_alloc.c | 74 74 + 0 - 0 ! 1 file changed, 74 insertions(+) Index: b/mm/page_alloc.c =================================================================== --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -111,6 +111,16 @@ unsigned long totalreserve_pages __read_ int percpu_pagelist_fraction; gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; +#ifdef CONFIG_PREEMPT_RT_FULL +struct zone_pcp_work { + int cpu; + struct zone *zone; + struct work_struct work; +}; + +static DEFINE_PER_CPU(struct zone_pcp_work, zone_pcp_update_work); +#endif + #ifdef CONFIG_PM_SLEEP /* * The following functions are used by the suspend/hibernate code to temporarily @@ -3926,6 +3936,7 @@ int zone_wait_table_init(struct zone *zo return 0; } +#ifndef CONFIG_PREEMPT_RT_FULL static int __zone_pcp_update(void *data) { struct zone *zone = data; @@ -3949,13 +3960,69 @@ static int __zone_pcp_update(void *data) return 0; } +#else + +static void __zone_cpu_pcp_update(struct work_struct *work) +{ + struct zone_pcp_work *work_data = + container_of(work, struct zone_pcp_work, work); + struct zone *zone = work_data->zone; + int cpu = work_data->cpu; + unsigned long batch = zone_batchsize(zone), flags; + struct per_cpu_pageset *pset; + struct per_cpu_pages *pcp; + LIST_HEAD(dst); + + pset = per_cpu_ptr(zone->pageset, cpu); + pcp = &pset->pcp; + + cpu_lock_irqsave(cpu, flags); + isolate_pcp_pages(pcp->count, pcp, &dst); + free_pcppages_bulk(zone, pcp->count, &dst); + setup_pageset(pset, batch); + cpu_unlock_irqrestore(cpu, flags); + +} +#endif + void zone_pcp_update(struct zone *zone) { + +#ifdef CONFIG_PREEMPT_RT_FULL + int cpu; + + get_online_cpus(); + for_each_possible_cpu(cpu) { + struct zone_pcp_work *zone_pcp_work = + &per_cpu(zone_pcp_update_work, cpu); + zone_pcp_work->zone = zone; + zone_pcp_work->cpu = cpu; + + if (cpu_online(cpu)) + schedule_work_on(cpu, &zone_pcp_work->work); + else + __zone_cpu_pcp_update(&zone_pcp_work->work); + } + + for_each_online_cpu(cpu) { + struct zone_pcp_work *zone_pcp_work = + &per_cpu(zone_pcp_update_work, cpu); + + flush_work(&zone_pcp_work->work); + } + put_online_cpus(); + +#else + stop_machine(__zone_pcp_update, zone, NULL); +#endif } static __meminit void zone_pcp_init(struct zone *zone) { +#ifdef CONFIG_PREEMPT_RT_FULL + int cpu; +#endif /* * per cpu subsystem is not up at this point. The following code * relies on the ability of the linker to provide the @@ -3967,6 +4034,13 @@ static __meminit void zone_pcp_init(stru printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%u\n", zone->name, zone->present_pages, zone_batchsize(zone)); +#ifdef CONFIG_PREEMPT_RT_FULL + for_each_possible_cpu(cpu) { + struct zone_pcp_work *zone_pcp_work = + &per_cpu(zone_pcp_update_work, cpu); + INIT_WORK(&zone_pcp_work->work, __zone_cpu_pcp_update); + } +#endif } __meminit int init_currently_empty_zone(struct zone *zone, > > stop_machine and send ipi are common technique for per-cpu area hack. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/