Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756014Ab2HVVZJ (ORCPT ); Wed, 22 Aug 2012 17:25:09 -0400 Received: from mx1.redhat.com ([209.132.183.28]:62387 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752765Ab2HVVZF (ORCPT ); Wed, 22 Aug 2012 17:25:05 -0400 Date: Wed, 22 Aug 2012 23:24:59 +0200 From: Andrea Arcangeli To: Andi Kleen Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 19/36] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Message-ID: <20120822212459.GC8107@redhat.com> References: <1345647560-30387-1-git-send-email-aarcange@redhat.com> <1345647560-30387-20-git-send-email-aarcange@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 16777 Lines: 319 Hi Andi, On Wed, Aug 22, 2012 at 01:19:04PM -0700, Andi Kleen wrote: > Andrea Arcangeli writes: > > > +/* > > + * In this function we build a temporal CPU_node<->page relation by > > + * using a two-stage autonuma_last_nid filter to remove short/unlikely > > + * relations. > > + * > > + * Using P(p) ~ n_p / n_t as per frequentest probability, we can > > + * equate a node's CPU usage of a particular page (n_p) per total > > + * usage of this page (n_t) (in a given time-span) to a probability. > > + * > > + * Our periodic faults will then sample this probability and getting > > + * the same result twice in a row, given these samples are fully > > + * independent, is then given by P(n)^2, provided our sample period > > + * is sufficiently short compared to the usage pattern. > > + * > > + * This quadric squishes small probabilities, making it less likely > > + * we act on an unlikely CPU_node<->page relation. > > + */ > > The code does not seem to do what the comment describes. This comment seems quite accurate to me (btw I taken it from sched-numa rewrite with minor changes). By having a confirmation through periodic samples that the memory access happens twice in a row from the same node we increase the probability of doing worthwhile memory migrations and we diminish the risk of worthless migration as result of false relations/sharing. > > +static inline bool last_nid_set(struct page *page, int this_nid) > > +{ > > + bool ret = true; > > + int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid); > > + VM_BUG_ON(this_nid < 0); > > + VM_BUG_ON(this_nid >= MAX_NUMNODES); > > + if (autonuma_last_nid >= 0 && autonuma_last_nid != this_nid) { > > + int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid); > > + if (migrate_nid >= 0) > > + __autonuma_migrate_page_remove(page); > > + ret = false; > > + } > > + if (autonuma_last_nid != this_nid) > > + ACCESS_ONCE(page->autonuma_last_nid) = this_nid; > > + return ret; > > +} > > + > > + /* > > + * Take the lock with irqs disabled to avoid a lock > > + * inversion with the lru_lock. The lru_lock is taken > > + * before the autonuma_migrate_lock in > > + * split_huge_page. If we didn't disable irqs, the > > + * lru_lock could be taken by interrupts after we have > > + * obtained the autonuma_migrate_lock here. > > + */ > > Which interrupt code takes the lru_lock? That sounds like a bug. Disabling irqs around lru_lock was an optimization to avoid increasing the hold time of the lock when all critical sections were short after the isolation code. Now it's used to rotate lrus at I/O completion too. end_page_writeback -> rotate_reclaimable_page -> pagevec_move_tail ========================================================= [ INFO: possible irq lock inversion dependency detected ] 3.6.0-rc2+ #46 Not tainted --------------------------------------------------------- numa01/7725 just changed the state of lock: (&(&zone->lru_lock)->rlock){..-.-.}, at: [] pagevec_lru_move_fn+0x9e/0x110 but this lock took another, SOFTIRQ-unsafe lock in the past: (&(&pgdat->autonuma_lock)->rlock){+.+.-.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&pgdat->autonuma_lock)->rlock); local_irq_disable(); lock(&(&zone->lru_lock)->rlock); lock(&(&pgdat->autonuma_lock)->rlock); lock(&(&zone->lru_lock)->rlock); *** DEADLOCK *** 2 locks held by numa01/7725: #0: (&mm->mmap_sem){++++++}, at: [] do_page_fault+0x121/0x520 #1: (rcu_read_lock){.+.+..}, at: [] __mem_cgroup_try_charge+0x348/0xbb0 the shortest dependencies between 2nd lock and 1st lock: -> (&(&pgdat->autonuma_lock)->rlock){+.+.-.} ops: 7031259 { HARDIRQ-ON-W at: [] mark_held_locks+0x5f/0x140 [] trace_hardirqs_on_caller+0xb2/0x1a0 [] trace_hardirqs_on+0xd/0x10 [] knuma_migrated+0x259/0xab0 [] kthread+0xb6/0xc0 [] kernel_thread_helper+0x4/0x10 SOFTIRQ-ON-W at: [] mark_held_locks+0x5f/0x140 [] trace_hardirqs_on_caller+0x10d/0x1a0 [] trace_hardirqs_on+0xd/0x10 [] knuma_migrated+0x259/0xab0 [] kthread+0xb6/0xc0 [] kernel_thread_helper+0x4/0x10 IN-RECLAIM_FS-W at: [] __lock_acquire+0x5c4/0x1dd0 [] lock_acquire+0x62/0x80 [] _raw_spin_lock+0x3b/0x50 [] __autonuma_migrate_page_remove+0xdd/0x1d0 [] free_pages_prepare+0xe3/0x190 [] free_hot_cold_page+0x44/0x1d0 [] free_hot_cold_page_list+0x3e/0x60 [] release_pages+0x1f1/0x230 [] pagevec_lru_move_fn+0xf0/0x110 [] __pagevec_lru_add+0x17/0x20 [] lru_add_drain_cpu+0x9b/0x130 [] lru_add_drain+0x29/0x40 [] shrink_active_list+0x65/0x340 [] balance_pgdat+0x323/0x890 [] kswapd+0x1c3/0x340 [] kthread+0xb6/0xc0 [] kernel_thread_helper+0x4/0x10 INITIAL USE at: [] __lock_acquire+0x2ff/0x1dd0 [] lock_acquire+0x62/0x80 [] _raw_spin_lock+0x3b/0x50 [] numa_hinting_fault+0x2bb/0x5b0 [] __pmd_numa_fixup+0x1cd/0x200 [] handle_mm_fault+0x2c8/0x380 [] do_page_fault+0x18e/0x520 [] page_fault+0x25/0x30 [] sys_poll+0x6c/0x100 [] system_call_fastpath+0x16/0x1b } ... key at: [] __key.16051+0x0/0x18 ... acquired at: [] lock_acquire+0x62/0x80 [] _raw_spin_lock+0x3b/0x50 [] autonuma_migrate_split_huge_page+0x119/0x210 [] split_huge_page+0x267/0x7f0 [] knuma_migrated+0x362/0xab0 [] kthread+0xb6/0xc0 [] kernel_thread_helper+0x4/0x10 -> (&(&zone->lru_lock)->rlock){..-.-.} ops: 10130605 { IN-SOFTIRQ-W at: [] __lock_acquire+0x765/0x1dd0 [] lock_acquire+0x62/0x80 [] _raw_spin_lock_irqsave+0x53/0x70 [] pagevec_lru_move_fn+0x9e/0x110 [] pagevec_move_tail+0x1f/0x30 [] rotate_reclaimable_page+0xdd/0x100 [] end_page_writeback+0x4d/0x60 [] end_swap_bio_write+0x2b/0x80 [] bio_endio+0x18/0x30 [] req_bio_endio.clone.53+0x8b/0xd0 [] blk_update_request+0xf0/0x5a0 [] blk_update_bidi_request+0x2f/0x90 [] blk_end_bidi_request+0x2a/0x80 [] blk_end_request+0xb/0x10 [] scsi_io_completion+0x97/0x640 [] scsi_finish_command+0xbe/0xf0 [] scsi_softirq_done+0x9f/0x130 [] blk_done_softirq+0x82/0xa0 [] __do_softirq+0xc8/0x180 [] call_softirq+0x1c/0x30 [] do_softirq+0xa5/0xe0 [] irq_exit+0x9e/0xc0 [] smp_call_function_single_interrupt+0x2f/0x40 [] call_function_single_interrupt+0x6f/0x80 [] mem_cgroup_from_task+0x4e/0xd0 [] __mem_cgroup_try_charge+0x3bd/0xbb0 [] mem_cgroup_charge_common+0x64/0xc0 [] mem_cgroup_newpage_charge+0x31/0x40 [] handle_pte_fault+0x70a/0xa90 [] handle_mm_fault+0x253/0x380 [] do_page_fault+0x18e/0x520 [] page_fault+0x25/0x30 IN-RECLAIM_FS-W at: [] __lock_acquire+0x5c4/0x1dd0 [] lock_acquire+0x62/0x80 [] _raw_spin_lock_irqsave+0x53/0x70 [] pagevec_lru_move_fn+0x9e/0x110 [] __pagevec_lru_add+0x17/0x20 [] lru_add_drain_cpu+0x9b/0x130 [] lru_add_drain+0x29/0x40 [] shrink_active_list+0x65/0x340 [] balance_pgdat+0x323/0x890 [] kswapd+0x1c3/0x340 [] kthread+0xb6/0xc0 [] kernel_thread_helper+0x4/0x10 INITIAL USE at: [] __lock_acquire+0x2ff/0x1dd0 [] lock_acquire+0x62/0x80 [] _raw_spin_lock_irqsave+0x53/0x70 [] pagevec_lru_move_fn+0x9e/0x110 [] __pagevec_lru_add+0x17/0x20 [] lru_add_drain_cpu+0x9b/0x130 [] lru_add_drain+0x29/0x40 [] __pagevec_release+0x11/0x30 [] truncate_inode_pages_range+0x344/0x4b0 [] truncate_inode_pages+0x10/0x20 [] kill_bdev+0x2a/0x40 [] __blkdev_put+0x6f/0x1d0 [] blkdev_put+0x5b/0x170 [] add_disk+0x41a/0x4a0 [] sd_probe_async+0x120/0x1d0 [] async_run_entry_fn+0x7d/0x180 [] process_one_work+0x19f/0x510 [] worker_thread+0x1a7/0x4b0 [] kthread+0xb6/0xc0 [] kernel_thread_helper+0x4/0x10 } ... key at: [] __key.34621+0x0/0x8 ... acquired at: [] check_usage_forwards+0x8e/0x110 [] mark_lock+0x1d6/0x630 [] __lock_acquire+0x765/0x1dd0 [] lock_acquire+0x62/0x80 [] _raw_spin_lock_irqsave+0x53/0x70 [] pagevec_lru_move_fn+0x9e/0x110 [] pagevec_move_tail+0x1f/0x30 [] rotate_reclaimable_page+0xdd/0x100 [] end_page_writeback+0x4d/0x60 [] end_swap_bio_write+0x2b/0x80 [] bio_endio+0x18/0x30 [] req_bio_endio.clone.53+0x8b/0xd0 [] blk_update_request+0xf0/0x5a0 [] blk_update_bidi_request+0x2f/0x90 [] blk_end_bidi_request+0x2a/0x80 [] blk_end_request+0xb/0x10 [] scsi_io_completion+0x97/0x640 [] scsi_finish_command+0xbe/0xf0 [] scsi_softirq_done+0x9f/0x130 [] blk_done_softirq+0x82/0xa0 [] __do_softirq+0xc8/0x180 [] call_softirq+0x1c/0x30 [] do_softirq+0xa5/0xe0 [] irq_exit+0x9e/0xc0 [] smp_call_function_single_interrupt+0x2f/0x40 [] call_function_single_interrupt+0x6f/0x80 [] mem_cgroup_from_task+0x4e/0xd0 [] __mem_cgroup_try_charge+0x3bd/0xbb0 [] mem_cgroup_charge_common+0x64/0xc0 [] mem_cgroup_newpage_charge+0x31/0x40 [] handle_pte_fault+0x70a/0xa90 [] handle_mm_fault+0x253/0x380 [] do_page_fault+0x18e/0x520 [] page_fault+0x25/0x30 stack backtrace: Pid: 7725, comm: numa01 Not tainted 3.6.0-rc2+ #46 Call Trace: [] print_irq_inversion_bug+0x1c6/0x210 [] ? print_irq_inversion_bug+0x210/0x210 [] check_usage_forwards+0x8e/0x110 [] mark_lock+0x1d6/0x630 [] __lock_acquire+0x765/0x1dd0 [] ? mempool_alloc_slab+0x10/0x20 [] ? kmem_cache_alloc+0xbb/0x1b0 [] lock_acquire+0x62/0x80 [] ? pagevec_lru_move_fn+0x9e/0x110 [] _raw_spin_lock_irqsave+0x53/0x70 [] ? pagevec_lru_move_fn+0x9e/0x110 [] pagevec_lru_move_fn+0x9e/0x110 [] ? __pagevec_lru_add_fn+0x130/0x130 [] pagevec_move_tail+0x1f/0x30 [] rotate_reclaimable_page+0xdd/0x100 [] end_page_writeback+0x4d/0x60 [] ? scsi_request_fn+0xa2/0x4b0 [] end_swap_bio_write+0x2b/0x80 [] bio_endio+0x18/0x30 [] req_bio_endio.clone.53+0x8b/0xd0 [] blk_update_request+0xf0/0x5a0 [] ? blk_update_request+0x32a/0x5a0 [] blk_update_bidi_request+0x2f/0x90 [] blk_end_bidi_request+0x2a/0x80 [] blk_end_request+0xb/0x10 [] scsi_io_completion+0x97/0x640 [] scsi_finish_command+0xbe/0xf0 [] scsi_softirq_done+0x9f/0x130 [] blk_done_softirq+0x82/0xa0 [] __do_softirq+0xc8/0x180 [] ? trace_hardirqs_off+0xd/0x10 [] call_softirq+0x1c/0x30 [] do_softirq+0xa5/0xe0 [] irq_exit+0x9e/0xc0 [] smp_call_function_single_interrupt+0x2f/0x40 [] call_function_single_interrupt+0x6f/0x80 [] ? debug_lockdep_rcu_enabled+0x29/0x40 [] mem_cgroup_from_task+0x4e/0xd0 [] __mem_cgroup_try_charge+0x3bd/0xbb0 [] ? __mem_cgroup_try_charge+0x348/0xbb0 [] mem_cgroup_charge_common+0x64/0xc0 [] mem_cgroup_newpage_charge+0x31/0x40 [] handle_pte_fault+0x70a/0xa90 [] ? __free_pages+0x35/0x40 [] handle_mm_fault+0x253/0x380 [] do_page_fault+0x18e/0x520 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] ? rcu_irq_exit+0x7f/0xd0 [] ? retint_restore_args+0x13/0x13 [] ? trace_hardirqs_off_thunk+0x3a/0x3c [] page_fault+0x25/0x30 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/