Received: by 2002:a25:d7c1:0:0:0:0:0 with SMTP id o184csp114015ybg; Tue, 22 Oct 2019 17:07:09 -0700 (PDT) X-Google-Smtp-Source: APXvYqzT/tnJhbwuw8j6P0k+Qbg+e8yT8B/MgDrukCcLIDq4S6n84Oj4c7RleCSlbSu1857XOInM X-Received: by 2002:a05:6402:6c6:: with SMTP id n6mr34727641edy.292.1571789229034; Tue, 22 Oct 2019 17:07:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1571789229; cv=none; d=google.com; s=arc-20160816; b=0IJONmB+BAeT3LAWN21KMM5omuWmBjEIPrldeLu9alDqucHFwvnhuRDAbQ4tHlJtqY i9zt5wEaNqRUfvHAhd0Od/OgVy57a6VehDHk9k401aZuPLieoHyM/V9aFaMdZKRYeCXW WyOfqlSQUqa4KR44xamAsnNe+oyMdR7GGPEdbQR5AIY1+z53jKfFbFFOPc2Fvd22CAwd Xm3rlIH3gvjGgFTjZYvKbV9gGjS/LSQlb/ndb0IK0tj6bw4axqUKJHDRvjtkIuNKPZpQ Gyyq9wYSz1SMJUvpfaBxp1I7QrYvEIY1133l6dpTbULaaPUCYHmd3IphoE5HVywbtUhd ehSw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=VZFIPoTvZX6SgHbngOhMtS1j7pO6g8jp0CON7ghQfSA=; b=YifxOcb4YKlL/4KSggzJgObylH3zq/CdUswwT+JxbzcIvZQHxV9SNpzcaVSyd23TMn 6mTRUdJ5rBLYPAvAuFG0MbjmHuIRF6wkbBRXebitaW2wO7fqOJboxkUlpSlDZjdB7GuM PTDdR35MgGoPV8kLTfqfJWF8EJIUgRSNzWLirMglvLqN3zNT9Gi+/iEk5AclpogQqRyB /jqBNYrbngJt6tbQSYpMfIGcbeWMjBUn4TiV7DRsiwBkokjwDJrIezj1KssxrJvbV8w7 x3vwx+EIeG1UI6gG1+wpmcXAWHbdd0xwJnfqjL2nd4dwl5OSxpRjHvJaCIh6nVmbIj+3 VmlA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="cieUVQ/f"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w32si4387733edc.443.2019.10.22.17.06.45; Tue, 22 Oct 2019 17:07:09 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="cieUVQ/f"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731985AbfJVV7F (ORCPT + 99 others); Tue, 22 Oct 2019 17:59:05 -0400 Received: from mail.kernel.org ([198.145.29.99]:41798 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731191AbfJVV7E (ORCPT ); Tue, 22 Oct 2019 17:59:04 -0400 Received: from akpm3.svl.corp.google.com (unknown [104.133.8.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id EA5CD207FC; Tue, 22 Oct 2019 21:59:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1571781543; bh=tl4VM/euQB200yzITWXCgGsKntOKepIcapE8JQ3Z8JY=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=cieUVQ/fzhpz8561vfQ4bIe9GHj4QnG5kPPW+zFooiQKvuGXtNQtdzhfYmt/V5jJu MFNAqvlcasyOwjj2BDt+/hxodLr0/ASXd4lp8cQ2Dtz08yJnJz/VR6YJh7xsUwJiN8 GV9N90t/VzdF2OQcsGtQn1Md25pY62ugSAPOkXD4= Date: Tue, 22 Oct 2019 14:59:02 -0700 From: Andrew Morton To: Waiman Long Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Michal Hocko , Roman Gushchin , Vlastimil Babka , Konstantin Khlebnikov , Jann Horn , Song Liu , Greg Kroah-Hartman , Rafael Aquini Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo Message-Id: <20191022145902.d9c4a719c0b32175e06e4eee@linux-foundation.org> In-Reply-To: <20191022162156.17316-1-longman@redhat.com> References: <20191022162156.17316-1-longman@redhat.com> X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.32; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long wrote: > The pagetypeinfo_showfree_print() function prints out the number of > free blocks for each of the page orders and migrate types. The current > code just iterates the each of the free lists to get counts. There are > bug reports about hard lockup panics when reading the /proc/pagetyeinfo > file just because it look too long to iterate all the free lists within > a zone while holing the zone lock with irq disabled. > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity > of crashing a system by the simple act of reading /proc/pagetypeinfo > by any user is a security problem that needs to be addressed. Yes. > There is a free_area structure associated with each page order. There > is also a nr_free count within the free_area for all the different > migration types combined. Tracking the number of free list entries > for each migration type will probably add some overhead to the fast > paths like moving pages from one migration type to another which may > not be desirable. > > we can actually skip iterating the list of one of the migration types > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE > is usually the largest one on large memory systems, this is the one > to be skipped. Since the printing order is migration-type => order, we > will have to store the counts in an internal 2D array before printing > them out. > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the > zone lock for too long blocking out other zone lock waiters from being > run. This can be problematic for systems with large amount of memory. > So a check is added to temporarily release the lock and reschedule if > more than 64k of list entries have been iterated for each order. With > a MAX_ORDER of 11, the worst case will be iterating about 700k of list > entries before releasing the lock. > > ... > > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m, > pg_data_t *pgdat, struct zone *zone) > { > int order, mtype; > + unsigned long nfree[MAX_ORDER][MIGRATE_TYPES]; 600+ bytes is a bit much. I guess it's OK in this situation. > - for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) { > - seq_printf(m, "Node %4d, zone %8s, type %12s ", > - pgdat->node_id, > - zone->name, > - migratetype_names[mtype]); > - for (order = 0; order < MAX_ORDER; ++order) { > + lockdep_assert_held(&zone->lock); > + lockdep_assert_irqs_disabled(); > + > + /* > + * MIGRATE_MOVABLE is usually the largest one in large memory > + * systems. We skip iterating that list. Instead, we compute it by > + * subtracting the total of the rests from free_area->nr_free. > + */ > + for (order = 0; order < MAX_ORDER; ++order) { > + unsigned long nr_total = 0; > + struct free_area *area = &(zone->free_area[order]); > + > + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) { > unsigned long freecount = 0; > - struct free_area *area; > struct list_head *curr; > > - area = &(zone->free_area[order]); > - > + if (mtype == MIGRATE_MOVABLE) > + continue; > list_for_each(curr, &area->free_list[mtype]) > freecount++; > - seq_printf(m, "%6lu ", freecount); > + nfree[order][mtype] = freecount; > + nr_total += freecount; > } > + nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total; > + > + /* > + * If we have already iterated more than 64k of list > + * entries, we might have hold the zone lock for too long. > + * Temporarily release the lock and reschedule before > + * continuing so that other lock waiters have a chance > + * to run. > + */ > + if (nr_total > (1 << 16)) { > + spin_unlock_irq(&zone->lock); > + cond_resched(); > + spin_lock_irq(&zone->lock); > + } > + } > + > + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) { > + seq_printf(m, "Node %4d, zone %8s, type %12s ", > + pgdat->node_id, > + zone->name, > + migratetype_names[mtype]); > + for (order = 0; order < MAX_ORDER; ++order) > + seq_printf(m, "%6lu ", nfree[order][mtype]); > seq_putc(m, '\n'); This is not exactly a thing of beauty :( Presumably there might still be situations where the irq-off times remain excessive. Why are we actually holding zone->lock so much? Can we get away with holding it across the list_for_each() loop and nothing else? If so, this still isn't a bulletproof fix. Maybe just terminate the list walk if freecount reaches 1024. Would anyone really care? Sigh. I wonder if anyone really uses this thing for anything important. Can we just remove it all?