Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751870AbdHDSzJ (ORCPT ); Fri, 4 Aug 2017 14:55:09 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35172 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751216AbdHDSzI (ORCPT ); Fri, 4 Aug 2017 14:55:08 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 642466147D Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=dledford@redhat.com Message-ID: <1501872906.79618.10.camel@redhat.com> Subject: Re: [PATCH] mm: ratelimit PFNs busy info message From: Doug Ledford To: Andrew Morton , Jonathan Toppins Cc: linux-mm@kvack.org, linux-rdma@vger.kernel.org, Michal Hocko , Vlastimil Babka , Mel Gorman , Hillf Danton , open list Date: Fri, 04 Aug 2017 14:55:06 -0400 In-Reply-To: <20170802141720.228502368b534f517e3107ff@linux-foundation.org> References: <499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com> <20170802141720.228502368b534f517e3107ff@linux-foundation.org> Organization: Red Hat, Inc. Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Fri, 04 Aug 2017 18:55:08 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2111 Lines: 44 On Wed, 2017-08-02 at 14:17 -0700, Andrew Morton wrote: > On Wed, 2 Aug 2017 13:44:57 -0400 Jonathan Toppins com> wrote: > > > The RDMA subsystem can generate several thousand of these messages > > per > > second eventually leading to a kernel crash. Ratelimit these > > messages > > to prevent this crash. > > Well... why are all these EBUSY's occurring? It sounds inefficient > (at > least) but if it is expected, normal and unavoidable then perhaps we > should just remove that message altogether? I don't have an answer to that question. To be honest, I haven't looked real hard. We never had this at all, then it started out of the blue, but only on our Dell 730xd machines (and it hits all of them), but no other classes or brands of machines. And we have our 730xd machines loaded up with different brands and models of cards (for instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines meant it wasn't tied to any particular brand/model of RDMA hardware. To me, it always smelled of a hardware oddity specific to maybe the CPUs or mainboard chipsets in these machines, so given that I'm not an mm expert anyway, I never chased it down. A few other relevant details: it showed up somewhere around 4.8/4.9 or thereabouts. It never happened before, but the prinkt has been there since the 3.18 days, so possibly the test to trigger this message was changed, or something else in the allocator changed such that the situation started happening on these machines? And, like I said, it is specific to our 730xd machines (but they are all identical, so that could mean it's something like their specific ram configuration is causing the allocator to hit this on these machine but not on other machines in the cluster, I don't want to say it's necessarily the model of chipset or CPU, there are other bits of identicalness between these machines). -- Doug Ledford GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD