Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752294AbeAQJcQ (ORCPT + 1 other); Wed, 17 Jan 2018 04:32:16 -0500 Received: from Galois.linutronix.de ([146.0.238.70]:45370 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750810AbeAQJcO (ORCPT ); Wed, 17 Jan 2018 04:32:14 -0500 Date: Wed, 17 Jan 2018 10:32:12 +0100 (CET) From: Thomas Gleixner To: Keith Busch cc: LKML Subject: Re: [BUG 4.15-rc7] IRQ matrix management errors In-Reply-To: Message-ID: References: <20180115025759.GG13580@localhost.localdomain> <20180115030255.GA13921@localhost.localdomain> <20180116061641.GB32639@localhost.localdomain> <20180116071145.GA5643@localhost.localdomain> <20180117022511.GD6259@localhost.localdomain> <20180117075500.GB7562@localhost.localdomain> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: On Wed, 17 Jan 2018, Thomas Gleixner wrote: > On Wed, 17 Jan 2018, Keith Busch wrote: > > On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote: > > > Can you trace the matrix allocations from the very beginning or tell me how > > > to reproduce. I'd like to figure out why this is happening. > > > > Sure, I'll get the irq_matrix events. > > > > I reproduce this on a machine with 112 CPUs and 3 NVMe controllers. The > > first two NVMe want 112 MSI-x vectors, and the last only 31 vectors. The > > test runs 'modprobe nvme' and 'modprobe -r nvme' in a loop with 10 > > second delay between each step. Repro occurs within a few iterations, > > sometimes already broken after the initial boot. > > That doesn't sound right. The vectors should be spread evenly accross the > CPUs. So ENOSPC should never happen. > > Can you please take snapshots of /sys/kernel/debug/irq/ between the > modprobe and modprobe -r steps? The allocation fails because CPU1 has exhausted it's vector space here: [002] d... 333.028216: irq_matrix_alloc_managed: bit=34 cpu=1 online=1 avl=0 alloc=202 managed=2 online_maps=112 global_avl=22085, global_rsvd=158, total_alloc=460 Now the interesting question is how that happens. Thanks, tglx