Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp5044252imu; Tue, 29 Jan 2019 11:48:10 -0800 (PST) X-Google-Smtp-Source: ALg8bN5TY9ZmEYKOTWAluUOKd5YWF3ACbpt7zTOKqTFkOnNub880YpDEmEWpo0eoiorThzrzAD4f X-Received: by 2002:a62:4bcf:: with SMTP id d76mr28786828pfj.170.1548791290631; Tue, 29 Jan 2019 11:48:10 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548791290; cv=none; d=google.com; s=arc-20160816; b=UK0wX7fLMYKzVLSPAus7JmBsFkFnqEjuvysfk2e2mYjL77FhIlIR+CPyLsQKHZhh1e DvtiEF1P3gcqQMS1E4nDk5We7mY9vyIHtffvyGZmFP58qx3r+hhinpXtiOUJ+P8LD4ZT rVtvq280aWfNMnXwCokHUnlNYI8yY4BUOydSIZTOXNNF8RYOnnhsmvdfUtr4Sdobdgv/ TV0Vpe2eJkDFBbb+CBWJStFt+dr7gPuOk4OUcIw188Po/DOM1+pBGn5an1hG8EP7aaH2 vTjYmq6DRh58SU1pGCes0lxG5bA9GAZCUaIg1gx3tEyr4MsY6thXxASznAIF/4oFfccN 6VKA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:feedback-id:mime-version:user-agent :references:message-id:in-reply-to:subject:cc:to:from:date :dkim-signature; bh=TX/9avdHlGwF5c9r30Zg4tBc/DaARBymNwmK8ffgcbk=; b=j0dHfawVufXPuezs1gY2ACW4KVKEQeTybNOuJqYCHT1Qmiz+Hvy9V69IRNUpngilCX fYXyePcA4sA09WPv1whdksp7UX1bEfisuBP5SbIEsp+R/P60/g7hY9cXwkPBtq6WZPkO NVj9ObPDnH/f5uQLKMPKBEGN404U1pzXlVTL/4R64mUKWEfleRl99h8hDR6ZBN4dgw/H 1UCHHFCORXi1TjbscZXkx3wom0BLSDWV4NIXV4kw78xGsLs2AVWv4m8qWs5Pz27sUH9E q9DHFrfNiUmPsn0t/bE898ApUebIcpo0Nu9SlsAbJV06y7W74jLYi2l6hfToWkCLUKPd S7Cg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazonses.com header.s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug header.b=CdwiNsJ2; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 35si2875367pgn.278.2019.01.29.11.47.55; Tue, 29 Jan 2019 11:48:10 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@amazonses.com header.s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug header.b=CdwiNsJ2; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729264AbfA2TqV (ORCPT + 99 others); Tue, 29 Jan 2019 14:46:21 -0500 Received: from a9-31.smtp-out.amazonses.com ([54.240.9.31]:40010 "EHLO a9-31.smtp-out.amazonses.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727056AbfA2TqU (ORCPT ); Tue, 29 Jan 2019 14:46:20 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug; d=amazonses.com; t=1548791179; h=Date:From:To:cc:Subject:In-Reply-To:Message-ID:References:MIME-Version:Content-Type:Feedback-ID; bh=I1UMXyrSr5C9Wun+bynXFwq9gdNq5eFAuCB5rD45GeY=; b=CdwiNsJ2r657gEyE3VWjT8bFrQoX+MIgKoI/inh4198n5xeIYCQIneLveQFCdnky czEodJ8+dXQfx098odBEhCc/i/zaImbuMRrakHEu832HISxBrmWPOvqOHzS/BOoWkbv om5p9VV4VQadUGmLIRFmfi5sv6EyVmOl654odqWk= Date: Tue, 29 Jan 2019 19:46:19 +0000 From: Christopher Lameter X-X-Sender: cl@nuc-kabylake To: Miles Chen cc: Andrew Morton , Pekka Enberg , David Rientjes , Joonsoo Kim , Jonathan Corbet , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-mediatek@lists.infradead.org Subject: Re: [PATCH v2] mm/slub: introduce SLAB_WARN_ON_ERROR In-Reply-To: <1548748424.18511.34.camel@mtkswgap22> Message-ID: <010001689b25e696-3caebea9-56c2-46eb-bb49-34e504a123ee-000000@email.amazonses.com> References: <1548313223-17114-1-git-send-email-miles.chen@mediatek.com> <20190128122954.949c2e6699d6e5ef060a325c@linux-foundation.org> <0100016898251824-359bbfae-e32b-43a6-8c58-8811a7b24520-000000@email.amazonses.com> <1548748424.18511.34.camel@mtkswgap22> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-SES-Outgoing: 2019.01.29-54.240.9.31 Feedback-ID: 1.us-east-1.fQZZZ0Xtj2+TD7V5apTT/NrT6QKuPgzCT/IC7XYgDKI=:AmazonSES Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 29 Jan 2019, Miles Chen wrote: > a) classic slub issue. e.g., use-after-free, redzone overwritten. It's > more efficient to report a issue as soon as slub detects it. (comparing > to monitor the log, set a breakpoint, and re-produce the issue). With > the coredump file, we can analyze the issue. What usually happens is that the systems fails with a strange error message. Then the system is rebooted using slub_debug options and the issue is reproduced yielding more information about the problem. Then you run the scenario again with additional debugging in the subsystem that caused the problem. So you are already reproducing the issue because you need to activate debugging to get more information. Doing it for the 3rd time is not that much more difficult. None of your modifications will be active in a production kernel. slub_debug must be activated to use it and thus you are already reproducing the issue. > b) memory corruption issues caused by h/w write. e.g., memory > overwritten by a DMA engine. Memory corruptions may or may not related > to the slab cache that reports any error. For example: kmalloc-256 or > dentry may report the same errors. If we can preserve the the coredump > file without any restore/reset processing in slub, we could have more > information of this memory corruption. If debugging is active then reporting will include the accurate slab cache affected. The memory layout is already changing when you enable the existing debugging code. None of your code runs without that and thus is cannot add a coredump for the prod case without debugging. > c) memory corruption issues caused by unstable h/w. e.g., bit flipping > because of xxxx DRAM die or applying new power settings. It's hard to > re-produce this kind of issue and it much easier to tell this kind of > issue in the coredump file without any restore/reset processing. But then you patch does not help in this situation because the code has to be enabled by special slub debug options. > Users can set the option by slub_debug. We can still have the original > behavior(keep the system alive) if the option is not set. We can turn on > the option when we need the coredump file. (with panic_on_warn is set, > of course). I think we would need to turn on debugging by default and have your patch for this to make sense. We already reproducing the issue multiple times for debugging. This patch does not change that.