Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp461828pxk; Wed, 9 Sep 2020 09:46:55 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyJS+naNs3lh4j20I8qU1/7pHTmOU569uEkbqwfEfUBMIPflTmg2IH6ObOUneFgcCJccXby X-Received: by 2002:a05:6402:50f:: with SMTP id m15mr5223461edv.41.1599670015016; Wed, 09 Sep 2020 09:46:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1599670015; cv=none; d=google.com; s=arc-20160816; b=yej0NYABwR3SQ5xyVHoHvq2lgFZHOS7o/kY9FR9l/fSSTYXBncokH/2JZ3M9nMrByH qx7+28OGSXf2CP8kJbpJKak/90DemDeM60b4iDO3f8acoClylB/FnRDffM4Njtp/kYQs ShG6nJ3PdVlMS9NnDDDVa+aho0VbodF0L451UmDwePWSUJkGjqO0WJ+cfwaKnwktuNl2 y5H9ff2Pa2tR8Poah91fN+7sESN36fdLgZQN3+VKfBmNf77kXILOY6B75tJN94/k8ccM cCx7teUMXe4db82tdB3fLrNOLX0FtHvs5cV31LU34r8R1cKDhsxLJjbeclxvvXQL2bAU cNbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=Z2tU+qSkg//p3QUUJru+1eyE4zumits1z47PmsvMV+I=; b=f61oLUFDpM8zxt6xkZok1N0q7Co3VcjYciI/eoum8AAWb5FNSIoH7sMPanFvYCEmGi JCwrQWnIRnpY8LaQgaSCXvdO/mXVudHKRkdDSyo8EPY5F7UYwF6WPzddkzSThz8LxE+A dqx+b02bAy1HjT7DTCCREHPSl95bcx5lCurMcB9tw/thau+6cBUXtSAUyfd+KhLbnPYX b83UnAkkhQ/xTHwn1rsMScJr1vnS5LF3ABC9JmCUzgqk8r0DoEj9p7laAk74AiznkhM3 LEKzUF0VeB7X83P3NSDHM9ynFou/QOlsqAgXf0geIA6dJRsUbx1fc4E7LJq353tB7sTl rTAw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Yf164iXI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d23si1748325edt.338.2020.09.09.09.46.31; Wed, 09 Sep 2020 09:46:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Yf164iXI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730994AbgIIQni (ORCPT + 99 others); Wed, 9 Sep 2020 12:43:38 -0400 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:53748 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1730967AbgIIQnd (ORCPT ); Wed, 9 Sep 2020 12:43:33 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1599669810; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Z2tU+qSkg//p3QUUJru+1eyE4zumits1z47PmsvMV+I=; b=Yf164iXIqOJiK6bwQelGdpr3FbyN2FuuKRa+hJVVja1i7nPsEEehNtohIEVJ1SRBR9Ig47 b/750mbnnOQJOnWCEAXIEnqPlPxUGMYD1qNJAOaY0gIaq4CymY5u+ZjHZDgp5ZuUslfZPJ AtBH8bOc8Aifw4rWqdXpFJU1dopTDhE= Received: from mail-il1-f197.google.com (mail-il1-f197.google.com [209.85.166.197]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-441-Gm9ohxF3PfCAXMxKd9isiA-1; Wed, 09 Sep 2020 12:43:28 -0400 X-MC-Unique: Gm9ohxF3PfCAXMxKd9isiA-1 Received: by mail-il1-f197.google.com with SMTP id f20so2501347ilg.3 for ; Wed, 09 Sep 2020 09:43:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Z2tU+qSkg//p3QUUJru+1eyE4zumits1z47PmsvMV+I=; b=mvha0VtfC89LN5JQjPkTJsGvWDG7pal3pO81M2lXqWA+71PQ7RHqFhwV0LuCSthRw4 hNuZik1YQa0qNzrRRSkZ5FbQEA7dOGUtr2NNeBwWGgfdpAoC1U0/F8024YtxOc2hWO5P ZJf5Zyb9ZQ15Sbwj7DflfgUIEbsTzsa5LxRD73lfet6jPg0WUi4EIuV9y8hDZACcu/dt n5AG+UYGomgx+LAuWAhFQztjyrY3DJwqju6aOq+31eBhRa0YBlzsOPVs7XvWFI1iP9mL K3Wor7oGtO1sbtxoFJ9PhXhtv81YRf/7INRAgJiwnY4PBr4pD+ccYRNO83dfYGzc9urQ 4Tyg== X-Gm-Message-State: AOAM530n1FnAgEwZuJAsZmES05KXySisO6MwGOe3mr+lcYwIbN1aRUcd /53vCuP5/5VOST7jP9/QQdkmUX69vum9hYB9pxsyISsGURkSxrtQn04+dQP6Wv1ts6QJWMEv+Be +EfCGGD7TDBh7DopCrVQFcGhIxacWhfsqWDaCjAg2 X-Received: by 2002:a05:6638:3ea:: with SMTP id s10mr4765105jaq.2.1599669807526; Wed, 09 Sep 2020 09:43:27 -0700 (PDT) X-Received: by 2002:a05:6638:3ea:: with SMTP id s10mr4765033jaq.2.1599669806457; Wed, 09 Sep 2020 09:43:26 -0700 (PDT) MIME-Version: 1.0 References: <20200909075016.104407-1-kasong@redhat.com> <87ft7r82sf.fsf@x220.int.ebiederm.org> In-Reply-To: <87ft7r82sf.fsf@x220.int.ebiederm.org> From: Kairui Song Date: Thu, 10 Sep 2020 00:43:15 +0800 Message-ID: Subject: Re: [RFC PATCH 0/3] Add writing support to vmcore for reusing oldmem To: "Eric W. Biederman" Cc: Linux Kernel Mailing List , Dave Young , Baoquan He , Vivek Goyal , Alexey Dobriyan , Thomas Gleixner , Ingo Molnar , Borislav Petkov , kexec@lists.infradead.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 9, 2020 at 10:04 PM Eric W. Biederman wrote: > > Kairui Song writes: > > > Currently vmcore only supports reading, this patch series is an RFC > > to add writing support to vmcore. It's x86_64 only yet, I'll add other > > architecture later if there is no problem with this idea. > > > > My purpose of adding writing support is to reuse the crashed kernel's > > old memory in kdump kernel, reduce kdump memory pressure, and > > allow kdump to run with a smaller crashkernel reservation. > > > > This is doable because in most cases, after kernel panic, user only > > interested in the crashed kernel itself, and userspace/cache/free > > memory pages are not dumped. `makedumpfile` is widely used to skip > > these pages. Kernel pages usually only take a small part of > > the whole old memory. So there will be many reusable pages. > > > > By adding writing support, userspace then can use these pages as a fast > > and temporary storage. This helps reduce memory pressure in many ways. > > > > For example, I've written a POC program based on this, it will find > > the reusable pages, and creates an NBD device which maps to these pages. > > The NBD device can then be used as swap, or to hold some temp files > > which previouly live in RAM. > > > > The link of the POC tool: https://github.com/ryncsn/kdumpd > > A couple of thoughts. > 1) Unless I am completely mistaken treating this as a exercise in > memory hotplug would be much simpler. > > AKA just plug in the memory that is not needed as part of the kdump. > > I see below that you have problems doing this because > of fragmentation. I still think hotplug is doable using some > kind of fragmented memory zone. > > 2) The purpose of the memory reservation is because hardware is > still potentially running agains the memory of the old kernel. > > By the time we have brought up a new kernel enough of the hardware > may have been reinitialized that we don't have to worry about > hardware randomly dma'ing into the memory used by the old kernel. > > With IOMMUs and care we may be able to guarantee for some machine > configurations it is impossible for DMA to come from some piece of > hardware that is present but the kernel does not have a driver > loaded for.\ > > I really do not like this approach because it is fundamentlly doing the > wrong thing. Adding write support to read-only drivers. I do not see > anywhere that you even mentioned the hard problem and the reason we > reserve memory in the first place. Hardware spontaneously DMA'ing onto > it. > That POC tool looks ugly for now as it only a draft to prove this works, sorry about it. For the patch, yes, it is expecting IOMMU to lower the chance of potential DMA issue, and expecting DMA will not hit userspace/free page, or at least won't override a massive amount of reusable old memory. And I thought about some solutions for the potential DMA issue. As old memories are used as a block device, which is proxied by userspace, so upon each IO, the userspace tool could do an integrity check of the corresponding data stored in old mem, and keep multiple copies of the data. (eg. use 512M of old memory to hold a 128M block device). These copies will be kept far away from each other regarding the physical memory location. The reusable old memories are sparse so the actual memory containing the data should be also sparse. So if some part is corrupted, it is still recoverable. Unless the DMA went very wrong and wiped a large region of memory, but if such thing happens, it's most likely kernel pages are also being wiped by DMA, so the vmcore is already corrupted and kdump may not help. But at least it won't fail silently, the userspace tool can still do something like dump some available data to an easy to setup target. And also that's one of the reasons not using old memory as kdump's memory directly. > > It's have been a long time issue that kdump suffers from OOM issue > > with limited crashkernel memory. So reusing old memory could be very > > helpful. > > There is a very fine line here between reusing existing code (aka > drivers and userspace) and doing something that should work. > > It might make sense to figure out what is using so much memory > that an OOM is triggered. > > Ages ago I did something that was essentially dumping the kernels printk > buffer to the serial console in case of a crash and I had things down to > something comparatively miniscule like 8M or less. > > My memory is that historically it has been high performance scsi raid > drivers or something like that, that are behind the need to have such > large memory reservations. > > Now that I think about it, you aren't by any chance doing something > silly like running systemd in your initrd are you? Are these OOMs by > any chance a userspace problem rather than a problem with inefficient > drivers? The problem with the user space is that, kdump is expected to dump to different kinds of dump targets, and there are some very memory consuming ones, eg. the NFSv3 case. And there are many other labor heavy jobs for the dump target setup, like network setup, lvm setup, iscsi setup, multipath etc, etc, not to mention potential corner cases with these dump targets. And it is not practical to reimplement them again in a memory friendly way. And the user space is growing, even if user only included "bash makedumpfile ssh ip" and required libs in initramfs, which are essential for dumping the vmcore over SSH (dump over SSH is commonly used), the initramfs will take 20M after decompressed. The kernel driver memory usage is trackable and I only encountered a few of such issues, and many have applied workaround for kdump. And if userspace memory pressure is reduced, kernel will also have more memory. And now in Fedora, it is using the existing tool Dracut to generate the initramfs, which heavily depends on systemd indeed. Even with these helpers, it has been taking quite some work to support all the dump targets. > > In summary you either need to show that it is safe to reuse the > memory of the old kernel, or do some work to reduce the memory footprint > of the crashdump kernel, and the crashdump userspace. > > Eric > -- Best Regards, Kairui Song