Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp4484837rdh; Wed, 29 Nov 2023 02:52:58 -0800 (PST) X-Google-Smtp-Source: AGHT+IGdQsdlpChrBhQHnn9p6saEJ4ztzzbXyrnzUi298LxR3SYtoMscyq+VqiM+0CloFOtIXu7C X-Received: by 2002:a05:6a00:a93:b0:6cb:b7fb:931f with SMTP id b19-20020a056a000a9300b006cbb7fb931fmr18191672pfl.33.1701255177823; Wed, 29 Nov 2023 02:52:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701255177; cv=none; d=google.com; s=arc-20160816; b=vQx2webckQY4bO6tA5nAvKVxnBmcCqVI4wWI01LLdjc6E474KdCgS6Kq8u4Tldf4mg gJiBny6Cj7ncObgvcRWrHCZDqYxWZK71jDmax1A+uO41QO6PcfzlA5/b6s45MBJBFV3d varwLx5NLrfEHKEArkmHu6X+eCtkWgW4SvMKOzv9MqC22KQh5d240LaYOjT0fsUubvwG SM2LE8pARUXj/nx5o13SSPY+T6/8bUiEUSbbgexWKjuA3/1oKCfu08+extJEuxoQGek4 vnd7FgQ/Mi2LNuuMhoEBRMZccIM9Gy4WSVlMvTRscb5U6yaDzakl1TZCV48sUtYQePD0 TzSg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature :dkim-signature; bh=B76eGcDwY3AvWINd4TnIg5SbA5HlN+Zz+jDp5sSwrmo=; fh=6FqhDmI9Ndd85t4dwHGvpYpry9nWb4i/3kDYVCO6Zdo=; b=ItweAcrFaX6GpehstfAYgjErSDBThjab3pwmZvjbOto/fYq3THQg11Hph2ppNZtgf7 ASdbGYhUkSvnqBlXohZ4639Hp0Sgw+J7BB+wQ95odqqQnoacJDvcCWqD2hwIV1Xgf9w9 jiIqUCpYPsGsxE1eAECUHzI90xqN/Ubh85Si8JzJiBiDt9XfOc3nJbiWkwuRIYBmQyn2 2lR30b9QS/L/QmWaAh0dnBdvcT4es+0i6KddwCIEjsQ65JH06AEPbG06F2oBSsdt00GN KZ0ksxNTcntiXzhZQexrmeXcibSh6hheiqsySocF3iu9jo25MYc7P8vvKM73VkXVInU0 KtIw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=PbG3LxXT; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=MRReT9Ti; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id u13-20020a056a00124d00b006cbb36b9060si1099582pfi.385.2023.11.29.02.52.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Nov 2023 02:52:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=PbG3LxXT; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=MRReT9Ti; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id B8D15804C1B4; Wed, 29 Nov 2023 02:52:17 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229588AbjK2Kvv (ORCPT + 99 others); Wed, 29 Nov 2023 05:51:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59174 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229477AbjK2Kvu (ORCPT ); Wed, 29 Nov 2023 05:51:50 -0500 Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2a07:de40:b251:101:10:150:64:1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1B2A310D7 for ; Wed, 29 Nov 2023 02:51:56 -0800 (PST) Received: from relay2.suse.de (unknown [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 873DA21999; Wed, 29 Nov 2023 10:51:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1701255114; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=B76eGcDwY3AvWINd4TnIg5SbA5HlN+Zz+jDp5sSwrmo=; b=PbG3LxXT891FhLv00eo4CIIi/LZYZzo6WwoHDm6XreEJapBNYRdRSevK7Qt9MhjKzeWkfo XILM06dNFcNRaPSLwxgOB1U/xIjBXd8FL1Jyt2XE6+ChrdpJdS2RrJfLTUVWFAQo3fV0wa uQhiQ5Sdv3HLyjXsuyRcUDJ+S8dSHJ4= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1701255114; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=B76eGcDwY3AvWINd4TnIg5SbA5HlN+Zz+jDp5sSwrmo=; b=MRReT9Ti2iztIwpcXDMYuel0XBhfde8MM+TwpqlObZCSU+ttaXcHswLoBZ58ZqfDN0Z9lT 66m37d7Qci3xZcDw== Received: from localhost (dwarf.suse.cz [10.100.12.32]) by relay2.suse.de (Postfix) with ESMTP id A9A022C199; Wed, 29 Nov 2023 10:51:53 +0000 (UTC) Date: Wed, 29 Nov 2023 11:51:53 +0100 From: Jiri Bohac To: Baoquan He Cc: Michal Hocko , Pingfan Liu , Tao Liu , Vivek Goyal , Dave Young , kexec@lists.infradead.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/4] kdump: crashkernel reservation from CMA Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spamd-Bar: +++++++++++ X-Spam-Score: 11.49 X-Rspamd-Server: rspamd1 Authentication-Results: smtp-out1.suse.de; dkim=none; spf=pass (smtp-out1.suse.de: domain of jbohac@suse.cz designates 149.44.160.134 as permitted sender) smtp.mailfrom=jbohac@suse.cz; dmarc=none X-Rspamd-Queue-Id: 873DA21999 X-Spamd-Result: default: False [11.49 / 50.00]; RDNS_NONE(1.00)[]; SPAMHAUS_XBL(0.00)[149.44.160.134:from]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:149.44.0.0/16]; RWL_MAILSPIKE_GOOD(0.00)[149.44.160.134:from]; HFILTER_HELO_IP_A(1.00)[relay2.suse.de]; HFILTER_HELO_NORES_A_OR_MX(0.30)[relay2.suse.de]; MID_RHS_MATCH_FROMTLD(0.00)[]; MX_GOOD(-0.01)[]; RCPT_COUNT_SEVEN(0.00)[8]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(2.20)[]; MIME_TRACE(0.00)[0:+]; BAYES_HAM(-3.00)[100.00%]; RDNS_DNSFAIL(0.00)[]; ARC_NA(0.00)[]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_SPAM_SHORT(3.00)[1.000]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(1.20)[suse.cz]; DKIM_SIGNED(0.00)[suse.cz:s=susede2_rsa,suse.cz:s=susede2_ed25519]; NEURAL_SPAM_LONG(3.50)[1.000]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.cz:email]; FUZZY_BLOCKED(0.00)[rspamd.com]; RCVD_COUNT_TWO(0.00)[2]; HFILTER_HOSTNAME_UNKNOWN(2.50)[] X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Wed, 29 Nov 2023 02:52:17 -0800 (PST) Hi Baoquan, thanks for your interest... On Wed, Nov 29, 2023 at 03:57:59PM +0800, Baoquan He wrote: > On 11/28/23 at 10:08am, Michal Hocko wrote: > > On Tue 28-11-23 10:11:31, Baoquan He wrote: > > > On 11/28/23 at 09:12am, Tao Liu wrote: > > [...] > > > Thanks for the effort to bring this up, Jiri. > > > > > > I am wondering how you will use this crashkernel=,cma parameter. I mean > > > the scenario of crashkernel=,cma. Asking this because I don't know how > > > SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's > > > driver will be filter out? If latter case, It's possibly having the > > > on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not > > > reset during kdump bootup because the NIC driver is not loaded in to > > > initialize. Not sure if this is 100%, possible in theory? yes, we also only add the necessary drivers to the kdump initrd (using dracut --hostonly). The plan was to use this feature by default only on systems where we are reasonably sure it is safe and let the user experiment with it when we're not sure. I grepped a list of all calls to pin_user_pages*. From the 55, about one half uses FOLL_LONGTERM, so these should be migrated away from the CMA area. In the rest there are four cases that don't use the pages to set up DMA: mm/process_vm_access.c: pinned_pages = pin_user_pages_remote(mm, pa, pinned_pages, net/rds/info.c: ret = pin_user_pages_fast(start, nr_pages, FOLL_WRITE, pages); drivers/vhost/vhost.c: r = pin_user_pages_fast(log, 1, FOLL_WRITE, &page); kernel/trace/trace_events_user.c: ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT, The remaining cases are potentially problematic: drivers/gpu/drm/i915/gem/i915_gem_userptr.c: ret = pin_user_pages_fast(obj->userptr.ptr + pinned * PAGE_SIZE, drivers/iommu/iommufd/iova_bitmap.c: ret = pin_user_pages_fast((unsigned long)addr, npages, drivers/iommu/iommufd/pages.c: rc = pin_user_pages_remote( drivers/media/pci/ivtv/ivtv-udma.c: err = pin_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, drivers/media/pci/ivtv/ivtv-yuv.c: uv_pages = pin_user_pages_unlocked(uv_dma.uaddr, drivers/media/pci/ivtv/ivtv-yuv.c: y_pages = pin_user_pages_unlocked(y_dma.uaddr, drivers/misc/genwqe/card_utils.c: rc = pin_user_pages_fast(data & PAGE_MASK, /* page aligned addr */ drivers/misc/xilinx_sdfec.c: res = pin_user_pages_fast((unsigned long)src_ptr, nr_pages, 0, pages); drivers/platform/goldfish/goldfish_pipe.c: ret = pin_user_pages_fast(first_page, requested_pages, drivers/rapidio/devices/rio_mport_cdev.c: pinned = pin_user_pages_fast( drivers/sbus/char/oradax.c: ret = pin_user_pages_fast((unsigned long)va, 1, FOLL_WRITE, p); drivers/scsi/st.c: res = pin_user_pages_fast(uaddr, nr_pages, rw == READ ? FOLL_WRITE : 0, drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c: actual_pages = pin_user_pages_fast((unsigned long)ubuf & PAGE_MASK, num_pages, drivers/tee/tee_shm.c: rc = pin_user_pages_fast(start, num_pages, FOLL_WRITE, drivers/vfio/vfio_iommu_spapr_tce.c: if (pin_user_pages_fast(tce & PAGE_MASK, 1, drivers/video/fbdev/pvr2fb.c: ret = pin_user_pages_fast((unsigned long)buf, nr_pages, FOLL_WRITE, pages); drivers/xen/gntdev.c: ret = pin_user_pages_fast(addr, 1, batch->writeable ? FOLL_WRITE : 0, &page); drivers/xen/privcmd.c: page_count = pin_user_pages_fast( fs/orangefs/orangefs-bufmap.c: ret = pin_user_pages_fast((unsigned long)user_desc->ptr, arch/x86/kvm/svm/sev.c: npinned = pin_user_pages_fast(uaddr, npages, write ? FOLL_WRITE : 0, pages); drivers/fpga/dfl-afu-dma-region.c: pinned = pin_user_pages_fast(region->user_addr, npages, FOLL_WRITE, lib/iov_iter.c: res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages); We can easily check if some of these drivers (of which some we don't even ship/support) are loaded and decide this system is not safe for CMA crashkernel. Maybe looking at the list more thoroughly will show that even some of the above calls are acually safe, e.g. because the DMA is set up for reading only. lib/iov_iter.c seem like it could be the real problem since it's used by generic block layer... > > > The crashkernel=,cma requires no userspace data dumping, from our > > > support engineers' feedback, customer never express they don't need to > > > dump user space data. Assume a server with huge databse deployed, and > > > the database often collapsed recently and database provider claimed that > > > it's not database's fault, OS need prove their innocence. What will you > > > do? > > > > Don't use CMA backed crash memory then? This is an optional feature. Right. Our kdump does not dump userspace by default and we would of course make sure ,cma is not used when the user wanted to turn on userspace dumping. > > Jiri will know better than me but for us a proper crash memory > > configuration has become a real nut. You do not want to reserve too much > > because it is effectively cutting of the usable memory and we regularly > > hit into "not enough memory" if we tried to be savvy. The more tight you > > try to configure the easier to fail that is. Even worse any in kernel > > memory consumer can increase its memory demand and get the overall > > consumption off the cliff. So this is not an easy to maintain solution. > > CMA backed crash memory can be much more generous while still usable. > > Hmm, Redhat could go in a different way. We have been trying to: > 1) customize initrd for kdump kernel specifically, e.g exclude unneeded > devices's driver to save memory; ditto > 2) monitor device and kenrel memory usage if they begin to consume much > more memory than before. We have CI testing cases to watch this. We ever > found one NIC even eat up GB level memory, then this need be > investigated and fixed. > With these effort, our default crashkernel values satisfy most of cases, > surely not call cases. Only rare cases need be handled manually, > increasing crashkernel. We get a lot of problems reported by partners testing kdump on their setups prior to release. But even if we tune the reserved size up, OOM is still the most common reason for kdump to fail when the product starts getting used in real life. It's been pretty frustrating for a long time. > Wondering how you will use this crashkernel=,cma syntax. On normal > machines and virt guests, not much meomry is needed, usually 256M or a > little more is enough. On those high end systems with hundreds of Giga > bytes, even Tera bytes of memory, I don't think the saved memory with > crashkernel=,cma make much sense. I feel the exact opposite about VMs. Reserving hundreds of MB for crash kernel on _every_ VM on a busy VM host wastes the most memory. VMs are often tuned to well defined task and can be set up with very little memory, so the ~256 MB can be a huge part of that. And while it's theoretically better to dump from the hypervisor, users still often prefer kdump because the hypervisor may not be under their control. Also, in a VM it should be much easier to be sure the machine is safe WRT the potential DMA corruption as it has less HW drivers. So I actually thought the CMA reservation could be most useful on VMs. Thanks, -- Jiri Bohac SUSE Labs, Prague, Czechia