Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp4438674imm; Mon, 18 Jun 2018 15:12:25 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKnPDOSwcILAc4Xs78bP1EuSuFzDzoEifS8SbqSKFLaCgcLUM/ORHfgfI0oeirrVK1PealR X-Received: by 2002:a62:211a:: with SMTP id h26-v6mr15285064pfh.133.1529359945691; Mon, 18 Jun 2018 15:12:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529359945; cv=none; d=google.com; s=arc-20160816; b=DO+za/7hYbLztvOJH6IekTeA3t4ELU+lfNszEOqmd+ax0mZ1rAJBbfV0RfEzrNi6Ym jqyXVcm3rX5OJDJ9mt41j6NdsEZd1txN7KrqhCQnVpW3PpyMQw8XWlicFe5bmSkv9xz9 BnB5583wfoqXQurUtisXmCVrO2gSw90MpeB+liY6KmxnBZkWtSoOa/e9d969hSTnM3rT Hpg3bjCJ5wmCKlsLm+vtAf9gNJrCfSKIAdr4YqsdPgM4eQexlNWkzQArvDh4Ixg7l+/Y FBkqjz0p+gixMDuOG0xtUpUwFNRVPMkf77ultW/2hpNDQ3TyEY6vVSH7UbBEKew2zIu+ nX1A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date :arc-authentication-results; bh=ookSZJ3OFMp9NWsatULuiZAnFL716QfrSPTH5y/fpm0=; b=Fcff9dxaD1IrMB83wG4whe8n0pKNZlZRN0FoXEXwmCGR5/qMETPphV/GwbbRGRSx7N ERWHzlgNXd3oIHB3eYuKGM8gOwZprHLQGrPc5WLXfJEm0HyVPHZQBBbf6339G1wPmY12 nnS6niklnoGKdj6PTZ9XacSeoq5Vzy8bo+RXB2kYqJmPXR2cnSKRtG9WHiqWBJPiQNN8 hmrKS6trkjt8ZRxT7WGivhRTvz8+9PEv9p8im9NtgyZ9bzWkF35qAOt2uZYBbKo9QgVX xg26nX5a0cPcZ4Rp5ggXjNy60TSisD9P6WSJaRt7JSY7QMLkB5CqhsRuwyN1KCMS3NIp Xd6w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o6-v6si13474410pgs.216.2018.06.18.15.12.11; Mon, 18 Jun 2018 15:12:25 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936202AbeFRWLd (ORCPT + 99 others); Mon, 18 Jun 2018 18:11:33 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:43796 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S933036AbeFRWLc (ORCPT ); Mon, 18 Jun 2018 18:11:32 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 808157DAC3; Mon, 18 Jun 2018 22:11:31 +0000 (UTC) Received: from file01.intranet.prod.int.rdu2.redhat.com (file01.intranet.prod.int.rdu2.redhat.com [10.11.5.7]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 04088111DCEB; Mon, 18 Jun 2018 22:11:26 +0000 (UTC) Received: from file01.intranet.prod.int.rdu2.redhat.com (localhost [127.0.0.1]) by file01.intranet.prod.int.rdu2.redhat.com (8.14.4/8.14.4) with ESMTP id w5IMBQwW026086; Mon, 18 Jun 2018 18:11:26 -0400 Received: from localhost (mpatocka@localhost) by file01.intranet.prod.int.rdu2.redhat.com (8.14.4/8.14.4/Submit) with ESMTP id w5IMBQMI026082; Mon, 18 Jun 2018 18:11:26 -0400 X-Authentication-Warning: file01.intranet.prod.int.rdu2.redhat.com: mpatocka owned process doing -bs Date: Mon, 18 Jun 2018 18:11:26 -0400 (EDT) From: Mikulas Patocka X-X-Sender: mpatocka@file01.intranet.prod.int.rdu2.redhat.com To: Michal Hocko cc: jing xia , Mike Snitzer , agk@redhat.com, dm-devel@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: dm bufio: Reduce dm_bufio_lock contention In-Reply-To: <20180615130925.GI24039@dhcp22.suse.cz> Message-ID: References: <1528790608-19557-1-git-send-email-jing.xia@unisoc.com> <20180612212007.GA22717@redhat.com> <20180614073153.GB9371@dhcp22.suse.cz> <20180615073201.GB24039@dhcp22.suse.cz> <20180615115547.GH24039@dhcp22.suse.cz> <20180615130925.GI24039@dhcp22.suse.cz> User-Agent: Alpine 2.02 (LRH 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Mon, 18 Jun 2018 22:11:31 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Mon, 18 Jun 2018 22:11:31 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'mpatocka@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 15 Jun 2018, Michal Hocko wrote: > On Fri 15-06-18 08:47:52, Mikulas Patocka wrote: > > > > > > On Fri, 15 Jun 2018, Michal Hocko wrote: > > > > > On Fri 15-06-18 07:35:07, Mikulas Patocka wrote: > > > > > > > > Because mempool uses it. Mempool uses allocations with "GFP_NOIO | > > > > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses > > > > these flags too. dm-bufio is just a big mempool. > > > > > > This doesn't answer my question though. Somebody else is doing it is not > > > an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO > > > allocation AFAICS. So why do you really need it now? Why cannot you > > > > dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | > > __GFP_NOWARN" since the kernel 3.2 when it was introduced. > > > > In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT > > allocation, then drops the lock and does GFP_NOIO with the dropped lock > > (because someone was likely experiencing the same issue that is reported > > in this thread) - there are two commits that change it - 9ea61cac0 and > > 41c73a49df31. > > OK, I see. Is there any fundamental reason why this path has to do one > round of GFP_IO or it can keep NOWAIT, drop the lock, sleep and retry > again? If the process is woken up, there was some buffer added to the freelist, or refcount of some buffer was dropped to 0. In this case, we don't want to drop the lock and use GFP_NOIO, because the freed buffer may disappear when we drop the lock. > [...] > > > is the same class of problem, honestly, I dunno. And I've already said > > > that stalling __GFP_NORETRY might be a good way around that but that > > > needs much more consideration and existing users examination. I am not > > > aware anybody has done that. Doing changes like that based on a single > > > user is certainly risky. > > > > Why don't you set any rules how these flags should be used? > > It is really hard to change rules during the game. You basically have to > examine all existing users and that is well beyond my time scope. I've > tried that where it was possible. E.g. __GFP_REPEAT and turned it into a > well defined semantic. __GFP_NORETRY is a bigger beast. > > Anyway, I believe that it would be much safer to look at the problem > from a highlevel perspective. You seem to be focused on __GFP_NORETRY > little bit too much IMHO. We are not throttling callers which explicitly > do not want to or cannot - see current_may_throttle. Is it possible that > both your md and mempool allocators can either (ab)use PF_LESS_THROTTLE > or use other means? E.g. do you have backing_dev_info at that time? > -- > Michal Hocko > SUSE Labs I grepped the kernel for __GFP_NORETRY and triaged them. I found 16 cases without a fallback - those are bugs that make various functions randomly return -ENOMEM. Most of the callers provide callback. There is another strange flag - __GFP_RETRY_MAYFAIL - it provides two different functions - if the allocation is larger than PAGE_ALLOC_COSTLY_ORDER, it retries the allocation as if it were smaller. If the allocations is smaller than PAGE_ALLOC_COSTLY_ORDER, __GFP_RETRY_MAYFAIL will avoid the oom killer (larger order allocations don't trigger the oom killer at all). So, perhaps __GFP_RETRY_MAYFAIL could be used instead of __GFP_NORETRY in the cases where the caller wants to avoid trigerring the oom killer (the problem is that __GFP_NORETRY causes random failure even in no-oom situations but __GFP_RETRY_MAYFAIL doesn't). So my suggestion is - fix these obvious bugs when someone allocates memory with __GFP_NORETRY without any fallback - and then, __GFP_NORETRY could be just changed to return NULL instead of sleeping. arch/arm/mm/dma-mapping.c - fallback to a smaller size without __GFP_NORETRY arch/mips/mm/dma-default.c - says that it uses __GFP_NORETRY to avoid the oom killer, provides no fallback - it seems to be a BUG arch/sparc/mm/tsb.c - fallback to a smaller size without __GFP_NORETRY arch/x86/include/asm/floppy.h - __GFP_NORETRY doesn't seem to serve any purpose, it may cause random failures during initialization, can be removed - BUG arch/powerpc/mm/mmu_context_iommu.c - uses it just during moving pages, there's no problem with failure arch/powerpc/platforms/pseries/cmm.c - a vm balloon driver, no problem with failure block/bio.c - falls back to mempool block/blk-mq.c - errorneous use of __GFP_NORETRY during initialization, it falls back to a smaller size, but doesn't drop the __GFP_NORETRY flag (BUG) drivers/gpu/drm/i915/i915_gem.c - it starts with __GFP_NORETRY and on failure, it ORs it with __GFP_RETRY_MAYFAIL (which of these conflicting flags wins?) drivers/gpu/drm/i915/i915_gem_gtt.c - __GFP_NORETRY is used during initialization (BUG), it shouldn't be used drivers/gpu/drm/i915/i915_gem_execbuffer.c - fallback to a smaller size without __GFP_NORETRY drivers/gpu/drm/i915/i915_gem_internal.c - fallback to a smaller size without __GFP_NORETRY size drivers/gpu/drm/i915/i915_gem_userptr.c - seems to provide fallback drivers/gpu/drm/i915/i915_gpu_error.c - fallback to a smaller size without __GFP_NORETRY drivers/gpu/drm/etnaviv/etnaviv_dump.c - coredump on error path, no problem if it fails drivers/gpu/drm/ttm/ttm_page_alloc.c - uses __GFP_NORETRY for transparent hugepages, no problem with failure drivers/gpu/drm/ttm/ttm_page_alloc_dma.c - uses __GFP_NORETRY for transparent hugepages, no problem with failure drivers/gpu/drm/msm/msm_gem_submit.c - uses __GFP_NORETRY to process ioctl and lacks a fallback - it is a BUG - __GFP_NORETRY should be dropped drivers/hv/hv_balloon.c - a vm balloon driver, no problem with failure drivers/crypto/chelsio/chtls/chtls_io.c - fallback to a smaller size without __GFP_NORETRY drivers/xen/balloon.c - a vm balloon driver, no problem with failure drivers/mtd/mtdcore.c - fallback to a smaller size without __GFP_NORETRY drivers/md/dm-verity-target.c - skips prefetch on failure, no problem drivers/md/dm-writecache.c - falls back to a smaller i/os on failure drivers/md/dm-bufio.c - reserves some buffers on creation and falls back to them drivers/md/dm-integrity.c - falls back to sector-by-sector verification drivers/md/dm-kcopyd.c - falls back to reserved pages drivers/md/dm-crypt.c - falls back to mempool drivers/iommu/dma-iommu.c - fallback to a smaller size without __GFP_NORETRY drivers/mmc/core/mmc_test.c - fallback to a smaller size, but doesn't drop __GFP_NORETRY - BUG drivers/staging/android/ion/ion_system_heap.c - fallback to a smaller size without __GFP_NORETRY fs/cachefiles/ - uses __GFP_NORETRY extensively - since this is just a cache, so failure supposedly shouldn't be problem - but it's hard to verify the whole driver that it handles failures properly fs/xfs/xfs_buf.c - uses __GFP_NORETRY only on readahead fs/fscache/cookie.c - no fallback, but if it fails, the caller will just invalidate the cache entry - no problem fs/fscache/page.c - like above - failure will just inhibit caching fs/nfs/write.c - fails only if allowed by the arugment never_fail include/linux/kexec.h - no fallback - it seems to be a BUG include/linux/pagemap.h - uses __GFP_NORETRY only on readahead kernel/bpf/syscall.c - it says it need __GFP_NORETRY to avoid oom killer - provides no fallback, so it seems to be BUG kernel/events/ring_buffer.c - falls back to a smaller size, but doesn't drop __GFP_NORETRY - BUG kernel/groups.c - falls back to vmalloc (should it use kvmalloc?) kernel/trace/ring_buffer.c - no fallback - BUG kernel/trace/trace.c - no fallback - BUG kernel/power/swap.c - falls back, but there is useless WARN_ON_ONCE(1) on the fallback path lib/debugobjects.c - turns off debugging on alloaction failure lib/rhashtable.c - uses __GFP_NORETRY only with GFP_ATOMIC - __GFP_NORETRY is useless mm/hugetlb.c - no problem if hugepage allocation fails mm/mempool.c - falls back to mempool mm/kmemleak.c - disables kmemleak on allocation failure mm/page_alloc.c/__page_frag_cache_refill - fallback to a smaller size without __GFP_NORETRY mm/zswap.c/zswap_pool_create - used during pool creation - probably a BUG mm/zswap.c/zswap_frontswap_store - used when compressing a page - probably ok to fail mm/memcontrol.c - I don't know if it is a bug mm/shmem.c - used during hugepage allocation - ok to fail mm/slub.c - fallback to a smaller size without __GFP_NORETRY mm/util.c - fallback to vmalloc net/packet/af_packet.c - fallback without __GFP_NORETRY net/xdp/xsk_queue.c - no fallback - BUG net/core/skbuff.c - fallback to a smaller size without __GFP_NORETRY net/core/sock.c - fallback to a smaller size without __GFP_NORETRY net/netlink/af_netlink.c - it's ok to fail when decreasing the size of skb net/smc/smc_core.c - falls back to a smaller size, but doesn't drop __GFP_NORETRY - BUG net/netfilter/x_tables.c - __GFP_NORETRY is used to avoid the oom killer - provides no fallback - it seems to be a BUG security/integrity/ima/ima_crypto.c - fallback to a smaller size without __GFP_NORETRY sound/core/memalloc.c - __GFP_NORETRY is used to avoid the oom killer - provides no fallback - it seems to be a BUG Mikulas