Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp638518ybl; Mon, 2 Dec 2019 16:14:38 -0800 (PST) X-Google-Smtp-Source: APXvYqztXSjkd4+8+Oou75oBXJZql8JO4eCTqb3CKoU6Rx8V/pX6DL7br5DRt8us79ykv9aT+s8t X-Received: by 2002:aca:4fcd:: with SMTP id d196mr1500317oib.89.1575332078500; Mon, 02 Dec 2019 16:14:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575332078; cv=none; d=google.com; s=arc-20160816; b=YMfWmkliJCSnYA75WKxL9D+0PqEnunUVHuX2m8gDBkga95GD4r+E9viZrew6KTnTFz pfE4ix0I8ai4xroNbijYuFsfaU2hJ5hi6lF1W30btDbpmReB1V6ZLu1y+I2F7zJ2KzcR cUbwIekENJVEz2Sle4jR0OVuO1BV0DwRsMB343yJUGY89rwwV1l8VZBMPw+dsMU5+pHb s7JNcvDq/dHsaP5yIGIVJLEEbiOBq97M4J6C9RucKeIf8v+25b7fGSlzUzxq0pG1BiVx w5L9efIvJKODqU4O+NuBuYUqfZVB6Bk67AQOr45vmYEXqepCdxoqP4Qbolm9D0XKWcnR w6yQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=jU8zQM7dQsl7j+C2HShiZHxrCqA16lgqBbZmIQjuFsE=; b=or612gMyQTOwrsJcyaBIUrfFcB9NFafIohun537OIKoGdvE+WIzP69Ci/U8/EVhXp6 X2nDOquPv5pYZkAE9db6Gib10ZsbG/cD7lfzOO+DcHSAufu0RGAw6DfH0QK+L+yMHY8m sHvBMPWsfRHlDgluT+390zfeAIw0dt4ngh5eOOdjzQmsMhDk2iNT5f+WRPZliE6DUCZo 4iK8BwPO80YdSNSdm53B9U/xvAtA73u/eSPcjCE+yWtEkk9XE8DGwSrlF7fkS0WVwyHH 0tPqiCd9HF75vzBxvuzP0elifkdUor8aiH13mi0H5FD/mll3irUWjRab0BWCodzVX4m9 FdEQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=UdtIHWXs; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 7si385734otv.26.2019.12.02.16.14.14; Mon, 02 Dec 2019 16:14:38 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=UdtIHWXs; spf=pass (google.com: best guess record for domain of linux-nfs-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725944AbfLCAOK (ORCPT + 99 others); Mon, 2 Dec 2019 19:14:10 -0500 Received: from mail-oi1-f193.google.com ([209.85.167.193]:34585 "EHLO mail-oi1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725954AbfLCAOJ (ORCPT ); Mon, 2 Dec 2019 19:14:09 -0500 Received: by mail-oi1-f193.google.com with SMTP id l136so1603527oig.1 for ; Mon, 02 Dec 2019 16:14:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jU8zQM7dQsl7j+C2HShiZHxrCqA16lgqBbZmIQjuFsE=; b=UdtIHWXsgA1cgpVIddyD/Sekot8kmzIvm9AkSrHSOZneyL3PfXwLCGrfd2C7oZrezJ e8qaDNGxII/RljQThXGhEK53xQlKGoHxvaU6wwSyag3/jkfaemJH/tfN/uNdLN94XfOL aVFVvIau+de3pcuRbNR7xT4Zu8AUADkU1oGkw1GizVGGrbAZB7rMydqYyNTn5c5au9iu YF28TyNBvvlTW+2V6dmrnMmGITVDsNi6ylMGS8G0Ym7Rx20w4FJf7W18LzPkd4ix1ubm lTY6XBoQjMu5H2nB1yRLjVmVwFFJbQN437b5r37RMCynvnUwjOF/bmGTunHUMdy83uvP 3eeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jU8zQM7dQsl7j+C2HShiZHxrCqA16lgqBbZmIQjuFsE=; b=cA0c7NuZWP1/BVOhETVPLN7t4D7+fGpfXKrJF1T4mo++e0xZeT+V8xEOKMwCojo4Fm 7iwysn7AKPD3+aOMJ/idHbKVDLljdrZXk58PRwNUCiyKm8bQls957pd6NibKEZcW3QoZ +o3D5axphzrnN1lNaoojvXy6NH+SOB7jD84Be87T0qcuDDRg+HFGQ6nGTUVwAd3IpJ6j Bra6dpy1gNXadWoRQl02BPJfHDZf1ZqbDzAuBHDNBYjjWwSoMDBiD5v+JbYhGpX3GFXT FzbpNTgqthhjxbi1RWFmayoaUDt8fnZRx9s0EGgG0faU5Y7SBD57tE1wIw9Wtz2jj/rU HtLQ== X-Gm-Message-State: APjAAAWRTkhGPfeIoBu6Xpgt/uZ6QIhjZ/CIoWyc0XBxs4kaz8OCfUUu W3D9FfC6nLIb8BqGKY2lXakJi9p2bezA8OitJGMK8Q== X-Received: by 2002:aca:670b:: with SMTP id z11mr1401696oix.79.1575332047908; Mon, 02 Dec 2019 16:14:07 -0800 (PST) MIME-Version: 1.0 References: <20191129214541.3110-1-ptikhomirov@virtuozzo.com> <4e2d959a-0b0e-30aa-59b4-8e37728e9793@virtuozzo.com> In-Reply-To: <4e2d959a-0b0e-30aa-59b4-8e37728e9793@virtuozzo.com> From: Shakeel Butt Date: Mon, 2 Dec 2019 16:13:56 -0800 Message-ID: Subject: Re: [PATCH] mm: fix hanging shrinker management on long do_shrink_slab To: Andrey Ryabinin Cc: Pavel Tikhomirov , Andrew Morton , LKML , Cgroups , Linux MM , Johannes Weiner , Michal Hocko , Vladimir Davydov , Roman Gushchin , Chris Down , Yang Shi , Tejun Heo , Thomas Gleixner , "Kirill A . Shutemov" , Konstantin Khorenko , Kirill Tkhai , Trond Myklebust , Anna Schumaker , "J. Bruce Fields" , Chuck Lever , linux-nfs@vger.kernel.org, Alexander Viro , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Mon, Dec 2, 2019 at 8:37 AM Andrey Ryabinin wrote: > > > On 11/30/19 12:45 AM, Pavel Tikhomirov wrote: > > We have a problem that shrinker_rwsem can be held for a long time for > > read in shrink_slab, at the same time any process which is trying to > > manage shrinkers hangs. > > > > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. > > It tries to shrink something on nfs (hard) but nfs server is dead at > > these moment already and rpc will never succeed. Generally any shrinker > > can take significant time to do_shrink_slab, so it's a bad idea to hold > > the list lock here. > > > > We have a similar problem in shrink_slab_memcg, except that we are > > traversing shrinker_map+shrinker_idr there. > > > > The idea of the patch is to inc a refcount to the chosen shrinker so it > > won't disappear and release shrinker_rwsem while we are in > > do_shrink_slab, after that we will reacquire shrinker_rwsem, dec > > the refcount and continue the traversal. > > > > We also need a wait_queue so that unregister_shrinker can wait for the > > refcnt to become zero. Only after these we can safely remove the > > shrinker from list and idr, and free the shrinker. > > > > I've reproduced the nfs hang in do_shrink_slab with the patch applied on > > ms kernel, all other mount/unmount pass fine without any hang. > > > > Here is a reproduction on kernel without patch: > > > > 1) Setup nfs on server node with some files in it (e.g. 200) > > > > [server]# cat /etc/exports > > /vz/nfs2 *(ro,no_root_squash,no_subtree_check,async) > > > > 2) Hard mount it on client node > > > > [client]# mount -ohard 10.94.3.40:/vz/nfs2 /mnt > > > > 3) Open some (e.g. 200) files on the mount > > > > [client]# for i in $(find /mnt/ -type f | head -n 200); \ > > do setsid sleep 1000 &>/dev/null <$i & done > > > > 4) Kill all openers > > > > [client]# killall sleep -9 > > > > 5) Put your network cable out on client node > > > > 6) Drop caches on the client, it will hang on nfs while holding > > shrinker_rwsem lock for read > > > > [client]# echo 3 > /proc/sys/vm/drop_caches > > > > crash> bt ... > > PID: 18739 TASK: ... CPU: 3 COMMAND: "bash" > > #0 [...] __schedule at ... > > #1 [...] schedule at ... > > #2 [...] rpc_wait_bit_killable at ... [sunrpc] > > #3 [...] __wait_on_bit at ... > > #4 [...] out_of_line_wait_on_bit at ... > > #5 [...] _nfs4_proc_delegreturn at ... [nfsv4] > > #6 [...] nfs4_proc_delegreturn at ... [nfsv4] > > #7 [...] nfs_do_return_delegation at ... [nfsv4] > > #8 [...] nfs4_evict_inode at ... [nfsv4] > > #9 [...] evict at ... > > #10 [...] dispose_list at ... > > #11 [...] prune_icache_sb at ... > > #12 [...] super_cache_scan at ... > > #13 [...] do_shrink_slab at ... > > #14 [...] shrink_slab at ... > > #15 [...] drop_slab_node at ... > > #16 [...] drop_slab at ... > > #17 [...] drop_caches_sysctl_handler at ... > > #18 [...] proc_sys_call_handler at ... > > #19 [...] vfs_write at ... > > #20 [...] ksys_write at ... > > #21 [...] do_syscall_64 at ... > > #22 [...] entry_SYSCALL_64_after_hwframe at ... > > > > 7) All other mount/umount activity now hangs with no luck to take > > shrinker_rwsem for write. > > > > [client]# mount -t tmpfs tmpfs /tmp > > > > crash> bt ... > > PID: 5464 TASK: ... CPU: 3 COMMAND: "mount" > > #0 [...] __schedule at ... > > #1 [...] schedule at ... > > #2 [...] rwsem_down_write_slowpath at ... > > #3 [...] prealloc_shrinker at ... > > #4 [...] alloc_super at ... > > #5 [...] sget at ... > > #6 [...] mount_nodev at ... > > #7 [...] legacy_get_tree at ... > > #8 [...] vfs_get_tree at ... > > #9 [...] do_mount at ... > > #10 [...] ksys_mount at ... > > #11 [...] __x64_sys_mount at ... > > #12 [...] do_syscall_64 at ... > > #13 [...] entry_SYSCALL_64_after_hwframe at ... > > > > > I don't think this patch solves the problem, it only fixes one minor symptom of it. > The actual problem here the reclaim hang in the nfs. > It means that any process, including kswapd, may go into nfs inode reclaim and stuck there. > > Even mount() itself has GFP_KERNEL allocations in its path, so it still might stuck there even with your patch. > > I think this should be handled on nfs/vfs level by making inode eviction during reclaim more asynchronous. Though I agree that we should be fixing shrinkers to not get stuck (and be more async), I still think the problem this patch is solving is worth fixing. On machines running multiple workloads, one job stuck in slab shrinker and blocking all other unrelated jobs wanting shrinker_rwsem, breaks the isolation and causes DoS. Shakeel