Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754611AbcKIS3H (ORCPT ); Wed, 9 Nov 2016 13:29:07 -0500 Received: from mail-wm0-f41.google.com ([74.125.82.41]:35689 "EHLO mail-wm0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752397AbcKIS0T (ORCPT ); Wed, 9 Nov 2016 13:26:19 -0500 MIME-Version: 1.0 In-Reply-To: <20161109111005.GA32353@quack2.suse.cz> References: <20161102220851.GA1839@veci.piliscsaba.szeredi.hu> <20161105213411.GA32353@quack2.suse.cz> <20161109111005.GA32353@quack2.suse.cz> From: Amir Goldstein Date: Wed, 9 Nov 2016 20:26:16 +0200 Message-ID: Subject: Re: fsnotify_mark_srcu wtf? To: Jan Kara Cc: Miklos Szeredi , Eric Paris , linux-fsdevel , linux-kernel Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3972 Lines: 85 On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara wrote: > On Sun 06-11-16 08:45:54, Amir Goldstein wrote: >> On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara wrote: >> > On Wed 02-11-16 23:09:26, Miklos Szeredi wrote: >> >> We've got a report where a fanotify daemon that implements permission checks >> >> screws up and doesn't send a reply. This then causes widespread hangs due to >> >> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu() >> >> called from e.g. inotify_release()-> fsnotify_destroy_group()-> >> >> fsnotify_mark_destroy_list() to block. >> > >> > Yes. But if a program implementing permission checks does not reply, your >> > system is likely hosed anyway. We can only try to somewhat limit the >> > damage... >> > >> >> That was my initial thought as well, but at least with the sample code >> Miklos sent >> the only thing that gets hosed is the one process watching that one file. >> You could think of a use case of fanotify being used to watch over files >> in a specific user directory, where the damage on the entire system >> should/could be limited. No? > > Yes, the damage could at least theoretically be limited only to those files > / dirs watched by the buggy process. > >> >> Below program demonstrates the issue. It should output a single line: >> >> >> >> close(inotify_fd): success >> >> >> >> Instead it outputs nothing, which means that close(inotify_fd) got blocked by >> >> the waiting permission event. >> >> >> >> Wouldn't making the srcu per-group fix this? Would that be too expensive? >> > >> > Per-group would be IMHO too expensive. You can have lots of groups and I'm >> > not sure srcu would scale to that. Furthermore the SRCU protects the list >> > of groups that need to get notification so it would not even be easily >> > possible. Also Amir's solution is buggy - I'll comment on that as a reply >> > to his patch. I'll try to find something to improve the situation but so >> > far I have no good idea... >> > >> >> Yes, very much buggy indeed :/ >> Anyway, the reason I drafted it quickly was to highlight the fact that the >> marks only need to live to the point of decision whether or not the event >> should be sent to the group and afterwards, its sufficient to grab the >> group reference, without having impact on the entire system. > > Yes, fanotify code as such does not need the marks anymore. But the core > fsnotify code does... > >> Yet another possible ugly (but less buggy) solution would be >> to iterate all marks under SRCU read protection. >> If any group is about to block (either by suggested return value >> EAGAIN or another >> by using a new op should_handle_event_deferred), defer event handling to post >> marks iteration, by keeping a few group references on stack. > > And this does not work as well... Fanotify must notify groups by their > priority so you cannot arbitrarily reorder ordering in which groups get > notified. I'm currently pondering on using mark refcount to pin it when > processing permission event but there are still some details to check. > All right, mark refcount sound like the proper solution. In case this approach doesn't work out for some reason, you may want to consider fsnotify_mark_srcu (and destroy_list) per priority. Or just 2 srcu, one for for priority 0 and one for other. Because priority > 0 may block and priority 0 may not block. The priority set on an inode/vfsmount can be easily encoded into the i_fsnotify_mask/mnt_fsnotify_mask, e.g.: #define FS_GROUP_PRIO_1 0x00040000 /* fanotify content based access control */ #define FS_GROUP_PRIO_2 0x00080000 /* fanotify pre-content access */ in fsnotify(), only need to take the read side srcu of relevant priority groups, but need to take extra care to set the priority bit in the inode/mnt mask *before* adding the mark to the list, with a relevant memory barrier before checking the priority bits in fsnotify(). Amir.