Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753310AbbFRIZl (ORCPT ); Thu, 18 Jun 2015 04:25:41 -0400 Received: from mailout1.w1.samsung.com ([210.118.77.11]:28744 "EHLO mailout1.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752990AbbFRIZ2 (ORCPT ); Thu, 18 Jun 2015 04:25:28 -0400 X-AuditID: cbfec7f4-f79c56d0000012ee-a4-5582807525b2 Message-id: <55828064.5040301@samsung.com> Date: Thu, 18 Jun 2015 10:25:08 +0200 From: Beata Michalska User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130804 Thunderbird/17.0.8 MIME-version: 1.0 To: Dave Chinner Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> In-reply-to: <20150617230605.GK10224@dastard> Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprDIsWRmVeSWpSXmKPExsVy+t/xa7qlDU2hBlcPcFt8/dLBYrHl2D1G i3MLZjBanJ6wiMni6ac+FovZ05uZLG5dXsVicbbpDbvFsgebWSw2f+9gs5g57w6bxZ69J1ks Lu+aw2Zxb81/VovWnp/sDvweLZvLPU4tkvBYsKnUY/MKLY+3DwM8Nn2axO7RdOYos8f7fVfZ PPq2rGL0OLPgCLvH501yAdxRXDYpqTmZZalF+nYJXBm7tiYU3NCveDivh62BcY5aFyMnh4SA icSjiZPZIGwxiQv31gPZXBxCAksZJe69P8MIkhASeMYo8e1DCojNK6Al0f/lGROIzSKgKtF6 dgY7iM0moC/xasZKsLioQITEn9P7WCHqBSV+TL7HAmKLCKhJTJq0gxlkAbPAESaJFe/mgxUJ C3hKfJwNMUhIYAWjxN15YIM4BXQlOuf8BoszC+hI7G+dxgZhy0tsXvOWeQKjwCwkO2YhKZuF pGwBI/MqRtHU0uSC4qT0XEO94sTc4tK8dL3k/NxNjJA4+7KDcfExq0OMAhyMSjy8DFxNoUKs iWXFlbmHGCU4mJVEeKNUgEK8KYmVValF+fFFpTmpxYcYpTlYlMR55+56HyIkkJ5YkpqdmlqQ WgSTZeLglGpgdGqyu8g/u/zcvewiw8/3Hjh++/3Qrn/3qbN1ey8e+pmy0Znv2S+TbG2/foal pU+85mmr8C89b1+RKdkmtVJkwetghotZIX8sZzXZNGYI/TfSmPulvPt3mZDmtXdLJ545oSMX +aNLhN3tl4vCxE3Hf0nJ6cxc/PLu+V1OV2YcnX/VZ9PMT5sY7yuxFGckGmoxFxUnAgC0iexG rwIAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6180 Lines: 174 Hi, On 06/18/2015 01:06 AM, Dave Chinner wrote: > On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: >> Introduce configurable generic interface for file >> system-wide event notifications, to provide file >> systems with a common way of reporting any potential >> issues as they emerge. >> >> The notifications are to be issued through generic >> netlink interface by newly introduced multicast group. >> >> Threshold notifications have been included, allowing >> triggering an event whenever the amount of free space drops >> below a certain level - or levels to be more precise as two >> of them are being supported: the lower and the upper range. >> The notifications work both ways: once the threshold level >> has been reached, an event shall be generated whenever >> the number of available blocks goes up again re-activating >> the threshold. >> >> The interface has been exposed through a vfs. Once mounted, >> it serves as an entry point for the set-up where one can >> register for particular file system events. >> >> Signed-off-by: Beata Michalska > > This has massive scalability problems: > >> + 4.3 Threshold notifications: >> + >> + #include >> + void fs_event_alloc_space(struct super_block *sb, u64 ncount); >> + void fs_event_free_space(struct super_block *sb, u64 ncount); >> + >> + Each filesystme supporting the threshold notifications should call >> + fs_event_alloc_space/fs_event_free_space respectively whenever the >> + amount of available blocks changes. >> + - sb: the filesystem's super block >> + - ncount: number of blocks being acquired/released > > ... here. > >> + Note that to properly handle the threshold notifications the fs events >> + interface needs to be kept up to date by the filesystems. Each should >> + register fs_trace_operations to enable querying the current number of >> + available blocks. > > Have you noticed that the filesystems have percpu counters for > tracking global space usage? There's good reason for that - taking a > spinlock in such a hot accounting path causes severe contention. > >> +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id) >> +{ >> + size_t size = nla_total_size(sizeof(u32)) * 2 + >> + nla_total_size(sizeof(u64)); >> + >> + fs_netlink_send_event(size, event_id, create_common_msg, en); >> +} >> + >> +static void fs_event_send_thresh(struct fs_trace_entry *en, >> + unsigned int event_id) >> +{ >> + size_t size = nla_total_size(sizeof(u32)) * 2 + >> + nla_total_size(sizeof(u64)) * 2; >> + >> + fs_netlink_send_event(size, event_id, create_thresh_msg, en); >> +} >> + >> +void fs_event_notify(struct super_block *sb, unsigned int event_id) >> +{ >> + struct fs_trace_entry *en; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; >> + >> + spin_lock(&en->lock); >> + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) >> + fs_event_send(en, event_id); >> + spin_unlock(&en->lock); >> + fs_trace_entry_put(en); >> +} >> +EXPORT_SYMBOL(fs_event_notify); >> + >> +void fs_event_alloc_space(struct super_block *sb, u64 ncount) >> +{ >> + struct fs_trace_entry *en; >> + s64 count; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; > > Adds an atomic write to get the trace entry, > >> + spin_lock(&en->lock); > > a spin lock to lock the entry, > > >> + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) >> + goto leave; >> + /* >> + * we shouldn't drop below 0 here, >> + * unless there is a sync issue somewhere (?) >> + */ >> + count = en->th.avail_space - ncount; >> + en->th.avail_space = count < 0 ? 0 : count; >> + >> + if (en->th.avail_space > en->th.lrange) >> + /* Not 'even' close - leave */ >> + goto leave; >> + >> + if (en->th.avail_space > en->th.urange) { >> + /* Close enough - the lower range has been reached */ >> + if (!(en->th.state & THRESH_LR_BEYOND)) { >> + /* Send notification */ >> + fs_event_send_thresh(en, FS_THR_LRBELOW); >> + en->th.state &= ~THRESH_LR_BELOW; >> + en->th.state |= THRESH_LR_BEYOND; >> + } >> + goto leave; > > Then puts the entire netlink send path inside this spinlock, which > includes memory allocation and all sorts of non-filesystem code > paths. And it may be inside critical filesystem locks as well.... > > Apart from the serialisation problem of the locking, adding > memory allocation and the network send path to filesystem code > that is effectively considered "innermost" filesystem code is going > to have all sorts of problems for various filesystems. In the XFS > case, we simply cannot execute this sort of function in the places > where we update global space accounting. > > As it is, I think the basic concept of separate tracking of free > space if fundamentally flawed. What I think needs to be done is that > filesystems need access to the thresholds for events, and then the > filesystems call fs_event_send_thresh() themselves from appropriate > contexts (ie. without compromising locking, scalability, memory > allocation recursion constraints, etc). > > e.g. instead of tracking every change in free space, a filesystem > might execute this once every few seconds from a workqueue: > > event = fs_event_need_space_warning(sb, ) > if (event) > fs_event_send_thresh(sb, event); > > User still gets warnings about space usage, but there's no runtime > overhead or problems with lock/memory allocation contexts, etc. > > Cheers, > > Dave. > Having fs to keep a firm hand on thresholds limits would indeed be far more sane approach though that would require each fs to add support for that and handle most of it on their own. Avoiding this was the main rationale behind this rfc. If fs people agree to that, I'll be more than willing to drop this in favour of the per-fs tracking solution. Personally, I hope they will. Best Regards Beata -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/