Hi All,
This series is a follow-up of the RFC patchset for generic filesystem
events interface [1]. As there have been some rather significant changes
to the synchronization method being used, more extensive testing (stress
testing) has been performed (thus the delay).
Changes from v1:
- Improved synchronization: switched to RCU accompanied with
ref counting mechanism
- Limiting scope of supported event types along with default
event codes
- Slightly modified configuration (event types followed by arguments
where required)
- Updated documentation
- Unified naming for netlink attributes
- Updated netlink message format to include dev minor:major numbers
despite the filesystem type
- Switched to single cmd id for messages
- Removed the per-config-entry ids
---
[1] https://lkml.org/lkml/2015/4/15/46
---
Beata Michalska (4):
fs: Add generic file system event notifications
ext4: Add helper function to mark group as corrupted
ext4: Add support for generic FS events
shmem: Add support for generic FS events
Documentation/filesystems/events.txt | 231 ++++++++++
fs/Makefile | 1 +
fs/events/Makefile | 6 +
fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
fs/events/fs_event.h | 25 ++
fs/events/fs_event_netlink.c | 99 +++++
fs/ext4/balloc.c | 25 +-
fs/ext4/ext4.h | 10 +
fs/ext4/ialloc.c | 5 +-
fs/ext4/inode.c | 2 +-
fs/ext4/mballoc.c | 17 +-
fs/ext4/resize.c | 1 +
fs/ext4/super.c | 39 ++
fs/namespace.c | 1 +
include/linux/fs.h | 6 +-
include/linux/fs_event.h | 58 +++
include/uapi/linux/fs_event.h | 54 +++
include/uapi/linux/genetlink.h | 1 +
mm/shmem.c | 33 +-
net/netlink/genetlink.c | 7 +-
20 files changed, 1357 insertions(+), 34 deletions(-)
create mode 100644 Documentation/filesystems/events.txt
create mode 100644 fs/events/Makefile
create mode 100644 fs/events/fs_event.c
create mode 100644 fs/events/fs_event.h
create mode 100644 fs/events/fs_event_netlink.c
create mode 100644 include/linux/fs_event.h
create mode 100644 include/uapi/linux/fs_event.h
--
1.7.9.5
Introduce configurable generic interface for file
system-wide event notifications, to provide file
systems with a common way of reporting any potential
issues as they emerge.
The notifications are to be issued through generic
netlink interface by newly introduced multicast group.
Threshold notifications have been included, allowing
triggering an event whenever the amount of free space drops
below a certain level - or levels to be more precise as two
of them are being supported: the lower and the upper range.
The notifications work both ways: once the threshold level
has been reached, an event shall be generated whenever
the number of available blocks goes up again re-activating
the threshold.
The interface has been exposed through a vfs. Once mounted,
it serves as an entry point for the set-up where one can
register for particular file system events.
Signed-off-by: Beata Michalska <[email protected]>
---
Documentation/filesystems/events.txt | 231 ++++++++++
fs/Makefile | 1 +
fs/events/Makefile | 6 +
fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
fs/events/fs_event.h | 25 ++
fs/events/fs_event_netlink.c | 99 +++++
fs/namespace.c | 1 +
include/linux/fs.h | 6 +-
include/linux/fs_event.h | 58 +++
include/uapi/linux/fs_event.h | 54 +++
include/uapi/linux/genetlink.h | 1 +
net/netlink/genetlink.c | 7 +-
12 files changed, 1257 insertions(+), 2 deletions(-)
create mode 100644 Documentation/filesystems/events.txt
create mode 100644 fs/events/Makefile
create mode 100644 fs/events/fs_event.c
create mode 100644 fs/events/fs_event.h
create mode 100644 fs/events/fs_event_netlink.c
create mode 100644 include/linux/fs_event.h
create mode 100644 include/uapi/linux/fs_event.h
diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt
new file mode 100644
index 0000000..df0b0e2
--- /dev/null
+++ b/Documentation/filesystems/events.txt
@@ -0,0 +1,231 @@
+
+ Generic file system event notification interface
+
+Document created 23 April 2015 by Beata Michalska <[email protected]>
+
+1. The reason behind:
+=====================
+
+There are many corner cases when things might get messy with the filesystems.
+And it is not always obvious what and when went wrong. Sometimes you might
+get some subtle hints that there is something going on - but by the time
+you realise it, it might be too late as you are already out-of-space
+or the filesystem has been remounted as read-only (i.e.). The generic
+interface for the filesystem events fills the gap by providing a rather
+easy way of real-time notifications triggered whenever something interesting
+happens, allowing filesystems to report events in a common way, as they occur.
+
+2. How does it work:
+====================
+
+The interface itself has been exposed as fstrace-type Virtual File System,
+primarily to ease the process of setting up the configuration for the
+notifications. So for starters, it needs to get mounted (obviously):
+
+ mount -t fstrace none /sys/fs/events
+
+This will unveil the single fstrace filesystem entry - the 'config' file,
+through which the notification are being set-up.
+
+Activating notifications for particular filesystem is as straightforward
+as writing into the 'config' file. Note that by default all events, despite
+the actual filesystem type, are being disregarded.
+
+Synopsis of config:
+------------------
+
+ MOUNT EVENT_TYPE [L1] [L2]
+
+ MOUNT : the filesystem's mount point
+ EVENT_TYPE : event types - currently two of them are being supported:
+
+ * generic events ("G") covering most common warnings
+ and errors that might be reported by any filesystem;
+ this option does not take any arguments;
+
+ * threshold notifications ("T") - events sent whenever
+ the amount of available space drops below certain level;
+ it is possible to specify two threshold levels though
+ only one is required to properly setup the notifications;
+ as those refer to the number of available blocks, the lower
+ level [L1] needs to be higher than the upper one [L2]
+
+Sample request could look like the following:
+
+ echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config
+
+Multiple request might be specified provided they are separated with semicolon.
+
+The configuration itself might be modified at any time. One can add/remove
+particular event types for given fielsystem, modify the threshold levels,
+and remove single or all entries from the 'config' file.
+
+ - Adding new event type:
+
+ $ echo MOUNT EVENT_TYPE > /sys/fs/events/config
+
+(Note that is is enough to provide the event type to be enabled without
+the already set ones.)
+
+ - Removing event type:
+
+ $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config
+
+ - Updating threshold limits:
+
+ $ echo MOUNT T L1 L2 > /sys/fs/events/config
+
+ - Removing single entry:
+
+ $ echo '!MOUNT' > /sys/fs/events/config
+
+ - Removing all entries:
+
+ $ echo > /sys/fs/events/config
+
+Reading the file will list all registered entries with their current set-up
+along with some additional info like the filesystem type and the backing device
+name if available.
+
+Final, though a very important note on the configuration: when and if the
+actual events are being triggered falls way beyond the scope of the generic
+filesystem events interface. It is up to a particular filesystem
+implementation which events are to be supported - if any at all. So if
+given filesystem does not support the event notifications, an attempt to
+enable those through 'config' file will fail.
+
+
+3. The generic netlink interface support:
+=========================================
+
+Whenever an event notification is triggered (by given filesystem) the current
+configuration is being validated to decide whether a userpsace notification
+should be launched. If there has been no request (in a mean of 'config' file
+entry) for given event, one will be silently disregarded. If, on the other
+hand, someone is 'watching' given filesystem for specific events, a generic
+netlink message will be sent. A dedicated multicast group has been provided
+solely for this purpose so in order to receive such notifications, one should
+subscribe to this new multicast group.
+
+3.1 Message format
+
+The FS_NL_C_EVENT shall be stored within the generic netlink message header
+as the command field. The message payload will provide more detailed info:
+the backing device major and minor numbers, the event code and the id of
+the process which action led to the event occurrence. In case of threshold
+notifications, the current number of available blocks will be included
+in the payload as well.
+
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | NETLINK MESSAGE HEADER |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | GENERIC NETLINK MESSAGE HEADER |
+ | (with FS_NL_C_EVENT as genlmsghdr cdm field) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Optional user specific message header |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | GENERIC MESSAGE PAYLOAD: |
+ +---------------------------------------------------------------+
+ | FS_NL_A_EVENT_ID (NLA_U32) |
+ +---------------------------------------------------------------+
+ | FS_NL_A_DEV_MAJOR (NLA_U32) |
+ +---------------------------------------------------------------+
+ | FS_NL_A_DEV_MINOR (NLA_U32) |
+ +---------------------------------------------------------------+
+ | FS_NL_A_CAUSED_ID (NLA_U32) |
+ +---------------------------------------------------------------+
+ | FS_NL_A_DATA (NLA_U64) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+
+The above figure is based on:
+ http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto#Message_Format
+
+
+4. API Reference:
+=================
+
+ 4.1 Generic file system event interface data & operations
+
+ #include <linux/fs_event.h>
+
+ struct fs_trace_info {
+ void __rcu *e_priv /* READ ONLY */
+ unsigned int events_cap_mask; /* Supported notifications */
+ const struct fs_trace_operations *ops;
+ };
+
+ struct fs_trace_operations {
+ void (*query)(struct super_block *, u64 *);
+ };
+
+ In order to get the fireworks and stuff, each filesystem needs to setup
+ the events_cap_mask field of the fs_trace_info structure, which has been
+ embedded within the super_block structure. This should reflect the type of
+ events the filesystem wants to support. In case of threshold notifications,
+ apart from setting the FS_EVENT_THRESH flag, the 'query' callback should
+ be provided as this enables the events interface to get the up-to-date
+ state of the number of available blocks whenever those notifications are
+ being requested.
+
+ The 'e_priv' field of the fs_trace_info structure should be completely ignored
+ as it's for INTERNAL USE ONLY. So don't even think of messing with it, if you
+ do not want to get yourself into some real trouble. If still, you are tempted
+ to do so - feel free, it's gonna be pure fun. Consider yourself warned.
+
+
+ 4.2 Event notification:
+
+ #include <linux/fs_event.h>
+ void fs_event_notify(struct super_block *sb, unsigned int event_id);
+
+ Notify the generic FS event interface of an occurring event.
+ This shall be used by any file system that wishes to inform any potential
+ listeners/watchers of a particular event.
+ - sb: the filesystem's super block
+ - event_id: an event identifier
+
+ 4.3 Threshold notifications:
+
+ #include <linux/fs_event.h>
+ void fs_event_alloc_space(struct super_block *sb, u64 ncount);
+ void fs_event_free_space(struct super_block *sb, u64 ncount);
+
+ Each filesystme supporting the threshold notifications should call
+ fs_event_alloc_space/fs_event_free_space respectively whenever the
+ amount of available blocks changes.
+ - sb: the filesystem's super block
+ - ncount: number of blocks being acquired/released
+
+ Note that to properly handle the threshold notifications the fs events
+ interface needs to be kept up to date by the filesystems. Each should
+ register fs_trace_operations to enable querying the current number of
+ available blocks.
+
+ 4.4 Sending message through generic netlink interface
+
+ #include <linux/fs_event.h>
+
+ int fs_netlink_send_event(size_t size, unsigned int event_id,
+ int (*compose_msg)(struct sk_buff *skb, void *data), void *cbdata);
+
+ Although the fs event interface is fully responsible for sending the messages
+ over the netlink, filesystems might use the FS_EVENT multicast group to send
+ their own custom messages.
+ - size: the size of the message payload
+ - event_id: the event identifier
+ - compose_msg: a callback responsible for filling-in the message payload
+ - cbdata: message custom data
+
+ Calling fs_netlink_send_event will result in a message being sent by
+ the FS_EVENT multicast group. Note that the body of the message should be
+ prepared (set-up )by the caller - through compose_msg callback. The message's
+ sk_buff will be allocated on behalf of the caller (thus the size parameter).
+ The compose_msg should only fill the payload with proper data. Unless
+ the event id is specified as FS_EVENT_NONE, it's value shall be added
+ to the payload prior to calling the compose_msg.
+
+
diff --git a/fs/Makefile b/fs/Makefile
index a88ac48..798021d 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -126,3 +126,4 @@ obj-y += exofs/ # Multiple modules
obj-$(CONFIG_CEPH_FS) += ceph/
obj-$(CONFIG_PSTORE) += pstore/
obj-$(CONFIG_EFIVAR_FS) += efivarfs/
+obj-y += events/
diff --git a/fs/events/Makefile b/fs/events/Makefile
new file mode 100644
index 0000000..58d1454
--- /dev/null
+++ b/fs/events/Makefile
@@ -0,0 +1,6 @@
+#
+# Makefile for the Linux Generic File System Event Interface
+#
+
+obj-y := fs_event.o
+obj-$(CONFIG_NET) += fs_event_netlink.o
diff --git a/fs/events/fs_event.c b/fs/events/fs_event.c
new file mode 100644
index 0000000..ea6afdd
--- /dev/null
+++ b/fs/events/fs_event.c
@@ -0,0 +1,770 @@
+/*
+ * Generic File System Evens Interface
+ *
+ * Copyright(c) 2015 Samsung Electronics. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+#include <linux/fs.h>
+#include <linux/hashtable.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/parser.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/rcupdate.h>
+#include <net/genetlink.h>
+#include "../mount.h"
+#include "fs_event.h"
+
+static LIST_HEAD(fs_trace_list);
+static DEFINE_SPINLOCK(fs_trace_lock);
+
+static struct kmem_cache *fs_trace_cachep __read_mostly;
+
+static atomic_t stray_traces = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(trace_wq);
+/*
+ * Threshold notification state bits.
+ * Note the reverse as this refers to the number
+ * of available blocks.
+ */
+#define THRESH_LR_BELOW 0x0001 /* Falling below the lower range */
+#define THRESH_LR_BEYOND 0x0002
+#define THRESH_UR_BELOW 0x0004
+#define THRESH_UR_BEYOND 0x0008 /* Going beyond the upper range */
+
+#define THRESH_LR_ON (THRESH_LR_BELOW | THRESH_LR_BEYOND)
+#define THRESH_UR_ON (THRESH_UR_BELOW | THRESH_UR_BEYOND)
+
+#define FS_TRACE_ADD 0x100000
+
+struct fs_trace_entry {
+ atomic_t count;
+ atomic_t active;
+ struct super_block *sb;
+ unsigned int notify;
+ struct path mnt_path;
+ struct list_head node;
+
+ struct fs_event_thresh {
+ u64 avail_space;
+ u64 lrange;
+ u64 urange;
+ unsigned int state;
+ } th;
+ struct rcu_head rcu_head;
+ spinlock_t lock;
+};
+
+static const match_table_t fs_etypes = {
+ { FS_EVENT_GENERIC, "G" },
+ { FS_EVENT_THRESH, "T" },
+ { 0, NULL },
+};
+
+static __always_inline int fs_trace_query_data(struct super_block *sb,
+ struct fs_trace_entry *en)
+{
+ if (sb->s_etrace.ops && sb->s_etrace.ops->query) {
+ sb->s_etrace.ops->query(sb, &en->th.avail_space);
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
+static inline void fs_trace_entry_list_del(struct fs_trace_entry *en)
+{
+
+ lockdep_assert_held(&fs_trace_lock);
+
+ if (!atomic_add_unless(&en->active, -1, 0))
+ return;
+ /*
+ * At this point the trace entry is being marked as inactive
+ * so no new references will be allowed.
+ */
+ list_del(&en->node);
+ atomic_inc(&stray_traces);
+}
+
+static inline void fs_trace_entry_free(struct fs_trace_entry *en)
+{
+ kmem_cache_free(fs_trace_cachep, en);
+}
+
+static void fs_destroy_trace_entry(struct rcu_head *rcu_head)
+{
+ struct fs_trace_entry *en = container_of(rcu_head,
+ struct fs_trace_entry, rcu_head);
+
+ WARN_ON(atomic_read(&en->active));
+ fs_trace_entry_free(en);
+ atomic_dec(&stray_traces);
+}
+
+static inline void fs_release_trace_entry(struct fs_trace_entry *en)
+{
+ struct super_block *sb = en->sb;
+
+ rcu_assign_pointer(sb->s_etrace.e_priv, NULL);
+ call_rcu(&en->rcu_head, fs_destroy_trace_entry);
+}
+
+static inline void fs_trace_entry_put(struct fs_trace_entry *en)
+{
+ if (en && atomic_dec_and_test(&en->count))
+ fs_release_trace_entry(en);
+}
+
+static inline
+struct fs_trace_entry *fs_trace_entry_get(struct fs_trace_entry *en)
+{
+ if (en) {
+ if (!atomic_inc_not_zero(&en->count))
+ return NULL;
+ /* Don't allow referencing inactive object */
+ if (!atomic_read(&en->active)) {
+ fs_trace_entry_put(en);
+ return NULL;
+ }
+ }
+ return en;
+}
+
+static struct fs_trace_entry *fetch_trace_entry(struct super_block *sb)
+{
+ struct fs_trace_entry *en;
+
+ if (!sb)
+ return NULL;
+
+ rcu_read_lock();
+ en = rcu_dereference(sb->s_etrace.e_priv);
+ en = fs_trace_entry_get(en);
+ rcu_read_unlock();
+
+ return en;
+}
+
+static int fs_remove_trace_entry(struct super_block *sb)
+{
+ struct fs_trace_entry *en;
+
+ en = fetch_trace_entry(sb);
+ if (!en)
+ return -EINVAL;
+
+ spin_lock(&fs_trace_lock);
+ /*
+ * The trace entry might have already been removed
+ * from the list of active traces with the proper
+ * ref drop, though it was still in use handling
+ * one of the fs events. This means that the object
+ * has been already scheduled for being released.
+ */
+ if (atomic_read(&en->active)) {
+ fs_trace_entry_list_del(en);
+ fs_trace_entry_put(en);
+ }
+ spin_unlock(&fs_trace_lock);
+ fs_trace_entry_put(en);
+ return 0;
+}
+
+static void fs_remove_all_traces(void)
+{
+ struct fs_trace_entry *en, *guard;
+
+ spin_lock(&fs_trace_lock);
+ list_for_each_entry_safe(en, guard, &fs_trace_list, node) {
+ fs_trace_entry_list_del(en);
+ fs_trace_entry_put(en);
+ }
+ spin_unlock(&fs_trace_lock);
+}
+
+static int create_common_msg(struct sk_buff *skb, void *data)
+{
+ struct fs_trace_entry *en = (struct fs_trace_entry *)data;
+ struct super_block *sb = en->sb;
+
+ if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev))
+ || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev)))
+ return -EINVAL;
+
+ if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_nr(task_pid(current))))
+ return -EINVAL;
+
+ return 0;
+}
+
+static int create_thresh_msg(struct sk_buff *skb, void *data)
+{
+ struct fs_trace_entry *en = (struct fs_trace_entry *)data;
+ int ret;
+
+ ret = create_common_msg(skb, data);
+ if (!ret)
+ ret = nla_put_u64(skb, FS_NL_A_DATA, en->th.avail_space);
+ return ret;
+}
+
+static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id)
+{
+ size_t size = nla_total_size(sizeof(u32)) * 2 +
+ nla_total_size(sizeof(u64));
+
+ fs_netlink_send_event(size, event_id, create_common_msg, en);
+}
+
+static void fs_event_send_thresh(struct fs_trace_entry *en,
+ unsigned int event_id)
+{
+ size_t size = nla_total_size(sizeof(u32)) * 2 +
+ nla_total_size(sizeof(u64)) * 2;
+
+ fs_netlink_send_event(size, event_id, create_thresh_msg, en);
+}
+
+void fs_event_notify(struct super_block *sb, unsigned int event_id)
+{
+ struct fs_trace_entry *en;
+
+ en = fetch_trace_entry(sb);
+ if (!en)
+ return;
+
+ spin_lock(&en->lock);
+ if (en->notify & FS_EVENT_GENERIC)
+ fs_event_send(en, event_id);
+ spin_unlock(&en->lock);
+ fs_trace_entry_put(en);
+}
+EXPORT_SYMBOL(fs_event_notify);
+
+void fs_event_alloc_space(struct super_block *sb, u64 ncount)
+{
+ struct fs_trace_entry *en;
+ s64 count;
+
+ en = fetch_trace_entry(sb);
+ if (!en)
+ return;
+
+ spin_lock(&en->lock);
+
+ if (!(en->notify & FS_EVENT_THRESH))
+ goto leave;
+ /* we shouldn't drop below 0 here,
+ * unless there is a sync issue somewhere (?)
+ */
+ count = en->th.avail_space - ncount;
+ en->th.avail_space = count < 0 ? 0 : count;
+
+ if (en->th.avail_space > en->th.lrange)
+ /* Not 'even' close - leave */
+ goto leave;
+
+ if (en->th.avail_space > en->th.urange) {
+ /* Close enough - the lower range has been reached */
+ if (!(en->th.state & THRESH_LR_BEYOND)) {
+ /* Send notification */
+ fs_event_send_thresh(en, FS_THR_LRBELOW);
+ en->th.state &= ~THRESH_LR_BELOW;
+ en->th.state |= THRESH_LR_BEYOND;
+ }
+ goto leave;
+ }
+ if (!(en->th.state & THRESH_UR_BEYOND)) {
+ fs_event_send_thresh(en, FS_THR_URBELOW);
+ en->th.state &= ~THRESH_UR_BELOW;
+ en->th.state |= THRESH_UR_BEYOND;
+ }
+
+leave:
+ spin_unlock(&en->lock);
+ fs_trace_entry_put(en);
+}
+EXPORT_SYMBOL(fs_event_alloc_space);
+
+void fs_event_free_space(struct super_block *sb, u64 ncount)
+{
+ struct fs_trace_entry *en;
+
+ en = fetch_trace_entry(sb);
+ if (!en)
+ return;
+
+ spin_lock(&en->lock);
+
+ if (!(en->notify & FS_EVENT_THRESH))
+ goto leave;
+
+ en->th.avail_space += ncount;
+
+ if (en->th.avail_space > en->th.lrange) {
+ if (!(en->th.state & THRESH_LR_BELOW)
+ && en->th.state & THRESH_LR_BEYOND) {
+ /* Send notification */
+ fs_event_send_thresh(en, FS_THR_LRABOVE);
+ en->th.state &= ~THRESH_LR_BEYOND;
+ en->th.state |= THRESH_LR_BELOW;
+ goto leave;
+ }
+ }
+ if (en->th.avail_space > en->th.urange) {
+ if (!(en->th.state & THRESH_UR_BELOW)
+ && en->th.state & THRESH_UR_BEYOND) {
+ /* Notify */
+ fs_event_send_thresh(en, FS_THR_URABOVE);
+ en->th.state &= ~THRESH_UR_BEYOND;
+ en->th.state |= THRESH_UR_BELOW;
+ }
+ }
+leave:
+ spin_unlock(&en->lock);
+ fs_trace_entry_put(en);
+}
+EXPORT_SYMBOL(fs_event_free_space);
+
+void fs_event_mount_dropped(struct vfsmount *mnt)
+{
+ /*
+ * This gets tricky here: the mount is dropped but the super
+ * might not get released at once so there is very small chance
+ * some notifications will come through.
+ */
+ fs_remove_trace_entry(mnt->mnt_sb);
+}
+
+static int fs_new_trace_entry(struct path *path, struct fs_event_thresh *thresh,
+ unsigned int nmask)
+{
+ struct fs_trace_entry *en;
+ struct super_block *sb;
+ struct mount *r_mnt;
+
+ en = kmem_cache_zalloc(fs_trace_cachep, GFP_KERNEL);
+ if (unlikely(!en))
+ return -ENOMEM;
+ /*
+ * Note that no reference is being taken here for the path as it would
+ * make the umount unnecessarily puzzling (due to an extra 'valid'
+ * reference for the mnt).
+ * This is *rather* safe as the notification on mount being dropped
+ * will get called prior to releasing the super block - so right
+ * in time to perform appropriate clean-up
+ */
+ r_mnt = real_mount(path->mnt);
+ en->mnt_path.dentry = r_mnt->mnt.mnt_root;
+ en->mnt_path.mnt = &r_mnt->mnt;
+
+ sb = path->mnt->mnt_sb;
+ en->sb = sb;
+
+ nmask &= sb->s_etrace.events_cap_mask;
+ if (!nmask)
+ goto leave;
+
+ spin_lock_init(&en->lock);
+ INIT_LIST_HEAD(&en->node);
+
+ en->notify = nmask;
+ memcpy(&en->th, thresh, offsetof(struct fs_event_thresh, state));
+ if (nmask & FS_EVENT_THRESH)
+ fs_trace_query_data(sb, en);
+
+ atomic_set(&en->count, 1);
+
+ if (rcu_access_pointer(sb->s_etrace.e_priv) != NULL) {
+ struct fs_trace_entry *prev_en;
+
+ prev_en = fetch_trace_entry(sb);
+ if (prev_en) {
+ fs_trace_entry_put(prev_en);
+ goto leave;
+ }
+ }
+ atomic_set(&en->active, 1);
+
+ spin_lock(&fs_trace_lock);
+ list_add(&en->node, &fs_trace_list);
+ spin_unlock(&fs_trace_lock);
+
+ rcu_assign_pointer(sb->s_etrace.e_priv, en);
+ synchronize_rcu();
+
+ return 0;
+leave:
+ kmem_cache_free(fs_trace_cachep, en);
+ return -EINVAL;
+}
+
+static int fs_update_trace_entry_locked(struct fs_trace_entry *en,
+ struct fs_event_thresh *thresh,
+ unsigned int nmask)
+{
+ int extend = nmask & FS_TRACE_ADD;
+ struct super_block *sb = en->sb;
+
+ nmask &= ~FS_TRACE_ADD;
+ if (!(nmask & sb->s_etrace.events_cap_mask))
+ return -EINVAL;
+
+ if (nmask & FS_EVENT_THRESH) {
+ if (extend) {
+ /* Get the current state */
+ if (!(en->notify & FS_EVENT_THRESH))
+ if (fs_trace_query_data(sb, en))
+ return -EINVAL;
+
+ if (thresh->state & THRESH_LR_ON) {
+ en->th.lrange = thresh->lrange;
+ en->th.state &= ~THRESH_LR_ON;
+ }
+
+ if (thresh->state & THRESH_UR_ON) {
+ en->th.urange = thresh->urange;
+ en->th.state &= ~THRESH_UR_ON;
+ }
+ } else {
+ memset(&en->th, 0, sizeof(en->th));
+ }
+ }
+
+ if (extend)
+ en->notify |= nmask;
+ else
+ en->notify &= ~nmask;
+ return 0;
+}
+
+static int fs_update_trace_entry(struct path *path,
+ struct fs_event_thresh *thresh,
+ unsigned int nmask)
+{
+ struct fs_trace_entry *en;
+ int ret;
+
+ en = fetch_trace_entry(path->mnt->mnt_sb);
+ if (!en)
+ return (nmask & FS_TRACE_ADD)
+ ? fs_new_trace_entry(path, thresh, nmask)
+ : -EINVAL;
+
+ spin_lock(&en->lock);
+ ret = fs_update_trace_entry_locked(en, thresh, nmask);
+ spin_unlock(&en->lock);
+
+ fs_trace_entry_put(en);
+ return ret;
+}
+
+static int fs_parse_trace_request(int argc, char **argv)
+{
+ struct fs_event_thresh thresh = {0};
+ struct path path;
+ substring_t args[MAX_OPT_ARGS];
+ unsigned int nmask = FS_TRACE_ADD;
+ int token;
+ char *s;
+ int ret = -EINVAL;
+
+ if (!argc) {
+ fs_remove_all_traces();
+ return 0;
+ }
+
+ s = *(argv);
+ if (*s == '!') {
+ /* Clear the trace entry */
+ nmask &= ~FS_TRACE_ADD;
+ ++s;
+ }
+
+ if (kern_path_mountpoint(AT_FDCWD, s, &path, LOOKUP_FOLLOW))
+ return -EINVAL;
+
+ if (!(--argc)) {
+ if (!(nmask & FS_TRACE_ADD))
+ ret = fs_remove_trace_entry(path.mnt->mnt_sb);
+ goto leave;
+ }
+
+repeat:
+ args[0].to = args[0].from = NULL;
+ token = match_token(*(++argv), fs_etypes, args);
+ if (!token && !nmask)
+ goto leave;
+
+ nmask |= token & FS_EVENTS_ALL;
+ --argc;
+ if ((token & FS_EVENT_THRESH) && (nmask & FS_TRACE_ADD)) {
+ /*
+ * Get the threshold config data:
+ * lower range
+ * upper range
+ */
+ if (!argc)
+ goto leave;
+
+ ret = kstrtoull(*(++argv), 10, &thresh.lrange);
+ if (ret)
+ goto leave;
+ thresh.state |= THRESH_LR_ON;
+ if ((--argc)) {
+ ret = kstrtoull(*(++argv), 10, &thresh.urange);
+ if (ret)
+ goto leave;
+ thresh.state |= THRESH_UR_ON;
+ --argc;
+ }
+ /* The thresholds are based on number of available blocks */
+ if (thresh.lrange < thresh.urange) {
+ ret = -EINVAL;
+ goto leave;
+ }
+ }
+ if (argc)
+ goto repeat;
+
+ ret = fs_update_trace_entry(&path, &thresh, nmask);
+leave:
+ path_put(&path);
+ return ret;
+}
+
+#define DEFAULT_BUF_SIZE PAGE_SIZE
+
+static ssize_t fs_trace_write(struct file *file, const char __user *buffer,
+ size_t count, loff_t *ppos)
+{
+ char **argv;
+ char *kern_buf, *next, *cfg;
+ size_t size, dcount = 0;
+ int argc;
+
+ if (!count)
+ return 0;
+
+ kern_buf = kmalloc(DEFAULT_BUF_SIZE, GFP_KERNEL);
+ if (!kern_buf)
+ return -ENOMEM;
+
+ while (dcount < count) {
+
+ size = count - dcount;
+ if (size >= DEFAULT_BUF_SIZE)
+ size = DEFAULT_BUF_SIZE - 1;
+ if (copy_from_user(kern_buf, buffer + dcount, size)) {
+ dcount = -EINVAL;
+ goto leave;
+ }
+
+ kern_buf[size] = '\0';
+
+ next = cfg = kern_buf;
+
+ do {
+ next = strchr(cfg, ';');
+ if (next)
+ *next = '\0';
+
+ argv = argv_split(GFP_KERNEL, cfg, &argc);
+ if (!argv) {
+ dcount = -ENOMEM;
+ goto leave;
+ }
+
+ if (fs_parse_trace_request(argc, argv)) {
+ dcount = -EINVAL;
+ argv_free(argv);
+ goto leave;
+ }
+
+ argv_free(argv);
+ if (next)
+ cfg = ++next;
+
+ } while (next);
+ dcount += size;
+ }
+leave:
+ kfree(kern_buf);
+ return dcount;
+}
+
+static void *fs_trace_seq_start(struct seq_file *m, loff_t *pos)
+{
+ spin_lock(&fs_trace_lock);
+ return seq_list_start(&fs_trace_list, *pos);
+}
+
+static void *fs_trace_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ return seq_list_next(v, &fs_trace_list, pos);
+}
+
+static void fs_trace_seq_stop(struct seq_file *m, void *v)
+{
+ spin_unlock(&fs_trace_lock);
+}
+
+static int fs_trace_seq_show(struct seq_file *m, void *v)
+{
+ struct fs_trace_entry *en;
+ struct super_block *sb;
+ struct mount *r_mnt;
+ const struct match_token *match;
+ unsigned int nmask;
+
+ en = list_entry(v, struct fs_trace_entry, node);
+ sb = en->sb;
+
+ seq_path(m, &en->mnt_path, "\t\n\\");
+ seq_putc(m, ' ');
+
+ seq_escape(m, sb->s_type->name, " \t\n\\");
+ if (sb->s_subtype && sb->s_subtype[0]) {
+ seq_putc(m, '.');
+ seq_escape(m, sb->s_subtype, " \t\n\\");
+ }
+
+ seq_putc(m, ' ');
+ if (sb->s_op->show_devname) {
+ sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root);
+ } else {
+ r_mnt = real_mount(en->mnt_path.mnt);
+ seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none",
+ " \t\n\\");
+ }
+ seq_puts(m, " (");
+
+ nmask = en->notify;
+ for (match = fs_etypes; match->pattern; ++match) {
+ if (match->token & nmask) {
+ seq_puts(m, match->pattern);
+ nmask &= ~match->token;
+ if (nmask)
+ seq_putc(m, ',');
+ }
+ }
+ seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange);
+ seq_puts(m, ")\n");
+ return 0;
+}
+
+static const struct seq_operations fs_trace_seq_ops = {
+ .start = fs_trace_seq_start,
+ .next = fs_trace_seq_next,
+ .stop = fs_trace_seq_stop,
+ .show = fs_trace_seq_show,
+};
+
+static int fs_trace_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &fs_trace_seq_ops);
+}
+
+static const struct file_operations fs_trace_fops = {
+ .owner = THIS_MODULE,
+ .open = fs_trace_open,
+ .write = fs_trace_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static int fs_trace_init(void)
+{
+ fs_trace_cachep = KMEM_CACHE(fs_trace_entry, 0);
+ if (!fs_trace_cachep)
+ return -EINVAL;
+ init_waitqueue_head(&trace_wq);
+ return 0;
+}
+
+/* VFS support */
+static int fs_trace_fill_super(struct super_block *sb, void *data, int silen)
+{
+ int ret;
+ static struct tree_descr desc[] = {
+ [2] = {
+ .name = "config",
+ .ops = &fs_trace_fops,
+ .mode = S_IWUSR | S_IRUGO,
+ },
+ {""},
+ };
+
+ ret = simple_fill_super(sb, 0x7246332, desc);
+ return !ret ? fs_trace_init() : ret;
+}
+
+static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type,
+ int ntype, const char *dev_name, void *data)
+{
+ return mount_single(fs_type, ntype, data, fs_trace_fill_super);
+}
+
+static void fs_trace_kill_super(struct super_block *sb)
+{
+ /*
+ * The rcu_barrier here will/should make sure all call_rcu
+ * callbacks are completed - still there might be some active
+ * trace objects in use, so the callbacks will not be triggered
+ * for them at all (at this point), which makes calling the
+ * kmem_cache_destroy unsafe. So we wait until all traces
+ * are finally released.
+ */
+ fs_remove_all_traces();
+ rcu_barrier();
+
+ wait_event(trace_wq, !atomic_read(&stray_traces));
+
+ kmem_cache_destroy(fs_trace_cachep);
+ kill_litter_super(sb);
+}
+
+static struct kset *fs_trace_kset;
+
+static struct file_system_type fs_trace_fstype = {
+ .name = "fstrace",
+ .mount = fs_trace_do_mount,
+ .kill_sb = fs_trace_kill_super,
+};
+
+static void __init fs_trace_vfs_init(void)
+{
+ fs_trace_kset = kset_create_and_add("events", NULL, fs_kobj);
+
+ if (!fs_trace_kset)
+ return;
+
+ if (!register_filesystem(&fs_trace_fstype)) {
+ if (!fs_event_netlink_register())
+ return;
+ unregister_filesystem(&fs_trace_fstype);
+ }
+ kset_unregister(fs_trace_kset);
+}
+
+static int __init fs_trace_evens_init(void)
+{
+ fs_trace_vfs_init();
+ return 0;
+};
+module_init(fs_trace_evens_init);
+
diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h
new file mode 100644
index 0000000..716a2dd
--- /dev/null
+++ b/fs/events/fs_event.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright(c) 2015 Samsung Electronics. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef __GENERIC_FS_EVENTS_H
+#define __GENERIC_FS_EVENTS_H
+
+#ifdef CONFIG_NET
+int fs_event_netlink_register(void);
+#else /* CONFIG_NET */
+static inline int fs_event_netlink_register(void) { return -ENOSYS; }
+#endif /* CONFIG_NET */
+
+#endif /* __GENERIC_FS_EVENTS_H */
diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c
new file mode 100644
index 0000000..b59c6dc
--- /dev/null
+++ b/fs/events/fs_event_netlink.c
@@ -0,0 +1,99 @@
+/*
+ * Copyright(c) 2015 Samsung Electronics. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <net/netlink.h>
+#include <net/genetlink.h>
+#include "fs_event.h"
+
+static const struct genl_multicast_group fs_event_mcgroups[] = {
+ { .name = "event", },
+};
+
+static struct genl_family fs_event_family = {
+ .id = GENL_ID_FS_EVENT,
+ .hdrsize = 0,
+ .name = "FS_EVENT",
+ .version = 1,
+ .maxattr = FS_NL_A_MAX,
+ .mcgrps = fs_event_mcgroups,
+ .n_mcgrps = ARRAY_SIZE(fs_event_mcgroups),
+};
+
+int fs_netlink_send_event(size_t size, unsigned int event_id,
+ int (*compose_msg)(struct sk_buff *skb, void *data), void *cbdata)
+{
+ static atomic_t seq;
+ struct sk_buff *skb;
+ void *msg_head;
+ int ret = 0;
+
+ if (!size || !compose_msg)
+ return -EINVAL;
+
+ if (event_id != FS_EVENT_NONE)
+ size += nla_total_size(sizeof(u32));
+ size += nla_total_size(sizeof(u64));
+ skb = genlmsg_new(size, GFP_NOWAIT);
+
+ if (!skb) {
+ pr_err("Failed to allocate new FS generic netlink message\n");
+ return -ENOMEM;
+ }
+
+ msg_head = genlmsg_put(skb, 0, atomic_add_return(1, &seq),
+ &fs_event_family, 0, FS_NL_C_EVENT);
+ if (!msg_head)
+ goto cleanup;
+
+ if (event_id != FS_EVENT_NONE)
+ if (nla_put_u32(skb, FS_NL_A_EVENT_ID, event_id))
+ goto cancel;
+
+ ret = compose_msg(skb, cbdata);
+ if (ret)
+ goto cancel;
+
+ genlmsg_end(skb, msg_head);
+ ret = genlmsg_multicast(&fs_event_family, skb, 0, 0, GFP_NOWAIT);
+ if (ret && ret != -ENOBUFS && ret != -ESRCH)
+ goto cleanup;
+
+ return ret;
+cancel:
+ genlmsg_cancel(skb, msg_head);
+cleanup:
+ nlmsg_free(skb);
+ return ret;
+}
+EXPORT_SYMBOL(fs_netlink_send_event);
+
+int fs_event_netlink_register(void)
+{
+ int ret;
+
+ ret = genl_register_family(&fs_event_family);
+ if (ret)
+ pr_err("Failed to register FS netlink interface\n");
+ return ret;
+}
+
+void fs_event_netlink_unregister(void)
+{
+ genl_unregister_family(&fs_event_family);
+}
diff --git a/fs/namespace.c b/fs/namespace.c
index 82ef140..ec6e2ef 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1031,6 +1031,7 @@ static void cleanup_mnt(struct mount *mnt)
if (unlikely(mnt->mnt_pins.first))
mnt_pin_kill(mnt);
fsnotify_vfsmount_delete(&mnt->mnt);
+ fs_event_mount_dropped(&mnt->mnt);
dput(mnt->mnt.mnt_root);
deactivate_super(mnt->mnt.mnt_sb);
mnt_free_id(mnt);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..b7dadd9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -263,6 +263,10 @@ struct iattr {
* Includes for diskquotas.
*/
#include <linux/quota.h>
+/*
+ * Include for Generic File System Events Interface
+ */
+#include <linux/fs_event.h>
/*
* Maximum number of layers of fs stack. Needs to be limited to
@@ -1253,7 +1257,7 @@ struct super_block {
struct hlist_node s_instances;
unsigned int s_quota_types; /* Bitmask of supported quota types */
struct quota_info s_dquot; /* Diskquota specific options */
-
+ struct fs_trace_info s_etrace;
struct sb_writers s_writers;
char s_id[32]; /* Informational name */
diff --git a/include/linux/fs_event.h b/include/linux/fs_event.h
new file mode 100644
index 0000000..aa182ec
--- /dev/null
+++ b/include/linux/fs_event.h
@@ -0,0 +1,58 @@
+/*
+ * Generic File System Events Interface
+ *
+ * Copyright(c) 2015 Samsung Electronics. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+#ifndef _LINUX_GENERIC_FS_EVETS_
+#define _LINUX_GENERIC_FS_EVETS_
+#include <net/netlink.h>
+#include <uapi/linux/fs_event.h>
+
+/*
+ * Currently supported event types
+ */
+#define FS_EVENT_GENERIC 0x001
+#define FS_EVENT_THRESH 0x002
+
+#define FS_EVENTS_ALL (FS_EVENT_GENERIC | FS_EVENT_THRESH)
+
+struct fs_trace_operations {
+ void (*query)(struct super_block *, u64 *);
+};
+
+struct fs_trace_info {
+ void __rcu *e_priv; /* READ ONLY */
+ unsigned int events_cap_mask; /* Supported notifications */
+ const struct fs_trace_operations *ops;
+};
+
+void fs_event_notify(struct super_block *sb, unsigned int event_id);
+void fs_event_alloc_space(struct super_block *sb, u64 ncount);
+void fs_event_free_space(struct super_block *sb, u64 ncount);
+void fs_event_mount_dropped(struct vfsmount *mnt);
+
+#ifdef CONFIG_NET
+int fs_netlink_send_event(size_t size, unsigned int event_id,
+ int (*compose_msg)(struct sk_buff *skb, void *data), void *cbdata);
+#else /* CONFIG_NET */
+static inline
+int fs_netlink_send_event(size_t size, unsigned int event_id,
+ int (*compose_msig)(struct sk_buff *skb, void *data), void *cbdata)
+{
+ return -ENOSYS;
+}
+#endif /* CONFIG_NET */
+
+#endif /* _LINUX_GENERIC_FS_EVENTS_ */
+
diff --git a/include/uapi/linux/fs_event.h b/include/uapi/linux/fs_event.h
new file mode 100644
index 0000000..081c39b
--- /dev/null
+++ b/include/uapi/linux/fs_event.h
@@ -0,0 +1,54 @@
+/*
+ * Generic netlink support for Generic File System Events Interface
+ *
+ * Copyright(c) 2015 Samsung Electronics. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+#ifndef _UAPI_LINUX_GENERIC_FS_EVENTS_
+#define _UAPI_LINUX_GENERIC_FS_EVENTS_
+/*
+ * Generic netlink attribute types
+ */
+enum {
+ FS_NL_A_NONE,
+ FS_NL_A_EVENT_ID,
+ FS_NL_A_DEV_MAJOR,
+ FS_NL_A_DEV_MINOR,
+ FS_NL_A_CAUSED_ID,
+ FS_NL_A_DATA,
+ __FS_NL_A_MAX,
+};
+#define FS_NL_A_MAX (__FS_NL_A_MAX - 1)
+/*
+ * Generic netlink commands
+ */
+#define FS_NL_C_EVENT 1
+
+/*
+ * Supported set of FS events
+ */
+enum {
+ FS_EVENT_NONE,
+ FS_WARN_ENOSPC, /* No space left to reserve data blks */
+ FS_WARN_ENOSPC_META, /* No space left for metadata */
+ FS_THR_LRBELOW, /* The threshold lower range has been reached */
+ FS_THR_LRABOVE, /* The threshold lower range re-activcated*/
+ FS_THR_URBELOW,
+ FS_THR_URABOVE,
+ FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */
+ FS_ERR_CORRUPTED /* Critical error - fs corrupted */
+
+};
+
+#endif /* _UAPI_LINUX_GENERIC_FS_EVENTS_ */
+
diff --git a/include/uapi/linux/genetlink.h b/include/uapi/linux/genetlink.h
index c3363ba..6464129 100644
--- a/include/uapi/linux/genetlink.h
+++ b/include/uapi/linux/genetlink.h
@@ -29,6 +29,7 @@ struct genlmsghdr {
#define GENL_ID_CTRL NLMSG_MIN_TYPE
#define GENL_ID_VFS_DQUOT (NLMSG_MIN_TYPE + 1)
#define GENL_ID_PMCRAID (NLMSG_MIN_TYPE + 2)
+#define GENL_ID_FS_EVENT (NLMSG_MIN_TYPE + 3)
/**************************************************************************
* Controller
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 2ed5f96..e8e0bd68 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -82,7 +82,8 @@ static struct list_head family_ht[GENL_FAM_TAB_SIZE];
*/
static unsigned long mc_group_start = 0x3 | BIT(GENL_ID_CTRL) |
BIT(GENL_ID_VFS_DQUOT) |
- BIT(GENL_ID_PMCRAID);
+ BIT(GENL_ID_PMCRAID) |
+ BIT(GENL_ID_FS_EVENT);
static unsigned long *mc_groups = &mc_group_start;
static unsigned long mc_groups_longs = 1;
@@ -146,6 +147,7 @@ static u16 genl_generate_id(void)
for (i = 0; i <= GENL_MAX_ID - GENL_MIN_ID; i++) {
if (id_gen_idx != GENL_ID_VFS_DQUOT &&
id_gen_idx != GENL_ID_PMCRAID &&
+ id_gen_idx != GENL_ID_FS_EVENT &&
!genl_family_find_byid(id_gen_idx))
return id_gen_idx;
if (++id_gen_idx > GENL_MAX_ID)
@@ -249,6 +251,9 @@ static int genl_validate_assign_mc_groups(struct genl_family *family)
} else if (family->id == GENL_ID_PMCRAID) {
first_id = GENL_ID_PMCRAID;
BUG_ON(n_groups != 1);
+ } else if (family->id == GENL_ID_FS_EVENT) {
+ first_id = GENL_ID_FS_EVENT;
+ BUG_ON(n_groups != 1);
} else {
groups_allocated = true;
err = genl_allocate_reserve_groups(n_groups, &first_id);
--
1.7.9.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Add support for generic FS events including threshold
notifications, ENOSPC and remount as read-only warnings,
along with generic internal warnings/errors.
Signed-off-by: Beata Michalska <[email protected]>
---
fs/ext4/balloc.c | 10 ++++++++--
fs/ext4/ext4.h | 1 +
fs/ext4/inode.c | 2 +-
fs/ext4/mballoc.c | 6 +++++-
fs/ext4/resize.c | 1 +
fs/ext4/super.c | 39 +++++++++++++++++++++++++++++++++++++++
6 files changed, 55 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index e95b27a..a48450f 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -569,6 +569,7 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi,
{
if (ext4_has_free_clusters(sbi, nclusters, flags)) {
percpu_counter_add(&sbi->s_dirtyclusters_counter, nclusters);
+ fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, nclusters));
return 0;
} else
return -ENOSPC;
@@ -590,9 +591,10 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
{
if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) ||
(*retries)++ > 3 ||
- !EXT4_SB(sb)->s_journal)
+ !EXT4_SB(sb)->s_journal) {
+ fs_event_notify(sb, FS_WARN_ENOSPC);
return 0;
-
+ }
jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
@@ -637,6 +639,10 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
dquot_alloc_block_nofail(inode,
EXT4_C2B(EXT4_SB(inode->i_sb), ar.len));
}
+
+ if (*errp == -ENOSPC)
+ fs_event_notify(inode->i_sb, FS_WARN_ENOSPC_META);
+
return ret;
}
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 163afe2..7d75ff9 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2542,6 +2542,7 @@ void ext4_mark_group_corrupted(struct ext4_sb_info *sbi,
if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp))
percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free);
set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
+ fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, grp->bb_free));
}
/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..2a7af0f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1238,7 +1238,7 @@ static void ext4_da_release_space(struct inode *inode, int to_free)
percpu_counter_sub(&sbi->s_dirtyclusters_counter, to_free);
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
-
+ fs_event_free_space(sbi->s_sb, to_free);
dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free));
}
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 24a4b6d..c2df6f0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4511,6 +4511,9 @@ out:
kmem_cache_free(ext4_ac_cachep, ac);
if (inquota && ar->len < inquota)
dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len));
+ if (reserv_clstrs && ar->len < reserv_clstrs)
+ fs_event_free_space(sbi->s_sb,
+ EXT4_C2B(sbi, reserv_clstrs - ar->len));
if (!ar->len) {
if ((ar->flags & EXT4_MB_DELALLOC_RESERVED) == 0)
/* release all the reserved blocks if non delalloc */
@@ -4848,7 +4851,7 @@ do_more:
if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE))
dquot_free_block(inode, EXT4_C2B(sbi, count_clusters));
percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters);
-
+ fs_event_free_space(sb, EXT4_C2B(sbi, count_clusters));
ext4_mb_unload_buddy(&e4b);
/* We dirtied the bitmap block */
@@ -4982,6 +4985,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
ext4_unlock_group(sb, block_group);
percpu_counter_add(&sbi->s_freeclusters_counter,
EXT4_NUM_B2C(sbi, blocks_freed));
+ fs_event_free_space(sb, blocks_freed);
if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 8a8ec62..dbf08d6 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -1378,6 +1378,7 @@ static void ext4_update_super(struct super_block *sb,
EXT4_NUM_B2C(sbi, free_blocks));
percpu_counter_add(&sbi->s_freeinodes_counter,
EXT4_INODES_PER_GROUP(sb) * flex_gd->count);
+ fs_event_free_space(sb, free_blocks - reserved_blocks);
ext4_debug("free blocks count %llu",
percpu_counter_read(&sbi->s_freeclusters_counter));
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e061e66..108b667 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -585,6 +585,8 @@ void __ext4_abort(struct super_block *sb, const char *function,
if (EXT4_SB(sb)->s_journal)
jbd2_journal_abort(EXT4_SB(sb)->s_journal, -EIO);
save_error_info(sb, function, line);
+ fs_event_notify(sb, FS_ERR_REMOUNT_RO);
+
}
if (test_opt(sb, ERRORS_PANIC))
panic("EXT4-fs panic from previous error\n");
@@ -1083,6 +1085,12 @@ static const struct quotactl_ops ext4_qctl_operations = {
};
#endif
+static void ext4_trace_query(struct super_block *sb, u64 *ncount);
+
+static const struct fs_trace_operations ext4_trace_ops = {
+ .query = ext4_trace_query,
+};
+
static const struct super_operations ext4_sops = {
.alloc_inode = ext4_alloc_inode,
.destroy_inode = ext4_destroy_inode,
@@ -3398,11 +3406,20 @@ static int ext4_reserve_clusters(struct ext4_sb_info *sbi, ext4_fsblk_t count)
{
ext4_fsblk_t clusters = ext4_blocks_count(sbi->s_es) >>
sbi->s_cluster_bits;
+ ext4_fsblk_t current_resv;
if (count >= clusters)
return -EINVAL;
+ current_resv = atomic64_read(&sbi->s_resv_clusters);
atomic64_set(&sbi->s_resv_clusters, count);
+
+ if (count > current_resv)
+ fs_event_alloc_space(sbi->s_sb,
+ EXT4_C2B(sbi, count - current_resv));
+ else
+ fs_event_free_space(sbi->s_sb,
+ EXT4_C2B(sbi, current_resv - count));
return 0;
}
@@ -3966,6 +3983,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sb->s_qcop = &ext4_qctl_operations;
sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP;
#endif
+ sb->s_etrace.ops = &ext4_trace_ops;
+ sb->s_etrace.events_cap_mask = FS_EVENTS_ALL;
+
memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
@@ -5438,6 +5458,25 @@ out:
#endif
+static void ext4_trace_query(struct super_block *sb, u64 *ncount)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_super_block *es = sbi->s_es;
+ ext4_fsblk_t rsv_blocks;
+ ext4_fsblk_t nblocks;
+
+ nblocks = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) -
+ percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter);
+ nblocks = EXT4_C2B(sbi, nblocks);
+ rsv_blocks = ext4_r_blocks_count(es) +
+ EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters));
+ if (nblocks < rsv_blocks)
+ nblocks = 0;
+ else
+ nblocks -= rsv_blocks;
+ *ncount = nblocks;
+}
+
static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags,
const char *dev_name, void *data)
{
--
1.7.9.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Add support for the generic FS events interface
covering threshold notifiactions and the ENOSPC
warning.
Signed-off-by: Beata Michalska <[email protected]>
---
mm/shmem.c | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index cf2d0ca..a044d12 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -201,6 +201,7 @@ static int shmem_reserve_inode(struct super_block *sb)
spin_lock(&sbinfo->stat_lock);
if (!sbinfo->free_inodes) {
spin_unlock(&sbinfo->stat_lock);
+ fs_event_notify(sb, FS_WARN_ENOSPC);
return -ENOSPC;
}
sbinfo->free_inodes--;
@@ -239,8 +240,10 @@ static void shmem_recalc_inode(struct inode *inode)
freed = info->alloced - info->swapped - inode->i_mapping->nrpages;
if (freed > 0) {
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
- if (sbinfo->max_blocks)
+ if (sbinfo->max_blocks) {
percpu_counter_add(&sbinfo->used_blocks, -freed);
+ fs_event_free_space(inode->i_sb, freed);
+ }
info->alloced -= freed;
inode->i_blocks -= freed * BLOCKS_PER_PAGE;
shmem_unacct_blocks(info->flags, freed);
@@ -1164,6 +1167,7 @@ repeat:
goto unacct;
}
percpu_counter_inc(&sbinfo->used_blocks);
+ fs_event_alloc_space(inode->i_sb, 1);
}
page = shmem_alloc_page(gfp, info, index);
@@ -1245,8 +1249,10 @@ trunc:
spin_unlock(&info->lock);
decused:
sbinfo = SHMEM_SB(inode->i_sb);
- if (sbinfo->max_blocks)
+ if (sbinfo->max_blocks) {
percpu_counter_add(&sbinfo->used_blocks, -1);
+ fs_event_free_space(inode->i_sb, 1);
+ }
unacct:
shmem_unacct_blocks(info->flags, 1);
failed:
@@ -1258,12 +1264,16 @@ unlock:
unlock_page(page);
page_cache_release(page);
}
- if (error == -ENOSPC && !once++) {
+ if (error == -ENOSPC) {
+ if (!once++) {
info = SHMEM_I(inode);
spin_lock(&info->lock);
shmem_recalc_inode(inode);
spin_unlock(&info->lock);
goto repeat;
+ } else {
+ fs_event_notify(inode->i_sb, FS_WARN_ENOSPC);
+ }
}
if (error == -EEXIST) /* from above or from radix_tree_insert */
goto repeat;
@@ -2729,12 +2739,26 @@ static int shmem_encode_fh(struct inode *inode, __u32 *fh, int *len,
return 1;
}
+static void shmem_trace_query(struct super_block *sb, u64 *ncount)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+
+ if (sbinfo->max_blocks)
+ *ncount = sbinfo->max_blocks -
+ percpu_counter_sum(&sbinfo->used_blocks);
+
+}
+
static const struct export_operations shmem_export_ops = {
.get_parent = shmem_get_parent,
.encode_fh = shmem_encode_fh,
.fh_to_dentry = shmem_fh_to_dentry,
};
+static const struct fs_trace_operations shmem_trace_ops = {
+ .query = shmem_trace_query,
+};
+
static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
bool remount)
{
@@ -3020,6 +3044,9 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
sb->s_flags |= MS_NOUSER;
}
sb->s_export_op = &shmem_export_ops;
+ sb->s_etrace.ops = &shmem_trace_ops;
+ sb->s_etrace.events_cap_mask = FS_EVENTS_ALL;
+
sb->s_flags |= MS_NOSEC;
#else
sb->s_flags |= MS_NOUSER;
--
1.7.9.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
> Introduce configurable generic interface for file
> system-wide event notifications, to provide file
> systems with a common way of reporting any potential
> issues as they emerge.
>
> The notifications are to be issued through generic
> netlink interface by newly introduced multicast group.
>
> Threshold notifications have been included, allowing
> triggering an event whenever the amount of free space drops
> below a certain level - or levels to be more precise as two
> of them are being supported: the lower and the upper range.
> The notifications work both ways: once the threshold level
> has been reached, an event shall be generated whenever
> the number of available blocks goes up again re-activating
> the threshold.
>
> The interface has been exposed through a vfs. Once mounted,
> it serves as an entry point for the set-up where one can
> register for particular file system events.
>
> Signed-off-by: Beata Michalska <[email protected]>
> ---
> Documentation/filesystems/events.txt | 231 ++++++++++
> fs/Makefile | 1 +
> fs/events/Makefile | 6 +
> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
> fs/events/fs_event.h | 25 ++
> fs/events/fs_event_netlink.c | 99 +++++
> fs/namespace.c | 1 +
> include/linux/fs.h | 6 +-
> include/linux/fs_event.h | 58 +++
> include/uapi/linux/fs_event.h | 54 +++
> include/uapi/linux/genetlink.h | 1 +
> net/netlink/genetlink.c | 7 +-
> 12 files changed, 1257 insertions(+), 2 deletions(-)
> create mode 100644 Documentation/filesystems/events.txt
> create mode 100644 fs/events/Makefile
> create mode 100644 fs/events/fs_event.c
> create mode 100644 fs/events/fs_event.h
> create mode 100644 fs/events/fs_event_netlink.c
> create mode 100644 include/linux/fs_event.h
> create mode 100644 include/uapi/linux/fs_event.h
Any reason why you just don't do uevents for the block devices today,
and not create a new type of netlink message and userspace tool required
to read these?
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -126,3 +126,4 @@ obj-y += exofs/ # Multiple modules
> obj-$(CONFIG_CEPH_FS) += ceph/
> obj-$(CONFIG_PSTORE) += pstore/
> obj-$(CONFIG_EFIVAR_FS) += efivarfs/
> +obj-y += events/
Always?
> diff --git a/fs/events/Makefile b/fs/events/Makefile
> new file mode 100644
> index 0000000..58d1454
> --- /dev/null
> +++ b/fs/events/Makefile
> @@ -0,0 +1,6 @@
> +#
> +# Makefile for the Linux Generic File System Event Interface
> +#
> +
> +obj-y := fs_event.o
Always? Even if the option is not selected? Why is everyone forced to
always use this code? Can't you disable it for the "tiny" systems that
don't need it?
> +struct fs_trace_entry {
> + atomic_t count;
Why not just use a 'struct kref' for your count, which will save a bunch
of open-coding of reference counting, and forcing us to audit your code
to verify you got all the corner cases correct? :)
> + atomic_t active;
> + struct super_block *sb;
Are you properly reference counting this pointer? I didn't see where
that was happening, so I must have missed it.
thanks,
greg k-h
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On 04/27/2015 04:24 PM, Greg KH wrote:
> On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
>> Introduce configurable generic interface for file
>> system-wide event notifications, to provide file
>> systems with a common way of reporting any potential
>> issues as they emerge.
>>
>> The notifications are to be issued through generic
>> netlink interface by newly introduced multicast group.
>>
>> Threshold notifications have been included, allowing
>> triggering an event whenever the amount of free space drops
>> below a certain level - or levels to be more precise as two
>> of them are being supported: the lower and the upper range.
>> The notifications work both ways: once the threshold level
>> has been reached, an event shall be generated whenever
>> the number of available blocks goes up again re-activating
>> the threshold.
>>
>> The interface has been exposed through a vfs. Once mounted,
>> it serves as an entry point for the set-up where one can
>> register for particular file system events.
>>
>> Signed-off-by: Beata Michalska <[email protected]>
>> ---
>> Documentation/filesystems/events.txt | 231 ++++++++++
>> fs/Makefile | 1 +
>> fs/events/Makefile | 6 +
>> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
>> fs/events/fs_event.h | 25 ++
>> fs/events/fs_event_netlink.c | 99 +++++
>> fs/namespace.c | 1 +
>> include/linux/fs.h | 6 +-
>> include/linux/fs_event.h | 58 +++
>> include/uapi/linux/fs_event.h | 54 +++
>> include/uapi/linux/genetlink.h | 1 +
>> net/netlink/genetlink.c | 7 +-
>> 12 files changed, 1257 insertions(+), 2 deletions(-)
>> create mode 100644 Documentation/filesystems/events.txt
>> create mode 100644 fs/events/Makefile
>> create mode 100644 fs/events/fs_event.c
>> create mode 100644 fs/events/fs_event.h
>> create mode 100644 fs/events/fs_event_netlink.c
>> create mode 100644 include/linux/fs_event.h
>> create mode 100644 include/uapi/linux/fs_event.h
>
> Any reason why you just don't do uevents for the block devices today,
> and not create a new type of netlink message and userspace tool required
> to read these?
The idea here is to have support for filesystems with no backing device as well.
Parsing the message with libnl is really simple and requires few lines of code
(sample application has been presented in the initial version of this RFC)
>
>> --- a/fs/Makefile
>> +++ b/fs/Makefile
>> @@ -126,3 +126,4 @@ obj-y += exofs/ # Multiple modules
>> obj-$(CONFIG_CEPH_FS) += ceph/
>> obj-$(CONFIG_PSTORE) += pstore/
>> obj-$(CONFIG_EFIVAR_FS) += efivarfs/
>> +obj-y += events/
>
> Always?
>
>> diff --git a/fs/events/Makefile b/fs/events/Makefile
>> new file mode 100644
>> index 0000000..58d1454
>> --- /dev/null
>> +++ b/fs/events/Makefile
>> @@ -0,0 +1,6 @@
>> +#
>> +# Makefile for the Linux Generic File System Event Interface
>> +#
>> +
>> +obj-y := fs_event.o
>
> Always? Even if the option is not selected? Why is everyone forced to
> always use this code? Can't you disable it for the "tiny" systems that
> don't need it?
>
I was considering making it optional and I guess it's worth getting back
to this idea.
>> +struct fs_trace_entry {
>> + atomic_t count;
>
> Why not just use a 'struct kref' for your count, which will save a bunch
> of open-coding of reference counting, and forcing us to audit your code
> to verify you got all the corner cases correct? :)
>
>> + atomic_t active;
>> + struct super_block *sb;
Not sure if using kref would change much here as the kref would not really
make it easier to verify those corner cases, unfortunately.
>
> Are you properly reference counting this pointer? I didn't see where
> that was happening, so I must have missed it.
>
> thanks,
>
You haven't. And if I haven't missed anything, the sb is being used only
as long as the super is alive. Most of the code operates on sb only if it
was explicitly asked to, through call from filesystem. There is also
a callback notifying of mount being dropped (which proceeds the call to
kill_super) that invalidates the object that depends on it.
Still, it should be explicitly stated that the sb is being used through
bidding up the s_count counter, though that would require taking the
sb_lock. AFAIK, one can get the reference to super block but for a particular
device. Maybe it would be worth having it more generic (?).
> greg k-h
>
BR
Beata
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
> On 04/27/2015 04:24 PM, Greg KH wrote:
> > On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
> >> Introduce configurable generic interface for file
> >> system-wide event notifications, to provide file
> >> systems with a common way of reporting any potential
> >> issues as they emerge.
> >>
> >> The notifications are to be issued through generic
> >> netlink interface by newly introduced multicast group.
> >>
> >> Threshold notifications have been included, allowing
> >> triggering an event whenever the amount of free space drops
> >> below a certain level - or levels to be more precise as two
> >> of them are being supported: the lower and the upper range.
> >> The notifications work both ways: once the threshold level
> >> has been reached, an event shall be generated whenever
> >> the number of available blocks goes up again re-activating
> >> the threshold.
> >>
> >> The interface has been exposed through a vfs. Once mounted,
> >> it serves as an entry point for the set-up where one can
> >> register for particular file system events.
> >>
> >> Signed-off-by: Beata Michalska <[email protected]>
> >> ---
> >> Documentation/filesystems/events.txt | 231 ++++++++++
> >> fs/Makefile | 1 +
> >> fs/events/Makefile | 6 +
> >> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
> >> fs/events/fs_event.h | 25 ++
> >> fs/events/fs_event_netlink.c | 99 +++++
> >> fs/namespace.c | 1 +
> >> include/linux/fs.h | 6 +-
> >> include/linux/fs_event.h | 58 +++
> >> include/uapi/linux/fs_event.h | 54 +++
> >> include/uapi/linux/genetlink.h | 1 +
> >> net/netlink/genetlink.c | 7 +-
> >> 12 files changed, 1257 insertions(+), 2 deletions(-)
> >> create mode 100644 Documentation/filesystems/events.txt
> >> create mode 100644 fs/events/Makefile
> >> create mode 100644 fs/events/fs_event.c
> >> create mode 100644 fs/events/fs_event.h
> >> create mode 100644 fs/events/fs_event_netlink.c
> >> create mode 100644 include/linux/fs_event.h
> >> create mode 100644 include/uapi/linux/fs_event.h
> >
> > Any reason why you just don't do uevents for the block devices today,
> > and not create a new type of netlink message and userspace tool required
> > to read these?
>
> The idea here is to have support for filesystems with no backing device as well.
> Parsing the message with libnl is really simple and requires few lines of code
> (sample application has been presented in the initial version of this RFC)
I'm not saying it's not "simple" to parse, just that now you are doing
something that requires a different tool. If you have a block device,
you should be able to emit uevents for it, you don't need a backing
device, we handle virtual filesystems in /sys/block/ just fine :)
People already have tools that listen to libudev for system monitoring
and management, why require them to hook up to yet-another-library? And
what is going to provide the ability for multiple userspace tools to
listen to these netlink messages in case you have more than one program
that wants to watch for these things (i.e. multiple desktop filesystem
monitoring tools, system-health checkers, etc.)?
> >> --- a/fs/Makefile
> >> +++ b/fs/Makefile
> >> @@ -126,3 +126,4 @@ obj-y += exofs/ # Multiple modules
> >> obj-$(CONFIG_CEPH_FS) += ceph/
> >> obj-$(CONFIG_PSTORE) += pstore/
> >> obj-$(CONFIG_EFIVAR_FS) += efivarfs/
> >> +obj-y += events/
> >
> > Always?
> >
> >> diff --git a/fs/events/Makefile b/fs/events/Makefile
> >> new file mode 100644
> >> index 0000000..58d1454
> >> --- /dev/null
> >> +++ b/fs/events/Makefile
> >> @@ -0,0 +1,6 @@
> >> +#
> >> +# Makefile for the Linux Generic File System Event Interface
> >> +#
> >> +
> >> +obj-y := fs_event.o
> >
> > Always? Even if the option is not selected? Why is everyone forced to
> > always use this code? Can't you disable it for the "tiny" systems that
> > don't need it?
> >
>
> I was considering making it optional and I guess it's worth getting back
> to this idea.
The "linux-tiny" people will appreciate that :)
> >> +struct fs_trace_entry {
> >> + atomic_t count;
> >
> > Why not just use a 'struct kref' for your count, which will save a bunch
> > of open-coding of reference counting, and forcing us to audit your code
> > to verify you got all the corner cases correct? :)
> >
> >> + atomic_t active;
> >> + struct super_block *sb;
>
> Not sure if using kref would change much here as the kref would not really
> make it easier to verify those corner cases, unfortunately.
Why not, that's the goal of a kref. Yes, you already did the hard work,
but now you require everyone else to also do the hard work of trying to
audit your code. That's why we have common functions/data structures in
the kernel, to make long-term maintenance easier.
Please switch to make it so that we "know" you are doing this correctly.
> > Are you properly reference counting this pointer? I didn't see where
> > that was happening, so I must have missed it.
> >
> > thanks,
> >
>
> You haven't. And if I haven't missed anything, the sb is being used only
> as long as the super is alive.
How do you know that? :)
> Most of the code operates on sb only if it
> was explicitly asked to, through call from filesystem. There is also
> a callback notifying of mount being dropped (which proceeds the call to
> kill_super) that invalidates the object that depends on it.
> Still, it should be explicitly stated that the sb is being used through
> bidding up the s_count counter, though that would require taking the
> sb_lock. AFAIK, one can get the reference to super block but for a particular
> device. Maybe it would be worth having it more generic (?).
Why not just grab a reference to the sb when you save the pointer, and
release it when you are done with it? That should handle the lifecycle
properly. It's always a very bad idea to have a pointer to a reference
counted object without actually grabbing the reference, as you have no
idea what is happening with it behind your back.
thanks,
greg k-h
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Hi,
On 04/27/2015 05:37 PM, Greg KH wrote:
> On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
>> On 04/27/2015 04:24 PM, Greg KH wrote:
>>> On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
>>>> Introduce configurable generic interface for file system-wide
>>>> event notifications, to provide file systems with a common way
>>>> of reporting any potential issues as they emerge.
>>>>
>>>> The notifications are to be issued through generic netlink
>>>> interface by newly introduced multicast group.
>>>>
>>>> Threshold notifications have been included, allowing triggering
>>>> an event whenever the amount of free space drops below a
>>>> certain level - or levels to be more precise as two of them are
>>>> being supported: the lower and the upper range. The
>>>> notifications work both ways: once the threshold level has been
>>>> reached, an event shall be generated whenever the number of
>>>> available blocks goes up again re-activating the threshold.
>>>>
>>>> The interface has been exposed through a vfs. Once mounted, it
>>>> serves as an entry point for the set-up where one can register
>>>> for particular file system events.
>>>>
>>>> Signed-off-by: Beata Michalska <[email protected]> ---
>>>> Documentation/filesystems/events.txt | 231 ++++++++++
>>>> fs/Makefile | 1 +
>>>> fs/events/Makefile | 6 +
>>>> fs/events/fs_event.c | 770
>>>> ++++++++++++++++++++++++++++++++++ fs/events/fs_event.h | 25
>>>> ++ fs/events/fs_event_netlink.c | 99 +++++
>>>> fs/namespace.c | 1 +
>>>> include/linux/fs.h | 6 +-
>>>> include/linux/fs_event.h | 58 +++
>>>> include/uapi/linux/fs_event.h | 54 +++
>>>> include/uapi/linux/genetlink.h | 1 +
>>>> net/netlink/genetlink.c | 7 +- 12 files
>>>> changed, 1257 insertions(+), 2 deletions(-) create mode 100644
>>>> Documentation/filesystems/events.txt create mode 100644
>>>> fs/events/Makefile create mode 100644 fs/events/fs_event.c
>>>> create mode 100644 fs/events/fs_event.h create mode 100644
>>>> fs/events/fs_event_netlink.c create mode 100644
>>>> include/linux/fs_event.h create mode 100644
>>>> include/uapi/linux/fs_event.h
>>>
>>> Any reason why you just don't do uevents for the block devices
>>> today, and not create a new type of netlink message and userspace
>>> tool required to read these?
>>
>> The idea here is to have support for filesystems with no backing
>> device as well. Parsing the message with libnl is really simple and
>> requires few lines of code (sample application has been presented
>> in the initial version of this RFC)
>
> I'm not saying it's not "simple" to parse, just that now you are
> doing something that requires a different tool. If you have a block
> device, you should be able to emit uevents for it, you don't need a
> backing device, we handle virtual filesystems in /sys/block/ just
> fine :)
>
The generic netlink interface is already being used by quota. As this is to
support file system events, including the threshold notifications, it just
seemed like a nice extension. I'm not really convinced that the concept here
goes well with the uevents and it's current usage. On the other hand, GFS2
already benefits form it. Still the generic netlink seems somehow ... lighter.
Anyway I'm open to any suggestions :)
> People already have tools that listen to libudev for system
> monitoring and management, why require them to hook up to
> yet-another-library? And what is going to provide the ability for
> multiple userspace tools to listen to these netlink messages in case
> you have more than one program that wants to watch for these things
> (i.e. multiple desktop filesystem monitoring tools, system-health
> checkers, etc.)?
>
I might be missing smth here, but any application might subscribe to
the multicast group so I'm not sure I understand your concerns here (?)
>>>> --- a/fs/Makefile +++ b/fs/Makefile @@ -126,3 +126,4 @@ obj-y
>>>> += exofs/ # Multiple modules obj-$(CONFIG_CEPH_FS) += ceph/
>>>> obj-$(CONFIG_PSTORE) += pstore/ obj-$(CONFIG_EFIVAR_FS) +=
>>>> efivarfs/ +obj-y += events/
>>>
>>> Always?
>>>
>>>> diff --git a/fs/events/Makefile b/fs/events/Makefile new file
>>>> mode 100644 index 0000000..58d1454 --- /dev/null +++
>>>> b/fs/events/Makefile @@ -0,0 +1,6 @@ +# +# Makefile for the
>>>> Linux Generic File System Event Interface +# + +obj-y :=
>>>> fs_event.o
>>>
>>> Always? Even if the option is not selected? Why is everyone
>>> forced to always use this code? Can't you disable it for the
>>> "tiny" systems that don't need it?
>>>
>>
>> I was considering making it optional and I guess it's worth getting
>> back to this idea.
>
> The "linux-tiny" people will appreciate that :)
>
Consider it done for the next round.
>>>> +struct fs_trace_entry { + atomic_t count;
>>>
>>> Why not just use a 'struct kref' for your count, which will save
>>> a bunch of open-coding of reference counting, and forcing us to
>>> audit your code to verify you got all the corner cases correct?
>>> :)
>>>
>>>> + atomic_t active; + struct super_block *sb;
>>
>> Not sure if using kref would change much here as the kref would not
>> really make it easier to verify those corner cases, unfortunately.
>
> Why not, that's the goal of a kref. Yes, you already did the hard
> work, but now you require everyone else to also do the hard work of
> trying to audit your code. That's why we have common functions/data
> structures in the kernel, to make long-term maintenance easier.
>
> Please switch to make it so that we "know" you are doing this
> correctly.
>
Alright, if this is to make the review any easier I'll do that.
Still it's gonna be replacing fs_trace_entry_get/put with kref
and I doubt it will help verifying if the references are being
acquired / released properly.
>>> Are you properly reference counting this pointer? I didn't see
>>> where that was happening, so I must have missed it.
>>>
>>> thanks,
>>>
>>
>> You haven't. And if I haven't missed anything, the sb is being used
>> only as long as the super is alive.
>
> How do you know that? :)
>
>> Most of the code operates on sb only if it was explicitly asked to,
>> through call from filesystem. There is also a callback notifying of
>> mount being dropped (which proceeds the call to kill_super) that
>> invalidates the object that depends on it. Still, it should be
>> explicitly stated that the sb is being used through bidding up the
>> s_count counter, though that would require taking the sb_lock.
>> AFAIK, one can get the reference to super block but for a
>> particular device. Maybe it would be worth having it more generic
>> (?).
>
> Why not just grab a reference to the sb when you save the pointer,
> and release it when you are done with it? That should handle the
> lifecycle properly. It's always a very bad idea to have a pointer to
> a reference counted object without actually grabbing the reference,
> as you have no idea what is happening with it behind your back.
>
Ok, will do.
> thanks,
>
> greg k-h
>
Thanks for your comments so far. Much appreciated.
BR
Beata
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon 27-04-15 17:37:11, Greg KH wrote:
> On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
> > On 04/27/2015 04:24 PM, Greg KH wrote:
> > > On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
> > >> Introduce configurable generic interface for file
> > >> system-wide event notifications, to provide file
> > >> systems with a common way of reporting any potential
> > >> issues as they emerge.
> > >>
> > >> The notifications are to be issued through generic
> > >> netlink interface by newly introduced multicast group.
> > >>
> > >> Threshold notifications have been included, allowing
> > >> triggering an event whenever the amount of free space drops
> > >> below a certain level - or levels to be more precise as two
> > >> of them are being supported: the lower and the upper range.
> > >> The notifications work both ways: once the threshold level
> > >> has been reached, an event shall be generated whenever
> > >> the number of available blocks goes up again re-activating
> > >> the threshold.
> > >>
> > >> The interface has been exposed through a vfs. Once mounted,
> > >> it serves as an entry point for the set-up where one can
> > >> register for particular file system events.
> > >>
> > >> Signed-off-by: Beata Michalska <[email protected]>
> > >> ---
> > >> Documentation/filesystems/events.txt | 231 ++++++++++
> > >> fs/Makefile | 1 +
> > >> fs/events/Makefile | 6 +
> > >> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
> > >> fs/events/fs_event.h | 25 ++
> > >> fs/events/fs_event_netlink.c | 99 +++++
> > >> fs/namespace.c | 1 +
> > >> include/linux/fs.h | 6 +-
> > >> include/linux/fs_event.h | 58 +++
> > >> include/uapi/linux/fs_event.h | 54 +++
> > >> include/uapi/linux/genetlink.h | 1 +
> > >> net/netlink/genetlink.c | 7 +-
> > >> 12 files changed, 1257 insertions(+), 2 deletions(-)
> > >> create mode 100644 Documentation/filesystems/events.txt
> > >> create mode 100644 fs/events/Makefile
> > >> create mode 100644 fs/events/fs_event.c
> > >> create mode 100644 fs/events/fs_event.h
> > >> create mode 100644 fs/events/fs_event_netlink.c
> > >> create mode 100644 include/linux/fs_event.h
> > >> create mode 100644 include/uapi/linux/fs_event.h
> > >
> > > Any reason why you just don't do uevents for the block devices today,
> > > and not create a new type of netlink message and userspace tool required
> > > to read these?
> >
> > The idea here is to have support for filesystems with no backing device as well.
> > Parsing the message with libnl is really simple and requires few lines of code
> > (sample application has been presented in the initial version of this RFC)
>
> I'm not saying it's not "simple" to parse, just that now you are doing
> something that requires a different tool. If you have a block device,
> you should be able to emit uevents for it, you don't need a backing
> device, we handle virtual filesystems in /sys/block/ just fine :)
>
> People already have tools that listen to libudev for system monitoring
> and management, why require them to hook up to yet-another-library? And
> what is going to provide the ability for multiple userspace tools to
> listen to these netlink messages in case you have more than one program
> that wants to watch for these things (i.e. multiple desktop filesystem
> monitoring tools, system-health checkers, etc.)?
As much as I understand your concerns I'm not convinced uevent interface
is a good fit. There are filesystems that don't have underlying block
device - think of e.g. tmpfs or filesystems working directly on top of
flash devices. These still want to send notification to userspace (one of
primary motivation for this interfaces was so that tmpfs can notify about
something). And creating some fake nodes in /sys/block for tmpfs and
similar filesystems seems like doing more harm than good to me...
Honza
> > Most of the code operates on sb only if it
> > was explicitly asked to, through call from filesystem. There is also
> > a callback notifying of mount being dropped (which proceeds the call to
> > kill_super) that invalidates the object that depends on it.
> > Still, it should be explicitly stated that the sb is being used through
> > bidding up the s_count counter, though that would require taking the
> > sb_lock. AFAIK, one can get the reference to super block but for a particular
> > device. Maybe it would be worth having it more generic (?).
>
> Why not just grab a reference to the sb when you save the pointer, and
> release it when you are done with it? That should handle the lifecycle
> properly. It's always a very bad idea to have a pointer to a reference
> counted object without actually grabbing the reference, as you have no
> idea what is happening with it behind your back.
>
> thanks,
>
> greg k-h
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Apr 28, 2015 at 03:56:53PM +0200, Jan Kara wrote:
> On Mon 27-04-15 17:37:11, Greg KH wrote:
> > On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
> > > On 04/27/2015 04:24 PM, Greg KH wrote:
> > > > On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
> > > >> Introduce configurable generic interface for file
> > > >> system-wide event notifications, to provide file
> > > >> systems with a common way of reporting any potential
> > > >> issues as they emerge.
> > > >>
> > > >> The notifications are to be issued through generic
> > > >> netlink interface by newly introduced multicast group.
> > > >>
> > > >> Threshold notifications have been included, allowing
> > > >> triggering an event whenever the amount of free space drops
> > > >> below a certain level - or levels to be more precise as two
> > > >> of them are being supported: the lower and the upper range.
> > > >> The notifications work both ways: once the threshold level
> > > >> has been reached, an event shall be generated whenever
> > > >> the number of available blocks goes up again re-activating
> > > >> the threshold.
> > > >>
> > > >> The interface has been exposed through a vfs. Once mounted,
> > > >> it serves as an entry point for the set-up where one can
> > > >> register for particular file system events.
> > > >>
> > > >> Signed-off-by: Beata Michalska <[email protected]>
> > > >> ---
> > > >> Documentation/filesystems/events.txt | 231 ++++++++++
> > > >> fs/Makefile | 1 +
> > > >> fs/events/Makefile | 6 +
> > > >> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
> > > >> fs/events/fs_event.h | 25 ++
> > > >> fs/events/fs_event_netlink.c | 99 +++++
> > > >> fs/namespace.c | 1 +
> > > >> include/linux/fs.h | 6 +-
> > > >> include/linux/fs_event.h | 58 +++
> > > >> include/uapi/linux/fs_event.h | 54 +++
> > > >> include/uapi/linux/genetlink.h | 1 +
> > > >> net/netlink/genetlink.c | 7 +-
> > > >> 12 files changed, 1257 insertions(+), 2 deletions(-)
> > > >> create mode 100644 Documentation/filesystems/events.txt
> > > >> create mode 100644 fs/events/Makefile
> > > >> create mode 100644 fs/events/fs_event.c
> > > >> create mode 100644 fs/events/fs_event.h
> > > >> create mode 100644 fs/events/fs_event_netlink.c
> > > >> create mode 100644 include/linux/fs_event.h
> > > >> create mode 100644 include/uapi/linux/fs_event.h
> > > >
> > > > Any reason why you just don't do uevents for the block devices today,
> > > > and not create a new type of netlink message and userspace tool required
> > > > to read these?
> > >
> > > The idea here is to have support for filesystems with no backing device as well.
> > > Parsing the message with libnl is really simple and requires few lines of code
> > > (sample application has been presented in the initial version of this RFC)
> >
> > I'm not saying it's not "simple" to parse, just that now you are doing
> > something that requires a different tool. If you have a block device,
> > you should be able to emit uevents for it, you don't need a backing
> > device, we handle virtual filesystems in /sys/block/ just fine :)
> >
> > People already have tools that listen to libudev for system monitoring
> > and management, why require them to hook up to yet-another-library? And
> > what is going to provide the ability for multiple userspace tools to
> > listen to these netlink messages in case you have more than one program
> > that wants to watch for these things (i.e. multiple desktop filesystem
> > monitoring tools, system-health checkers, etc.)?
> As much as I understand your concerns I'm not convinced uevent interface
> is a good fit. There are filesystems that don't have underlying block
> device - think of e.g. tmpfs or filesystems working directly on top of
> flash devices. These still want to send notification to userspace (one of
> primary motivation for this interfaces was so that tmpfs can notify about
> something). And creating some fake nodes in /sys/block for tmpfs and
> similar filesystems seems like doing more harm than good to me...
If these are "fake" block devices, what's going to be present in the
block major/minor fields of the netlink message? For some reason I
thought it was a required field, and because of that, I thought we had a
"real" filesystem somewhere to refer to, otherwise how would userspace
know what filesystem was creating these events?
What am I missing here?
confused,
greg k-h
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On 04/28/2015 04:09 PM, Greg KH wrote:
> On Tue, Apr 28, 2015 at 03:56:53PM +0200, Jan Kara wrote:
>> On Mon 27-04-15 17:37:11, Greg KH wrote:
>>> On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
>>>> On 04/27/2015 04:24 PM, Greg KH wrote:
>>>>> On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
>>>>>> Introduce configurable generic interface for file
>>>>>> system-wide event notifications, to provide file
>>>>>> systems with a common way of reporting any potential
>>>>>> issues as they emerge.
>>>>>>
>>>>>> The notifications are to be issued through generic
>>>>>> netlink interface by newly introduced multicast group.
>>>>>>
>>>>>> Threshold notifications have been included, allowing
>>>>>> triggering an event whenever the amount of free space drops
>>>>>> below a certain level - or levels to be more precise as two
>>>>>> of them are being supported: the lower and the upper range.
>>>>>> The notifications work both ways: once the threshold level
>>>>>> has been reached, an event shall be generated whenever
>>>>>> the number of available blocks goes up again re-activating
>>>>>> the threshold.
>>>>>>
>>>>>> The interface has been exposed through a vfs. Once mounted,
>>>>>> it serves as an entry point for the set-up where one can
>>>>>> register for particular file system events.
>>>>>>
>>>>>> Signed-off-by: Beata Michalska <[email protected]>
>>>>>> ---
>>>>>> Documentation/filesystems/events.txt | 231 ++++++++++
>>>>>> fs/Makefile | 1 +
>>>>>> fs/events/Makefile | 6 +
>>>>>> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
>>>>>> fs/events/fs_event.h | 25 ++
>>>>>> fs/events/fs_event_netlink.c | 99 +++++
>>>>>> fs/namespace.c | 1 +
>>>>>> include/linux/fs.h | 6 +-
>>>>>> include/linux/fs_event.h | 58 +++
>>>>>> include/uapi/linux/fs_event.h | 54 +++
>>>>>> include/uapi/linux/genetlink.h | 1 +
>>>>>> net/netlink/genetlink.c | 7 +-
>>>>>> 12 files changed, 1257 insertions(+), 2 deletions(-)
>>>>>> create mode 100644 Documentation/filesystems/events.txt
>>>>>> create mode 100644 fs/events/Makefile
>>>>>> create mode 100644 fs/events/fs_event.c
>>>>>> create mode 100644 fs/events/fs_event.h
>>>>>> create mode 100644 fs/events/fs_event_netlink.c
>>>>>> create mode 100644 include/linux/fs_event.h
>>>>>> create mode 100644 include/uapi/linux/fs_event.h
>>>>>
>>>>> Any reason why you just don't do uevents for the block devices today,
>>>>> and not create a new type of netlink message and userspace tool required
>>>>> to read these?
>>>>
>>>> The idea here is to have support for filesystems with no backing device as well.
>>>> Parsing the message with libnl is really simple and requires few lines of code
>>>> (sample application has been presented in the initial version of this RFC)
>>>
>>> I'm not saying it's not "simple" to parse, just that now you are doing
>>> something that requires a different tool. If you have a block device,
>>> you should be able to emit uevents for it, you don't need a backing
>>> device, we handle virtual filesystems in /sys/block/ just fine :)
>>>
>>> People already have tools that listen to libudev for system monitoring
>>> and management, why require them to hook up to yet-another-library? And
>>> what is going to provide the ability for multiple userspace tools to
>>> listen to these netlink messages in case you have more than one program
>>> that wants to watch for these things (i.e. multiple desktop filesystem
>>> monitoring tools, system-health checkers, etc.)?
>> As much as I understand your concerns I'm not convinced uevent interface
>> is a good fit. There are filesystems that don't have underlying block
>> device - think of e.g. tmpfs or filesystems working directly on top of
>> flash devices. These still want to send notification to userspace (one of
>> primary motivation for this interfaces was so that tmpfs can notify about
>> something). And creating some fake nodes in /sys/block for tmpfs and
>> similar filesystems seems like doing more harm than good to me...
>
> If these are "fake" block devices, what's going to be present in the
> block major/minor fields of the netlink message? For some reason I
> thought it was a required field, and because of that, I thought we had a
> "real" filesystem somewhere to refer to, otherwise how would userspace
> know what filesystem was creating these events?
>
> What am I missing here?
>
> confused,
>
> greg k-h
>
For those 'fake' block devs, upon mount, get_anon_bdev will assign
the major:minor numbers. Userspace might get those through stat.
BR
Beata
On Tue, Apr 28, 2015 at 04:46:46PM +0200, Beata Michalska wrote:
> On 04/28/2015 04:09 PM, Greg KH wrote:
> > On Tue, Apr 28, 2015 at 03:56:53PM +0200, Jan Kara wrote:
> >> On Mon 27-04-15 17:37:11, Greg KH wrote:
> >>> On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
> >>>> On 04/27/2015 04:24 PM, Greg KH wrote:
> >>>>> On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
> >>>>>> Introduce configurable generic interface for file
> >>>>>> system-wide event notifications, to provide file
> >>>>>> systems with a common way of reporting any potential
> >>>>>> issues as they emerge.
> >>>>>>
> >>>>>> The notifications are to be issued through generic
> >>>>>> netlink interface by newly introduced multicast group.
> >>>>>>
> >>>>>> Threshold notifications have been included, allowing
> >>>>>> triggering an event whenever the amount of free space drops
> >>>>>> below a certain level - or levels to be more precise as two
> >>>>>> of them are being supported: the lower and the upper range.
> >>>>>> The notifications work both ways: once the threshold level
> >>>>>> has been reached, an event shall be generated whenever
> >>>>>> the number of available blocks goes up again re-activating
> >>>>>> the threshold.
> >>>>>>
> >>>>>> The interface has been exposed through a vfs. Once mounted,
> >>>>>> it serves as an entry point for the set-up where one can
> >>>>>> register for particular file system events.
> >>>>>>
> >>>>>> Signed-off-by: Beata Michalska <[email protected]>
> >>>>>> ---
> >>>>>> Documentation/filesystems/events.txt | 231 ++++++++++
> >>>>>> fs/Makefile | 1 +
> >>>>>> fs/events/Makefile | 6 +
> >>>>>> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
> >>>>>> fs/events/fs_event.h | 25 ++
> >>>>>> fs/events/fs_event_netlink.c | 99 +++++
> >>>>>> fs/namespace.c | 1 +
> >>>>>> include/linux/fs.h | 6 +-
> >>>>>> include/linux/fs_event.h | 58 +++
> >>>>>> include/uapi/linux/fs_event.h | 54 +++
> >>>>>> include/uapi/linux/genetlink.h | 1 +
> >>>>>> net/netlink/genetlink.c | 7 +-
> >>>>>> 12 files changed, 1257 insertions(+), 2 deletions(-)
> >>>>>> create mode 100644 Documentation/filesystems/events.txt
> >>>>>> create mode 100644 fs/events/Makefile
> >>>>>> create mode 100644 fs/events/fs_event.c
> >>>>>> create mode 100644 fs/events/fs_event.h
> >>>>>> create mode 100644 fs/events/fs_event_netlink.c
> >>>>>> create mode 100644 include/linux/fs_event.h
> >>>>>> create mode 100644 include/uapi/linux/fs_event.h
> >>>>>
> >>>>> Any reason why you just don't do uevents for the block devices today,
> >>>>> and not create a new type of netlink message and userspace tool required
> >>>>> to read these?
> >>>>
> >>>> The idea here is to have support for filesystems with no backing device as well.
> >>>> Parsing the message with libnl is really simple and requires few lines of code
> >>>> (sample application has been presented in the initial version of this RFC)
> >>>
> >>> I'm not saying it's not "simple" to parse, just that now you are doing
> >>> something that requires a different tool. If you have a block device,
> >>> you should be able to emit uevents for it, you don't need a backing
> >>> device, we handle virtual filesystems in /sys/block/ just fine :)
> >>>
> >>> People already have tools that listen to libudev for system monitoring
> >>> and management, why require them to hook up to yet-another-library? And
> >>> what is going to provide the ability for multiple userspace tools to
> >>> listen to these netlink messages in case you have more than one program
> >>> that wants to watch for these things (i.e. multiple desktop filesystem
> >>> monitoring tools, system-health checkers, etc.)?
> >> As much as I understand your concerns I'm not convinced uevent interface
> >> is a good fit. There are filesystems that don't have underlying block
> >> device - think of e.g. tmpfs or filesystems working directly on top of
> >> flash devices. These still want to send notification to userspace (one of
> >> primary motivation for this interfaces was so that tmpfs can notify about
> >> something). And creating some fake nodes in /sys/block for tmpfs and
> >> similar filesystems seems like doing more harm than good to me...
> >
> > If these are "fake" block devices, what's going to be present in the
> > block major/minor fields of the netlink message? For some reason I
> > thought it was a required field, and because of that, I thought we had a
> > "real" filesystem somewhere to refer to, otherwise how would userspace
> > know what filesystem was creating these events?
> >
> > What am I missing here?
> >
> > confused,
> >
> > greg k-h
> >
>
> For those 'fake' block devs, upon mount, get_anon_bdev will assign
> the major:minor numbers. Userspace might get those through stat.
How can userspace do the mapping backwards from this "anonymous"
major:minor number for these types of filesystems in such a way that
they can "know" how to report the block device that is causing the
event?
thanks,
greg k-h
On 04/28/2015 07:39 PM, Greg KH wrote:
> On Tue, Apr 28, 2015 at 04:46:46PM +0200, Beata Michalska wrote:
>> On 04/28/2015 04:09 PM, Greg KH wrote:
>>> On Tue, Apr 28, 2015 at 03:56:53PM +0200, Jan Kara wrote:
>>>> On Mon 27-04-15 17:37:11, Greg KH wrote:
>>>>> On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
>>>>>> On 04/27/2015 04:24 PM, Greg KH wrote:
>>>>>>> On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
>>>>>>>> Introduce configurable generic interface for file
>>>>>>>> system-wide event notifications, to provide file
>>>>>>>> systems with a common way of reporting any potential
>>>>>>>> issues as they emerge.
>>>>>>>>
>>>>>>>> The notifications are to be issued through generic
>>>>>>>> netlink interface by newly introduced multicast group.
>>>>>>>>
>>>>>>>> Threshold notifications have been included, allowing
>>>>>>>> triggering an event whenever the amount of free space drops
>>>>>>>> below a certain level - or levels to be more precise as two
>>>>>>>> of them are being supported: the lower and the upper range.
>>>>>>>> The notifications work both ways: once the threshold level
>>>>>>>> has been reached, an event shall be generated whenever
>>>>>>>> the number of available blocks goes up again re-activating
>>>>>>>> the threshold.
>>>>>>>>
>>>>>>>> The interface has been exposed through a vfs. Once mounted,
>>>>>>>> it serves as an entry point for the set-up where one can
>>>>>>>> register for particular file system events.
>>>>>>>>
>>>>>>>> Signed-off-by: Beata Michalska <[email protected]>
>>>>>>>> ---
>>>>>>>> Documentation/filesystems/events.txt | 231 ++++++++++
>>>>>>>> fs/Makefile | 1 +
>>>>>>>> fs/events/Makefile | 6 +
>>>>>>>> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
>>>>>>>> fs/events/fs_event.h | 25 ++
>>>>>>>> fs/events/fs_event_netlink.c | 99 +++++
>>>>>>>> fs/namespace.c | 1 +
>>>>>>>> include/linux/fs.h | 6 +-
>>>>>>>> include/linux/fs_event.h | 58 +++
>>>>>>>> include/uapi/linux/fs_event.h | 54 +++
>>>>>>>> include/uapi/linux/genetlink.h | 1 +
>>>>>>>> net/netlink/genetlink.c | 7 +-
>>>>>>>> 12 files changed, 1257 insertions(+), 2 deletions(-)
>>>>>>>> create mode 100644 Documentation/filesystems/events.txt
>>>>>>>> create mode 100644 fs/events/Makefile
>>>>>>>> create mode 100644 fs/events/fs_event.c
>>>>>>>> create mode 100644 fs/events/fs_event.h
>>>>>>>> create mode 100644 fs/events/fs_event_netlink.c
>>>>>>>> create mode 100644 include/linux/fs_event.h
>>>>>>>> create mode 100644 include/uapi/linux/fs_event.h
>>>>>>>
>>>>>>> Any reason why you just don't do uevents for the block devices today,
>>>>>>> and not create a new type of netlink message and userspace tool required
>>>>>>> to read these?
>>>>>>
>>>>>> The idea here is to have support for filesystems with no backing device as well.
>>>>>> Parsing the message with libnl is really simple and requires few lines of code
>>>>>> (sample application has been presented in the initial version of this RFC)
>>>>>
>>>>> I'm not saying it's not "simple" to parse, just that now you are doing
>>>>> something that requires a different tool. If you have a block device,
>>>>> you should be able to emit uevents for it, you don't need a backing
>>>>> device, we handle virtual filesystems in /sys/block/ just fine :)
>>>>>
>>>>> People already have tools that listen to libudev for system monitoring
>>>>> and management, why require them to hook up to yet-another-library? And
>>>>> what is going to provide the ability for multiple userspace tools to
>>>>> listen to these netlink messages in case you have more than one program
>>>>> that wants to watch for these things (i.e. multiple desktop filesystem
>>>>> monitoring tools, system-health checkers, etc.)?
>>>> As much as I understand your concerns I'm not convinced uevent interface
>>>> is a good fit. There are filesystems that don't have underlying block
>>>> device - think of e.g. tmpfs or filesystems working directly on top of
>>>> flash devices. These still want to send notification to userspace (one of
>>>> primary motivation for this interfaces was so that tmpfs can notify about
>>>> something). And creating some fake nodes in /sys/block for tmpfs and
>>>> similar filesystems seems like doing more harm than good to me...
>>>
>>> If these are "fake" block devices, what's going to be present in the
>>> block major/minor fields of the netlink message? For some reason I
>>> thought it was a required field, and because of that, I thought we had a
>>> "real" filesystem somewhere to refer to, otherwise how would userspace
>>> know what filesystem was creating these events?
>>>
>>> What am I missing here?
>>>
>>> confused,
>>>
>>> greg k-h
>>>
>>
>> For those 'fake' block devs, upon mount, get_anon_bdev will assign
>> the major:minor numbers. Userspace might get those through stat.
>
> How can userspace do the mapping backwards from this "anonymous"
> major:minor number for these types of filesystems in such a way that
> they can "know" how to report the block device that is causing the
> event?
>
> thanks,
>
> greg k-h
>
It needs to be done internally by the app but is doable.
The app knows what it is watching, so it can maintain the mappings.
So prior to activating the notifications it can call 'stat' on the mount point.
Stat struct gives the 'st_dev' which is the device id. Same will be reported
within the message payload (through major:minor numbers). So having this,
the app is able to get any other information it needs.
Note that the events refer to the file system as a whole and they may not
necessarily have anything to do with the actual block device.
BR
Beata
On Wed 29-04-15 09:03:08, Beata Michalska wrote:
> On 04/28/2015 07:39 PM, Greg KH wrote:
> > On Tue, Apr 28, 2015 at 04:46:46PM +0200, Beata Michalska wrote:
> >> On 04/28/2015 04:09 PM, Greg KH wrote:
> >>> On Tue, Apr 28, 2015 at 03:56:53PM +0200, Jan Kara wrote:
> >>>> On Mon 27-04-15 17:37:11, Greg KH wrote:
> >>>>> On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
> >>>>>> On 04/27/2015 04:24 PM, Greg KH wrote:
> >>>>>>> On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
> >>>>>>>> Introduce configurable generic interface for file
> >>>>>>>> system-wide event notifications, to provide file
> >>>>>>>> systems with a common way of reporting any potential
> >>>>>>>> issues as they emerge.
> >>>>>>>>
> >>>>>>>> The notifications are to be issued through generic
> >>>>>>>> netlink interface by newly introduced multicast group.
> >>>>>>>>
> >>>>>>>> Threshold notifications have been included, allowing
> >>>>>>>> triggering an event whenever the amount of free space drops
> >>>>>>>> below a certain level - or levels to be more precise as two
> >>>>>>>> of them are being supported: the lower and the upper range.
> >>>>>>>> The notifications work both ways: once the threshold level
> >>>>>>>> has been reached, an event shall be generated whenever
> >>>>>>>> the number of available blocks goes up again re-activating
> >>>>>>>> the threshold.
> >>>>>>>>
> >>>>>>>> The interface has been exposed through a vfs. Once mounted,
> >>>>>>>> it serves as an entry point for the set-up where one can
> >>>>>>>> register for particular file system events.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Beata Michalska <[email protected]>
> >>>>>>>> ---
> >>>>>>>> Documentation/filesystems/events.txt | 231 ++++++++++
> >>>>>>>> fs/Makefile | 1 +
> >>>>>>>> fs/events/Makefile | 6 +
> >>>>>>>> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
> >>>>>>>> fs/events/fs_event.h | 25 ++
> >>>>>>>> fs/events/fs_event_netlink.c | 99 +++++
> >>>>>>>> fs/namespace.c | 1 +
> >>>>>>>> include/linux/fs.h | 6 +-
> >>>>>>>> include/linux/fs_event.h | 58 +++
> >>>>>>>> include/uapi/linux/fs_event.h | 54 +++
> >>>>>>>> include/uapi/linux/genetlink.h | 1 +
> >>>>>>>> net/netlink/genetlink.c | 7 +-
> >>>>>>>> 12 files changed, 1257 insertions(+), 2 deletions(-)
> >>>>>>>> create mode 100644 Documentation/filesystems/events.txt
> >>>>>>>> create mode 100644 fs/events/Makefile
> >>>>>>>> create mode 100644 fs/events/fs_event.c
> >>>>>>>> create mode 100644 fs/events/fs_event.h
> >>>>>>>> create mode 100644 fs/events/fs_event_netlink.c
> >>>>>>>> create mode 100644 include/linux/fs_event.h
> >>>>>>>> create mode 100644 include/uapi/linux/fs_event.h
> >>>>>>>
> >>>>>>> Any reason why you just don't do uevents for the block devices today,
> >>>>>>> and not create a new type of netlink message and userspace tool required
> >>>>>>> to read these?
> >>>>>>
> >>>>>> The idea here is to have support for filesystems with no backing device as well.
> >>>>>> Parsing the message with libnl is really simple and requires few lines of code
> >>>>>> (sample application has been presented in the initial version of this RFC)
> >>>>>
> >>>>> I'm not saying it's not "simple" to parse, just that now you are doing
> >>>>> something that requires a different tool. If you have a block device,
> >>>>> you should be able to emit uevents for it, you don't need a backing
> >>>>> device, we handle virtual filesystems in /sys/block/ just fine :)
> >>>>>
> >>>>> People already have tools that listen to libudev for system monitoring
> >>>>> and management, why require them to hook up to yet-another-library? And
> >>>>> what is going to provide the ability for multiple userspace tools to
> >>>>> listen to these netlink messages in case you have more than one program
> >>>>> that wants to watch for these things (i.e. multiple desktop filesystem
> >>>>> monitoring tools, system-health checkers, etc.)?
> >>>> As much as I understand your concerns I'm not convinced uevent interface
> >>>> is a good fit. There are filesystems that don't have underlying block
> >>>> device - think of e.g. tmpfs or filesystems working directly on top of
> >>>> flash devices. These still want to send notification to userspace (one of
> >>>> primary motivation for this interfaces was so that tmpfs can notify about
> >>>> something). And creating some fake nodes in /sys/block for tmpfs and
> >>>> similar filesystems seems like doing more harm than good to me...
> >>>
> >>> If these are "fake" block devices, what's going to be present in the
> >>> block major/minor fields of the netlink message? For some reason I
> >>> thought it was a required field, and because of that, I thought we had a
> >>> "real" filesystem somewhere to refer to, otherwise how would userspace
> >>> know what filesystem was creating these events?
> >>>
> >>> What am I missing here?
> >>>
> >>> confused,
> >>>
> >>> greg k-h
> >>>
> >>
> >> For those 'fake' block devs, upon mount, get_anon_bdev will assign
> >> the major:minor numbers. Userspace might get those through stat.
> >
> > How can userspace do the mapping backwards from this "anonymous"
> > major:minor number for these types of filesystems in such a way that
> > they can "know" how to report the block device that is causing the
> > event?
> >
> > thanks,
> >
> > greg k-h
> >
>
> It needs to be done internally by the app but is doable.
> The app knows what it is watching, so it can maintain the mappings.
> So prior to activating the notifications it can call 'stat' on the mount point.
> Stat struct gives the 'st_dev' which is the device id. Same will be reported
> within the message payload (through major:minor numbers). So having this,
> the app is able to get any other information it needs.
> Note that the events refer to the file system as a whole and they may not
> necessarily have anything to do with the actual block device.
Or you can use /proc/self/mountinfo for the mapping. There you can see
device numbers, real device names if applicable and mountpoints. This has
the advantage that it works even if filesystem mountpoints change.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Wed, Apr 29, 2015 at 09:42:59AM +0200, Jan Kara wrote:
> On Wed 29-04-15 09:03:08, Beata Michalska wrote:
> > On 04/28/2015 07:39 PM, Greg KH wrote:
> > > On Tue, Apr 28, 2015 at 04:46:46PM +0200, Beata Michalska wrote:
> > >> On 04/28/2015 04:09 PM, Greg KH wrote:
> > >>> On Tue, Apr 28, 2015 at 03:56:53PM +0200, Jan Kara wrote:
> > >>>> On Mon 27-04-15 17:37:11, Greg KH wrote:
> > >>>>> On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
> > >>>>>> On 04/27/2015 04:24 PM, Greg KH wrote:
> > >>>>>>> On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
> > >>>>>>>> Introduce configurable generic interface for file
> > >>>>>>>> system-wide event notifications, to provide file
> > >>>>>>>> systems with a common way of reporting any potential
> > >>>>>>>> issues as they emerge.
> > >>>>>>>>
> > >>>>>>>> The notifications are to be issued through generic
> > >>>>>>>> netlink interface by newly introduced multicast group.
> > >>>>>>>>
> > >>>>>>>> Threshold notifications have been included, allowing
> > >>>>>>>> triggering an event whenever the amount of free space drops
> > >>>>>>>> below a certain level - or levels to be more precise as two
> > >>>>>>>> of them are being supported: the lower and the upper range.
> > >>>>>>>> The notifications work both ways: once the threshold level
> > >>>>>>>> has been reached, an event shall be generated whenever
> > >>>>>>>> the number of available blocks goes up again re-activating
> > >>>>>>>> the threshold.
> > >>>>>>>>
> > >>>>>>>> The interface has been exposed through a vfs. Once mounted,
> > >>>>>>>> it serves as an entry point for the set-up where one can
> > >>>>>>>> register for particular file system events.
> > >>>>>>>>
> > >>>>>>>> Signed-off-by: Beata Michalska <[email protected]>
> > >>>>>>>> ---
> > >>>>>>>> Documentation/filesystems/events.txt | 231 ++++++++++
> > >>>>>>>> fs/Makefile | 1 +
> > >>>>>>>> fs/events/Makefile | 6 +
> > >>>>>>>> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
> > >>>>>>>> fs/events/fs_event.h | 25 ++
> > >>>>>>>> fs/events/fs_event_netlink.c | 99 +++++
> > >>>>>>>> fs/namespace.c | 1 +
> > >>>>>>>> include/linux/fs.h | 6 +-
> > >>>>>>>> include/linux/fs_event.h | 58 +++
> > >>>>>>>> include/uapi/linux/fs_event.h | 54 +++
> > >>>>>>>> include/uapi/linux/genetlink.h | 1 +
> > >>>>>>>> net/netlink/genetlink.c | 7 +-
> > >>>>>>>> 12 files changed, 1257 insertions(+), 2 deletions(-)
> > >>>>>>>> create mode 100644 Documentation/filesystems/events.txt
> > >>>>>>>> create mode 100644 fs/events/Makefile
> > >>>>>>>> create mode 100644 fs/events/fs_event.c
> > >>>>>>>> create mode 100644 fs/events/fs_event.h
> > >>>>>>>> create mode 100644 fs/events/fs_event_netlink.c
> > >>>>>>>> create mode 100644 include/linux/fs_event.h
> > >>>>>>>> create mode 100644 include/uapi/linux/fs_event.h
> > >>>>>>>
> > >>>>>>> Any reason why you just don't do uevents for the block devices today,
> > >>>>>>> and not create a new type of netlink message and userspace tool required
> > >>>>>>> to read these?
> > >>>>>>
> > >>>>>> The idea here is to have support for filesystems with no backing device as well.
> > >>>>>> Parsing the message with libnl is really simple and requires few lines of code
> > >>>>>> (sample application has been presented in the initial version of this RFC)
> > >>>>>
> > >>>>> I'm not saying it's not "simple" to parse, just that now you are doing
> > >>>>> something that requires a different tool. If you have a block device,
> > >>>>> you should be able to emit uevents for it, you don't need a backing
> > >>>>> device, we handle virtual filesystems in /sys/block/ just fine :)
> > >>>>>
> > >>>>> People already have tools that listen to libudev for system monitoring
> > >>>>> and management, why require them to hook up to yet-another-library? And
> > >>>>> what is going to provide the ability for multiple userspace tools to
> > >>>>> listen to these netlink messages in case you have more than one program
> > >>>>> that wants to watch for these things (i.e. multiple desktop filesystem
> > >>>>> monitoring tools, system-health checkers, etc.)?
> > >>>> As much as I understand your concerns I'm not convinced uevent interface
> > >>>> is a good fit. There are filesystems that don't have underlying block
> > >>>> device - think of e.g. tmpfs or filesystems working directly on top of
> > >>>> flash devices. These still want to send notification to userspace (one of
> > >>>> primary motivation for this interfaces was so that tmpfs can notify about
> > >>>> something). And creating some fake nodes in /sys/block for tmpfs and
> > >>>> similar filesystems seems like doing more harm than good to me...
> > >>>
> > >>> If these are "fake" block devices, what's going to be present in the
> > >>> block major/minor fields of the netlink message? For some reason I
> > >>> thought it was a required field, and because of that, I thought we had a
> > >>> "real" filesystem somewhere to refer to, otherwise how would userspace
> > >>> know what filesystem was creating these events?
> > >>>
> > >>> What am I missing here?
> > >>>
> > >>> confused,
> > >>>
> > >>> greg k-h
> > >>>
> > >>
> > >> For those 'fake' block devs, upon mount, get_anon_bdev will assign
> > >> the major:minor numbers. Userspace might get those through stat.
> > >
> > > How can userspace do the mapping backwards from this "anonymous"
> > > major:minor number for these types of filesystems in such a way that
> > > they can "know" how to report the block device that is causing the
> > > event?
> > >
> > > thanks,
> > >
> > > greg k-h
> > >
> >
> > It needs to be done internally by the app but is doable.
> > The app knows what it is watching, so it can maintain the mappings.
> > So prior to activating the notifications it can call 'stat' on the mount point.
> > Stat struct gives the 'st_dev' which is the device id. Same will be reported
> > within the message payload (through major:minor numbers). So having this,
> > the app is able to get any other information it needs.
> > Note that the events refer to the file system as a whole and they may not
> > necessarily have anything to do with the actual block device.
How are you going to show an event for a filesystem that is made up of
multiple block devices?
> Or you can use /proc/self/mountinfo for the mapping. There you can see
> device numbers, real device names if applicable and mountpoints. This has
> the advantage that it works even if filesystem mountpoints change.
Ok, then that brings up my next question, how does this handle
namespaces? What namespace is the event being sent in? block devices
aren't namespaced, but the mount points are, is that going to cause
problems?
thanks,
greg k-h
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On 04/29/2015 11:13 AM, Greg KH wrote:
> On Wed, Apr 29, 2015 at 09:42:59AM +0200, Jan Kara wrote:
>> On Wed 29-04-15 09:03:08, Beata Michalska wrote:
>>> On 04/28/2015 07:39 PM, Greg KH wrote:
>>>> On Tue, Apr 28, 2015 at 04:46:46PM +0200, Beata Michalska wrote:
>>>>> On 04/28/2015 04:09 PM, Greg KH wrote:
>>>>>> On Tue, Apr 28, 2015 at 03:56:53PM +0200, Jan Kara wrote:
>>>>>>> On Mon 27-04-15 17:37:11, Greg KH wrote:
>>>>>>>> On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
>>>>>>>>> On 04/27/2015 04:24 PM, Greg KH wrote:
>>>>>>>>>> On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
>>>>>>>>>>> Introduce configurable generic interface for file
>>>>>>>>>>> system-wide event notifications, to provide file
>>>>>>>>>>> systems with a common way of reporting any potential
>>>>>>>>>>> issues as they emerge.
>>>>>>>>>>>
>>>>>>>>>>> The notifications are to be issued through generic
>>>>>>>>>>> netlink interface by newly introduced multicast group.
>>>>>>>>>>>
>>>>>>>>>>> Threshold notifications have been included, allowing
>>>>>>>>>>> triggering an event whenever the amount of free space drops
>>>>>>>>>>> below a certain level - or levels to be more precise as two
>>>>>>>>>>> of them are being supported: the lower and the upper range.
>>>>>>>>>>> The notifications work both ways: once the threshold level
>>>>>>>>>>> has been reached, an event shall be generated whenever
>>>>>>>>>>> the number of available blocks goes up again re-activating
>>>>>>>>>>> the threshold.
>>>>>>>>>>>
>>>>>>>>>>> The interface has been exposed through a vfs. Once mounted,
>>>>>>>>>>> it serves as an entry point for the set-up where one can
>>>>>>>>>>> register for particular file system events.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Beata Michalska <[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>> Documentation/filesystems/events.txt | 231 ++++++++++
>>>>>>>>>>> fs/Makefile | 1 +
>>>>>>>>>>> fs/events/Makefile | 6 +
>>>>>>>>>>> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
>>>>>>>>>>> fs/events/fs_event.h | 25 ++
>>>>>>>>>>> fs/events/fs_event_netlink.c | 99 +++++
>>>>>>>>>>> fs/namespace.c | 1 +
>>>>>>>>>>> include/linux/fs.h | 6 +-
>>>>>>>>>>> include/linux/fs_event.h | 58 +++
>>>>>>>>>>> include/uapi/linux/fs_event.h | 54 +++
>>>>>>>>>>> include/uapi/linux/genetlink.h | 1 +
>>>>>>>>>>> net/netlink/genetlink.c | 7 +-
>>>>>>>>>>> 12 files changed, 1257 insertions(+), 2 deletions(-)
>>>>>>>>>>> create mode 100644 Documentation/filesystems/events.txt
>>>>>>>>>>> create mode 100644 fs/events/Makefile
>>>>>>>>>>> create mode 100644 fs/events/fs_event.c
>>>>>>>>>>> create mode 100644 fs/events/fs_event.h
>>>>>>>>>>> create mode 100644 fs/events/fs_event_netlink.c
>>>>>>>>>>> create mode 100644 include/linux/fs_event.h
>>>>>>>>>>> create mode 100644 include/uapi/linux/fs_event.h
>>>>>>>>>>
>>>>>>>>>> Any reason why you just don't do uevents for the block devices today,
>>>>>>>>>> and not create a new type of netlink message and userspace tool required
>>>>>>>>>> to read these?
>>>>>>>>>
>>>>>>>>> The idea here is to have support for filesystems with no backing device as well.
>>>>>>>>> Parsing the message with libnl is really simple and requires few lines of code
>>>>>>>>> (sample application has been presented in the initial version of this RFC)
>>>>>>>>
>>>>>>>> I'm not saying it's not "simple" to parse, just that now you are doing
>>>>>>>> something that requires a different tool. If you have a block device,
>>>>>>>> you should be able to emit uevents for it, you don't need a backing
>>>>>>>> device, we handle virtual filesystems in /sys/block/ just fine :)
>>>>>>>>
>>>>>>>> People already have tools that listen to libudev for system monitoring
>>>>>>>> and management, why require them to hook up to yet-another-library? And
>>>>>>>> what is going to provide the ability for multiple userspace tools to
>>>>>>>> listen to these netlink messages in case you have more than one program
>>>>>>>> that wants to watch for these things (i.e. multiple desktop filesystem
>>>>>>>> monitoring tools, system-health checkers, etc.)?
>>>>>>> As much as I understand your concerns I'm not convinced uevent interface
>>>>>>> is a good fit. There are filesystems that don't have underlying block
>>>>>>> device - think of e.g. tmpfs or filesystems working directly on top of
>>>>>>> flash devices. These still want to send notification to userspace (one of
>>>>>>> primary motivation for this interfaces was so that tmpfs can notify about
>>>>>>> something). And creating some fake nodes in /sys/block for tmpfs and
>>>>>>> similar filesystems seems like doing more harm than good to me...
>>>>>>
>>>>>> If these are "fake" block devices, what's going to be present in the
>>>>>> block major/minor fields of the netlink message? For some reason I
>>>>>> thought it was a required field, and because of that, I thought we had a
>>>>>> "real" filesystem somewhere to refer to, otherwise how would userspace
>>>>>> know what filesystem was creating these events?
>>>>>>
>>>>>> What am I missing here?
>>>>>>
>>>>>> confused,
>>>>>>
>>>>>> greg k-h
>>>>>>
>>>>>
>>>>> For those 'fake' block devs, upon mount, get_anon_bdev will assign
>>>>> the major:minor numbers. Userspace might get those through stat.
>>>>
>>>> How can userspace do the mapping backwards from this "anonymous"
>>>> major:minor number for these types of filesystems in such a way that
>>>> they can "know" how to report the block device that is causing the
>>>> event?
>>>>
>>>> thanks,
>>>>
>>>> greg k-h
>>>>
>>>
>>> It needs to be done internally by the app but is doable.
>>> The app knows what it is watching, so it can maintain the mappings.
>>> So prior to activating the notifications it can call 'stat' on the mount point.
>>> Stat struct gives the 'st_dev' which is the device id. Same will be reported
>>> within the message payload (through major:minor numbers). So having this,
>>> the app is able to get any other information it needs.
>>> Note that the events refer to the file system as a whole and they may not
>>> necessarily have anything to do with the actual block device.
>
> How are you going to show an event for a filesystem that is made up of
> multiple block devices?
AFAIK, for such filesystems there will be similar case with the anonymous
major:minor numbers - at least the btrfs is doing so. Not sure we can
differentiate here the actual block device. So in this case such events
serves merely as a hint for the userspace. At this point a user might
decide to run some scanning tools. We might extend the scope of the
info being sent, though I would consider this as a nice-to-have but not
required for this initial version of notifications. The filesystems
might also want to decide to send their own custom messages so it is
possible for filesystems like btrfs to send more detailed information
using the new genetlink multicast group.
>
>> Or you can use /proc/self/mountinfo for the mapping. There you can see
>> device numbers, real device names if applicable and mountpoints. This has
>> the advantage that it works even if filesystem mountpoints change.
>
> Ok, then that brings up my next question, how does this handle
> namespaces? What namespace is the event being sent in? block devices
> aren't namespaced, but the mount points are, is that going to cause
> problems?
>
The path should get resolved properly (as from root level). though I must
admit I'm not sure if there will be no issues when it comes to the network
namespaces. I'll double check it. Any hints though are more than welcomed :)
> thanks,
>
> greg k-h
>
BR
Beata
On Wed, Apr 29, 2015 at 01:10:34PM +0200, Beata Michalska wrote:
> >>> It needs to be done internally by the app but is doable.
> >>> The app knows what it is watching, so it can maintain the mappings.
> >>> So prior to activating the notifications it can call 'stat' on the mount point.
> >>> Stat struct gives the 'st_dev' which is the device id. Same will be reported
> >>> within the message payload (through major:minor numbers). So having this,
> >>> the app is able to get any other information it needs.
> >>> Note that the events refer to the file system as a whole and they may not
> >>> necessarily have anything to do with the actual block device.
> >
> > How are you going to show an event for a filesystem that is made up of
> > multiple block devices?
>
> AFAIK, for such filesystems there will be similar case with the anonymous
> major:minor numbers - at least the btrfs is doing so. Not sure we can
> differentiate here the actual block device. So in this case such events
> serves merely as a hint for the userspace.
"hint" seems like this isn't really going to work well.
Do you have userspace code that can properly map this back to the "real"
device that is causing problems? Without that, this doesn't seem all
that useful as no one would be able to use those events.
> At this point a user might decide to run some scanning tools.
You can't run a scanning tool on a tmpfs :)
So what can a user do with information about one of these "virtual"
filesystems that it can't directly see or access?
> We might extend the scope of the
> info being sent, though I would consider this as a nice-to-have but not
> required for this initial version of notifications. The filesystems
> might also want to decide to send their own custom messages so it is
> possible for filesystems like btrfs to send more detailed information
> using the new genetlink multicast group.
> >> Or you can use /proc/self/mountinfo for the mapping. There you can see
> >> device numbers, real device names if applicable and mountpoints. This has
> >> the advantage that it works even if filesystem mountpoints change.
> >
> > Ok, then that brings up my next question, how does this handle
> > namespaces? What namespace is the event being sent in? block devices
> > aren't namespaced, but the mount points are, is that going to cause
> > problems?
> >
>
> The path should get resolved properly (as from root level). though I must
> admit I'm not sure if there will be no issues when it comes to the network
> namespaces. I'll double check it. Any hints though are more than welcomed :)
What is "root level" here? You can mount things in different namespaces
all over the place.
This is going to get really complex very quickly :(
I still think you should tie this to an existing sysfs device, which
handles the namespace issues for you, and it also handles the fact that
userspace can properly identify the device, if at all possible.
thanks,
greg k-h
On 04/29/2015 03:45 PM, Greg KH wrote:
> On Wed, Apr 29, 2015 at 01:10:34PM +0200, Beata Michalska wrote:
>>>>> It needs to be done internally by the app but is doable.
>>>>> The app knows what it is watching, so it can maintain the mappings.
>>>>> So prior to activating the notifications it can call 'stat' on the mount point.
>>>>> Stat struct gives the 'st_dev' which is the device id. Same will be reported
>>>>> within the message payload (through major:minor numbers). So having this,
>>>>> the app is able to get any other information it needs.
>>>>> Note that the events refer to the file system as a whole and they may not
>>>>> necessarily have anything to do with the actual block device.
>>>
>>> How are you going to show an event for a filesystem that is made up of
>>> multiple block devices?
>>
>> AFAIK, for such filesystems there will be similar case with the anonymous
>> major:minor numbers - at least the btrfs is doing so. Not sure we can
>> differentiate here the actual block device. So in this case such events
>> serves merely as a hint for the userspace.
>
> "hint" seems like this isn't really going to work well.
>
> Do you have userspace code that can properly map this back to the "real"
> device that is causing problems? Without that, this doesn't seem all
> that useful as no one would be able to use those events.
I'm not sure we are on the same page here.
This is about watching the file system rather than the 'real' device.
Like the threshold notifications: you would like to know when you
will be approaching certain level of available space for the tmpfs
mounted on /tmp. You do know you are watching the /tmp
and you know that the dev numbers for this are 0:20 (or so).
(either through calling stat on /tmp or through reading the /proc/$$/mountinfo)
With this interface you can setup threshold levels
for /tmp. Then, once the limit is reached the event will be
sent with those anonymous major:minor numbers.
I can provide a sample code which will demonstrate how this
can be achieved.
>
>> At this point a user might decide to run some scanning tools.
>
> You can't run a scanning tool on a tmpfs :)
I was referring to btrfs here as a filesystem with multiple devices
and its btrfs device scan :)
>
> So what can a user do with information about one of these "virtual"
> filesystems that it can't directly see or access?
>
>> We might extend the scope of the
>> info being sent, though I would consider this as a nice-to-have but not
>> required for this initial version of notifications. The filesystems
>> might also want to decide to send their own custom messages so it is
>> possible for filesystems like btrfs to send more detailed information
>> using the new genetlink multicast group.
>>>> Or you can use /proc/self/mountinfo for the mapping. There you can see
>>>> device numbers, real device names if applicable and mountpoints. This has
>>>> the advantage that it works even if filesystem mountpoints change.
>>>
>>> Ok, then that brings up my next question, how does this handle
>>> namespaces? What namespace is the event being sent in? block devices
>>> aren't namespaced, but the mount points are, is that going to cause
>>> problems?
>>>
>>
>> The path should get resolved properly (as from root level). though I must
>> admit I'm not sure if there will be no issues when it comes to the network
>> namespaces. I'll double check it. Any hints though are more than welcomed :)
>
> What is "root level" here? You can mount things in different namespaces
> all over the place.
I was referring here to the mounts visibility and the mount propagation
which on some distros is set by default with the make-shared option,
so the mounts created in new namespace are visible outside of it (running
cat /proc/$$/moutinfo showed the new mounts). Which got me really
confused, obviously.
>
> This is going to get really complex very quickly :(
It will/is indeed - still I believe it's worth giving it a try.
I'll try to work out the namespace issue here and get back to you.
BR
Beata
>
> I still think you should tie this to an existing sysfs device, which
> handles the namespace issues for you, and it also handles the fact that
> userspace can properly identify the device, if at all possible.
>
> thanks,
>
> greg k-h
>
On Wed, Apr 29, 2015 at 05:48:14PM +0200, Beata Michalska wrote:
> On 04/29/2015 03:45 PM, Greg KH wrote:
> > On Wed, Apr 29, 2015 at 01:10:34PM +0200, Beata Michalska wrote:
> >>>>> It needs to be done internally by the app but is doable.
> >>>>> The app knows what it is watching, so it can maintain the mappings.
> >>>>> So prior to activating the notifications it can call 'stat' on the mount point.
> >>>>> Stat struct gives the 'st_dev' which is the device id. Same will be reported
> >>>>> within the message payload (through major:minor numbers). So having this,
> >>>>> the app is able to get any other information it needs.
> >>>>> Note that the events refer to the file system as a whole and they may not
> >>>>> necessarily have anything to do with the actual block device.
> >>>
> >>> How are you going to show an event for a filesystem that is made up of
> >>> multiple block devices?
> >>
> >> AFAIK, for such filesystems there will be similar case with the anonymous
> >> major:minor numbers - at least the btrfs is doing so. Not sure we can
> >> differentiate here the actual block device. So in this case such events
> >> serves merely as a hint for the userspace.
> >
> > "hint" seems like this isn't really going to work well.
> >
> > Do you have userspace code that can properly map this back to the "real"
> > device that is causing problems? Without that, this doesn't seem all
> > that useful as no one would be able to use those events.
>
> I'm not sure we are on the same page here.
> This is about watching the file system rather than the 'real' device.
> Like the threshold notifications: you would like to know when you
> will be approaching certain level of available space for the tmpfs
> mounted on /tmp. You do know you are watching the /tmp
> and you know that the dev numbers for this are 0:20 (or so).
> (either through calling stat on /tmp or through reading the /proc/$$/mountinfo)
> With this interface you can setup threshold levels
> for /tmp. Then, once the limit is reached the event will be
> sent with those anonymous major:minor numbers.
>
> I can provide a sample code which will demonstrate how this
> can be achieved.
Yes, example code would be helpful to understand this, thanks.
greg k-h
Hi,
On 04/29/2015 05:55 PM, Greg KH wrote:
> On Wed, Apr 29, 2015 at 05:48:14PM +0200, Beata Michalska wrote:
>> On 04/29/2015 03:45 PM, Greg KH wrote:
>>> On Wed, Apr 29, 2015 at 01:10:34PM +0200, Beata Michalska wrote:
>>>>>>> It needs to be done internally by the app but is doable.
>>>>>>> The app knows what it is watching, so it can maintain the mappings.
>>>>>>> So prior to activating the notifications it can call 'stat' on the mount point.
>>>>>>> Stat struct gives the 'st_dev' which is the device id. Same will be reported
>>>>>>> within the message payload (through major:minor numbers). So having this,
>>>>>>> the app is able to get any other information it needs.
>>>>>>> Note that the events refer to the file system as a whole and they may not
>>>>>>> necessarily have anything to do with the actual block device.
>>>>>
>>>>> How are you going to show an event for a filesystem that is made up of
>>>>> multiple block devices?
>>>>
>>>> AFAIK, for such filesystems there will be similar case with the anonymous
>>>> major:minor numbers - at least the btrfs is doing so. Not sure we can
>>>> differentiate here the actual block device. So in this case such events
>>>> serves merely as a hint for the userspace.
>>>
>>> "hint" seems like this isn't really going to work well.
>>>
>>> Do you have userspace code that can properly map this back to the "real"
>>> device that is causing problems? Without that, this doesn't seem all
>>> that useful as no one would be able to use those events.
>>
>> I'm not sure we are on the same page here.
>> This is about watching the file system rather than the 'real' device.
>> Like the threshold notifications: you would like to know when you
>> will be approaching certain level of available space for the tmpfs
>> mounted on /tmp. You do know you are watching the /tmp
>> and you know that the dev numbers for this are 0:20 (or so).
>> (either through calling stat on /tmp or through reading the /proc/$$/mountinfo)
>> With this interface you can setup threshold levels
>> for /tmp. Then, once the limit is reached the event will be
>> sent with those anonymous major:minor numbers.
>>
>> I can provide a sample code which will demonstrate how this
>> can be achieved.
>
> Yes, example code would be helpful to understand this, thanks.
>
> greg k-h
>
Below is an absolutely *simplified* sample application.
Hope this will be helpful.
---------------
#include <netlink/cli/utils.h>
#include <fs_event.h>
#include <string.h>
#include <regex.h>
#define ARRAY_SIZE(x) (sizeof(x)/sizeof((x)[0]))
#define LOG(args...) fprintf(stderr, args)
#define BUFF_SIZE 256
struct list_node {
struct list_node *next;
struct list_node *prev;
};
#define MBITS 20
#define MAKE_DEV(major, minor) \
((major) << MBITS | ((minor) & ((1U << MBITS) -1)))
struct mount_data {
struct list_node link;
dev_t dev;
char *dname;
};
static struct list_node mount_list = {&mount_list, &mount_list};
static void list_add(struct list_node *new_node, struct list_node *head)
{
struct list_node *node;
node = head->next;
head->next = new_node;
new_node->prev = head;
new_node->next = node;
node->prev = new_node;
}
static struct mount_data *find_mount(struct list_node *mlist, dev_t dev)
{
struct list_node *node;
struct mount_data *mdata;
for (node = mlist->prev; node != mlist; node = node->prev) {
mdata = (char*)node - ((size_t) &((struct mount_data*)0)->link);
if (mdata->dev == dev)
return mdata;
}
return NULL;
}
static void create_mount_base(struct list_node *mlist)
{
FILE *f;
char entry[BUFF_SIZE];
regex_t re;
if (!(f = fopen("/proc/self/mountinfo", "r")))
return;
if (regcomp(&re, "[0-9]*:[0-9]*", REG_EXTENDED))
goto leave;
while (fgets(entry, BUFF_SIZE, f)) {
regmatch_t pmatch;
int dev_major, dev_minor;
char *s;
if (regexec(&re, entry, 1, &pmatch, 0))
continue;
if (pmatch.rm_so == -1)
continue;
sscanf(entry + pmatch.rm_so, "%d:%d",
&dev_major, &dev_minor);
s = entry + pmatch.rm_eo;
s = strtok(++s, " ");
if (!s)
continue;
if (s = strtok(NULL, " ")) {
struct mount_data *data = malloc(sizeof(*data));
if (!data)
continue;
data->dev = MAKE_DEV(dev_major, dev_minor);
data->dname = strdup(s);
list_add(&data->link, mlist);
}
}
regfree(&re);
leave:
close(f);
return;
}
static int parse_event(struct nl_cache_ops *unused, struct genl_cmd *cmd,
struct genl_info *info, void *arg)
{
struct mount_data *mdata;
int dev_major, dev_minor;
dev_major = info->attrs[FS_NL_A_DEV_MAJOR]
? nla_get_u32(info->attrs[FS_NL_A_DEV_MAJOR])
: 0;
dev_minor = info->attrs[FS_NL_A_DEV_MINOR]
? nla_get_u32(info->attrs[FS_NL_A_DEV_MINOR])
: 0;
mdata = find_mount(&mount_list, MAKE_DEV(dev_major, dev_minor));
if (!mdata) {
LOG("Unable to identify file system\n");
return 0;
}
LOG("Notification received for %s \n", mdata->dname);
LOG("Event ID: %d\n", nla_get_u32(info->attrs[FS_NL_A_EVENT_ID]));
LOG("Owner: %d\n", nla_get_u32(info->attrs[FS_NL_A_CAUSED_ID]));
LOG("Threshold data: %llu\n", info->attrs[FS_NL_A_DATA]
? nla_get_u64(info->attrs[FS_NL_A_DATA])
: 0);
return 0;
}
static struct genl_cmd cmd[] = {
{
.c_id = 1 ,
.c_name = "event",
.c_maxattr = 5,
.c_msg_parser = parse_event,
},
};
static struct genl_ops ops = {
.o_id = GENL_ID_FS_EVENT,
.o_name = "FS_EVENT",
.o_hdrsize = 0,
.o_cmds = cmd,
.o_ncmds = ARRAY_SIZE(cmd),
};
int events_cb(struct nl_msg *msg, void *arg)
{
return genl_handle_msg(msg, arg);
}
int main(int argc, char **argv)
{
struct nl_sock *sock;
int ret;
create_mount_base(&mount_list);
sock = nl_cli_alloc_socket();
nl_socket_set_local_port(sock, 0);
nl_socket_disable_seq_check(sock);
nl_socket_modify_cb(sock, NL_CB_VALID, NL_CB_CUSTOM, events_cb, NULL);
nl_cli_connect(sock, NETLINK_GENERIC);
if ((ret = nl_socket_add_membership(sock, GENL_ID_FS_EVENT))) {
LOG("Failed to add membership\n");
goto leave;
}
if((ret = genl_register_family(&ops))) {
LOG("Failed to register protocol family\n");
goto leave;
}
if ((ret = genl_ops_resolve(sock, &ops) < 0)) {
LOG("Unable to resolve the family name\n");
goto leave;
}
if (genl_ctrl_resolve(sock, "FS_EVENT") < 0) {
LOG("Failed to resolve the family name\n");
goto leave;
}
while (1) {
if ((ret = nl_recvmsgs_default(sock)) < 0)
LOG("Unable to receive message: %s\n",
nl_geterror(ret));
}
leave:
nl_close(sock);
nl_socket_free(sock);
return 0;
}
----------------------------
The configuration setup for the app:
# echo /tmp T 50000 10000 > /sys/fs/events/config;
# echo /opt/usr G T 710000 500000 > /sys/fs/events/config;
(tmpfs and ext4 as the support for those is part of the patchset)
And the output after playing around with the 'dd':
Notification received for /tmp
Event ID: 3 /* FS_THR_LRBELOW */
Owner: 3128
Threshold data: 50000
Notification received for /opt/usr
Event ID: 3 /* FS_THR_LRBELOW */
Owner: 3127
Threshold data: 710000
Notification received for /tmp
Event ID: 5 /* FS_THR_URBELOW */
Owner: 3128
Threshold data: 10000
Notification received for /opt/usr
Event ID: 5 /* FS_THR_URBELOW */
Owner: 3127
Threshold data: 500000
Notification received for /opt/usr
Event ID: 1 /* FS_WARN_ENOSPC */
Owner: 3127
Threshold data: 0
Notification received for /opt/usr
Event ID: 1 /* FS_WARN_ENOSPC */
Owner: 3127
Threshold data: 0
-------------------------
BR
Beata
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Hi again,
On 04/29/2015 11:13 AM, Greg KH wrote:
> On Wed, Apr 29, 2015 at 09:42:59AM +0200, Jan Kara wrote:
>> On Wed 29-04-15 09:03:08, Beata Michalska wrote:
>>> On 04/28/2015 07:39 PM, Greg KH wrote:
>>>> On Tue, Apr 28, 2015 at 04:46:46PM +0200, Beata Michalska wrote:
>>>>> On 04/28/2015 04:09 PM, Greg KH wrote:
>>>>>> On Tue, Apr 28, 2015 at 03:56:53PM +0200, Jan Kara wrote:
>>>>>>> On Mon 27-04-15 17:37:11, Greg KH wrote:
>>>>>>>> On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
>>>>>>>>> On 04/27/2015 04:24 PM, Greg KH wrote:
>>>>>>>>>> On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
>>>>>>>>>>> Introduce configurable generic interface for file
>>>>>>>>>>> system-wide event notifications, to provide file
>>>>>>>>>>> systems with a common way of reporting any potential
>>>>>>>>>>> issues as they emerge.
>>>>>>>>>>>
>>>>>>>>>>> The notifications are to be issued through generic
>>>>>>>>>>> netlink interface by newly introduced multicast group.
>>>>>>>>>>>
>>>>>>>>>>> Threshold notifications have been included, allowing
>>>>>>>>>>> triggering an event whenever the amount of free space drops
>>>>>>>>>>> below a certain level - or levels to be more precise as two
>>>>>>>>>>> of them are being supported: the lower and the upper range.
>>>>>>>>>>> The notifications work both ways: once the threshold level
>>>>>>>>>>> has been reached, an event shall be generated whenever
>>>>>>>>>>> the number of available blocks goes up again re-activating
>>>>>>>>>>> the threshold.
>>>>>>>>>>>
>>>>>>>>>>> The interface has been exposed through a vfs. Once mounted,
>>>>>>>>>>> it serves as an entry point for the set-up where one can
>>>>>>>>>>> register for particular file system events.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Beata Michalska <[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>> Documentation/filesystems/events.txt | 231 ++++++++++
>>>>>>>>>>> fs/Makefile | 1 +
>>>>>>>>>>> fs/events/Makefile | 6 +
>>>>>>>>>>> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
>>>>>>>>>>> fs/events/fs_event.h | 25 ++
>>>>>>>>>>> fs/events/fs_event_netlink.c | 99 +++++
>>>>>>>>>>> fs/namespace.c | 1 +
>>>>>>>>>>> include/linux/fs.h | 6 +-
>>>>>>>>>>> include/linux/fs_event.h | 58 +++
>>>>>>>>>>> include/uapi/linux/fs_event.h | 54 +++
>>>>>>>>>>> include/uapi/linux/genetlink.h | 1 +
>>>>>>>>>>> net/netlink/genetlink.c | 7 +-
>>>>>>>>>>> 12 files changed, 1257 insertions(+), 2 deletions(-)
>>>>>>>>>>> create mode 100644 Documentation/filesystems/events.txt
>>>>>>>>>>> create mode 100644 fs/events/Makefile
>>>>>>>>>>> create mode 100644 fs/events/fs_event.c
>>>>>>>>>>> create mode 100644 fs/events/fs_event.h
>>>>>>>>>>> create mode 100644 fs/events/fs_event_netlink.c
>>>>>>>>>>> create mode 100644 include/linux/fs_event.h
>>>>>>>>>>> create mode 100644 include/uapi/linux/fs_event.h
>>>>>>>>>>
>>>>>>>>>> Any reason why you just don't do uevents for the block devices today,
>>>>>>>>>> and not create a new type of netlink message and userspace tool required
>>>>>>>>>> to read these?
>>>>>>>>>
>>>>>>>>> The idea here is to have support for filesystems with no backing device as well.
>>>>>>>>> Parsing the message with libnl is really simple and requires few lines of code
>>>>>>>>> (sample application has been presented in the initial version of this RFC)
>>>>>>>>
>>>>>>>> I'm not saying it's not "simple" to parse, just that now you are doing
>>>>>>>> something that requires a different tool. If you have a block device,
>>>>>>>> you should be able to emit uevents for it, you don't need a backing
>>>>>>>> device, we handle virtual filesystems in /sys/block/ just fine :)
>>>>>>>>
>>>>>>>> People already have tools that listen to libudev for system monitoring
>>>>>>>> and management, why require them to hook up to yet-another-library? And
>>>>>>>> what is going to provide the ability for multiple userspace tools to
>>>>>>>> listen to these netlink messages in case you have more than one program
>>>>>>>> that wants to watch for these things (i.e. multiple desktop filesystem
>>>>>>>> monitoring tools, system-health checkers, etc.)?
>>>>>>> As much as I understand your concerns I'm not convinced uevent interface
>>>>>>> is a good fit. There are filesystems that don't have underlying block
>>>>>>> device - think of e.g. tmpfs or filesystems working directly on top of
>>>>>>> flash devices. These still want to send notification to userspace (one of
>>>>>>> primary motivation for this interfaces was so that tmpfs can notify about
>>>>>>> something). And creating some fake nodes in /sys/block for tmpfs and
>>>>>>> similar filesystems seems like doing more harm than good to me...
>>>>>>
>>>>>> If these are "fake" block devices, what's going to be present in the
>>>>>> block major/minor fields of the netlink message? For some reason I
>>>>>> thought it was a required field, and because of that, I thought we had a
>>>>>> "real" filesystem somewhere to refer to, otherwise how would userspace
>>>>>> know what filesystem was creating these events?
>>>>>>
>>>>>> What am I missing here?
>>>>>>
>>>>>> confused,
>>>>>>
>>>>>> greg k-h
>>>>>>
>>>>>
>>>>> For those 'fake' block devs, upon mount, get_anon_bdev will assign
>>>>> the major:minor numbers. Userspace might get those through stat.
>>>>
>>>> How can userspace do the mapping backwards from this "anonymous"
>>>> major:minor number for these types of filesystems in such a way that
>>>> they can "know" how to report the block device that is causing the
>>>> event?
>>>>
>>>> thanks,
>>>>
>>>> greg k-h
>>>>
>>>
>>> It needs to be done internally by the app but is doable.
>>> The app knows what it is watching, so it can maintain the mappings.
>>> So prior to activating the notifications it can call 'stat' on the mount point.
>>> Stat struct gives the 'st_dev' which is the device id. Same will be reported
>>> within the message payload (through major:minor numbers). So having this,
>>> the app is able to get any other information it needs.
>>> Note that the events refer to the file system as a whole and they may not
>>> necessarily have anything to do with the actual block device.
>
> How are you going to show an event for a filesystem that is made up of
> multiple block devices?
>
>> Or you can use /proc/self/mountinfo for the mapping. There you can see
>> device numbers, real device names if applicable and mountpoints. This has
>> the advantage that it works even if filesystem mountpoints change.
>
> Ok, then that brings up my next question, how does this handle
> namespaces? What namespace is the event being sent in? block devices
> aren't namespaced, but the mount points are, is that going to cause
> problems?
>
> thanks,
>
> greg k-h
>
Getting back to the namespaces ...
In the current state the notifications will be sent to the init network namespace,
which means that processes belonging to a different net namespace will not
be able to receive them. To be more precise, those processes will not be
able to subscribe to the multicast group, though this can be easily changed.
Furthermore, the notifications might also be sent to specific namespace.
In this case, the one, with which the trace for the mount point has been registered,
which as I believe would be the best approach.
As for the mount namespaces, reading the config file needs to be slightly tweaked,
to hide away all the registered mount points which does not belong to the current
mount namespace.
Still, there is one possible 'issue' - the private/slave mount points.
As the notifications will be sent to all the listeners (within the same netns),
the events might be visible to processes outside the given mount ns.
This should be limited to only those listeners that share the mount namespace,
to which such private/slave mount points belong. As using the generic netlink
to filter the outgoing messages is doable (with small changes to current
implementation), the filters themselves seem rather cumbersome, as they would require
finding the socket’s owner mount namespace, which just doesn't seems right.
On the other hand, identifying the file system, which generated the event, will
not be possible for processes outside such namespace, as device major:minor
numbers are not bound to any namespace (afaict) so they will not provide any
valid information. They will remain unresolved.
The best way out here though, is to leave it to userspace to properly setup new namespaces:
the mount namespace with possible private/slave mounts should have a separate
network namespace to isolate the potential fs events, if required.
BR
Beata
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Hi,
On 05/05/2015 02:16 PM, Beata Michalska wrote:
> Hi again,
>
> On 04/29/2015 11:13 AM, Greg KH wrote:
>> On Wed, Apr 29, 2015 at 09:42:59AM +0200, Jan Kara wrote:
>>> On Wed 29-04-15 09:03:08, Beata Michalska wrote:
>>>> On 04/28/2015 07:39 PM, Greg KH wrote:
>>>>> On Tue, Apr 28, 2015 at 04:46:46PM +0200, Beata Michalska wrote:
>>>>>> On 04/28/2015 04:09 PM, Greg KH wrote:
>>>>>>> On Tue, Apr 28, 2015 at 03:56:53PM +0200, Jan Kara wrote:
>>>>>>>> On Mon 27-04-15 17:37:11, Greg KH wrote:
>>>>>>>>> On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
>>>>>>>>>> On 04/27/2015 04:24 PM, Greg KH wrote:
>>>>>>>>>>> On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
>>>>>>>>>>>> Introduce configurable generic interface for file
>>>>>>>>>>>> system-wide event notifications, to provide file
>>>>>>>>>>>> systems with a common way of reporting any potential
>>>>>>>>>>>> issues as they emerge.
>>>>>>>>>>>>
>>>>>>>>>>>> The notifications are to be issued through generic
>>>>>>>>>>>> netlink interface by newly introduced multicast group.
>>>>>>>>>>>>
>>>>>>>>>>>> Threshold notifications have been included, allowing
>>>>>>>>>>>> triggering an event whenever the amount of free space drops
>>>>>>>>>>>> below a certain level - or levels to be more precise as two
>>>>>>>>>>>> of them are being supported: the lower and the upper range.
>>>>>>>>>>>> The notifications work both ways: once the threshold level
>>>>>>>>>>>> has been reached, an event shall be generated whenever
>>>>>>>>>>>> the number of available blocks goes up again re-activating
>>>>>>>>>>>> the threshold.
>>>>>>>>>>>>
>>>>>>>>>>>> The interface has been exposed through a vfs. Once mounted,
>>>>>>>>>>>> it serves as an entry point for the set-up where one can
>>>>>>>>>>>> register for particular file system events.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Beata Michalska <[email protected]>
>>>>>>>>>>>> ---
>>>>>>>>>>>> Documentation/filesystems/events.txt | 231 ++++++++++
>>>>>>>>>>>> fs/Makefile | 1 +
>>>>>>>>>>>> fs/events/Makefile | 6 +
>>>>>>>>>>>> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
>>>>>>>>>>>> fs/events/fs_event.h | 25 ++
>>>>>>>>>>>> fs/events/fs_event_netlink.c | 99 +++++
>>>>>>>>>>>> fs/namespace.c | 1 +
>>>>>>>>>>>> include/linux/fs.h | 6 +-
>>>>>>>>>>>> include/linux/fs_event.h | 58 +++
>>>>>>>>>>>> include/uapi/linux/fs_event.h | 54 +++
>>>>>>>>>>>> include/uapi/linux/genetlink.h | 1 +
>>>>>>>>>>>> net/netlink/genetlink.c | 7 +-
>>>>>>>>>>>> 12 files changed, 1257 insertions(+), 2 deletions(-)
>>>>>>>>>>>> create mode 100644 Documentation/filesystems/events.txt
>>>>>>>>>>>> create mode 100644 fs/events/Makefile
>>>>>>>>>>>> create mode 100644 fs/events/fs_event.c
>>>>>>>>>>>> create mode 100644 fs/events/fs_event.h
>>>>>>>>>>>> create mode 100644 fs/events/fs_event_netlink.c
>>>>>>>>>>>> create mode 100644 include/linux/fs_event.h
>>>>>>>>>>>> create mode 100644 include/uapi/linux/fs_event.h
>>>>>>>>>>>
>>>>>>>>>>> Any reason why you just don't do uevents for the block devices today,
>>>>>>>>>>> and not create a new type of netlink message and userspace tool required
>>>>>>>>>>> to read these?
>>>>>>>>>>
>>>>>>>>>> The idea here is to have support for filesystems with no backing device as well.
>>>>>>>>>> Parsing the message with libnl is really simple and requires few lines of code
>>>>>>>>>> (sample application has been presented in the initial version of this RFC)
>>>>>>>>>
>>>>>>>>> I'm not saying it's not "simple" to parse, just that now you are doing
>>>>>>>>> something that requires a different tool. If you have a block device,
>>>>>>>>> you should be able to emit uevents for it, you don't need a backing
>>>>>>>>> device, we handle virtual filesystems in /sys/block/ just fine :)
>>>>>>>>>
>>>>>>>>> People already have tools that listen to libudev for system monitoring
>>>>>>>>> and management, why require them to hook up to yet-another-library? And
>>>>>>>>> what is going to provide the ability for multiple userspace tools to
>>>>>>>>> listen to these netlink messages in case you have more than one program
>>>>>>>>> that wants to watch for these things (i.e. multiple desktop filesystem
>>>>>>>>> monitoring tools, system-health checkers, etc.)?
>>>>>>>> As much as I understand your concerns I'm not convinced uevent interface
>>>>>>>> is a good fit. There are filesystems that don't have underlying block
>>>>>>>> device - think of e.g. tmpfs or filesystems working directly on top of
>>>>>>>> flash devices. These still want to send notification to userspace (one of
>>>>>>>> primary motivation for this interfaces was so that tmpfs can notify about
>>>>>>>> something). And creating some fake nodes in /sys/block for tmpfs and
>>>>>>>> similar filesystems seems like doing more harm than good to me...
>>>>>>>
>>>>>>> If these are "fake" block devices, what's going to be present in the
>>>>>>> block major/minor fields of the netlink message? For some reason I
>>>>>>> thought it was a required field, and because of that, I thought we had a
>>>>>>> "real" filesystem somewhere to refer to, otherwise how would userspace
>>>>>>> know what filesystem was creating these events?
>>>>>>>
>>>>>>> What am I missing here?
>>>>>>>
>>>>>>> confused,
>>>>>>>
>>>>>>> greg k-h
>>>>>>>
>>>>>>
>>>>>> For those 'fake' block devs, upon mount, get_anon_bdev will assign
>>>>>> the major:minor numbers. Userspace might get those through stat.
>>>>>
>>>>> How can userspace do the mapping backwards from this "anonymous"
>>>>> major:minor number for these types of filesystems in such a way that
>>>>> they can "know" how to report the block device that is causing the
>>>>> event?
>>>>>
>>>>> thanks,
>>>>>
>>>>> greg k-h
>>>>>
>>>>
>>>> It needs to be done internally by the app but is doable.
>>>> The app knows what it is watching, so it can maintain the mappings.
>>>> So prior to activating the notifications it can call 'stat' on the mount point.
>>>> Stat struct gives the 'st_dev' which is the device id. Same will be reported
>>>> within the message payload (through major:minor numbers). So having this,
>>>> the app is able to get any other information it needs.
>>>> Note that the events refer to the file system as a whole and they may not
>>>> necessarily have anything to do with the actual block device.
>>
>> How are you going to show an event for a filesystem that is made up of
>> multiple block devices?
>>
>>> Or you can use /proc/self/mountinfo for the mapping. There you can see
>>> device numbers, real device names if applicable and mountpoints. This has
>>> the advantage that it works even if filesystem mountpoints change.
>>
>> Ok, then that brings up my next question, how does this handle
>> namespaces? What namespace is the event being sent in? block devices
>> aren't namespaced, but the mount points are, is that going to cause
>> problems?
>>
>> thanks,
>>
>> greg k-h
>>
>
> Getting back to the namespaces ...
> In the current state the notifications will be sent to the init network namespace,
> which means that processes belonging to a different net namespace will not
> be able to receive them. To be more precise, those processes will not be
> able to subscribe to the multicast group, though this can be easily changed.
> Furthermore, the notifications might also be sent to specific namespace.
> In this case, the one, with which the trace for the mount point has been registered,
> which as I believe would be the best approach.
>
> As for the mount namespaces, reading the config file needs to be slightly tweaked,
> to hide away all the registered mount points which does not belong to the current
> mount namespace.
>
> Still, there is one possible 'issue' - the private/slave mount points.
> As the notifications will be sent to all the listeners (within the same netns),
> the events might be visible to processes outside the given mount ns.
> This should be limited to only those listeners that share the mount namespace,
> to which such private/slave mount points belong. As using the generic netlink
> to filter the outgoing messages is doable (with small changes to current
> implementation), the filters themselves seem rather cumbersome, as they would require
> finding the socket’s owner mount namespace, which just doesn't seems right.
> On the other hand, identifying the file system, which generated the event, will
> not be possible for processes outside such namespace, as device major:minor
> numbers are not bound to any namespace (afaict) so they will not provide any
> valid information. They will remain unresolved.
>
> The best way out here though, is to leave it to userspace to properly setup new namespaces:
> the mount namespace with possible private/slave mounts should have a separate
> network namespace to isolate the potential fs events, if required.
>
>
> BR
> Beata
>
>
>
I'm not really sure where we are with this RFC now (?).
Just wanted to let You know I won't be available for the next two weeks,
in case this comes around.
Best Regards
Beata
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Hi,
On 05/07/2015 01:57 PM, Beata Michalska wrote:
> Hi,
>
> On 05/05/2015 02:16 PM, Beata Michalska wrote:
>> Hi again,
>>
>> On 04/29/2015 11:13 AM, Greg KH wrote:
>>> On Wed, Apr 29, 2015 at 09:42:59AM +0200, Jan Kara wrote:
>>>> On Wed 29-04-15 09:03:08, Beata Michalska wrote:
>>>>> On 04/28/2015 07:39 PM, Greg KH wrote:
>>>>>> On Tue, Apr 28, 2015 at 04:46:46PM +0200, Beata Michalska wrote:
>>>>>>> On 04/28/2015 04:09 PM, Greg KH wrote:
>>>>>>>> On Tue, Apr 28, 2015 at 03:56:53PM +0200, Jan Kara wrote:
>>>>>>>>> On Mon 27-04-15 17:37:11, Greg KH wrote:
>>>>>>>>>> On Mon, Apr 27, 2015 at 05:08:27PM +0200, Beata Michalska wrote:
>>>>>>>>>>> On 04/27/2015 04:24 PM, Greg KH wrote:
>>>>>>>>>>>> On Mon, Apr 27, 2015 at 01:51:41PM +0200, Beata Michalska wrote:
>>>>>>>>>>>>> Introduce configurable generic interface for file
>>>>>>>>>>>>> system-wide event notifications, to provide file
>>>>>>>>>>>>> systems with a common way of reporting any potential
>>>>>>>>>>>>> issues as they emerge.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The notifications are to be issued through generic
>>>>>>>>>>>>> netlink interface by newly introduced multicast group.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Threshold notifications have been included, allowing
>>>>>>>>>>>>> triggering an event whenever the amount of free space drops
>>>>>>>>>>>>> below a certain level - or levels to be more precise as two
>>>>>>>>>>>>> of them are being supported: the lower and the upper range.
>>>>>>>>>>>>> The notifications work both ways: once the threshold level
>>>>>>>>>>>>> has been reached, an event shall be generated whenever
>>>>>>>>>>>>> the number of available blocks goes up again re-activating
>>>>>>>>>>>>> the threshold.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The interface has been exposed through a vfs. Once mounted,
>>>>>>>>>>>>> it serves as an entry point for the set-up where one can
>>>>>>>>>>>>> register for particular file system events.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Beata Michalska <[email protected]>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> Documentation/filesystems/events.txt | 231 ++++++++++
>>>>>>>>>>>>> fs/Makefile | 1 +
>>>>>>>>>>>>> fs/events/Makefile | 6 +
>>>>>>>>>>>>> fs/events/fs_event.c | 770 ++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>> fs/events/fs_event.h | 25 ++
>>>>>>>>>>>>> fs/events/fs_event_netlink.c | 99 +++++
>>>>>>>>>>>>> fs/namespace.c | 1 +
>>>>>>>>>>>>> include/linux/fs.h | 6 +-
>>>>>>>>>>>>> include/linux/fs_event.h | 58 +++
>>>>>>>>>>>>> include/uapi/linux/fs_event.h | 54 +++
>>>>>>>>>>>>> include/uapi/linux/genetlink.h | 1 +
>>>>>>>>>>>>> net/netlink/genetlink.c | 7 +-
>>>>>>>>>>>>> 12 files changed, 1257 insertions(+), 2 deletions(-)
>>>>>>>>>>>>> create mode 100644 Documentation/filesystems/events.txt
>>>>>>>>>>>>> create mode 100644 fs/events/Makefile
>>>>>>>>>>>>> create mode 100644 fs/events/fs_event.c
>>>>>>>>>>>>> create mode 100644 fs/events/fs_event.h
>>>>>>>>>>>>> create mode 100644 fs/events/fs_event_netlink.c
>>>>>>>>>>>>> create mode 100644 include/linux/fs_event.h
>>>>>>>>>>>>> create mode 100644 include/uapi/linux/fs_event.h
>>>>>>>>>>>>
>>>>>>>>>>>> Any reason why you just don't do uevents for the block devices today,
>>>>>>>>>>>> and not create a new type of netlink message and userspace tool required
>>>>>>>>>>>> to read these?
>>>>>>>>>>>
>>>>>>>>>>> The idea here is to have support for filesystems with no backing device as well.
>>>>>>>>>>> Parsing the message with libnl is really simple and requires few lines of code
>>>>>>>>>>> (sample application has been presented in the initial version of this RFC)
>>>>>>>>>>
>>>>>>>>>> I'm not saying it's not "simple" to parse, just that now you are doing
>>>>>>>>>> something that requires a different tool. If you have a block device,
>>>>>>>>>> you should be able to emit uevents for it, you don't need a backing
>>>>>>>>>> device, we handle virtual filesystems in /sys/block/ just fine :)
>>>>>>>>>>
>>>>>>>>>> People already have tools that listen to libudev for system monitoring
>>>>>>>>>> and management, why require them to hook up to yet-another-library? And
>>>>>>>>>> what is going to provide the ability for multiple userspace tools to
>>>>>>>>>> listen to these netlink messages in case you have more than one program
>>>>>>>>>> that wants to watch for these things (i.e. multiple desktop filesystem
>>>>>>>>>> monitoring tools, system-health checkers, etc.)?
>>>>>>>>> As much as I understand your concerns I'm not convinced uevent interface
>>>>>>>>> is a good fit. There are filesystems that don't have underlying block
>>>>>>>>> device - think of e.g. tmpfs or filesystems working directly on top of
>>>>>>>>> flash devices. These still want to send notification to userspace (one of
>>>>>>>>> primary motivation for this interfaces was so that tmpfs can notify about
>>>>>>>>> something). And creating some fake nodes in /sys/block for tmpfs and
>>>>>>>>> similar filesystems seems like doing more harm than good to me...
>>>>>>>>
>>>>>>>> If these are "fake" block devices, what's going to be present in the
>>>>>>>> block major/minor fields of the netlink message? For some reason I
>>>>>>>> thought it was a required field, and because of that, I thought we had a
>>>>>>>> "real" filesystem somewhere to refer to, otherwise how would userspace
>>>>>>>> know what filesystem was creating these events?
>>>>>>>>
>>>>>>>> What am I missing here?
>>>>>>>>
>>>>>>>> confused,
>>>>>>>>
>>>>>>>> greg k-h
>>>>>>>>
>>>>>>>
>>>>>>> For those 'fake' block devs, upon mount, get_anon_bdev will assign
>>>>>>> the major:minor numbers. Userspace might get those through stat.
>>>>>>
>>>>>> How can userspace do the mapping backwards from this "anonymous"
>>>>>> major:minor number for these types of filesystems in such a way that
>>>>>> they can "know" how to report the block device that is causing the
>>>>>> event?
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> greg k-h
>>>>>>
>>>>>
>>>>> It needs to be done internally by the app but is doable.
>>>>> The app knows what it is watching, so it can maintain the mappings.
>>>>> So prior to activating the notifications it can call 'stat' on the mount point.
>>>>> Stat struct gives the 'st_dev' which is the device id. Same will be reported
>>>>> within the message payload (through major:minor numbers). So having this,
>>>>> the app is able to get any other information it needs.
>>>>> Note that the events refer to the file system as a whole and they may not
>>>>> necessarily have anything to do with the actual block device.
>>>
>>> How are you going to show an event for a filesystem that is made up of
>>> multiple block devices?
>>>
>>>> Or you can use /proc/self/mountinfo for the mapping. There you can see
>>>> device numbers, real device names if applicable and mountpoints. This has
>>>> the advantage that it works even if filesystem mountpoints change.
>>>
>>> Ok, then that brings up my next question, how does this handle
>>> namespaces? What namespace is the event being sent in? block devices
>>> aren't namespaced, but the mount points are, is that going to cause
>>> problems?
>>>
>>> thanks,
>>>
>>> greg k-h
>>>
>>
>> Getting back to the namespaces ...
>> In the current state the notifications will be sent to the init network namespace,
>> which means that processes belonging to a different net namespace will not
>> be able to receive them. To be more precise, those processes will not be
>> able to subscribe to the multicast group, though this can be easily changed.
>> Furthermore, the notifications might also be sent to specific namespace.
>> In this case, the one, with which the trace for the mount point has been registered,
>> which as I believe would be the best approach.
>>
>> As for the mount namespaces, reading the config file needs to be slightly tweaked,
>> to hide away all the registered mount points which does not belong to the current
>> mount namespace.
>>
>> Still, there is one possible 'issue' - the private/slave mount points.
>> As the notifications will be sent to all the listeners (within the same netns),
>> the events might be visible to processes outside the given mount ns.
>> This should be limited to only those listeners that share the mount namespace,
>> to which such private/slave mount points belong. As using the generic netlink
>> to filter the outgoing messages is doable (with small changes to current
>> implementation), the filters themselves seem rather cumbersome, as they would require
>> finding the socket’s owner mount namespace, which just doesn't seems right.
>> On the other hand, identifying the file system, which generated the event, will
>> not be possible for processes outside such namespace, as device major:minor
>> numbers are not bound to any namespace (afaict) so they will not provide any
>> valid information. They will remain unresolved.
>>
>> The best way out here though, is to leave it to userspace to properly setup new namespaces:
>> the mount namespace with possible private/slave mounts should have a separate
>> network namespace to isolate the potential fs events, if required.
>>
>>
>> BR
>> Beata
>>
>>
>>
>
> I'm not really sure where we are with this RFC now (?).
> Just wanted to let You know I won't be available for the next two weeks,
> in case this comes around.
>
> Best Regards
> Beata
>
>
Things has gone a bit quiet thread wise ...
As I believe I've managed to snap back to reality, I was hoping we could continue with this?
I'm not sure if we've got everything cleared up or ... have we reached a dead end?
Please let me know if we can move to the next stage? Or, if there are any showstoppers?
Thank You,
Best Regards
Beata
On Tue, May 26, 2015 at 06:39:48PM +0200, Beata Michalska wrote:
> Hi,
>
> Things has gone a bit quiet thread wise ...
> As I believe I've managed to snap back to reality, I was hoping we could continue with this?
> I'm not sure if we've got everything cleared up or ... have we reached a dead end?
> Please let me know if we can move to the next stage? Or, if there are any showstoppers?
Please resend if you think it's ready and you have addressed the issues
raised so far.
thanks,
greg k-h
On 05/27/2015 04:34 AM, Greg KH wrote:
> On Tue, May 26, 2015 at 06:39:48PM +0200, Beata Michalska wrote:
>> Hi,
>>
>> Things has gone a bit quiet thread wise ...
>> As I believe I've managed to snap back to reality, I was hoping we could continue with this?
>> I'm not sure if we've got everything cleared up or ... have we reached a dead end?
>> Please let me know if we can move to the next stage? Or, if there are any showstoppers?
>
> Please resend if you think it's ready and you have addressed the issues
> raised so far.
>
> thanks,
>
> greg k-h
>
Alright.
I'm still running some tests so I'll resend it most probably tomorrow
or on Friday.
Best Regards
Beata