2008-10-02 05:55:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/3] Memory management livelock


> Subject: [PATCH 2/3] Memory management livelock

Please don't send multiple patches with identical titles - think up a
good, unique, meaningful title for each patch.

On Wed, 24 Sep 2008 14:52:18 -0400 (EDT) Mikulas Patocka <[email protected]> wrote:

> Avoid starvation when walking address space.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
>

Please include a full changelog with each iteration of each patch.

That changelog should explain the reason for playing games with
bitlocks so Linus doesn't have kittens when he sees it.

> include/linux/pagemap.h | 1 +
> mm/filemap.c | 20 ++++++++++++++++++++
> mm/page-writeback.c | 37 ++++++++++++++++++++++++++++++++++++-
> mm/truncate.c | 24 +++++++++++++++++++++++-
> 4 files changed, 80 insertions(+), 2 deletions(-)
>
> Index: linux-2.6.27-rc7-devel/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.27-rc7-devel.orig/include/linux/pagemap.h 2008-09-24 02:57:37.000000000 +0200
> +++ linux-2.6.27-rc7-devel/include/linux/pagemap.h 2008-09-24 02:59:04.000000000 +0200
> @@ -21,6 +21,7 @@
> #define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
> #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
> #define AS_MM_ALL_LOCKS (__GFP_BITS_SHIFT + 2) /* under mm_take_all_locks() */
> +#define AS_STARVATION (__GFP_BITS_SHIFT + 3) /* an anti-starvation barrier */
>
> static inline void mapping_set_error(struct address_space *mapping, int error)
> {
> Index: linux-2.6.27-rc7-devel/mm/filemap.c
> ===================================================================
> --- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-24 02:59:33.000000000 +0200
> +++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-24 03:13:47.000000000 +0200
> @@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct
> int nr_pages;
> int ret = 0;
> pgoff_t index;
> + long pages_to_process;
>
> if (end < start)
> return 0;
>
> + /*
> + * Estimate the number of pages to process. If we process significantly
> + * more than this, someone is making writeback pages under us.
> + * We must pull the anti-starvation plug.
> + */
> + pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> + pages_to_process += (pages_to_process >> 3) + 16;

This sequence appears twice and it would probably be clearer to
implement it in a well-commented function.

> pagevec_init(&pvec, 0);
> index = start;
> while ((index <= end) &&
> @@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct
> if (page->index > end)
> continue;
>
> + if (pages_to_process >= 0)
> + if (!pages_to_process--)
> + wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);

This is copied three times and perhaps also should be factored out.

Please note that an effort has been made to make mm/filemap.c look
presentable in an 80-col display.

> wait_on_page_writeback(page);
> if (PageError(page))
> ret = -EIO;
> @@ -296,6 +309,13 @@ int wait_on_page_writeback_range(struct
> cond_resched();
> }
>
> + if (pages_to_process < 0) {
> + smp_mb__before_clear_bit();
> + clear_bit(AS_STARVATION, &mapping->flags);
> + smp_mb__after_clear_bit();
> + wake_up_bit(&mapping->flags, AS_STARVATION);
> + }

This sequence is repeated three or four times and should be pulled out
into a well-commented function. That comment should explain the logic
behind the use of these barriers, please.


2008-10-05 22:12:27

by Mikulas Patocka

[permalink] [raw]
Subject: RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock)

Hi

I removed the repeated code and create a new bit mutexes. They are
space-efficient mutexes that consume only one bit. See the next 3 patches.

If you are concerned about the size of an inode, I can convert other
mutexes to bit mutexes: i_mutex and inotify_mutex. I could also create
bit_spinlock (one-bit spinlock that uses test_and_set_bit) and save space
for address_space->tree_lock, address_space->i_mmap_lock,
address_space->private_lock, inode->i_lock.

Look at it and say what you think about the idea of condensing mutexes
into single bits.

Mikulas

>
> > Subject: [PATCH 2/3] Memory management livelock
>
> Please don't send multiple patches with identical titles - think up a
> good, unique, meaningful title for each patch.
>
> On Wed, 24 Sep 2008 14:52:18 -0400 (EDT) Mikulas Patocka <[email protected]> wrote:
>
> > Avoid starvation when walking address space.
> >
> > Signed-off-by: Mikulas Patocka <[email protected]>
> >
>
> Please include a full changelog with each iteration of each patch.
>
> That changelog should explain the reason for playing games with
> bitlocks so Linus doesn't have kittens when he sees it.
>
> > include/linux/pagemap.h | 1 +
> > mm/filemap.c | 20 ++++++++++++++++++++
> > mm/page-writeback.c | 37 ++++++++++++++++++++++++++++++++++++-
> > mm/truncate.c | 24 +++++++++++++++++++++++-
> > 4 files changed, 80 insertions(+), 2 deletions(-)
> >
> > Index: linux-2.6.27-rc7-devel/include/linux/pagemap.h
> > ===================================================================
> > --- linux-2.6.27-rc7-devel.orig/include/linux/pagemap.h 2008-09-24 02:57:37.000000000 +0200
> > +++ linux-2.6.27-rc7-devel/include/linux/pagemap.h 2008-09-24 02:59:04.000000000 +0200
> > @@ -21,6 +21,7 @@
> > #define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
> > #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
> > #define AS_MM_ALL_LOCKS (__GFP_BITS_SHIFT + 2) /* under mm_take_all_locks() */
> > +#define AS_STARVATION (__GFP_BITS_SHIFT + 3) /* an anti-starvation barrier */
> >
> > static inline void mapping_set_error(struct address_space *mapping, int error)
> > {
> > Index: linux-2.6.27-rc7-devel/mm/filemap.c
> > ===================================================================
> > --- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-24 02:59:33.000000000 +0200
> > +++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-24 03:13:47.000000000 +0200
> > @@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct
> > int nr_pages;
> > int ret = 0;
> > pgoff_t index;
> > + long pages_to_process;
> >
> > if (end < start)
> > return 0;
> >
> > + /*
> > + * Estimate the number of pages to process. If we process significantly
> > + * more than this, someone is making writeback pages under us.
> > + * We must pull the anti-starvation plug.
> > + */
> > + pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> > + pages_to_process += (pages_to_process >> 3) + 16;
>
> This sequence appears twice and it would probably be clearer to
> implement it in a well-commented function.
>
> > pagevec_init(&pvec, 0);
> > index = start;
> > while ((index <= end) &&
> > @@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct
> > if (page->index > end)
> > continue;
> >
> > + if (pages_to_process >= 0)
> > + if (!pages_to_process--)
> > + wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);
>
> This is copied three times and perhaps also should be factored out.
>
> Please note that an effort has been made to make mm/filemap.c look
> presentable in an 80-col display.
>
> > wait_on_page_writeback(page);
> > if (PageError(page))
> > ret = -EIO;
> > @@ -296,6 +309,13 @@ int wait_on_page_writeback_range(struct
> > cond_resched();
> > }
> >
> > + if (pages_to_process < 0) {
> > + smp_mb__before_clear_bit();
> > + clear_bit(AS_STARVATION, &mapping->flags);
> > + smp_mb__after_clear_bit();
> > + wake_up_bit(&mapping->flags, AS_STARVATION);
> > + }
>
> This sequence is repeated three or four times and should be pulled out
> into a well-commented function. That comment should explain the logic
> behind the use of these barriers, please.
>
>

2008-10-05 22:14:22

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH 1/3] bit mutexes

A bit mutex implementation.

Bit mutex is a space-efficient mutex that consumes exactly one bit in
the apropriate structure (if mutex debugging is turned off). Bit mutex
is implemented as a bit in unsigned long field. Other bits in the field
may be used for other purposes, if test/set/clear_bit functions are used
to manipulate them. There is no wait queue in the structure containing
the bit mutex; hashed wait queues for waiting on bits are used.

Additional structure struct bit_mutex is needed, it contains lock
debugging & dependency information. When the kernel is compiled without
lock debugging, this structure is empty.

Bit mutexes are used with functions
bit_mutex_init
bit_mutex_lock
bit_mutex_unlock
bit_mutex_is_locked
These functions take 3 arguments: pointer to the unsigned long field,
the bit position and pointer to struct bit_mutex.

Signed-off-by: Mikulas Patocka <[email protected]>

---
include/linux/bit-mutex.h | 98 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/wait.h | 8 +++
kernel/Makefile | 2
kernel/bit-mutex-debug.c | 55 +++++++++++++++++++++++++
kernel/wait.c | 7 +++
5 files changed, 169 insertions(+), 1 deletion(-)

Index: linux-2.6.27-rc8-devel/include/linux/bit-mutex.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.27-rc8-devel/include/linux/bit-mutex.h 2008-10-05 19:58:30.000000000 +0200
@@ -0,0 +1,98 @@
+#ifndef __LINUX_BIT_MUTEX_H
+#define __LINUX_BIT_MUTEX_H
+
+#include <linux/bitops.h>
+#include <linux/lockdep.h>
+#include <linux/wait.h>
+#include <linux/sched.h>
+
+struct bit_mutex {
+#ifdef CONFIG_DEBUG_MUTEXES
+ unsigned long *field;
+ int bit;
+ struct thread_info *owner;
+ const char *name;
+ void *magic;
+#endif
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map dep_map;
+#endif
+};
+
+#ifndef CONFIG_DEBUG_MUTEXES
+
+static inline void bit_mutex_init(unsigned long *field, int bit, struct bit_mutex *mtx)
+{
+ clear_bit(bit, field);
+ smp_mb__after_clear_bit();
+}
+
+static inline void bit_mutex_lock(unsigned long *field, int bit, struct bit_mutex *mtx)
+{
+ wait_on_bit_lock(field, bit, wait_action_schedule, TASK_UNINTERRUPTIBLE);
+}
+
+static inline void bit_mutex_unlock(unsigned long *field, int bit, struct bit_mutex *mtx)
+{
+ smp_mb__before_clear_bit();
+ clear_bit(bit, field);
+ smp_mb__after_clear_bit();
+ wake_up_bit(field, bit);
+}
+
+static inline int bit_mutex_is_locked(unsigned long *field, int bit, struct bit_mutex *mtx)
+{
+ return test_bit(bit, field);
+}
+
+#define __DEBUG_BIT_MUTEX_INITIALIZER(field, bit, mtx)
+
+#else
+
+void __bit_mutex_init(unsigned long *field, int bit, struct bit_mutex *mtx, const char *name, struct lock_class_key *key);
+
+#define bit_mutex_init(field, bit, mtx) \
+do { \
+ static struct lock_class_key __key; \
+ __bit_mutex_init(field, bit, mtx, #mtx, &__key); \
+} while (0)
+
+void __bit_mutex_lock(unsigned long *field, int bit, struct bit_mutex *mtx, int subclass);
+
+#define bit_mutex_lock(field, bit, mtx) \
+ __bit_mutex_lock(field, bit, mtx, 0)
+
+void __bit_mutex_unlock(unsigned long *field, int bit, struct bit_mutex *mtx, int subclass);
+
+#define bit_mutex_unlock(field, bit, mtx) \
+ __bit_mutex_unlock(field, bit, mtx, 0)
+
+int bit_mutex_is_locked(unsigned long *field, int bit, struct bit_mutex *mtx);
+
+#define __DEBUG_BIT_MUTEX_INITIALIZER(field_, bit_, mtx) \
+ .magic = &(mtx), \
+ .field = (field_), \
+ .bit = (bit_)
+
+#endif
+
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+
+#define __DEP_MAP_BIT_MUTEX_INITIALIZER(field, bit, mtx) \
+ , .dep_map = { .name = #mtx }
+
+#else
+
+#define __DEP_MAP_BIT_MUTEX_INITIALIZER(field, bit, mtx)
+
+#endif
+
+
+#define __BIT_MUTEX_INITIALIZER(field, bit, mtx) \
+ { \
+ __DEBUG_BIT_MUTEX_INITIALIZER(field, bit, mtx) \
+ __DEP_MAP_BIT_MUTEX_INITIALIZER(field, bit, mtx)\
+ }
+
+#endif
Index: linux-2.6.27-rc8-devel/include/linux/wait.h
===================================================================
--- linux-2.6.27-rc8-devel.orig/include/linux/wait.h 2008-10-04 13:40:46.000000000 +0200
+++ linux-2.6.27-rc8-devel/include/linux/wait.h 2008-10-04 13:41:21.000000000 +0200
@@ -513,7 +513,13 @@ static inline int wait_on_bit_lock(void
return 0;
return out_of_line_wait_on_bit_lock(word, bit, action, mode);
}
-
+
+/**
+ * wait_action_schedule - this function can be passed to wait_on_bit or
+ * wait_on_bit_lock and it will call just schedule().
+ */
+int wait_action_schedule(void *);
+
#endif /* __KERNEL__ */

#endif
Index: linux-2.6.27-rc8-devel/kernel/wait.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/kernel/wait.c 2008-10-04 13:37:24.000000000 +0200
+++ linux-2.6.27-rc8-devel/kernel/wait.c 2008-10-04 13:38:21.000000000 +0200
@@ -251,3 +251,10 @@ wait_queue_head_t *bit_waitqueue(void *w
return &zone->wait_table[hash_long(val, zone->wait_table_bits)];
}
EXPORT_SYMBOL(bit_waitqueue);
+
+int wait_action_schedule(void *word)
+{
+ schedule();
+ return 0;
+}
+EXPORT_SYMBOL(wait_action_schedule);
Index: linux-2.6.27-rc8-devel/kernel/Makefile
===================================================================
--- linux-2.6.27-rc8-devel.orig/kernel/Makefile 2008-10-05 14:03:24.000000000 +0200
+++ linux-2.6.27-rc8-devel/kernel/Makefile 2008-10-05 14:11:25.000000000 +0200
@@ -17,6 +17,7 @@ ifdef CONFIG_FTRACE
# Do not trace debug files and internal ftrace files
CFLAGS_REMOVE_lockdep.o = -pg
CFLAGS_REMOVE_lockdep_proc.o = -pg
+CFLAGS_REMOVE_bit-mutex-debug.o = -pg
CFLAGS_REMOVE_mutex-debug.o = -pg
CFLAGS_REMOVE_rtmutex-debug.o = -pg
CFLAGS_REMOVE_cgroup-debug.o = -pg
@@ -29,6 +30,7 @@ obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sy
obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
+obj-$(CONFIG_DEBUG_MUTEXES) += bit-mutex-debug.o
obj-$(CONFIG_LOCKDEP) += lockdep.o
ifeq ($(CONFIG_PROC_FS),y)
obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
Index: linux-2.6.27-rc8-devel/kernel/bit-mutex-debug.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.27-rc8-devel/kernel/bit-mutex-debug.c 2008-10-05 19:12:06.000000000 +0200
@@ -0,0 +1,55 @@
+#include <linux/module.h>
+#include <linux/bitops.h>
+#include <linux/debug_locks.h>
+#include <linux/bit-mutex.h>
+
+void __bit_mutex_init(unsigned long *field, int bit, struct bit_mutex *mtx, const char *name, struct lock_class_key *key)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ debug_check_no_locks_freed((void *)mtx, sizeof(*mtx));
+ lockdep_init_map(&mtx->dep_map, name, key, 0);
+#endif
+ mtx->field = field;
+ mtx->bit = bit;
+ mtx->owner = NULL;
+ mtx->magic = mtx;
+ clear_bit(bit, field);
+ smp_mb__after_clear_bit();
+}
+EXPORT_SYMBOL(__bit_mutex_init);
+
+void __bit_mutex_lock(unsigned long *field, int bit, struct bit_mutex *mtx, int subclass)
+{
+ DEBUG_LOCKS_WARN_ON(mtx->magic != mtx);
+ DEBUG_LOCKS_WARN_ON(field != mtx->field);
+ DEBUG_LOCKS_WARN_ON(bit != mtx->bit);
+ mutex_acquire(&mtx->dep_map, subclass, 0, _RET_IP_);
+ wait_on_bit_lock(field, bit, wait_action_schedule, TASK_UNINTERRUPTIBLE);
+ lock_acquired(&mtx->dep_map);
+ mtx->owner = current_thread_info();
+}
+EXPORT_SYMBOL(__bit_mutex_lock);
+
+void __bit_mutex_unlock(unsigned long *field, int bit, struct bit_mutex *mtx, int nested)
+{
+ DEBUG_LOCKS_WARN_ON(mtx->magic != mtx);
+ DEBUG_LOCKS_WARN_ON(mtx->owner != current_thread_info());
+ DEBUG_LOCKS_WARN_ON(mtx->field != field);
+ DEBUG_LOCKS_WARN_ON(mtx->bit != bit);
+ mtx->owner = NULL;
+ mutex_release(&mtx->dep_map, nested, _RET_IP_);
+ smp_mb__before_clear_bit();
+ clear_bit(bit, field);
+ smp_mb__after_clear_bit();
+ wake_up_bit(field, bit);
+}
+EXPORT_SYMBOL(__bit_mutex_unlock);
+
+int bit_mutex_is_locked(unsigned long *field, int bit, struct bit_mutex *mtx)
+{
+ DEBUG_LOCKS_WARN_ON(mtx->magic != mtx);
+ DEBUG_LOCKS_WARN_ON(field != mtx->field);
+ DEBUG_LOCKS_WARN_ON(bit != mtx->bit);
+ return test_bit(bit, field);
+}
+EXPORT_SYMBOL(bit_mutex_is_locked);

2008-10-05 22:15:18

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH 2/3] Fix fsync livelock

Fix starvation in memory management.

The bug happens when one process is doing sequential buffered writes to
a block device (or file) and another process is attempting to execute
sync(), fsync() or direct-IO on that device (or file). This syncing
process will wait indefinitelly, until the first writing process
finishes.

For example, run these two commands:
dd if=/dev/zero of=/dev/sda1 bs=65536 &
dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct

The bug is caused by sequential walking of address space in
write_cache_pages and wait_on_page_writeback_range: if some other
process is constantly making dirty and writeback pages while these
functions run, the functions will wait on every new page, resulting in
indefinite wait.

To fix the starvation, I declared a bit mutex starvation_barrier in
struct address_space. The bit AS_STARVATION_BARRIER in
address_space->flags is used for actual locking. When the mutex is
taken, anyone making dirty pages on that address space should stop. The
functions that walk address space sequentially first estimate a number
of pages to process. If they process more than this amount (plus some
small delta), it means that someone is making dirty pages under them ---
they take the mutex and hold it until they finish. When the mutex is
taken, the function balance_dirty_pages will wait and not allow more
dirty pages being made.

The mutex is not really used as a mutex, it does not prevent access to
any critical section. It is used as a barrier that blocks new dirty
pages from being created. I use mutex and not wait queue to make sure
that the starvation can't happend the other way (if there were wait
queue, excessive calls to write_cache_pages and
wait_on_page_writeback_range would block balance_dirty_pages forever;
with mutex it is guaranteed that every process eventually makes
progress).

The essential property of this patch is that if the starvation doesn't
happen, no additional locks are taken and no atomic operations are
performed. So the patch shouldn't damage performance.

The indefinite starvation was observed in write_cache_pages and
wait_on_page_writeback_range. Another possibility where it could happen
is invalidate_inode_pages2_range.

There are two more functions that walk all the pages in address space,
but I think they don't need this starvation protection:
truncate_inode_pages_range: it is called with i_mutex locked, so no new
pages are created under it.
__invalidate_mapping_pages: it skips locked, dirty and writeback pages,
so there's no point in protecting the function against starving on them.

Signed-off-by: Mikulas Patocka <[email protected]>

---
fs/inode.c | 1 +
include/linux/fs.h | 3 +++
include/linux/pagemap.h | 14 ++++++++++++++
mm/filemap.c | 5 +++++
mm/page-writeback.c | 18 +++++++++++++++++-
mm/swap_state.c | 1 +
mm/truncate.c | 10 +++++++++-
7 files changed, 50 insertions(+), 2 deletions(-)

Index: linux-2.6.27-rc8-devel/fs/inode.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/fs/inode.c 2008-10-04 16:47:55.000000000 +0200
+++ linux-2.6.27-rc8-devel/fs/inode.c 2008-10-04 16:48:04.000000000 +0200
@@ -216,6 +216,7 @@ void inode_init_once(struct inode *inode
spin_lock_init(&inode->i_data.private_lock);
INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
+ bit_mutex_init(&inode->i_data.flags, AS_STARVATION_BARRIER, &inode->i_data.starvation_barrier);
i_size_ordered_init(inode);
#ifdef CONFIG_INOTIFY
INIT_LIST_HEAD(&inode->inotify_watches);
Index: linux-2.6.27-rc8-devel/include/linux/fs.h
===================================================================
--- linux-2.6.27-rc8-devel.orig/include/linux/fs.h 2008-10-04 16:47:54.000000000 +0200
+++ linux-2.6.27-rc8-devel/include/linux/fs.h 2008-10-04 16:49:54.000000000 +0200
@@ -289,6 +289,7 @@ extern int dir_notify_enable;
#include <linux/init.h>
#include <linux/pid.h>
#include <linux/mutex.h>
+#include <linux/bit-mutex.h>
#include <linux/capability.h>
#include <linux/semaphore.h>

@@ -539,6 +540,8 @@ struct address_space {
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
+
+ struct bit_mutex starvation_barrier;/* taken when fsync starves */
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
Index: linux-2.6.27-rc8-devel/include/linux/pagemap.h
===================================================================
--- linux-2.6.27-rc8-devel.orig/include/linux/pagemap.h 2008-10-04 16:47:54.000000000 +0200
+++ linux-2.6.27-rc8-devel/include/linux/pagemap.h 2008-10-04 16:48:04.000000000 +0200
@@ -21,6 +21,7 @@
#define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
#define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
#define AS_MM_ALL_LOCKS (__GFP_BITS_SHIFT + 2) /* under mm_take_all_locks() */
+#define AS_STARVATION_BARRIER (__GFP_BITS_SHIFT + 3) /* an anti-starvation barrier */

static inline void mapping_set_error(struct address_space *mapping, int error)
{
@@ -424,4 +425,17 @@ static inline int add_to_page_cache(stru
return error;
}

+#define starvation_protection_head(n) \
+ long pages_to_process = (n); \
+ pages_to_process += (pages_to_process >> 3) + 16;
+
+#define starvation_protection_step(mapping) \
+ if (pages_to_process >= 0) \
+ if (!pages_to_process--) \
+ bit_mutex_lock(&(mapping)->flags, AS_STARVATION_BARRIER, &(mapping)->starvation_barrier);
+
+#define starvation_protection_end(mapping) \
+ if (pages_to_process < 0) \
+ bit_mutex_unlock(&(mapping)->flags, AS_STARVATION_BARRIER, &(mapping)->starvation_barrier);
+
#endif /* _LINUX_PAGEMAP_H */
Index: linux-2.6.27-rc8-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/mm/filemap.c 2008-10-04 16:47:55.000000000 +0200
+++ linux-2.6.27-rc8-devel/mm/filemap.c 2008-10-04 16:48:04.000000000 +0200
@@ -269,6 +269,7 @@ int wait_on_page_writeback_range(struct
int nr_pages;
int ret = 0;
pgoff_t index;
+ starvation_protection_head(bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK));

if (end < start)
return 0;
@@ -288,6 +289,8 @@ int wait_on_page_writeback_range(struct
if (page->index > end)
continue;

+ starvation_protection_step(mapping);
+
wait_on_page_writeback(page);
if (PageError(page))
ret = -EIO;
@@ -296,6 +299,8 @@ int wait_on_page_writeback_range(struct
cond_resched();
}

+ starvation_protection_end(mapping);
+
/* Check for outstanding write errors */
if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
ret = -ENOSPC;
Index: linux-2.6.27-rc8-devel/mm/page-writeback.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/mm/page-writeback.c 2008-10-04 16:47:55.000000000 +0200
+++ linux-2.6.27-rc8-devel/mm/page-writeback.c 2008-10-04 16:48:04.000000000 +0200
@@ -435,6 +435,14 @@ static void balance_dirty_pages(struct a

struct backing_dev_info *bdi = mapping->backing_dev_info;

+ if (unlikely(bit_mutex_is_locked(&mapping->flags, AS_STARVATION_BARRIER,
+ &mapping->starvation_barrier))) {
+ bit_mutex_lock(&mapping->flags, AS_STARVATION_BARRIER,
+ &mapping->starvation_barrier);
+ bit_mutex_unlock(&mapping->flags, AS_STARVATION_BARRIER,
+ &mapping->starvation_barrier);
+ }
+
for (;;) {
struct writeback_control wbc = {
.bdi = bdi,
@@ -876,6 +884,7 @@ int write_cache_pages(struct address_spa
pgoff_t end; /* Inclusive */
int scanned = 0;
int range_whole = 0;
+ starvation_protection_head(bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE));

if (wbc->nonblocking && bdi_write_congested(bdi)) {
wbc->encountered_congestion = 1;
@@ -902,7 +911,11 @@ retry:

scanned = 1;
for (i = 0; i < nr_pages; i++) {
- struct page *page = pvec.pages[i];
+ struct page *page;
+
+ starvation_protection_step(mapping);
+
+ page = pvec.pages[i];

/*
* At this point we hold neither mapping->tree_lock nor
@@ -963,6 +976,9 @@ retry:

if (wbc->range_cont)
wbc->range_start = index << PAGE_CACHE_SHIFT;
+
+ starvation_protection_end(mapping);
+
return ret;
}
EXPORT_SYMBOL(write_cache_pages);
Index: linux-2.6.27-rc8-devel/mm/swap_state.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/mm/swap_state.c 2008-10-04 16:47:55.000000000 +0200
+++ linux-2.6.27-rc8-devel/mm/swap_state.c 2008-10-04 17:14:03.000000000 +0200
@@ -43,6 +43,7 @@ struct address_space swapper_space = {
.a_ops = &swap_aops,
.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.backing_dev_info = &swap_backing_dev_info,
+ .starvation_barrier = __BIT_MUTEX_INITIALIZER(&swapper_space.flags, AS_STARVATION_BARRIER, swapper_space.starvation_barrier),
};

#define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
Index: linux-2.6.27-rc8-devel/mm/truncate.c
===================================================================
--- linux-2.6.27-rc8-devel.orig/mm/truncate.c 2008-10-04 16:47:55.000000000 +0200
+++ linux-2.6.27-rc8-devel/mm/truncate.c 2008-10-04 16:48:04.000000000 +0200
@@ -392,6 +392,7 @@ int invalidate_inode_pages2_range(struct
int ret2 = 0;
int did_range_unmap = 0;
int wrapped = 0;
+ starvation_protection_head(mapping->nrpages);

pagevec_init(&pvec, 0);
next = start;
@@ -399,9 +400,13 @@ int invalidate_inode_pages2_range(struct
pagevec_lookup(&pvec, mapping, next,
min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
- struct page *page = pvec.pages[i];
+ struct page *page;
pgoff_t page_index;

+ starvation_protection_step(mapping);
+
+ page = pvec.pages[i];
+
lock_page(page);
if (page->mapping != mapping) {
unlock_page(page);
@@ -449,6 +454,9 @@ int invalidate_inode_pages2_range(struct
pagevec_release(&pvec);
cond_resched();
}
+
+ starvation_protection_end(mapping);
+
return ret;
}
EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);

2008-10-05 22:17:46

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH 3/3] Fix fsync-vs-write misbehavior

Fix violation of sync()/fsync() semantics. Previous code walked up to
mapping->nrpages * 2 pages. Because pages could be created while
__filemap_fdatawrite_range was in progress, it could lead to a
misbehavior. Example: there are two pages in address space with indices
4, 5. Both are dirty. Someone calls __filemap_fdatawrite_range, it sets
.nr_to_write = 4. Meanwhile, some other process creates dirty pages 0,
1, 2, 3. __filemap_fdatawrite_range writes pages 0, 1, 2, 3, finds out
that it reached the limit and exits.

Result: pages that were dirty before __filemap_fdatawrite_range was
invoked were not written.

With starvation protection from the previous patch, this
mapping->nrpages * 2 logic is no longer needed.

Signed-off-by: Mikulas Patocka <[email protected]>

---
mm/filemap.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

Index: linux-2.6.27-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-24 14:47:01.000000000 +0200
+++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-24 15:01:23.000000000 +0200
@@ -202,6 +202,11 @@ static int sync_page_killable(void *word
* opposed to a regular memory cleansing writeback. The difference between
* these two operations is that if a dirty page/buffer is encountered, it must
* be waited upon, and not just skipped over.
+ *
+ * Because new pages dirty can be created while this is executing, that
+ * mapping->nrpages * 2 condition is unsafe. If we are doing data integrity
+ * write, we must write all the pages. AS_STARVATION bit will eventually prevent
+ * creating more dirty pages to avoid starvation.
*/
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
loff_t end, int sync_mode)
@@ -209,7 +214,7 @@ int __filemap_fdatawrite_range(struct ad
int ret;
struct writeback_control wbc = {
.sync_mode = sync_mode,
- .nr_to_write = mapping->nrpages * 2,
+ .nr_to_write = sync_mode == WB_SYNC_NONE ? mapping->nrpages * 2 : LONG_MAX,
.range_start = start,
.range_end = end,
};

2008-10-05 22:33:22

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock

On Sun, 5 Oct 2008 18:14:50 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:

> Fix starvation in memory management.
>
> The bug happens when one process is doing sequential buffered writes
> to a block device (or file) and another process is attempting to
> execute sync(), fsync() or direct-IO on that device (or file). This
> syncing process will wait indefinitelly, until the first writing
> process finishes.
>
> For example, run these two commands:
> dd if=/dev/zero of=/dev/sda1 bs=65536 &
> dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
>
> The bug is caused by sequential walking of address space in
> write_cache_pages and wait_on_page_writeback_range: if some other
> process is constantly making dirty and writeback pages while these
> functions run, the functions will wait on every new page, resulting in
> indefinite wait.

are you sure?
isn't the right fix to just walk the file pages only once?

2008-10-05 23:03:44

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock



On Sun, 5 Oct 2008, Arjan van de Ven wrote:

> On Sun, 5 Oct 2008 18:14:50 -0400 (EDT)
> Mikulas Patocka <[email protected]> wrote:
>
> > Fix starvation in memory management.
> >
> > The bug happens when one process is doing sequential buffered writes
> > to a block device (or file) and another process is attempting to
> > execute sync(), fsync() or direct-IO on that device (or file). This
> > syncing process will wait indefinitelly, until the first writing
> > process finishes.
> >
> > For example, run these two commands:
> > dd if=/dev/zero of=/dev/sda1 bs=65536 &
> > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
> >
> > The bug is caused by sequential walking of address space in
> > write_cache_pages and wait_on_page_writeback_range: if some other
> > process is constantly making dirty and writeback pages while these
> > functions run, the functions will wait on every new page, resulting in
> > indefinite wait.
>
> are you sure?
> isn't the right fix to just walk the file pages only once?

It walks the pages only once --- and waits on each on them. But because
new pages are contantly appearing under it, that "walk only once" becomes
infinite loop (well, finite, until the whole disk is written).

Mikulas

2008-10-05 23:07:48

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock

On Sun, 5 Oct 2008 19:02:57 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:

> > are you sure?
> > isn't the right fix to just walk the file pages only once?
>
> It walks the pages only once --- and waits on each on them. But
> because new pages are contantly appearing under it, that "walk only
> once" becomes infinite loop (well, finite, until the whole disk is
> written).

well. fsync() promises that everything that's dirty at the time of the
call will hit the disk. That is not something you can compromise.
The only way out would be is to not allow new dirty during an fsync()...
which is imo even worse.

Submit them all in one go, then wait, should not be TOO bad. Unless a
lot was dirty already, but then you go back to "but it has to go to
disk".


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-10-05 23:18:49

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock

On Sun, 5 Oct 2008, Arjan van de Ven wrote:

> On Sun, 5 Oct 2008 19:02:57 -0400 (EDT)
> Mikulas Patocka <[email protected]> wrote:
>
> > > are you sure?
> > > isn't the right fix to just walk the file pages only once?
> >
> > It walks the pages only once --- and waits on each on them. But
> > because new pages are contantly appearing under it, that "walk only
> > once" becomes infinite loop (well, finite, until the whole disk is
> > written).
>
> well. fsync() promises that everything that's dirty at the time of the
> call will hit the disk. That is not something you can compromise.
> The only way out would be is to not allow new dirty during an fsync()...
> which is imo even worse.
>
> Submit them all in one go, then wait, should not be TOO bad. Unless a
> lot was dirty already, but then you go back to "but it has to go to
> disk".

The problem here is that you have two processes --- one is writing, the
other is simultaneously syncing. The syncing process can't distinguish the
pages that were created before fsync() was invoked (it has to wait on
them) and the pages that were created while fsync() was running (it
doesn't have to wait on them) --- so it waits on them all. The result is
livelock, it waits indefinitely, because more and more pages are being
created.

The patch changes it so that if it waits long enough, it stops the other
writers creating dirty pages.

Or, how otherwise would you implement "Submit them all in one go, then
wait"? The current code is:
you grab page 0, see it is under writeback, wait on it
you grab page 1, see it is under writeback, wait on it
you grab page 2, see it is under writeback, wait on it
you grab page 3, see it is under writeback, wait on it
...
--- and the other process is just making more and more writeback pages
while your waiting routine run. So the waiting is indefinite.

Mikulas

2008-10-05 23:29:15

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock

On Sun, 5 Oct 2008 19:18:22 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:

> On Sun, 5 Oct 2008, Arjan van de Ven wrote:
>
>
> The problem here is that you have two processes --- one is writing,
> the other is simultaneously syncing. The syncing process can't
> distinguish the pages that were created before fsync() was invoked
> (it has to wait on them) and the pages that were created while
> fsync() was running (it doesn't have to wait on them) --- so it waits
> on them all.

for pages it already passed that's not true.
but yes while you walk, you may find new ones. tough luck.
>
> Or, how otherwise would you implement "Submit them all in one go,
> then wait"? The current code is:
> you grab page 0, see it is under writeback, wait on it
> you grab page 1, see it is under writeback, wait on it
> you grab page 2, see it is under writeback, wait on it
> you grab page 3, see it is under writeback, wait on it

it shouldn't be doing this. It should be "submit the lot"
"wait on the lot that got submitted".

keeping away new writers is just shifting the latency to an innocent
party (since the fsync() is just bloody expensive, the person doing it
should be paying the price, not the rest of the system), and a grave
mistake.

If the fsync() implementation isn't smart enough, sure, lets improve
it. But not by shifting latency around... lets make it more efficient
at submitting IO.
If we need to invent something like "chained IO" where if you wait on
the last of the chain, you wait on the entirely chain, so be it.


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-10-06 00:02:16

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock



On Sun, 5 Oct 2008, Arjan van de Ven wrote:

> On Sun, 5 Oct 2008 19:18:22 -0400 (EDT)
> Mikulas Patocka <[email protected]> wrote:
>
> > On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> >
> >
> > The problem here is that you have two processes --- one is writing,
> > the other is simultaneously syncing. The syncing process can't
> > distinguish the pages that were created before fsync() was invoked
> > (it has to wait on them) and the pages that were created while
> > fsync() was running (it doesn't have to wait on them) --- so it waits
> > on them all.
>
> for pages it already passed that's not true.
> but yes while you walk, you may find new ones. tough luck.
> >
> > Or, how otherwise would you implement "Submit them all in one go,
> > then wait"? The current code is:
> > you grab page 0, see it is under writeback, wait on it
> > you grab page 1, see it is under writeback, wait on it
> > you grab page 2, see it is under writeback, wait on it
> > you grab page 3, see it is under writeback, wait on it
>
> it shouldn't be doing this. It should be "submit the lot"
> "wait on the lot that got submitted".

And how do you want to implement that "wait on the lot that got
submitted". Note that you can't allocate an array or linked list that
holds pointers to all submitted pages. And if you add anything to the page
structure or radix tree, you have to serialize concurrent sync,fsync
calls, otherwise they'd fight over these bits.

> keeping away new writers is just shifting the latency to an innocent
> party (since the fsync() is just bloody expensive, the person doing it
> should be paying the price, not the rest of the system), and a grave
> mistake.

I assume that if very few people complained about the livelock till now,
very few people will see degraded write performance. My patch blocks the
writes only if the livelock happens, so if the livelock doesn't happen in
unpatched kernel for most people, the patch won't make it worse.

> If the fsync() implementation isn't smart enough, sure, lets improve
> it. But not by shifting latency around... lets make it more efficient
> at submitting IO.
> If we need to invent something like "chained IO" where if you wait on
> the last of the chain, you wait on the entirely chain, so be it.

This looks madly complicated. And ineffective, because if some page was
submitted before fsync() was invoked, and is under writeback while fsync()
is called, fsync() still has to wait on it.

Mikulas

2008-10-06 00:30:52

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock

On Sun, 5 Oct 2008 20:01:46 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:

> I assume that if very few people complained about the livelock till
> now, very few people will see degraded write performance. My patch
> blocks the writes only if the livelock happens, so if the livelock
> doesn't happen in unpatched kernel for most people, the patch won't
> make it worse.

I object to calling this a livelock. It's not.
And yes, fsync is slow and lots of people are seeing that.
It's not helped by how ext3 is implemented (where fsync is effectively
equivalent of a sync for many cases).
But again, moving the latency to "innocent" parties is not acceptable.

>
> > If the fsync() implementation isn't smart enough, sure, lets improve
> > it. But not by shifting latency around... lets make it more
> > efficient at submitting IO.
> > If we need to invent something like "chained IO" where if you wait
> > on the last of the chain, you wait on the entirely chain, so be it.
>
> This looks madly complicated. And ineffective, because if some page
> was submitted before fsync() was invoked, and is under writeback
> while fsync() is called, fsync() still has to wait on it.

so?
just make a chain per inode always...


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-10-06 02:51:30

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock

On Sun, Oct 05, 2008 at 08:01:46PM -0400, Mikulas Patocka wrote:
> This looks madly complicated. And ineffective, because if some page was
> submitted before fsync() was invoked, and is under writeback while fsync()
> is called, fsync() still has to wait on it.

fsync() waiting on pre-issued writeback pages is the correct
behaviour.

IOW, if the page is under writeback at the time an fsync() is
issued (e.g. issued by pdflush), the page was *not clean* at the
time the fsync() was called and hence must be clean when fsync()
returns. fsync() needs to wait for all pages under I/O at the time
it is called, not just the dirty pages it issues I/O on.....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-10-06 03:31:45

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock



On Sun, 5 Oct 2008, Arjan van de Ven wrote:

> On Sun, 5 Oct 2008 20:01:46 -0400 (EDT)
> Mikulas Patocka <[email protected]> wrote:
>
> > I assume that if very few people complained about the livelock till
> > now, very few people will see degraded write performance. My patch
> > blocks the writes only if the livelock happens, so if the livelock
> > doesn't happen in unpatched kernel for most people, the patch won't
> > make it worse.
>
> I object to calling this a livelock. It's not.

It unlocks itself when the whole disk is written, and it can be several
hours (or days, if you have many-terabyte array). So formally it is not
livelock, from the user experience it is --- he sees unkillable process in
'D' state for many hours.

> And yes, fsync is slow and lots of people are seeing that.
> It's not helped by how ext3 is implemented (where fsync is effectively
> equivalent of a sync for many cases).
> But again, moving the latency to "innocent" parties is not acceptable.
>
> >
> > > If the fsync() implementation isn't smart enough, sure, lets improve
> > > it. But not by shifting latency around... lets make it more
> > > efficient at submitting IO.
> > > If we need to invent something like "chained IO" where if you wait
> > > on the last of the chain, you wait on the entirely chain, so be it.
> >
> > This looks madly complicated. And ineffective, because if some page
> > was submitted before fsync() was invoked, and is under writeback
> > while fsync() is called, fsync() still has to wait on it.
>
> so?
> just make a chain per inode always...

The point is that many fsync()s may run in parallel and you have just one
inode and just one chain. And if you add two-word list_head to a page, to
link it on this list, many developers will hate it for increasing its
size.

See the work dobe by Nick Piggin somewhere in this thread. He uses just
one bit in radix tree to mark pages to process. But he needs to serialize
all syncs on a given file, they no longer run in parallel.

Mikulas

2008-10-06 04:20:57

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock

On Sun, 5 Oct 2008 23:30:51 -0400 (EDT
> The point is that many fsync()s may run in parallel and you have just
> one inode and just one chain. And if you add two-word list_head to a
> page, to link it on this list, many developers will hate it for
> increasing its size.

why to a page?
a list head in the inode and chain up the bios....
or not make an actual list but just a "is the previous one done" thing
it's not all that hard to get something that works on a per inode basis,
that gives "wait for all io upto this one".



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-10-06 13:01:19

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock

On Sun, 5 Oct 2008, Arjan van de Ven wrote:

> On Sun, 5 Oct 2008 23:30:51 -0400 (EDT
> > The point is that many fsync()s may run in parallel and you have just
> > one inode and just one chain. And if you add two-word list_head to a
> > page, to link it on this list, many developers will hate it for
> > increasing its size.
>
> why to a page?
> a list head in the inode and chain up the bios....

And if you want to wait for a bio submitted by a different process?
There's no way you can find the bio from the page.

> or not make an actual list but just a "is the previous one done" thing
> it's not all that hard to get something that works on a per inode basis,
> that gives "wait for all io upto this one".

So code it :)

Mikulas

2008-10-06 13:50:48

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock

On Mon, 6 Oct 2008 09:00:14 -0400 (EDT)
Mikulas Patocka <[email protected]> wrote:

> On Sun, 5 Oct 2008, Arjan van de Ven wrote:
>
> > On Sun, 5 Oct 2008 23:30:51 -0400 (EDT
> > > The point is that many fsync()s may run in parallel and you have
> > > just one inode and just one chain. And if you add two-word
> > > list_head to a page, to link it on this list, many developers
> > > will hate it for increasing its size.
> >
> > why to a page?
> > a list head in the inode and chain up the bios....
>
> And if you want to wait for a bio submitted by a different process?
> There's no way you can find the bio from the page.

the point is that the kernel would always chain it to the inode,
independent of who or when it is submitted


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-10-06 20:45:52

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock

On Mon, 6 Oct 2008, Arjan van de Ven wrote:

> On Mon, 6 Oct 2008 09:00:14 -0400 (EDT)
> Mikulas Patocka <[email protected]> wrote:
>
> > On Sun, 5 Oct 2008, Arjan van de Ven wrote:
> >
> > > On Sun, 5 Oct 2008 23:30:51 -0400 (EDT
> > > > The point is that many fsync()s may run in parallel and you have
> > > > just one inode and just one chain. And if you add two-word
> > > > list_head to a page, to link it on this list, many developers
> > > > will hate it for increasing its size.
> > >
> > > why to a page?
> > > a list head in the inode and chain up the bios....
> >
> > And if you want to wait for a bio submitted by a different process?
> > There's no way you can find the bio from the page.
>
> the point is that the kernel would always chain it to the inode,
> independent of who or when it is submitted

If you add a list to an inode, you need to protect it with a spinlock. So
you take one more spinlock for any write bio submitted --- a lot of
developers would hate it.

Another problem: how do you want to walk all dirty pages and submit bio
for them?

The act of allocating and submission of bio can block (if you run out of
some mempool) and in this case it wait until some other bio is finished.
During this time, more dirty pages can be created.

Also, if you find a page that is both dirty and under writeback, you need
to wait until a writeback finishes and then initiate another writeback
(because the old writeback may be writing stale data). You again, block,
and more dirty pages can appear.

And if you block and more dirty pages appear, you are prone to the
livelock.

[ In Nick Piggin's patch, it is needed to lock the whole address space,
mark dirty pages in one non-blocking pass and write marked pages again in
a blocking pass --- so that if more dirty pages appear while bios are
submitted, the new pages will be skipped ]

Mikulas

2008-10-08 16:46:33

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2/3] Fix fsync livelock

On Sun 2008-10-05 17:30:19, Arjan van de Ven wrote:
> On Sun, 5 Oct 2008 20:01:46 -0400 (EDT)
> Mikulas Patocka <[email protected]> wrote:
>
> > I assume that if very few people complained about the livelock till
> > now, very few people will see degraded write performance. My patch
> > blocks the writes only if the livelock happens, so if the livelock
> > doesn't happen in unpatched kernel for most people, the patch won't
> > make it worse.
>
> I object to calling this a livelock. It's not.

8 hours of process in D state is a livelock. And we can do minimal fix
here, this almost never happens in real life anyway.

Latency imposed of writer should not be a problem...
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-10-09 01:13:27

by Randy Dunlap

[permalink] [raw]
Subject: [PATCH] documentation: explain memory barriers

On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:

> This sequence is repeated three or four times and should be pulled out
> into a well-commented function. That comment should explain the logic
> behind the use of these barriers, please.

and on 2008-OCT-08 Ben Hutchings wrote:

> All memory barriers need a comment to explain why and what they're doing.


Should this be added to CodingStyle or some other file?




From: Randy Dunlap <[email protected]>

We want all uses of memory barriers to be explained in the source
code.

Signed-off-by: Randy Dunlap <[email protected]>
---
Documentation/SubmitChecklist | 3 +++
1 file changed, 3 insertions(+)

--- lin2627-rc9-docs.orig/Documentation/SubmitChecklist
+++ lin2627-rc9-docs/Documentation/SubmitChecklist
@@ -85,3 +85,6 @@ kernel patches.
23: Tested after it has been merged into the -mm patchset to make sure
that it still works with all of the other queued patches and various
changes in the VM, VFS, and other subsystems.
+
+24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment in the
+ source code that explains the logic of what they are doing and why.

2008-10-09 01:18:55

by Chris Snook

[permalink] [raw]
Subject: Re: [PATCH] documentation: explain memory barriers

Randy Dunlap wrote:
> On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:
>
>> This sequence is repeated three or four times and should be pulled out
>> into a well-commented function. That comment should explain the logic
>> behind the use of these barriers, please.
>
> and on 2008-OCT-08 Ben Hutchings wrote:
>
>> All memory barriers need a comment to explain why and what they're doing.

Seriously? When a barrier is used, it's generally self-evident what
it's doing. The real problem is when a barrier is *not* used. It would
probably be more helpful to comment every non-barrier line of code to
explain why it's okay if memory operations are getting reordered.

-- Chris

2008-10-09 01:32:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] documentation: explain memory barriers

On Wed, 08 Oct 2008 21:17:58 -0400 Chris Snook <[email protected]> wrote:

> Randy Dunlap wrote:
> > On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:
> >
> >> This sequence is repeated three or four times and should be pulled out
> >> into a well-commented function. That comment should explain the logic
> >> behind the use of these barriers, please.
> >
> > and on 2008-OCT-08 Ben Hutchings wrote:
> >
> >> All memory barriers need a comment to explain why and what they're doing.

I approve this message.

> Seriously? When a barrier is used, it's generally self-evident what
> it's doing.

fs/buffer.c:sync_buffer(). Have fun.

2008-10-09 01:52:48

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] documentation: explain memory barriers

On Wed, 08 Oct 2008 18:12:23 PDT, Randy Dunlap said:

> +
> +24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment in the
> + source code that explains the logic of what they are doing and why.

"what they are doing" will almost always be "flush value to RAM" or similar.
How about this instead:

+ 24: All memory barriers ({e.g., barrier(), rmb(), wmb()} need a comment in the
+ source code that explains the race condition being prevented, and states
+ the location of the other code or hardware feature that races with this.
+
+ An example comment:
+
+ /*
+ * If we don't do a wmb() here, the RBFROBNIZ register on the XU293
+ * card will get confused and wedge the hardware...
+ */
+ wmb();


Attachments:
(No filename) (226.00 B)

2008-10-09 05:53:11

by Chris Snook

[permalink] [raw]
Subject: Re: [PATCH] documentation: explain memory barriers

Andrew Morton wrote:
> On Wed, 08 Oct 2008 21:17:58 -0400 Chris Snook <[email protected]> wrote:
>
>> Randy Dunlap wrote:
>>> On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:
>>>
>>>> This sequence is repeated three or four times and should be pulled out
>>>> into a well-commented function. That comment should explain the logic
>>>> behind the use of these barriers, please.
>>> and on 2008-OCT-08 Ben Hutchings wrote:
>>>
>>>> All memory barriers need a comment to explain why and what they're doing.
>
> I approve this message.
>
>> Seriously? When a barrier is used, it's generally self-evident what
>> it's doing.
>
> fs/buffer.c:sync_buffer(). Have fun.

The real disaster there is the clear_buffer_##name macro and friends, as
evidenced by fs/ext2/inode.c:599

clear_buffer_new(bh_result); /* What's this do? */

I'm completely in favor of documenting everything that can potentially interact
with that train wreck, but I maintain that the vast majority of memory barriers
are self-evident.

-- Chris

2008-10-09 06:29:26

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] documentation: explain memory barriers

On Thursday 09 October 2008 16:51, Chris Snook wrote:
> Andrew Morton wrote:
> > On Wed, 08 Oct 2008 21:17:58 -0400 Chris Snook <[email protected]> wrote:
> >> Randy Dunlap wrote:
> >>> On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:
> >>>> This sequence is repeated three or four times and should be pulled out
> >>>> into a well-commented function. That comment should explain the logic
> >>>> behind the use of these barriers, please.
> >>>
> >>> and on 2008-OCT-08 Ben Hutchings wrote:
> >>>> All memory barriers need a comment to explain why and what they're
> >>>> doing.
> >
> > I approve this message.
> >
> >> Seriously? When a barrier is used, it's generally self-evident what
> >> it's doing.
> >
> > fs/buffer.c:sync_buffer(). Have fun.
>
> The real disaster there is the clear_buffer_##name macro and friends, as
> evidenced by fs/ext2/inode.c:599
>
> clear_buffer_new(bh_result); /* What's this do? */

That's not a disaster. It is relatively easy even if you have no
idea about any of that code to a) work out what BH_New flag does,
see where it gets set and where it gets cleared, and then work out
what that does. Actually, buffer.c used to leak BH_New in some
cases, but now it should be a bug if BH_New is found to be set
there (buffer.c should take any BH_New buffers, initialize them
appropriately, and clear BH_New; a dangling BH_New would indicate
uninitialized data going to or coming from the block).

No, they're easy, because you can find every single place where any
one cares about them with a single simple grep.

Again, fs/buffer.c:sync_buffer(). Which stores and/or loads is that
barrier sequencing against which subsequent stores and/or loads? Why?

For another fun example, mm/filemap.c:sync_page. This actually has a
big comment, but (to me) it isn't even evident then which loads and
stores are being sequenced against which subsequent ones because it
is not explicitly documented. And I do have some experience in adding
barriers to existing mm code where they have been missed completely.

mempool_free, set_dumpable, freeze_bdev.



> I'm completely in favor of documenting everything that can potentially
> interact with that train wreck,

What's the train-wreck, again?


> but I maintain that the vast majority of
> memory barriers are self-evident.

They are self-evident if you have spent a lot of time getting the
state machine and locking/concurrency model in your mind. If you
have not, then how do you know there is not some memory operation
way back or forward in the instruction stream that is supposed to
be ordered by this barrier?

All memory barriers have to be documented, except acquire/release
for critical sections, IMO.

2008-10-09 06:36:14

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] documentation: explain memory barriers

On Thursday 09 October 2008 12:50, [email protected] wrote:
> On Wed, 08 Oct 2008 18:12:23 PDT, Randy Dunlap said:
> > +
> > +24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment
> > in the + source code that explains the logic of what they are doing
> > and why.
>
> "what they are doing" will almost always be "flush value to RAM" or
> similar.

Memory barriers don't flush anything anywhere. It's simple: you must
explain which operations you are ordering against which.


> How about this instead:
>
> + 24: All memory barriers ({e.g., barrier(), rmb(), wmb()} need a comment
> in the + source code that explains the race condition being prevented,
> and states + the location of the other code or hardware feature that
> races with this. +
> + An example comment:
> +
> + /*
> + * If we don't do a wmb() here, the RBFROBNIZ register on the XU293
> + * card will get confused and wedge the hardware...
> + */
> + wmb();

This comment is no good, because it doesn't tell you what the memory barrier
does. It tells you what might happen if you omit it.

/*
* If we don't do a wmb() here, the store to the RBFROBNIZ,
* above, might reach the device before the store X, below.
*
* If that happens, then the XU293 card will get confused
* and wedge the hardware...
*/
wmb();

If you don't comment like that, then how does the reader know that the wmb
is not *also* supposed to order the store with any other of the limitless
subsequent stores until the next memory ordering operation? Or any of the
previous stores since the last one?

2008-10-09 06:53:50

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] documentation: explain memory barriers

On Fri, 10 Oct 2008 04:35:47 +1100, Nick Piggin said:
> On Thursday 09 October 2008 12:50, [email protected] wrote:
> > On Wed, 08 Oct 2008 18:12:23 PDT, Randy Dunlap said:
> > > +
> > > +24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment
> > > in the + source code that explains the logic of what they are doing
> > > and why.
> >
> > "what they are doing" will almost always be "flush value to RAM" or
> > similar.
>
> Memory barriers don't flush anything anywhere.

That's what I get for commenting on stuff when I'm into a 40-hour week by
Wednesday. :)

I was speaking of the generic programmer who does stuff like:

x = 10; /* set x to 10 */

for "what they are doing". You know the type. ;)

"flush value to RAM", "force memory barrier operation", and I think I've seen
a few kzalloc()'s that have "allocate zero'ed memory" on them. "what they are
doing" is usually not worth writing down, but being verbose for the *why*
is almost always good, especially for things like memory barriers that
almost nobody can get their brains wrapped around (how many flame-fests per
year do we have about "volatile"? ;)

> /*
> * If we don't do a wmb() here, the store to the RBFROBNIZ,
> * above, might reach the device before the store X, below.
> *
> * If that happens, then the XU293 card will get confused
> * and wedge the hardware...
> */
> wmb();
>
> If you don't comment like that, then how does the reader know that the wmb
> is not *also* supposed to order the store with any other of the limitless
> subsequent stores until the next memory ordering operation? Or any of the
> previous stores since the last one?

Even better (as I missed the "also supposed to know" case). My general point
was that a concrete example would improve Randy's original patch by showing
what sort of things should be in the comment, and your correction pointed
out *why* a concrete example was needed. ;)


Attachments:
(No filename) (226.00 B)

2008-10-09 09:58:35

by Ben Hutchings

[permalink] [raw]
Subject: Re: [PATCH] documentation: explain memory barriers

On Thu, 2008-10-09 at 01:51 -0400, Chris Snook wrote:
> Andrew Morton wrote:
> > On Wed, 08 Oct 2008 21:17:58 -0400 Chris Snook <[email protected]> wrote:
> >
> >> Randy Dunlap wrote:
> >>> On Wed, 1 Oct 2008 22:54:04 -0700 Andrew Morton wrote:
> >>>
> >>>> This sequence is repeated three or four times and should be pulled out
> >>>> into a well-commented function. That comment should explain the logic
> >>>> behind the use of these barriers, please.
> >>> and on 2008-OCT-08 Ben Hutchings wrote:
> >>>
> >>>> All memory barriers need a comment to explain why and what they're doing.
> >
> > I approve this message.
> >
> >> Seriously? When a barrier is used, it's generally self-evident what
> >> it's doing.
> >
> > fs/buffer.c:sync_buffer(). Have fun.
>
> The real disaster there is the clear_buffer_##name macro and friends, as
> evidenced by fs/ext2/inode.c:599
>
> clear_buffer_new(bh_result); /* What's this do? */
>
> I'm completely in favor of documenting everything that can potentially interact
> with that train wreck, but I maintain that the vast majority of memory barriers
> are self-evident.

Acquire and release barriers attached to operations are usually self-
evident; standalone wmb() and rmb() much less so. It is helpful to be
explicit about exactly which memory operations need to be ordered, which
are often not the memory operations immediately preceding and following
it. "all" may have been a bit strong though.

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

2008-10-09 10:28:17

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] documentation: explain memory barriers

On Thursday 09 October 2008 20:58, Ben Hutchings wrote:
> On Thu, 2008-10-09 at 01:51 -0400, Chris Snook wrote:

> > I'm completely in favor of documenting everything that can potentially
> > interact with that train wreck, but I maintain that the vast majority of
> > memory barriers are self-evident.
>
> Acquire and release barriers attached to operations are usually self-
> evident; standalone wmb() and rmb() much less so. It is helpful to be
> explicit about exactly which memory operations need to be ordered, which
> are often not the memory operations immediately preceding and following
> it. "all" may have been a bit strong though.

No, I don't think so. We should absolutely force "all". That allows nobody
to be lazy, no confusion, and reminds people that memory barriers are not
easy to follow for a new reader of the code, or necessarily even the author,
6 months later. If somebody is too lazy to write a comment, they can use
locks

One last quick quiz, easier than the earlier ones...
mm/vmscan.c:__remove_mapping has a score of lines documenting exactly
what memory operations are being ordered, and even an example of what
happens if the ordering is not folllowed. This is a pretty good comment,
if I say so myself. However, it has one deficiency in that it doesn't
explicitly state where the write barrier(s) is (IMO the comments for one
part of an ordering protocol should reference the other parts of the
protcol).

Where are the store barriers, or why are they not required?

2008-10-11 12:06:38

by Nick Piggin

[permalink] [raw]
Subject: Re: RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock)

On Monday 06 October 2008 09:11, Mikulas Patocka wrote:
> Hi
>
> I removed the repeated code and create a new bit mutexes. They are
> space-efficient mutexes that consume only one bit. See the next 3 patches.

Pretty reasonable to have.


> If you are concerned about the size of an inode, I can convert other
> mutexes to bit mutexes: i_mutex and inotify_mutex.

I wouldn't worry for now. mutexes can be unlocked much faster than bit
mutexes, especially in the fastpath. And due to slab, it would be
unlikely to actually save any space.


> I could also create
> bit_spinlock (one-bit spinlock that uses test_and_set_bit) and save space
> for address_space->tree_lock, address_space->i_mmap_lock,
> address_space->private_lock, inode->i_lock.

We have that already. It is much much faster to unlock spinlocks than
bit spinlocks in general (if you own the word exclusively, then it's
not, but then you would be less likely to save space), and we can also
do proper FIFO ticket locks with a larger word.


> Look at it and say what you think about the idea of condensing mutexes
> into single bits.

Looks pretty good to me.

2008-10-20 20:14:50

by Mikulas Patocka

[permalink] [raw]
Subject: Re: RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock)

> > If you are concerned about the size of an inode, I can convert other
> > mutexes to bit mutexes: i_mutex and inotify_mutex.
>
> I wouldn't worry for now. mutexes can be unlocked much faster than bit
> mutexes, especially in the fastpath. And due to slab, it would be
> unlikely to actually save any space.

Maybe inotify_mutex. You are right that i_mutex is so heavily contended
that slowing it down to save few words wouldn't be good. Do you know about
any inotify-intensive workload?

> > I could also create
> > bit_spinlock (one-bit spinlock that uses test_and_set_bit) and save space
> > for address_space->tree_lock, address_space->i_mmap_lock,
> > address_space->private_lock, inode->i_lock.
>
> We have that already. It is much much faster to unlock spinlocks than
> bit spinlocks in general (if you own the word exclusively, then it's
> not, but then you would be less likely to save space), and we can also
> do proper FIFO ticket locks with a larger word.

BTW. why do spinlocks on x86(64) have 32 bits and not 8 bits or 16 bits?
Are atomic 32-bit instuctions faster?

Can x86(86) system have 256 CPUs?

Mikulas

2008-10-21 01:51:30

by Nick Piggin

[permalink] [raw]
Subject: Re: RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock)

On Tuesday 21 October 2008 07:14, Mikulas Patocka wrote:
> > > If you are concerned about the size of an inode, I can convert other
> > > mutexes to bit mutexes: i_mutex and inotify_mutex.
> >
> > I wouldn't worry for now. mutexes can be unlocked much faster than bit
> > mutexes, especially in the fastpath. And due to slab, it would be
> > unlikely to actually save any space.
>
> Maybe inotify_mutex. You are right that i_mutex is so heavily contended
> that slowing it down to save few words wouldn't be good. Do you know about
> any inotify-intensive workload?

Don't really know, no. I think most desktop environments use it to
some extent, but no idea how much.


> > > I could also create
> > > bit_spinlock (one-bit spinlock that uses test_and_set_bit) and save
> > > space for address_space->tree_lock, address_space->i_mmap_lock,
> > > address_space->private_lock, inode->i_lock.
> >
> > We have that already. It is much much faster to unlock spinlocks than
> > bit spinlocks in general (if you own the word exclusively, then it's
> > not, but then you would be less likely to save space), and we can also
> > do proper FIFO ticket locks with a larger word.
>
> BTW. why do spinlocks on x86(64) have 32 bits and not 8 bits or 16 bits?
> Are atomic 32-bit instuctions faster?

In the case of <= 256 CPUs, they could be an unsigned short I think.
Probably it has never been found to be a huge win because they are
often beside other ints or longs. I think I actually booted up the
kernel with 16-bit spinlocks when doing the FIFO locks, but never
sent a patch for it... Don't let me stop you from trying though.


> Can x86(86) system have 256 CPUs?

Well, none that I know of which actually exist. SGI is hoping to have
4096 CPU x86 systems as far as I can tell.