2008-06-24 05:54:21

by Mikulas Patocka

[permalink] [raw]
Subject: [10 PATCHES] inline functions to avoid stack overflow

Hi

Here I'm sending 10 patches to inline various functions.

To give you some understanding of sparc64, every function there uses big
stack frame (at least 192 bytes). 128 bytes are required by architecture
(16 64-bit registers), 48 bytes are there due to mistake of Sparc64 ABI
designers (calling function has to allocate 48 bytes for called function)
and 16 bytes are some dubious padding.

So, on sparc64, if you have a simple function that passes arguments to
other function it still takes 192 byte --- regardless of how simple the
function is. Tail-call may be used, but it is disabled in kernel if
debugging is enabled (Makefile: ifdef CONFIG_FRAME_POINTER
KBUILD_CFLAGS += -fno-omit-frame-pointer -fno-optimize-sibling-calls).

The stack trace has 75 nested functions, that totals to at least 14400
bytes --- and it kills the 16k stack space on sparc. In the stack trace,
there are many function which do nothing but pass parameters to other
function. In this series of patches, I found 10 such functions and turned
them to inlines, saving 1920 bytes. Especially waking wait queue is bad,
it calls 8 nested functions, 7 of which do nothing. I turned 5 of them to
inline.

I'll post those 10 patches in next emails.

Some of the patches fix cases that could be optimized with tail-calling.
If you want to drop the patches, you should enable tail-calling by
removing -fno-optimize-sibling-calls from Makefile (it makes debugging a
bit harder).

This was the trace:

linux_sparc_syscall32
sys_read
vfs_read
do_sync_read
generic_file_aio_read
generic_file_direct_io
filemap_write_and_wait
filemap_fdatawrite
__filemap_fdatawrite_range
do_writepages
generic_writepages
write_cache_pages
__writepage
blkdev_writepage
block_write_full_page
__block_write_fiull_page
submit_bh
submit_bio
generic_make_request
dm_request
__split_bio
__map_bio
origin_map
start_copy
dm_kcopyd_copy
dispatch_job
wake
queue_work
__queue_work
__spin_unlock_irqrestore
sys_call_table
timer_interrupt
irq_exit
do_softirq
__do_softirq
run_timer_softirq
__spin_unlock_irq
sys_call_table
handler_irq
handler_fasteoi_irq
handle_irq_event
ide_intr
ide_dma_intr
task_end_request
ide_end_request
__ide_end_request
__blk_end_request
__end_that_request_first
req_bio_endio
bio_endio
clone_endio
dec_pending
bio_endio
clone_endio
dec_pending
bio_endio
clone_endio
dec_pending
bio_endio
end_bio_bh_io_sync
end_buffer_read_sync
__end_buffer_read_notouch
unlock_buffer
wake_up_bit
__wake_up_bit
__wake_up
__wake_up_common
wake_bio_function
autoremove_wake_function
default_wake_function
try_to_wake_up
task_rq_lock
__spin_lock
lock_acquire
__lock_acquire
*** crash ***


2008-06-24 05:55:29

by Mikulas Patocka

[permalink] [raw]
Subject: [1/10 PATCH] inline __queue_work

Make __queue_work inlined. It just takes spinlock and calls another function.

Signed-off-by: Mikulas Patocka <[email protected]>

Index: linux-2.6.26-rc7-devel/kernel/workqueue.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/kernel/workqueue.c 2008-06-24 07:28:15.000000000 +0200
+++ linux-2.6.26-rc7-devel/kernel/workqueue.c 2008-06-24 07:37:08.000000000 +0200
@@ -141,8 +141,8 @@
}

/* Preempt must be disabled. */
-static void __queue_work(struct cpu_workqueue_struct *cwq,
- struct work_struct *work)
+static __always_inline void __queue_work(struct cpu_workqueue_struct *cwq,
+ struct work_struct *work)
{
unsigned long flags;

2008-06-24 05:56:29

by Mikulas Patocka

[permalink] [raw]
Subject: [2/10 PATCH] inline inline-generic_writepages.patch

Make generic_writepages inlinable in current file.

Signed-off-by: Mikulas Patocka <[email protected]>

Index: linux-2.6.26-rc7-devel/mm/page-writeback.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/mm/page-writeback.c 2008-06-24 07:28:14.000000000 +0200
+++ linux-2.6.26-rc7-devel/mm/page-writeback.c 2008-06-24 07:37:15.000000000 +0200
@@ -981,8 +981,8 @@
* This is a library function, which implements the writepages()
* address_space_operation.
*/
-int generic_writepages(struct address_space *mapping,
- struct writeback_control *wbc)
+__always_inline int generic_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
{
/* deal with chardevs and other special file */
if (!mapping->a_ops->writepage)

2008-06-24 05:57:18

by Mikulas Patocka

[permalink] [raw]
Subject: [3/10 PATCH] inline wake_up_bit

Inline wake_up_bit. The function just pases arguments around.

Signed-off-by: Mikulas Patocka <[email protected]>

Index: linux-2.6.26-rc7-devel/include/linux/wait.h
===================================================================
--- linux-2.6.26-rc7-devel.orig/include/linux/wait.h 2008-06-24 07:28:13.000000000 +0200
+++ linux-2.6.26-rc7-devel/include/linux/wait.h 2008-06-24 07:37:20.000000000 +0200
@@ -147,11 +147,32 @@
void __wake_up_bit(wait_queue_head_t *, void *, int);
int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
-void wake_up_bit(void *, int);
int out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned);
int out_of_line_wait_on_bit_lock(void *, int, int (*)(void *), unsigned);
wait_queue_head_t *bit_waitqueue(void *, int);

+/**
+ * wake_up_bit - wake up a waiter on a bit
+ * @word: the word being waited on, a kernel virtual address
+ * @bit: the bit of the word being waited on
+ *
+ * There is a standard hashed waitqueue table for generic use. This
+ * is the part of the hashtable's accessor API that wakes up waiters
+ * on a bit. For instance, if one were to have waiters on a bitflag,
+ * one would call wake_up_bit() after clearing the bit.
+ *
+ * In order for this to function properly, as it uses waitqueue_active()
+ * internally, some kind of memory barrier must be done prior to calling
+ * this. Typically, this will be smp_mb__after_clear_bit(), but in some
+ * cases where bitflags are manipulated non-atomically under a lock, one
+ * may need to use a less regular barrier, such fs/inode.c's smp_mb(),
+ * because spin_unlock() does not guarantee a memory barrier.
+ */
+static __always_inline void wake_up_bit(void *word, int bit)
+{
+ __wake_up_bit(bit_waitqueue(word, bit), word, bit);
+}
+
#define wake_up(x) __wake_up(x, TASK_NORMAL, 1, NULL)
#define wake_up_nr(x, nr) __wake_up(x, TASK_NORMAL, nr, NULL)
#define wake_up_all(x) __wake_up(x, TASK_NORMAL, 0, NULL)
Index: linux-2.6.26-rc7-devel/kernel/wait.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/kernel/wait.c 2008-06-24 07:28:14.000000000 +0200
+++ linux-2.6.26-rc7-devel/kernel/wait.c 2008-06-24 07:37:20.000000000 +0200
@@ -219,29 +219,6 @@
}
EXPORT_SYMBOL(__wake_up_bit);

-/**
- * wake_up_bit - wake up a waiter on a bit
- * @word: the word being waited on, a kernel virtual address
- * @bit: the bit of the word being waited on
- *
- * There is a standard hashed waitqueue table for generic use. This
- * is the part of the hashtable's accessor API that wakes up waiters
- * on a bit. For instance, if one were to have waiters on a bitflag,
- * one would call wake_up_bit() after clearing the bit.
- *
- * In order for this to function properly, as it uses waitqueue_active()
- * internally, some kind of memory barrier must be done prior to calling
- * this. Typically, this will be smp_mb__after_clear_bit(), but in some
- * cases where bitflags are manipulated non-atomically under a lock, one
- * may need to use a less regular barrier, such fs/inode.c's smp_mb(),
- * because spin_unlock() does not guarantee a memory barrier.
- */
-void wake_up_bit(void *word, int bit)
-{
- __wake_up_bit(bit_waitqueue(word, bit), word, bit);
-}
-EXPORT_SYMBOL(wake_up_bit);
-
wait_queue_head_t *bit_waitqueue(void *word, int bit)
{
const int shift = BITS_PER_LONG == 32 ? 5 : 6;

2008-06-24 05:58:10

by Mikulas Patocka

[permalink] [raw]
Subject: [4/10 PATCH] inline __wake_up_bit

Make __wake_up_bit inlines. This requires to move task state definitions from
sched.h to wait.h

This patch has the worst size-increase impact, increasing total kernel size
by 0.2%.

Signed-off-by: Mikulas Patocka <[email protected]>

Index: linux-2.6.26-rc7-devel/include/linux/wait.h
===================================================================
--- linux-2.6.26-rc7-devel.orig/include/linux/wait.h 2008-06-24 07:37:20.000000000 +0200
+++ linux-2.6.26-rc7-devel/include/linux/wait.h 2008-06-24 07:37:25.000000000 +0200
@@ -25,6 +25,42 @@
#include <asm/system.h>
#include <asm/current.h>

+/*
+ * Task state bitmask. NOTE! These bits are also
+ * encoded in fs/proc/array.c: get_task_state().
+ *
+ * We have two separate sets of flags: task->state
+ * is about runnability, while task->exit_state are
+ * about the task exiting. Confusing, but this way
+ * modifying one set can't modify the other one by
+ * mistake.
+ */
+#define TASK_RUNNING 0
+#define TASK_INTERRUPTIBLE 1
+#define TASK_UNINTERRUPTIBLE 2
+#define __TASK_STOPPED 4
+#define __TASK_TRACED 8
+/* in tsk->exit_state */
+#define EXIT_ZOMBIE 16
+#define EXIT_DEAD 32
+/* in tsk->state again */
+#define TASK_DEAD 64
+#define TASK_WAKEKILL 128
+
+/* Convenience macros for the sake of set_task_state */
+#define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
+#define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED)
+#define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED)
+
+/* Convenience macros for the sake of wake_up */
+#define TASK_NORMAL (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
+#define TASK_ALL (TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED)
+
+/* get_task_state() */
+#define TASK_REPORT (TASK_RUNNING | TASK_INTERRUPTIBLE | \
+ TASK_UNINTERRUPTIBLE | __TASK_STOPPED | \
+ __TASK_TRACED)
+
typedef struct __wait_queue wait_queue_t;
typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int sync, void *key);
int default_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
@@ -144,13 +180,19 @@
void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
extern void __wake_up_locked(wait_queue_head_t *q, unsigned int mode);
extern void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
-void __wake_up_bit(wait_queue_head_t *, void *, int);
int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
int out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned);
int out_of_line_wait_on_bit_lock(void *, int, int (*)(void *), unsigned);
wait_queue_head_t *bit_waitqueue(void *, int);

+static __always_inline void __wake_up_bit(wait_queue_head_t *wq, void *word, int bit)
+{
+ struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
+ if (waitqueue_active(wq))
+ __wake_up(wq, TASK_NORMAL, 1, &key);
+}
+
/**
* wake_up_bit - wake up a waiter on a bit
* @word: the word being waited on, a kernel virtual address
Index: linux-2.6.26-rc7-devel/kernel/wait.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/kernel/wait.c 2008-06-24 07:37:20.000000000 +0200
+++ linux-2.6.26-rc7-devel/kernel/wait.c 2008-06-24 07:37:25.000000000 +0200
@@ -211,14 +211,6 @@
}
EXPORT_SYMBOL(out_of_line_wait_on_bit_lock);

-void __wake_up_bit(wait_queue_head_t *wq, void *word, int bit)
-{
- struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
- if (waitqueue_active(wq))
- __wake_up(wq, TASK_NORMAL, 1, &key);
-}
-EXPORT_SYMBOL(__wake_up_bit);
-
wait_queue_head_t *bit_waitqueue(void *word, int bit)
{
const int shift = BITS_PER_LONG == 32 ? 5 : 6;
Index: linux-2.6.26-rc7-devel/include/linux/sched.h
===================================================================
--- linux-2.6.26-rc7-devel.orig/include/linux/sched.h 2008-06-24 07:28:12.000000000 +0200
+++ linux-2.6.26-rc7-devel/include/linux/sched.h 2008-06-24 07:37:26.000000000 +0200
@@ -160,42 +160,6 @@

extern unsigned long long time_sync_thresh;

-/*
- * Task state bitmask. NOTE! These bits are also
- * encoded in fs/proc/array.c: get_task_state().
- *
- * We have two separate sets of flags: task->state
- * is about runnability, while task->exit_state are
- * about the task exiting. Confusing, but this way
- * modifying one set can't modify the other one by
- * mistake.
- */
-#define TASK_RUNNING 0
-#define TASK_INTERRUPTIBLE 1
-#define TASK_UNINTERRUPTIBLE 2
-#define __TASK_STOPPED 4
-#define __TASK_TRACED 8
-/* in tsk->exit_state */
-#define EXIT_ZOMBIE 16
-#define EXIT_DEAD 32
-/* in tsk->state again */
-#define TASK_DEAD 64
-#define TASK_WAKEKILL 128
-
-/* Convenience macros for the sake of set_task_state */
-#define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
-#define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED)
-#define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED)
-
-/* Convenience macros for the sake of wake_up */
-#define TASK_NORMAL (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
-#define TASK_ALL (TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED)
-
-/* get_task_state() */
-#define TASK_REPORT (TASK_RUNNING | TASK_INTERRUPTIBLE | \
- TASK_UNINTERRUPTIBLE | __TASK_STOPPED | \
- __TASK_TRACED)
-
#define task_is_traced(task) ((task->state & __TASK_TRACED) != 0)
#define task_is_stopped(task) ((task->state & __TASK_STOPPED) != 0)
#define task_is_stopped_or_traced(task) \
@@ -2026,6 +1990,19 @@
return signal_pending(p) && __fatal_signal_pending(p);
}

+static __always_inline int signal_pending_state(long state, struct task_struct *p)
+{
+ if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
+ return 0;
+ if (!signal_pending(p))
+ return 0;
+
+ if (state & (__TASK_STOPPED | __TASK_TRACED))
+ return 0;
+
+ return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
+}
+
static inline int need_resched(void)
{
return unlikely(test_thread_flag(TIF_NEED_RESCHED));

2008-06-24 05:58:45

by Mikulas Patocka

[permalink] [raw]
Subject: [5/10 PATCH] inline __wake_up

Inline __wake_up and __wake_up_locked.

Signed-off-by: Mikulas Patocka <[email protected]>

Index: linux-2.6.26-rc7-devel/include/linux/wait.h
===================================================================
--- linux-2.6.26-rc7-devel.orig/include/linux/wait.h 2008-06-24 07:37:25.000000000 +0200
+++ linux-2.6.26-rc7-devel/include/linux/wait.h 2008-06-24 07:37:31.000000000 +0200
@@ -177,8 +177,7 @@
list_del(&old->task_list);
}

-void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-extern void __wake_up_locked(wait_queue_head_t *q, unsigned int mode);
+void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync, void *key);
extern void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
@@ -186,6 +185,31 @@
int out_of_line_wait_on_bit_lock(void *, int, int (*)(void *), unsigned);
wait_queue_head_t *bit_waitqueue(void *, int);

+/**
+ * __wake_up - wake up threads blocked on a waitqueue.
+ * @q: the waitqueue
+ * @mode: which threads
+ * @nr_exclusive: how many wake-one or wake-many threads to wake up
+ * @key: is directly passed to the wakeup function
+ */
+static __always_inline void __wake_up(wait_queue_head_t *q, unsigned int mode,
+ int nr_exclusive, void *key)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&q->lock, flags);
+ __wake_up_common(q, mode, nr_exclusive, 0, key);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+
+/*
+ * Same as __wake_up but called with the spinlock in wait_queue_head_t held.
+ */
+static __always_inline void __wake_up_locked(wait_queue_head_t *q, unsigned int mode)
+{
+ __wake_up_common(q, mode, 1, 0, NULL);
+}
+
static __always_inline void __wake_up_bit(wait_queue_head_t *wq, void *word, int bit)
{
struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
Index: linux-2.6.26-rc7-devel/kernel/sched.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/kernel/sched.c 2008-06-24 07:28:11.000000000 +0200
+++ linux-2.6.26-rc7-devel/kernel/sched.c 2008-06-24 07:37:31.000000000 +0200
@@ -4284,8 +4284,8 @@
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
-static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
- int nr_exclusive, int sync, void *key)
+void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
+ int nr_exclusive, int sync, void *key)
{
wait_queue_t *curr, *next;

@@ -4297,32 +4297,7 @@
break;
}
}
-
-/**
- * __wake_up - wake up threads blocked on a waitqueue.
- * @q: the waitqueue
- * @mode: which threads
- * @nr_exclusive: how many wake-one or wake-many threads to wake up
- * @key: is directly passed to the wakeup function
- */
-void __wake_up(wait_queue_head_t *q, unsigned int mode,
- int nr_exclusive, void *key)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr_exclusive, 0, key);
- spin_unlock_irqrestore(&q->lock, flags);
-}
-EXPORT_SYMBOL(__wake_up);
-
-/*
- * Same as __wake_up but called with the spinlock in wait_queue_head_t held.
- */
-void __wake_up_locked(wait_queue_head_t *q, unsigned int mode)
-{
- __wake_up_common(q, mode, 1, 0, NULL);
-}
+EXPORT_SYMBOL(__wake_up_common);

/**
* __wake_up_sync - wake up threads blocked on a waitqueue.

2008-06-24 05:59:24

by Mikulas Patocka

[permalink] [raw]
Subject: [6/10 PATCH] inline default_wake_function

Make autoremove_wake_function -> default_wake_function call inlined.

default_wake_function cannot be put as static inline into headers, because
reference to it is taken at some places.

Signed-off-by: Mikulas Patocka <[email protected]>

Index: linux-2.6.26-rc7-devel/kernel/sched.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/kernel/sched.c 2008-06-24 07:37:31.000000000 +0200
+++ linux-2.6.26-rc7-devel/kernel/sched.c 2008-06-24 07:37:36.000000000 +0200
@@ -4268,13 +4268,23 @@

#endif /* CONFIG_PREEMPT */

-int default_wake_function(wait_queue_t *curr, unsigned mode, int sync,
- void *key)
+__always_inline int default_wake_function(wait_queue_t *curr, unsigned mode,
+ int sync, void *key)
{
return try_to_wake_up(curr->private, mode, sync);
}
EXPORT_SYMBOL(default_wake_function);

+int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+ int ret = default_wake_function(wait, mode, sync, key);
+
+ if (ret)
+ list_del_init(&wait->task_list);
+ return ret;
+}
+EXPORT_SYMBOL(autoremove_wake_function);
+
/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
Index: linux-2.6.26-rc7-devel/kernel/wait.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/kernel/wait.c 2008-06-24 07:37:25.000000000 +0200
+++ linux-2.6.26-rc7-devel/kernel/wait.c 2008-06-24 07:37:36.000000000 +0200
@@ -127,16 +127,6 @@
}
EXPORT_SYMBOL(finish_wait);

-int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
-{
- int ret = default_wake_function(wait, mode, sync, key);
-
- if (ret)
- list_del_init(&wait->task_list);
- return ret;
-}
-EXPORT_SYMBOL(autoremove_wake_function);
-
int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
{
struct wait_bit_key *key = arg;
Index: linux-2.6.26-rc7-devel/include/linux/wait.h
===================================================================
--- linux-2.6.26-rc7-devel.orig/include/linux/wait.h 2008-06-24 07:37:31.000000000 +0200
+++ linux-2.6.26-rc7-devel/include/linux/wait.h 2008-06-24 07:37:36.000000000 +0200
@@ -63,7 +63,7 @@

typedef struct __wait_queue wait_queue_t;
typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int sync, void *key);
-int default_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
+__always_inline int default_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);

struct __wait_queue {
unsigned int flags;

2008-06-24 05:59:59

by Mikulas Patocka

[permalink] [raw]
Subject: [6/10 PATCH] inline autoremove_wake_function

Make wake_bit_function -> autoremove_wake_function call inlined.

autoremove_wake_function cannot be put as static inline into headers, because
reference to it is taken at some places. So use "inline" keyword that will make
it both exported and inlinable in current file.

Signed-off-by: Mikulas Patocka <[email protected]>

Index: linux-2.6.26-rc7-devel/kernel/sched.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/kernel/sched.c 2008-06-24 07:37:36.000000000 +0200
+++ linux-2.6.26-rc7-devel/kernel/sched.c 2008-06-24 07:37:42.000000000 +0200
@@ -4275,7 +4275,7 @@
}
EXPORT_SYMBOL(default_wake_function);

-int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
+__always_inline int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
int ret = default_wake_function(wait, mode, sync, key);

@@ -4285,6 +4285,21 @@
}
EXPORT_SYMBOL(autoremove_wake_function);

+int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
+{
+ struct wait_bit_key *key = arg;
+ struct wait_bit_queue *wait_bit
+ = container_of(wait, struct wait_bit_queue, wait);
+
+ if (wait_bit->key.flags != key->flags ||
+ wait_bit->key.bit_nr != key->bit_nr ||
+ test_bit(key->bit_nr, key->flags))
+ return 0;
+ else
+ return autoremove_wake_function(wait, mode, sync, key);
+}
+EXPORT_SYMBOL(wake_bit_function);
+
/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
Index: linux-2.6.26-rc7-devel/kernel/wait.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/kernel/wait.c 2008-06-24 07:37:36.000000000 +0200
+++ linux-2.6.26-rc7-devel/kernel/wait.c 2008-06-24 07:37:42.000000000 +0200
@@ -127,21 +127,6 @@
}
EXPORT_SYMBOL(finish_wait);

-int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
-{
- struct wait_bit_key *key = arg;
- struct wait_bit_queue *wait_bit
- = container_of(wait, struct wait_bit_queue, wait);
-
- if (wait_bit->key.flags != key->flags ||
- wait_bit->key.bit_nr != key->bit_nr ||
- test_bit(key->bit_nr, key->flags))
- return 0;
- else
- return autoremove_wake_function(wait, mode, sync, key);
-}
-EXPORT_SYMBOL(wake_bit_function);
-
/*
* To allow interruptible waiting and asynchronous (i.e. nonblocking)
* waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are
Index: linux-2.6.26-rc7-devel/include/linux/wait.h
===================================================================
--- linux-2.6.26-rc7-devel.orig/include/linux/wait.h 2008-06-24 07:37:36.000000000 +0200
+++ linux-2.6.26-rc7-devel/include/linux/wait.h 2008-06-24 07:37:42.000000000 +0200
@@ -527,7 +527,7 @@
void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state);
void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state);
void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
-int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
+__always_inline int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);

#define DEFINE_WAIT(name) \

2008-06-24 06:01:19

by Mikulas Patocka

[permalink] [raw]
Subject: [8/10 PATCH] inline filemap_fdatawrite

Make filemap_fdatawrite and filemap_flush calls inlined.

This needs to move WB_SYNC_* definitions from writeback.h to fs.h.
writeback.h requires fs.h, so it creates no problem.

Signed-off-by: Mikulas Patocka <[email protected]>

Index: linux-2.6.26-rc7-devel/include/linux/fs.h
===================================================================
--- linux-2.6.26-rc7-devel.orig/include/linux/fs.h 2008-06-24 07:28:06.000000000 +0200
+++ linux-2.6.26-rc7-devel/include/linux/fs.h 2008-06-24 07:37:48.000000000 +0200
@@ -1731,8 +1731,6 @@
extern int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end);
extern int write_inode_now(struct inode *, int);
-extern int filemap_fdatawrite(struct address_space *);
-extern int filemap_flush(struct address_space *);
extern int filemap_fdatawait(struct address_space *);
extern int filemap_write_and_wait(struct address_space *mapping);
extern int filemap_write_and_wait_range(struct address_space *mapping,
@@ -1742,6 +1740,38 @@
extern int __filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end, int sync_mode);

+/*
+ * fs/fs-writeback.c
+ */
+enum writeback_sync_modes {
+ WB_SYNC_NONE, /* Don't wait on anything */
+ WB_SYNC_ALL, /* Wait on every mapping */
+ WB_SYNC_HOLD, /* Hold the inode on sb_dirty for sys_sync() */
+};
+
+static __always_inline int __filemap_fdatawrite(struct address_space *mapping,
+ int sync_mode)
+{
+ return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
+}
+
+static __always_inline int filemap_fdatawrite(struct address_space *mapping)
+{
+ return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
+}
+
+/**
+ * filemap_flush - mostly a non-blocking flush
+ * @mapping: target address_space
+ *
+ * This is a mostly non-blocking flush. Not suitable for data-integrity
+ * purposes - I/O may not be started against all dirty pages.
+ */
+static __always_inline int filemap_flush(struct address_space *mapping)
+{
+ return __filemap_fdatawrite(mapping, WB_SYNC_NONE);
+}
+
extern long do_fsync(struct file *file, int datasync);
extern void sync_supers(void);
extern void sync_filesystems(int wait);
@@ -2000,7 +2030,10 @@
extern int simple_pin_fs(struct file_system_type *, struct vfsmount **mount, int *count);
extern void simple_release_fs(struct vfsmount **mount, int *count);

-extern ssize_t simple_read_from_buffer(void __user *, size_t, loff_t *, const void *, size_t);
+extern ssize_t simple_read_from_buffer(void __user *to, size_t count,
+ loff_t *ppos, const void *from, size_t available);
+extern ssize_t memory_read_from_buffer(void *to, size_t count, loff_t *ppos,
+ const void *from, size_t available);

#ifdef CONFIG_MIGRATION
extern int buffer_migrate_page(struct address_space *,
Index: linux-2.6.26-rc7-devel/include/linux/writeback.h
===================================================================
--- linux-2.6.26-rc7-devel.orig/include/linux/writeback.h 2008-06-24 07:28:07.000000000 +0200
+++ linux-2.6.26-rc7-devel/include/linux/writeback.h 2008-06-24 07:37:49.000000000 +0200
@@ -25,15 +25,6 @@
#define current_is_pdflush() task_is_pdflush(current)

/*
- * fs/fs-writeback.c
- */
-enum writeback_sync_modes {
- WB_SYNC_NONE, /* Don't wait on anything */
- WB_SYNC_ALL, /* Wait on every mapping */
- WB_SYNC_HOLD, /* Hold the inode on sb_dirty for sys_sync() */
-};
-
-/*
* A control structure which tells the writeback code what to do. These are
* always on the stack, and hence need no locking. They are always initialised
* in a manner such that unspecified fields are set to zero.
Index: linux-2.6.26-rc7-devel/mm/filemap.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/mm/filemap.c 2008-06-24 07:28:07.000000000 +0200
+++ linux-2.6.26-rc7-devel/mm/filemap.c 2008-06-24 07:37:49.000000000 +0200
@@ -223,39 +223,15 @@
ret = do_writepages(mapping, &wbc);
return ret;
}
+EXPORT_SYMBOL(__filemap_fdatawrite_range);

-static inline int __filemap_fdatawrite(struct address_space *mapping,
- int sync_mode)
-{
- return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
-}
-
-int filemap_fdatawrite(struct address_space *mapping)
-{
- return __filemap_fdatawrite(mapping, WB_SYNC_ALL);
-}
-EXPORT_SYMBOL(filemap_fdatawrite);
-
-static int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
+static __always_inline int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
loff_t end)
{
return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
}

/**
- * filemap_flush - mostly a non-blocking flush
- * @mapping: target address_space
- *
- * This is a mostly non-blocking flush. Not suitable for data-integrity
- * purposes - I/O may not be started against all dirty pages.
- */
-int filemap_flush(struct address_space *mapping)
-{
- return __filemap_fdatawrite(mapping, WB_SYNC_NONE);
-}
-EXPORT_SYMBOL(filemap_flush);
-
-/**
* wait_on_page_writeback_range - wait for writeback to complete
* @mapping: target address_space
* @start: beginning page index

2008-06-24 06:01:59

by Mikulas Patocka

[permalink] [raw]
Subject: [9/10 PATCH] inline dm-kcopyd-inline-wake.patch


Inline wake. Just passes arguments around.

Signed-off-by: Mikulas Patocka <[email protected]>

Index: linux-2.6.26-rc7-devel/drivers/md/dm-kcopyd.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/drivers/md/dm-kcopyd.c 2008-06-24 07:28:06.000000000 +0200
+++ linux-2.6.26-rc7-devel/drivers/md/dm-kcopyd.c 2008-06-24 07:37:54.000000000 +0200
@@ -61,7 +61,7 @@
struct list_head pages_jobs;
};

-static void wake(struct dm_kcopyd_client *kc)
+static __always_inline void wake(struct dm_kcopyd_client *kc)
{
queue_work(kc->kcopyd_wq, &kc->kcopyd_work);
}

2008-06-24 06:03:27

by Mikulas Patocka

[permalink] [raw]
Subject: [10/10 PATCH] inline dispatch_job

Inline dispatch_job. It is called at two places, but it is quite small.

Signed-off-by: Mikulas Patocka <[email protected]>

Index: linux-2.6.26-rc7-devel/drivers/md/dm-kcopyd.c
===================================================================
--- linux-2.6.26-rc7-devel.orig/drivers/md/dm-kcopyd.c 2008-06-24 07:37:54.000000000 +0200
+++ linux-2.6.26-rc7-devel/drivers/md/dm-kcopyd.c 2008-06-24 07:37:59.000000000 +0200
@@ -433,7 +433,7 @@
* to do the copy, otherwise the io has to be split up into many
* jobs.
*/
-static void dispatch_job(struct kcopyd_job *job)
+static __always_inline void dispatch_job(struct kcopyd_job *job)
{
struct dm_kcopyd_client *kc = job->kc;
atomic_inc(&kc->nr_jobs);

2008-06-24 06:07:00

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH] limit irq nesting

Another potential problem (found during code review) that could cause
stack overflow is indefinite irq nesting. Linux doesn't have any limit on
number of nested irq handlers, so there may be as many handlers on a stack
as there are registered hardware interrupts --- enough to cause a crash.

This patch limits interrupt nesting to at most 2 levels.

--

IRQs without IRQF_DISABLED could nest to arbitrary level.

At worst this would mean having as many IRQ handlers stack frames, as there
are interrupts registered --- enough to cause a stack overflow.

This patch makes a limit to have at most two handlers on the stack.

Signed-off-by: Mikulas Patocka <[email protected]>

Index: linux-2.6.26-rc7-devel/include/linux/interrupt.h
===================================================================
--- linux-2.6.26-rc7-devel.orig/include/linux/interrupt.h 2008-06-23 17:47:16.000000000 +0200
+++ linux-2.6.26-rc7-devel/include/linux/interrupt.h 2008-06-23 18:02:40.000000000 +0200
@@ -16,6 +16,11 @@
#include <asm/system.h>

/*
+ * Max number of interrupt handlers on a stack. To prevent stack overflow.
+ */
+#define MAX_NESTED_INTERRUPTS 2
+
+/*
* These correspond to the IORESOURCE_IRQ_* defines in
* linux/ioport.h to select the interrupt line behaviour. When
* requesting an interrupt without specifying a IRQF_TRIGGER, the
@@ -95,7 +100,7 @@
#ifdef CONFIG_LOCKDEP
# define local_irq_enable_in_hardirq() do { } while (0)
#else
-# define local_irq_enable_in_hardirq() local_irq_enable()
+# define local_irq_enable_in_hardirq() do { if (hardirq_count() < (MAX_NESTED_INTERRUPTS << HARDIRQ_SHIFT)) local_irq_enable(); } while (0)
#endif

extern void disable_irq_nosync(unsigned int irq);

2008-06-24 07:02:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [10 PATCHES] inline functions to avoid stack overflow


* Mikulas Patocka <[email protected]> wrote:

> Hi
>
> Here I'm sending 10 patches to inline various functions.

( sidenote: the patches are seriously whitespace damaged. Please see
Documentation/email-clients.txt about how to send patches. )

NAK on this whole current line of approach. One problem is that it
affects a lot more than just sparc64:

> This patch has the worst size-increase impact, increasing total kernel
> size by 0.2%.

[...]

> To give you some understanding of sparc64, every function there uses
> big stack frame (at least 192 bytes). 128 bytes are required by
> architecture (16 64-bit registers), 48 bytes are there due to mistake
> of Sparc64 ABI designers (calling function has to allocate 48 bytes
> for called function) and 16 bytes are some dubious padding.
>
> So, on sparc64, if you have a simple function that passes arguments to
> other function it still takes 192 byte --- regardless of how simple
> the function is. Tail-call may be used, but it is disabled in kernel
> if debugging is enabled (Makefile: ifdef CONFIG_FRAME_POINTER
> KBUILD_CFLAGS += -fno-omit-frame-pointer -fno-optimize-sibling-calls).
>
> The stack trace has 75 nested functions, that totals to at least 14400
> bytes --- and it kills the 16k stack space on sparc. In the stack
> trace, there are many function which do nothing but pass parameters to
> other function. In this series of patches, I found 10 such functions
> and turned them to inlines, saving 1920 bytes. Especially waking wait
> queue is bad, it calls 8 nested functions, 7 of which do nothing. I
> turned 5 of them to inline.

please solve this sparc64 problem without hurting other architectures.

also, the trace looks suspect:

> This was the trace:
>
> linux_sparc_syscall32
> sys_read
> vfs_read
> do_sync_read
> generic_file_aio_read
> generic_file_direct_io
> filemap_write_and_wait
> filemap_fdatawrite
> __filemap_fdatawrite_range
> do_writepages
> generic_writepages
> write_cache_pages
> __writepage
> blkdev_writepage
> block_write_full_page
> __block_write_fiull_page
> submit_bh
> submit_bio
> generic_make_request
> dm_request
> __split_bio
> __map_bio
> origin_map
> start_copy
> dm_kcopyd_copy
> dispatch_job
> wake
> queue_work
> __queue_work
> __spin_unlock_irqrestore
> sys_call_table
> timer_interrupt
> irq_exit
> do_softirq
> __do_softirq
> run_timer_softirq
> __spin_unlock_irq
> sys_call_table
> handler_irq
> handler_fasteoi_irq
> handle_irq_event
> ide_intr
> ide_dma_intr
> task_end_request
> ide_end_request
> __ide_end_request
> __blk_end_request
> __end_that_request_first
> req_bio_endio
> bio_endio
> clone_endio
> dec_pending
> bio_endio
> clone_endio
> dec_pending
> bio_endio
> clone_endio
> dec_pending
> bio_endio
> end_bio_bh_io_sync
> end_buffer_read_sync
> __end_buffer_read_notouch
> unlock_buffer
> wake_up_bit
> __wake_up_bit
> __wake_up
> __wake_up_common
> wake_bio_function
> autoremove_wake_function
> default_wake_function
> try_to_wake_up
> task_rq_lock
> __spin_lock
> lock_acquire
> __lock_acquire

if function frames are so large, why are there no separate IRQ stacks on
Sparc64? IRQ stacks can drastically lower the worst-case stack footprint
and only affect sparc64.

Also, the stack trace above seems to be imprecise (for example sys_read
cannot nest inside an irq context - so it does not show 75 function
frames) and there are no stack frame size annotations that could tell us
exactly where the stack overhead comes from.

( Please Cc: me to future iterations of this patchset - as long as it
still has generic impact. Thanks! )

Ingo

2008-06-25 12:53:43

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [10 PATCHES] inline functions to avoid stack overflow

On Wed, 25 Jun 2008, Helge Hafting wrote:

> Mikulas Patocka wrote:
>> Hi
>>
>> Here I'm sending 10 patches to inline various functions.
>>
>> To give you some understanding of sparc64, every function there uses big
>> stack frame (at least 192 bytes). 128 bytes are required by architecture
>> (16 64-bit registers), 48 bytes are there due to mistake of Sparc64 ABI
>> designers (calling function has to allocate 48 bytes for called function)
>> and 16 bytes are some dubious padding.
> I guess there is no way around those 128 bytes required by hardware - but
> is it necessary to allocate the 48 bytes of "mistake" and the dubious
> padding?
>
> Linux kernel functions don't call any outside functions, and are not called
> from
> outside either. So violating the ABI should be ok - there is nothing else to
> be compatible to. Similiar to how x86 experimented with --mregparm to get
> a little more performance from changing the kernel calling convention.
> Helge Hafting

So, ask gcc developers to do kernel-specific ABI with only 128-byte stack
frame.

BTW. could some gcc developer explain the reason for additional 16-bytes
on stack on sparc64? 64-bit ABI mandates 176 bytes, but gcc allocates 192
bytes.

Even worse, gcc doesn't use these additional bytes. If you try this:

extern void f(int *i);
void g()
{
int a;
f(&a);
}

, it allocates additional 16 bytes for the variable "a" (so there's total
208 bytes), even though it could place the variable into 48-byte
ABI-mandated area that it inherited from the caller or into it's own
16-byte padding that it made when calling "f".

Mikulas

2008-06-25 14:17:47

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [3/10 PATCH] inline wake_up_bit

On Tuesday 24 June 2008 07:57, Mikulas Patocka wrote:
> Inline wake_up_bit. The function just pases arguments around.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
>
> int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
> -void wake_up_bit(void *, int);
> int out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned);

> +static __always_inline void wake_up_bit(void *word, int bit)
> +{
> + __wake_up_bit(bit_waitqueue(word, bit), word, bit);
> +}

So now every call to wake_up_bit(word, bit) now is converted to:

__wake_up_bit(bit_waitqueue(word, bit), word, bit);

which is in turn converted to (looking into your next patch):

{
wait_queue_head_t *wq = bit_waitqueue(word, bit);
struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
if (waitqueue_active(wq))
__wake_up(wq, TASK_NORMAL, 1, &key);
}

which is in turn converted to (looking into your other patch):

{
wait_queue_head_t *wq = bit_waitqueue(word, bit);
struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
if (waitqueue_active(wq))
{
unsigned long flags;
spin_lock_irqsave(&qw->lock, flags);
__wake_up_common(wq, TASK_NORMAL, 1, 0, &key);
spin_unlock_irqrestore(&wq->lock, flags);
}
}

And you know what? This is likely not the end yet! It's possible
spin_lock_irqXXX, __wake_up_common, waitqueue_active or bit_waitqueue
are inlines - I didn't check.
--
vda

2008-06-25 14:37:39

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [3/10 PATCH] inline wake_up_bit

On Wed, 25 Jun 2008, Denys Vlasenko wrote:

> On Tuesday 24 June 2008 07:57, Mikulas Patocka wrote:
>> Inline wake_up_bit. The function just pases arguments around.
>>
>> Signed-off-by: Mikulas Patocka <[email protected]>
>>
>> int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
>> -void wake_up_bit(void *, int);
>> int out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned);
>
>> +static __always_inline void wake_up_bit(void *word, int bit)
>> +{
>> + __wake_up_bit(bit_waitqueue(word, bit), word, bit);
>> +}
>
> So now every call to wake_up_bit(word, bit) now is converted to:
>
> __wake_up_bit(bit_waitqueue(word, bit), word, bit);
>
> which is in turn converted to (looking into your next patch):
>
> {
> wait_queue_head_t *wq = bit_waitqueue(word, bit);
> struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> if (waitqueue_active(wq))
> __wake_up(wq, TASK_NORMAL, 1, &key);
> }
>
> which is in turn converted to (looking into your other patch):
>
> {
> wait_queue_head_t *wq = bit_waitqueue(word, bit);
> struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> if (waitqueue_active(wq))
> {
> unsigned long flags;
> spin_lock_irqsave(&qw->lock, flags);
> __wake_up_common(wq, TASK_NORMAL, 1, 0, &key);
> spin_unlock_irqrestore(&wq->lock, flags);
> }
> }
>
> And you know what? This is likely not the end yet! It's possible
> spin_lock_irqXXX, __wake_up_common, waitqueue_active or bit_waitqueue
> are inlines - I didn't check.
> --
> vda

Yes, that's 0.2% code size increase (or none increase, if drop
inline-__wake_up_bit.patch and apply only the other patches). To me it
seems crazy, how this code was refactored again and again over time, up to
8 levels of functions (including passing a pointer to a method). In 2.0.x
kernel series, it was just a single call to wake up a queue.

Mikulas

2008-06-25 15:25:14

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [3/10 PATCH] inline wake_up_bit

On Wednesday 25 June 2008 16:36, Mikulas Patocka wrote:
> > On Tuesday 24 June 2008 07:57, Mikulas Patocka wrote:
> >> Inline wake_up_bit. The function just pases arguments around.
> >>
> >> Signed-off-by: Mikulas Patocka <[email protected]>
> >>
> >> int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
> >> -void wake_up_bit(void *, int);
> >> int out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned);
> >
> >> +static __always_inline void wake_up_bit(void *word, int bit)
> >> +{
> >> + __wake_up_bit(bit_waitqueue(word, bit), word, bit);
> >> +}
> >
> > So now every call to wake_up_bit(word, bit) now is converted to:
> >
> > __wake_up_bit(bit_waitqueue(word, bit), word, bit);
> >
> > which is in turn converted to (looking into your next patch):
> >
> > {
> > wait_queue_head_t *wq = bit_waitqueue(word, bit);
> > struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> > if (waitqueue_active(wq))
> > __wake_up(wq, TASK_NORMAL, 1, &key);
> > }
> >
> > which is in turn converted to (looking into your other patch):
> >
> > {
> > wait_queue_head_t *wq = bit_waitqueue(word, bit);
> > struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> > if (waitqueue_active(wq))
> > {
> > unsigned long flags;
> > spin_lock_irqsave(&qw->lock, flags);
> > __wake_up_common(wq, TASK_NORMAL, 1, 0, &key);
> > spin_unlock_irqrestore(&wq->lock, flags);
> > }
> > }
> >
> > And you know what? This is likely not the end yet! It's possible
> > spin_lock_irqXXX, __wake_up_common, waitqueue_active or bit_waitqueue
> > are inlines - I didn't check.
> > --
> > vda
>
> Yes, that's 0.2% code size increase

...In just 17 callsites in entire kernel.

> (or none increase, if drop
> inline-__wake_up_bit.patch and apply only the other patches).

Now this is a better approach - to actually see how many
callsites are there, and inlining only where makes sense.
But in practice it's hard to do and also is changing all the time
during development. What is optimal today won't be optimal in
2.6.45 :)

Ingo's suggestion to talk to gcc people to remedy
insane call convention sounds as a more workable solution.

BTW, i386 uses regparm call convention, is similar trick
possible for sparc64?

> To me it
> seems crazy, how this code was refactored again and again over time, up to
> 8 levels of functions (including passing a pointer to a method). In 2.0.x
> kernel series, it was just a single call to wake up a queue.

Yes, probably... If you can simplify it, everyone will be glad.
--
vda

2008-06-25 16:01:51

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [3/10 PATCH] inline wake_up_bit

>>> And you know what? This is likely not the end yet! It's possible
>>> spin_lock_irqXXX, __wake_up_common, waitqueue_active or bit_waitqueue
>>> are inlines - I didn't check.
>>> --
>>> vda
>>
>> Yes, that's 0.2% code size increase
>
> ...In just 17 callsites in entire kernel.
>
>> (or none increase, if drop
>> inline-__wake_up_bit.patch and apply only the other patches).
>
> Now this is a better approach - to actually see how many
> callsites are there, and inlining only where makes sense.
> But in practice it's hard to do and also is changing all the time
> during development. What is optimal today won't be optimal in
> 2.6.45 :)
>
> Ingo's suggestion to talk to gcc people to remedy
> insane call convention sounds as a more workable solution.
>
> BTW, i386 uses regparm call convention, is similar trick
> possible for sparc64?

Sparc64 has register windows: it passes arguments in registers, but it
must allocate space for that registers. If the call stack is too deep (8
levels), the CPU runs out of registers and starts spilling the registers
of the function 8-levels-deep to the stack.

The stack usage could be reduced to 176 bytes with little work from gcc
developers and to 128 bytes with more work (ABI change). If you wanted to
go below 128 bytes, you could use one register to indicate number of used
registers and modify the spill/fill handlers to load only that number of
registers and reduce the stack usage even more --- that would be a big
code change in both gcc and linux.

Mikulas

>> To me it
>> seems crazy, how this code was refactored again and again over time, up to
>> 8 levels of functions (including passing a pointer to a method). In 2.0.x
>> kernel series, it was just a single call to wake up a queue.
>
> Yes, probably... If you can simplify it, everyone will be glad.
> --
> vda
>

2008-06-25 20:38:12

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [3/10 PATCH] inline wake_up_bit

On Wednesday 25 June 2008 18:01, Mikulas Patocka wrote:
> > Ingo's suggestion to talk to gcc people to remedy
> > insane call convention sounds as a more workable solution.
> >
> > BTW, i386 uses regparm call convention, is similar trick
> > possible for sparc64?
>
> Sparc64 has register windows: it passes arguments in registers, but it
> must allocate space for that registers. If the call stack is too deep (8
> levels), the CPU runs out of registers and starts spilling the registers
> of the function 8-levels-deep to the stack.
>
> The stack usage could be reduced to 176 bytes with little work from gcc
> developers and to 128 bytes with more work (ABI change). If you wanted to

Wow, it's nearly x2 reduction.

ABI change in not a problem for kernel, since it is a "freestanding
application". Exactly like i386 switched to regparm, which is a different ABI.

> go below 128 bytes, you could use one register to indicate number of used
> registers and modify the spill/fill handlers to load only that number of
> registers and reduce the stack usage even more --- that would be a big
> code change in both gcc and linux.
--
vda

2008-06-25 22:09:42

by David Miller

[permalink] [raw]
Subject: Re: [10 PATCHES] inline functions to avoid stack overflow

From: Mikulas Patocka <[email protected]>
Date: Wed, 25 Jun 2008 08:53:10 -0400 (EDT)

> Even worse, gcc doesn't use these additional bytes. If you try this:
>
> extern void f(int *i);
> void g()
> {
> int a;
> f(&a);
> }
>
> , it allocates additional 16 bytes for the variable "a" (so there's total
> 208 bytes), even though it could place the variable into 48-byte
> ABI-mandated area that it inherited from the caller or into it's own
> 16-byte padding that it made when calling "f".

The extra 16 bytes of space allocated is so that GCC can perform a
secondary reload of a quad floating point value. It always has to be
present, because we can't satisfy a secondary reload by emitting yet
another reload, it's the end of the possible level of recursions
allowed by the reload pass.

GCC could be smart and eliminate that slot when it's not used, but
such a thing is not implemented yet.

It would also require quite a bit of new code to determine cases
like you mention above, where the incoming arg slots from the
caller are unused, assuming this would be legal.

And that legality is doubtful. We'd need to be careful because I
think the caller is allowed to assume that those slots are untouched
by the callee, and thus can be assumed to have whatever values the
caller put there even after the callee returns.

2008-06-25 22:23:43

by David Miller

[permalink] [raw]
Subject: Re: [3/10 PATCH] inline wake_up_bit

From: Denys Vlasenko <[email protected]>
Date: Wed, 25 Jun 2008 17:24:57 +0200

> talk to gcc people to remedy insane call convention sounds as a more
> workable solution.

It's not a tenable solution to this problem. We have to make this
work properly with compiler tools currently available to users, rather
than versions of gcc that don't even exist yet.

And what's more, the amount we can get back from the current stack
size allocation is, at best, 16 bytes. The problem isn't going away
no matter what we do to the compiler.

> BTW, i386 uses regparm call convention, is similar trick
> possible for sparc64?

sparc64 already passes arguments in registers.

The stack frame is (predominantly, size wise) to accomodate the save
area for the cpu's register windows in non-leaf functions, and that
isn't going anywhere.

Back to the patch set, several of these non-inlined cases really
are extremely suboptimal. If it's just moving args around, it
should be inline. This would help even x86.

Also, suggesting that IRQSTACKS are mandatory for correct operation is
a pretty unreasonable thing to say. It's currently still optional on
the platforms where it is implemented, so I wonder if this means that
correct operation is optional on these platforms? :-)

I have a sparc64 IRQSTACKS implementation that I'm working on, but
it's not something I can stick into 2.6.26 by any stretch of the
imagination.

2008-06-25 22:30:21

by David Miller

[permalink] [raw]
Subject: Re: [3/10 PATCH] inline wake_up_bit

From: Denys Vlasenko <[email protected]>
Date: Wed, 25 Jun 2008 16:17:40 +0200

> And you know what? This is likely not the end yet! It's possible
> spin_lock_irqXXX, __wake_up_common, waitqueue_active or bit_waitqueue
> are inlines - I didn't check.

The thing to do in these kinds of scenerios is to provide top-level
routes in kernel/sched.c where all the inlining occurs, rather
than doing it in header files and thus expanding this stuff in
multiple places.

2008-06-26 00:28:48

by David Miller

[permalink] [raw]
Subject: Re: [3/10 PATCH] inline wake_up_bit

From: Denys Vlasenko <[email protected]>
Date: Wed, 25 Jun 2008 22:37:58 +0200

> On Wednesday 25 June 2008 18:01, Mikulas Patocka wrote:
> > > Ingo's suggestion to talk to gcc people to remedy
> > > insane call convention sounds as a more workable solution.
> > >
> > > BTW, i386 uses regparm call convention, is similar trick
> > > possible for sparc64?
> >
> > Sparc64 has register windows: it passes arguments in registers, but it
> > must allocate space for that registers. If the call stack is too deep (8
> > levels), the CPU runs out of registers and starts spilling the registers
> > of the function 8-levels-deep to the stack.
> >
> > The stack usage could be reduced to 176 bytes with little work from gcc
> > developers and to 128 bytes with more work (ABI change). If you wanted to
>
> Wow, it's nearly x2 reduction.
>
> ABI change in not a problem for kernel, since it is a "freestanding
> application". Exactly like i386 switched to regparm, which is a different ABI.

Except that nobody has written this code and therefore being about to
use this unimplemented compiler facility to get correctness is not
tenable.

2008-06-26 03:36:11

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [3/10 PATCH] inline wake_up_bit

On Thursday 26 June 2008 02:28, David Miller wrote:
> From: Denys Vlasenko <[email protected]>
> Date: Wed, 25 Jun 2008 22:37:58 +0200
> > > Sparc64 has register windows: it passes arguments in registers, but it
> > > must allocate space for that registers. If the call stack is too deep (8
> > > levels), the CPU runs out of registers and starts spilling the registers
> > > of the function 8-levels-deep to the stack.
> > >
> > > The stack usage could be reduced to 176 bytes with little work from gcc
> > > developers and to 128 bytes with more work (ABI change). If you wanted to
> >
> > Wow, it's nearly x2 reduction.
> >
> > ABI change in not a problem for kernel, since it is a "freestanding
> > application". Exactly like i386 switched to regparm, which is a different ABI.
>
> Except that nobody has written this code and therefore being about to
> use this unimplemented compiler facility to get correctness is not
> tenable.

Inlining everything is even less tenable. Why architectures which do not
require 128+ bytes of stack for every function call should suffer?

I am all for fixing code where there are extra useless levels of calls,
but in this example I pointed out that patch adds inlines too liberally.
Do you agree that blowing up every wake_up_bit() into half a dozen
or more C lines is not what we want?
--
vda

2008-06-26 04:18:58

by David Miller

[permalink] [raw]
Subject: Re: [3/10 PATCH] inline wake_up_bit

From: Denys Vlasenko <[email protected]>
Date: Thu, 26 Jun 2008 05:35:59 +0200

> On Thursday 26 June 2008 02:28, David Miller wrote:
> > From: Denys Vlasenko <[email protected]>
> > Date: Wed, 25 Jun 2008 22:37:58 +0200
> > > > Sparc64 has register windows: it passes arguments in registers, but it
> > > > must allocate space for that registers. If the call stack is too deep (8
> > > > levels), the CPU runs out of registers and starts spilling the registers
> > > > of the function 8-levels-deep to the stack.
> > > >
> > > > The stack usage could be reduced to 176 bytes with little work from gcc
> > > > developers and to 128 bytes with more work (ABI change). If you wanted to
> > >
> > > Wow, it's nearly x2 reduction.
> > >
> > > ABI change in not a problem for kernel, since it is a "freestanding
> > > application". Exactly like i386 switched to regparm, which is a different ABI.
> >
> > Except that nobody has written this code and therefore being about to
> > use this unimplemented compiler facility to get correctness is not
> > tenable.
>
> Inlining everything is even less tenable.

I never suggested this. Although I do support inlining the
cases which merely adjust the ordering of arguments being
passed to function calls, because such inlines are essentially
of zero cost and of gain to all platforms.

> I am all for fixing code where there are extra useless levels of calls,
> but in this example I pointed out that patch adds inlines too liberally.
> Do you agree that blowing up every wake_up_bit() into half a dozen
> or more C lines is not what we want?

I stated my suggested alternative to this in another posting, so of
course I do not support it as-is.

2008-06-26 06:32:47

by Bart Van Assche

[permalink] [raw]
Subject: Re: [10 PATCHES] inline functions to avoid stack overflow

On Thu, Jun 26, 2008 at 12:09 AM, David Miller <[email protected]> wrote:
> From: Mikulas Patocka <[email protected]>
> Date: Wed, 25 Jun 2008 08:53:10 -0400 (EDT)
>
>> Even worse, gcc doesn't use these additional bytes. If you try this:
>>
>> extern void f(int *i);
>> void g()
>> {
>> int a;
>> f(&a);
>> }
>>
>> , it allocates additional 16 bytes for the variable "a" (so there's total
>> 208 bytes), even though it could place the variable into 48-byte
>> ABI-mandated area that it inherited from the caller or into it's own
>> 16-byte padding that it made when calling "f".
>
> The extra 16 bytes of space allocated is so that GCC can perform a
> secondary reload of a quad floating point value. It always has to be
> present, because we can't satisfy a secondary reload by emitting yet
> another reload, it's the end of the possible level of recursions
> allowed by the reload pass.

Is there any floating-point code present in the Linux kernel ? Would
it be a good idea to add an option to gcc that tells gcc that the
compiled code does not contain floating-point instructions, such that
gcc knows that no space has to be provided for a quad floating point
value ?

Bart.

2008-06-26 09:06:31

by David Miller

[permalink] [raw]
Subject: Re: [10 PATCHES] inline functions to avoid stack overflow

From: "Bart Van Assche" <[email protected]>
Date: Thu, 26 Jun 2008 08:32:35 +0200

> On Thu, Jun 26, 2008 at 12:09 AM, David Miller <[email protected]> wrote:
> > The extra 16 bytes of space allocated is so that GCC can perform a
> > secondary reload of a quad floating point value. It always has to be
> > present, because we can't satisfy a secondary reload by emitting yet
> > another reload, it's the end of the possible level of recursions
> > allowed by the reload pass.
>
> Is there any floating-point code present in the Linux kernel ?

Yes, but not coming from C compiled code. Floating point is
used in most of the memcpy/memset implementations of the
sparc64 kernel.

> Would it be a good idea to add an option to gcc that tells gcc that
> the compiled code does not contain floating-point instructions, such
> that gcc knows that no space has to be provided for a quad floating
> point value ?

I think it exists already, it's called -mno-fpu :-)

2008-06-27 12:19:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [3/10 PATCH] inline wake_up_bit

On Wed 2008-06-25 17:28:38, David Miller wrote:
> From: Denys Vlasenko <[email protected]>
> Date: Wed, 25 Jun 2008 22:37:58 +0200
>
> > On Wednesday 25 June 2008 18:01, Mikulas Patocka wrote:
> > > > Ingo's suggestion to talk to gcc people to remedy
> > > > insane call convention sounds as a more workable solution.
> > > >
> > > > BTW, i386 uses regparm call convention, is similar trick
> > > > possible for sparc64?
> > >
> > > Sparc64 has register windows: it passes arguments in registers, but it
> > > must allocate space for that registers. If the call stack is too deep (8
> > > levels), the CPU runs out of registers and starts spilling the registers
> > > of the function 8-levels-deep to the stack.
> > >
> > > The stack usage could be reduced to 176 bytes with little work from gcc
> > > developers and to 128 bytes with more work (ABI change). If you wanted to
> >
> > Wow, it's nearly x2 reduction.
> >
> > ABI change in not a problem for kernel, since it is a "freestanding
> > application". Exactly like i386 switched to regparm, which is a different ABI.
>
> Except that nobody has written this code and therefore being about to
> use this unimplemented compiler facility to get correctness is not
> tenable.

Switch to 32K stack on sparc64, then?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-02 04:40:14

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [10 PATCHES] inline functions to avoid stack overflow



On Wed, 25 Jun 2008, David Miller wrote:

> From: Mikulas Patocka <[email protected]>
> Date: Wed, 25 Jun 2008 08:53:10 -0400 (EDT)
>
>> Even worse, gcc doesn't use these additional bytes. If you try this:
>>
>> extern void f(int *i);
>> void g()
>> {
>> int a;
>> f(&a);
>> }
>>
>> , it allocates additional 16 bytes for the variable "a" (so there's total
>> 208 bytes), even though it could place the variable into 48-byte
>> ABI-mandated area that it inherited from the caller or into it's own
>> 16-byte padding that it made when calling "f".
>
> The extra 16 bytes of space allocated is so that GCC can perform a
> secondary reload of a quad floating point value. It always has to be
> present, because we can't satisfy a secondary reload by emitting yet
> another reload, it's the end of the possible level of recursions
> allowed by the reload pass.
>
> GCC could be smart and eliminate that slot when it's not used, but
> such a thing is not implemented yet.
>
> It would also require quite a bit of new code to determine cases
> like you mention above, where the incoming arg slots from the
> caller are unused, assuming this would be legal.
>
> And that legality is doubtful. We'd need to be careful because I
> think the caller is allowed to assume that those slots are untouched
> by the callee, and thus can be assumed to have whatever values the
> caller put there even after the callee returns.

The ABI is very vague about it. The V9 ABI just displays that 6-word space
in a figure bug doesn't say anything about it's usage. The V8 ABI just
says that "the function may write incoming arguments there". If it may
write anything other, it is unknown --- probably yes, but it is not said
in the document.

The document nicely specifies who owns which registers, but doesn't say
that about the stack space :-(

Mikulas

2008-07-02 04:45:22

by David Miller

[permalink] [raw]
Subject: Re: [10 PATCHES] inline functions to avoid stack overflow

From: Mikulas Patocka <[email protected]>
Date: Wed, 2 Jul 2008 00:39:35 -0400 (EDT)

> The ABI is very vague about it. The V9 ABI just displays that 6-word space
> in a figure bug doesn't say anything about it's usage. The V8 ABI just
> says that "the function may write incoming arguments there". If it may
> write anything other, it is unknown --- probably yes, but it is not said
> in the document.
>
> The document nicely specifies who owns which registers, but doesn't say
> that about the stack space :-(

Actually, I know for a fact that you have to have those slots there.

A long time ago in the sparc64 kernel, in the trap entry code, I tried
only giving 128 bytes of stack frame as the trap entry called into C
code. And it did not work, I had to put the 6 slots there.

2008-07-03 21:12:53

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [10 PATCHES] inline functions to avoid stack overflow

On Tue, 1 Jul 2008, David Miller wrote:

> From: Mikulas Patocka <[email protected]>
> Date: Wed, 2 Jul 2008 00:39:35 -0400 (EDT)
>
>> The ABI is very vague about it. The V9 ABI just displays that 6-word space
>> in a figure bug doesn't say anything about it's usage. The V8 ABI just
>> says that "the function may write incoming arguments there". If it may
>> write anything other, it is unknown --- probably yes, but it is not said
>> in the document.
>>
>> The document nicely specifies who owns which registers, but doesn't say
>> that about the stack space :-(
>
> Actually, I know for a fact that you have to have those slots there.
>
> A long time ago in the sparc64 kernel, in the trap entry code, I tried
> only giving 128 bytes of stack frame as the trap entry called into C
> code. And it did not work, I had to put the 6 slots there.

The bad thing is that gcc can't use those slots optimally. If you have for
example:

void f(int *x)
{
}

void g()
{
int a;
f(&a);
}

void h()
{
g();
}

Then the variable "a" can't be placed into one of the 6 implicit slots for
g->f call (beacuse "f" may overwrite that slot). But "a" could be placed
into one of those 6 slots that "h" allocates for "g" (because these slots
are owned by "g"). But it isn't --- additional place is allocated for "a"
:-/

Mikulas