2015-04-23 21:40:26

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: [PATCH] enforce function inlining for hot functions

GCC inlining heuristics are sometimes quizzical. Especially with inline
assembler constructs GCC seems to have issues. A allyesconfig show a rather
long list of functions where GCC inlining decisions are questionable (not
inlined). Furthermore, because the functions are declared with static
linkage each function is copied n times - and n can be rather high:

atomic_inc: 544 duplicates
rcu_read_unlock: 453 duplicates
rcu_read_lock: 383 duplicates
get_dma_ops: 271 duplicates
arch_local_irq_restore: 258 duplicates
atomic_dec: 215 duplicates
kzalloc: 185 duplicates
cpumask_check: 157 duplicates
test_and_set_bit: 156 duplicates
cpumask_next: 146 duplicates
list_del: 131 duplicates
kref_get: 126 duplicates

This patch enforces inlining for the hottest functions. Where hottest is a
subjectively biased list of functions where a) inline is a performance must
and/or b) the number of callee's is quite high and/or c) the function call
overhead is too high for a 2 or 3 instructions inline assembler function
(usually all three). E.g.:

static inline void kref_get(struct kref *kref)
{
WARN_ON_ONCE(atomic_inc_return(&kref->refcount) < 2);
}

For the larger part of inlines GCC is able to take proper inlining
decisions. Let's pray that in future versions of GCC inlining heuristic
are further improved so that we even can revert this patch - but this
should last for a while.

Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Hagen Paul Pfeifer <[email protected]>
---

net_generic() is a borderline candidate, probably to heavy for inlining.
Possible a candidate for removing static linkage and inline attributes?!
I don't touched the status yet - but I can change it too.

arch/x86/include/asm/bitops.h | 4 ++--
arch/x86/include/asm/dma-mapping.h | 4 ++--
arch/x86/include/asm/irqflags.h | 8 ++++----
arch/x86/include/asm/paravirt.h | 10 +++++-----
arch/x86/include/asm/uaccess.h | 4 ++--
drivers/net/wireless/rtlwifi/wifi.h | 2 +-
include/linux/buffer_head.h | 2 +-
include/linux/clk.h | 2 +-
include/linux/completion.h | 2 +-
include/linux/cpumask.h | 6 +++---
include/linux/interrupt.h | 2 +-
include/linux/kref.h | 2 +-
include/linux/list.h | 4 ++--
include/linux/netdevice.h | 6 +++---
include/linux/rcupdate.h | 6 +++---
include/linux/skbuff.h | 4 ++--
include/linux/slab.h | 2 +-
include/linux/workqueue.h | 10 +++++-----
include/net/netlink.h | 2 +-
include/net/netns/generic.h | 2 +-
include/net/sock.h | 2 +-
21 files changed, 43 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index cfe3b95..39ccf6c 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -201,7 +201,7 @@ static inline void change_bit(long nr, volatile unsigned long *addr)
* This operation is atomic and cannot be reordered.
* It also implies a memory barrier.
*/
-static inline int test_and_set_bit(long nr, volatile unsigned long *addr)
+static __always_inline int test_and_set_bit(long nr, volatile unsigned long *addr)
{
GEN_BINARY_RMWcc(LOCK_PREFIX "bts", *addr, "Ir", nr, "%0", "c");
}
@@ -247,7 +247,7 @@ static inline int __test_and_set_bit(long nr, volatile unsigned long *addr)
* This operation is atomic and cannot be reordered.
* It also implies a memory barrier.
*/
-static inline int test_and_clear_bit(long nr, volatile unsigned long *addr)
+static __always_inline int test_and_clear_bit(long nr, volatile unsigned long *addr)
{
GEN_BINARY_RMWcc(LOCK_PREFIX "btr", *addr, "Ir", nr, "%0", "c");
}
diff --git a/arch/x86/include/asm/dma-mapping.h b/arch/x86/include/asm/dma-mapping.h
index 808dae6..55097bc 100644
--- a/arch/x86/include/asm/dma-mapping.h
+++ b/arch/x86/include/asm/dma-mapping.h
@@ -29,7 +29,7 @@ extern int panic_on_overflow;

extern struct dma_map_ops *dma_ops;

-static inline struct dma_map_ops *get_dma_ops(struct device *dev)
+static __always_inline struct dma_map_ops *get_dma_ops(struct device *dev)
{
#ifndef CONFIG_X86_DEV_DMA_OPS
return dma_ops;
@@ -44,7 +44,7 @@ static inline struct dma_map_ops *get_dma_ops(struct device *dev)
#include <asm-generic/dma-mapping-common.h>

/* Make sure we keep the same behaviour */
-static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
+static __always_inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
{
struct dma_map_ops *ops = get_dma_ops(dev);
debug_dma_mapping_error(dev, dma_addr);
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index b77f5ed..c3842bc 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -62,22 +62,22 @@ static inline void native_halt(void)
#ifndef __ASSEMBLY__
#include <linux/types.h>

-static inline notrace unsigned long arch_local_save_flags(void)
+static __always_inline notrace unsigned long arch_local_save_flags(void)
{
return native_save_fl();
}

-static inline notrace void arch_local_irq_restore(unsigned long flags)
+static __always_inline notrace void arch_local_irq_restore(unsigned long flags)
{
native_restore_fl(flags);
}

-static inline notrace void arch_local_irq_disable(void)
+static __always_inline notrace void arch_local_irq_disable(void)
{
native_irq_disable();
}

-static inline notrace void arch_local_irq_enable(void)
+static __always_inline notrace void arch_local_irq_enable(void)
{
native_irq_enable();
}
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 8957810..2d139e1 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -799,27 +799,27 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
#define __PV_IS_CALLEE_SAVE(func) \
((struct paravirt_callee_save) { func })

-static inline notrace unsigned long arch_local_save_flags(void)
+static __always_inline notrace unsigned long arch_local_save_flags(void)
{
return PVOP_CALLEE0(unsigned long, pv_irq_ops.save_fl);
}

-static inline notrace void arch_local_irq_restore(unsigned long f)
+static __always_inline notrace void arch_local_irq_restore(unsigned long f)
{
PVOP_VCALLEE1(pv_irq_ops.restore_fl, f);
}

-static inline notrace void arch_local_irq_disable(void)
+static __always_inline notrace void arch_local_irq_disable(void)
{
PVOP_VCALLEE0(pv_irq_ops.irq_disable);
}

-static inline notrace void arch_local_irq_enable(void)
+static __always_inline notrace void arch_local_irq_enable(void)
{
PVOP_VCALLEE0(pv_irq_ops.irq_enable);
}

-static inline notrace unsigned long arch_local_irq_save(void)
+static __always_inline notrace unsigned long arch_local_irq_save(void)
{
unsigned long f;

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index ace9dec..a34dff0 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -684,7 +684,7 @@ __copy_from_user_overflow(int size, unsigned long count)

#endif

-static inline unsigned long __must_check
+static __always_inline unsigned long __must_check
copy_from_user(void *to, const void __user *from, unsigned long n)
{
int sz = __compiletime_object_size(to);
@@ -719,7 +719,7 @@ copy_from_user(void *to, const void __user *from, unsigned long n)
return n;
}

-static inline unsigned long __must_check
+static __always_inline unsigned long __must_check
copy_to_user(void __user *to, const void *from, unsigned long n)
{
int sz = __compiletime_object_size(from);
diff --git a/drivers/net/wireless/rtlwifi/wifi.h b/drivers/net/wireless/rtlwifi/wifi.h
index 5157291..96d568c 100644
--- a/drivers/net/wireless/rtlwifi/wifi.h
+++ b/drivers/net/wireless/rtlwifi/wifi.h
@@ -2900,7 +2900,7 @@ static inline u32 rtl_read_dword(struct rtl_priv *rtlpriv, u32 addr)
return rtlpriv->io.read32_sync(rtlpriv, addr);
}

-static inline void rtl_write_byte(struct rtl_priv *rtlpriv, u32 addr, u8 val8)
+static __always_inline void rtl_write_byte(struct rtl_priv *rtlpriv, u32 addr, u8 val8)
{
rtlpriv->io.write8_async(rtlpriv, addr, val8);

diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 73b4522..d3942a7 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -281,7 +281,7 @@ static inline void put_bh(struct buffer_head *bh)
atomic_dec(&bh->b_count);
}

-static inline void brelse(struct buffer_head *bh)
+static __always_inline void brelse(struct buffer_head *bh)
{
if (bh)
__brelse(bh);
diff --git a/include/linux/clk.h b/include/linux/clk.h
index 68c16a6..7fb8314 100644
--- a/include/linux/clk.h
+++ b/include/linux/clk.h
@@ -450,7 +450,7 @@ static inline struct clk *clk_get_parent(struct clk *clk)
#endif

/* clk_prepare_enable helps cases using clk_enable in non-atomic context. */
-static inline int clk_prepare_enable(struct clk *clk)
+static __always_inline int clk_prepare_enable(struct clk *clk)
{
int ret;

diff --git a/include/linux/completion.h b/include/linux/completion.h
index 5d5aaae..06163a6 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -70,7 +70,7 @@ struct completion {
* This inline function will initialize a dynamically created completion
* structure.
*/
-static inline void init_completion(struct completion *x)
+static __always_inline void init_completion(struct completion *x)
{
x->done = 0;
init_waitqueue_head(&x->wait);
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 27e285b..912dede 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -111,7 +111,7 @@ extern const struct cpumask *const cpu_active_mask;
#endif

/* verify cpu argument to cpumask_* operators */
-static inline unsigned int cpumask_check(unsigned int cpu)
+static __always_inline unsigned int cpumask_check(unsigned int cpu)
{
#ifdef CONFIG_DEBUG_PER_CPU_MAPS
WARN_ON_ONCE(cpu >= nr_cpumask_bits);
@@ -183,7 +183,7 @@ static inline unsigned int cpumask_first(const struct cpumask *srcp)
*
* Returns >= nr_cpu_ids if no further cpus set.
*/
-static inline unsigned int cpumask_next(int n, const struct cpumask *srcp)
+static __always_inline unsigned int cpumask_next(int n, const struct cpumask *srcp)
{
/* -1 is a legal arg here. */
if (n != -1)
@@ -473,7 +473,7 @@ static inline bool cpumask_full(const struct cpumask *srcp)
* cpumask_weight - Count of bits in *srcp
* @srcp: the cpumask to count bits (< nr_cpu_ids) in.
*/
-static inline unsigned int cpumask_weight(const struct cpumask *srcp)
+static __always_inline unsigned int cpumask_weight(const struct cpumask *srcp)
{
return bitmap_weight(cpumask_bits(srcp), nr_cpumask_bits);
}
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 950ae45..ef2279a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -527,7 +527,7 @@ static inline void tasklet_unlock_wait(struct tasklet_struct *t)

extern void __tasklet_schedule(struct tasklet_struct *t);

-static inline void tasklet_schedule(struct tasklet_struct *t)
+static __always_inline void tasklet_schedule(struct tasklet_struct *t)
{
if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
__tasklet_schedule(t);
diff --git a/include/linux/kref.h b/include/linux/kref.h
index 484604d..0a91a21 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -38,7 +38,7 @@ static inline void kref_init(struct kref *kref)
* kref_get - increment refcount for object.
* @kref: object.
*/
-static inline void kref_get(struct kref *kref)
+static __always_inline void kref_get(struct kref *kref)
{
/* If refcount was 0 before incrementing then we have a race
* condition when this kref is freeing by some other thread right now.
diff --git a/include/linux/list.h b/include/linux/list.h
index feb773c..85d3a1e 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -102,7 +102,7 @@ static inline void __list_del_entry(struct list_head *entry)
__list_del(entry->prev, entry->next);
}

-static inline void list_del(struct list_head *entry)
+static __always_inline void list_del(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
entry->next = LIST_POISON1;
@@ -140,7 +140,7 @@ static inline void list_replace_init(struct list_head *old,
* list_del_init - deletes entry from list and reinitialize it.
* @entry: the element to delete from the list.
*/
-static inline void list_del_init(struct list_head *entry)
+static __always_inline void list_del_init(struct list_head *entry)
{
__list_del_entry(entry);
INIT_LIST_HEAD(entry);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bcbde79..87d83b4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2547,7 +2547,7 @@ static inline void netif_tx_wake_all_queues(struct net_device *dev)
}
}

-static inline void netif_tx_stop_queue(struct netdev_queue *dev_queue)
+static __always_inline void netif_tx_stop_queue(struct netdev_queue *dev_queue)
{
if (WARN_ON(!dev_queue)) {
pr_info("netif_stop_queue() cannot be called before register_netdev()\n");
@@ -2563,12 +2563,12 @@ static inline void netif_tx_stop_queue(struct netdev_queue *dev_queue)
* Stop upper layers calling the device hard_start_xmit routine.
* Used for flow control when transmit resources are unavailable.
*/
-static inline void netif_stop_queue(struct net_device *dev)
+static __always_inline void netif_stop_queue(struct net_device *dev)
{
netif_tx_stop_queue(netdev_get_tx_queue(dev, 0));
}

-static inline void netif_tx_stop_all_queues(struct net_device *dev)
+static __always_inline void netif_tx_stop_all_queues(struct net_device *dev)
{
unsigned int i;

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 573a5af..1acd8ef 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -446,7 +446,7 @@ static inline bool rcu_lockdep_current_cpu_online(void)

#ifdef CONFIG_DEBUG_LOCK_ALLOC

-static inline void rcu_lock_acquire(struct lockdep_map *map)
+static __always_inline void rcu_lock_acquire(struct lockdep_map *map)
{
lock_acquire(map, 0, 0, 2, 0, NULL, _THIS_IP_);
}
@@ -907,7 +907,7 @@ static inline void rcu_preempt_sleep_check(void)
* read-side critical sections may be preempted and they may also block, but
* only when acquiring spinlocks that are subject to priority inheritance.
*/
-static inline void rcu_read_lock(void)
+static __always_inline void rcu_read_lock(void)
{
__rcu_read_lock();
__acquire(RCU);
@@ -961,7 +961,7 @@ static inline void rcu_read_lock(void)
*
* See rcu_read_lock() for more information.
*/
-static inline void rcu_read_unlock(void)
+static __always_inline void rcu_read_unlock(void)
{
rcu_lockdep_assert(rcu_is_watching(),
"rcu_read_unlock() used illegally while idle");
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 06793b5..0f98aa5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1328,7 +1328,7 @@ static inline void __skb_queue_head_init(struct sk_buff_head *list)
* network layer or drivers should need annotation to consolidate the
* main types of usage into 3 classes.
*/
-static inline void skb_queue_head_init(struct sk_buff_head *list)
+static __always_inline void skb_queue_head_init(struct sk_buff_head *list)
{
spin_lock_init(&list->lock);
__skb_queue_head_init(list);
@@ -1722,7 +1722,7 @@ static inline unsigned char *pskb_pull(struct sk_buff *skb, unsigned int len)
return unlikely(len > skb->len) ? NULL : __pskb_pull(skb, len);
}

-static inline int pskb_may_pull(struct sk_buff *skb, unsigned int len)
+static __always_inline int pskb_may_pull(struct sk_buff *skb, unsigned int len)
{
if (likely(len <= skb_headlen(skb)))
return 1;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index ffd24c8..06af7bc 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -578,7 +578,7 @@ static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
* @size: how many bytes of memory are required.
* @flags: the type of memory to allocate (see kmalloc).
*/
-static inline void *kzalloc(size_t size, gfp_t flags)
+static __always_inline void *kzalloc(size_t size, gfp_t flags)
{
return kmalloc(size, flags | __GFP_ZERO);
}
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index deee212..9f902c9 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -466,8 +466,8 @@ extern void show_workqueue_state(void);
* We queue the work to the CPU on which it was submitted, but if the CPU dies
* it can be processed by another CPU.
*/
-static inline bool queue_work(struct workqueue_struct *wq,
- struct work_struct *work)
+static __always_inline bool queue_work(struct workqueue_struct *wq,
+ struct work_struct *work)
{
return queue_work_on(WORK_CPU_UNBOUND, wq, work);
}
@@ -525,7 +525,7 @@ static inline bool schedule_work_on(int cpu, struct work_struct *work)
* queued and leaves it in the same position on the kernel-global
* workqueue otherwise.
*/
-static inline bool schedule_work(struct work_struct *work)
+static __always_inline bool schedule_work(struct work_struct *work)
{
return queue_work(system_wq, work);
}
@@ -553,8 +553,8 @@ static inline bool schedule_delayed_work_on(int cpu, struct delayed_work *dwork,
* After waiting for a given time this puts a job in the kernel-global
* workqueue.
*/
-static inline bool schedule_delayed_work(struct delayed_work *dwork,
- unsigned long delay)
+static __always_inline bool schedule_delayed_work(struct delayed_work *dwork,
+ unsigned long delay)
{
return queue_delayed_work(system_wq, dwork, delay);
}
diff --git a/include/net/netlink.h b/include/net/netlink.h
index 2a5dbcc..9c5ea29 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -517,7 +517,7 @@ static inline void *nlmsg_get_pos(struct sk_buff *skb)
*
* Trims the message to the provided mark.
*/
-static inline void nlmsg_trim(struct sk_buff *skb, const void *mark)
+static __always_inline void nlmsg_trim(struct sk_buff *skb, const void *mark)
{
if (mark) {
WARN_ON((unsigned char *) mark < skb->data);
diff --git a/include/net/netns/generic.h b/include/net/netns/generic.h
index 70e1585..b529fa3 100644
--- a/include/net/netns/generic.h
+++ b/include/net/netns/generic.h
@@ -31,7 +31,7 @@ struct net_generic {
void *ptr[0];
};

-static inline void *net_generic(const struct net *net, int id)
+static __always_inline void *net_generic(const struct net *net, int id)
{
struct net_generic *ng;
void *ptr;
diff --git a/include/net/sock.h b/include/net/sock.h
index 3a4898e..92e12fb 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1619,7 +1619,7 @@ void sock_init_data(struct socket *sock, struct sock *sk);
*/

/* Ungrab socket and destroy it, if it was the last reference. */
-static inline void sock_put(struct sock *sk)
+static __always_inline void sock_put(struct sock *sk)
{
if (atomic_dec_and_test(&sk->sk_refcnt))
sk_free(sk);
--
2.1.4


2015-04-24 19:49:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] enforce function inlining for hot functions

On Thu, 23 Apr 2015 23:40:01 +0200 Hagen Paul Pfeifer <[email protected]> wrote:

> GCC inlining heuristics are sometimes quizzical. Especially with inline
> assembler constructs GCC seems to have issues. A allyesconfig show a rather
> long list of functions where GCC inlining decisions are questionable (not
> inlined).

I can't reproduce this with either gcc-4.8.2 or gcc-4.4.4. The patch
makes zero difference to `size vmlinux' and a bit of poking around with
nm doesn't show any out-of-lined versions of the functions you
identify.

So. More details, please. How to demonstrate this, gcc versions, etc.

> Furthermore, because the functions are declared with static
> linkage each function is copied n times - and n can be rather high:
>
> atomic_inc: 544 duplicates
> rcu_read_unlock: 453 duplicates
> rcu_read_lock: 383 duplicates
> get_dma_ops: 271 duplicates
> arch_local_irq_restore: 258 duplicates
> atomic_dec: 215 duplicates
> kzalloc: 185 duplicates
> cpumask_check: 157 duplicates
> test_and_set_bit: 156 duplicates
> cpumask_next: 146 duplicates
> list_del: 131 duplicates
> kref_get: 126 duplicates

That's pretty pathetic.

2015-04-24 20:13:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] enforce function inlining for hot functions

On Fri, Apr 24, 2015 at 12:49:05PM -0700, Andrew Morton wrote:
> On Thu, 23 Apr 2015 23:40:01 +0200 Hagen Paul Pfeifer <[email protected]> wrote:
>
> > GCC inlining heuristics are sometimes quizzical. Especially with inline
> > assembler constructs GCC seems to have issues. A allyesconfig show a rather
> > long list of functions where GCC inlining decisions are questionable (not
> > inlined).
>
> I can't reproduce this with either gcc-4.8.2 or gcc-4.4.4. The patch
> makes zero difference to `size vmlinux' and a bit of poking around with
> nm doesn't show any out-of-lined versions of the functions you
> identify.
>
> So. More details, please. How to demonstrate this, gcc versions, etc.
>
> > Furthermore, because the functions are declared with static
> > linkage each function is copied n times - and n can be rather high:
> >
> > atomic_inc: 544 duplicates
> > rcu_read_unlock: 453 duplicates
> > rcu_read_lock: 383 duplicates

Hmmm... allyesconfig would have PROVE_RCU=y, which would mean that the
above two would contain lockdep calls that might in some cases defeat
inlining. With the more typical production choice of PROVE_RCU=n, I would
expect these to just be a call instruction, which should get inlined.

Thanx, Paul

> > get_dma_ops: 271 duplicates
> > arch_local_irq_restore: 258 duplicates
> > atomic_dec: 215 duplicates
> > kzalloc: 185 duplicates
> > cpumask_check: 157 duplicates
> > test_and_set_bit: 156 duplicates
> > cpumask_next: 146 duplicates
> > list_del: 131 duplicates
> > kref_get: 126 duplicates
>
> That's pretty pathetic.
>

2015-04-24 20:39:25

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [PATCH] enforce function inlining for hot functions

On 24 April 2015 at 21:49, Andrew Morton <[email protected]> wrote:
> I can't reproduce this with either gcc-4.8.2 or gcc-4.4.4. The patch
> makes zero difference to `size vmlinux' and a bit of poking around with
> nm doesn't show any out-of-lined versions of the functions you
> identify.
>
> So. More details, please. How to demonstrate this, gcc versions, etc.

Hey Andrew,

first of all you should probably scan over https://lkml.org/lkml/2015/4/21/178
Some questions are already answered there.

Here is the situation: the inlining problem occur with the 4.9.x
branch - I tried to reproduce it with 4.8.x and saw *no* problems. So
it is probably limited to 4.9 - which is sad. I don't checked it with
5.0 or 5.1 yet. Then, if CONFIG_OPTIMIZE_INLINING is disabled inlining
is enforced anyway, so there is no problem because all inlined marked
functions are always inlined - gcc heuristic is defacto disabled. This
patch makes sure that the hot functions always inlined, no matter what
config is selected or compiler version is used. Yes, in an ideal world
gcc had inlined most of them - now we enforce the ideal world.

Hagen

2015-04-24 20:44:45

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [PATCH] enforce function inlining for hot functions

On 24 April 2015 at 22:13, Paul E. McKenney <[email protected]> wrote:

Hey Paul,

> Hmmm... allyesconfig would have PROVE_RCU=y, which would mean that the
> above two would contain lockdep calls that might in some cases defeat
> inlining. With the more typical production choice of PROVE_RCU=n, I would
> expect these to just be a call instruction, which should get inlined.

I can rebuild and check with PROVE_RCU=n - the question is what is the
reaction to the result? I tend to enforce the inlining anyway for both
rcu functions because nobody is harmed. But wait, the compiler is
already started ... ;-)

Hagen

2015-04-24 23:11:05

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [PATCH] enforce function inlining for hot functions

* Paul E. McKenney | 2015-04-24 13:13:40 [-0700]:

>Hmmm... allyesconfig would have PROVE_RCU=y, which would mean that the
>above two would contain lockdep calls that might in some cases defeat
>inlining. With the more typical production choice of PROVE_RCU=n, I would
>expect these to just be a call instruction, which should get inlined.


Ok, here are the results:

with PROVE_RCU=y:
rcu_read_lock: 383 duplicates
with PROVE_RCU=n:
rcu_read_lock: 114 duplicates


If you look at the function anatomy of rcu_read_lock you often see the
following definitions:

<rcu_read_lock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
48 c7 c7 50 64 e7 85 mov $0xffffffff85e76450,%rdi
e8 ce ff ff ff callq ffffffff816af206 <rcu_lock_acquire>
5d pop %rbp
c3 retq

but sometimes rcu_read_lock looks:

<rcu_read_lock>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
50 push %rax
68 83 1e 1c 81 pushq $0xffffffff811c1e83
b9 02 00 00 00 mov $0x2,%ecx
31 d2 xor %edx,%edx
45 31 c9 xor %r9d,%r9d
45 31 c0 xor %r8d,%r8d
31 f6 xor %esi,%esi
48 c7 c7 50 64 e7 85 mov $0xffffffff85e76450,%rdi
e8 86 4c f9 ff callq ffffffff81156b2e <lock_acquire>
5a pop %rdx
59 pop %rcx
c9 leaveq
c3 retq


Means rcu_lock_acquire() is inlined here - but not in every compilation unit.
Don't know exactly what forces gcc to inline not everywhere. Maybe register
pressure in the function unit, or at least gcc is think that. I don't know.

At the end you may notice that gcc inlining decisions are not always perfect
and a little bit fuzzy (sure, they have their metric/scoring system). And
sometimes the inlining should be enforced - as this patch do for some important
functions. But as I said we should not enforce it everywhere, rather we should
pray for better heuristics and let the compiler choose the best strategy (and
incorporate -Os/-O2 decisions too). I think this is the best compromise here.

Cheers, Hagen

2015-04-25 10:31:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] enforce function inlining for hot functions

On Sat, Apr 25, 2015 at 01:10:56AM +0200, Hagen Paul Pfeifer wrote:
> * Paul E. McKenney | 2015-04-24 13:13:40 [-0700]:
>
> >Hmmm... allyesconfig would have PROVE_RCU=y, which would mean that the
> >above two would contain lockdep calls that might in some cases defeat
> >inlining. With the more typical production choice of PROVE_RCU=n, I would
> >expect these to just be a call instruction, which should get inlined.
>
>
> Ok, here are the results:
>
> with PROVE_RCU=y:
> rcu_read_lock: 383 duplicates
> with PROVE_RCU=n:
> rcu_read_lock: 114 duplicates
>
>
> If you look at the function anatomy of rcu_read_lock you often see the
> following definitions:
>
> <rcu_read_lock>:
> 55 push %rbp
> 48 89 e5 mov %rsp,%rbp
> 48 c7 c7 50 64 e7 85 mov $0xffffffff85e76450,%rdi
> e8 ce ff ff ff callq ffffffff816af206 <rcu_lock_acquire>
> 5d pop %rbp
> c3 retq

OK, so you have PROVE_RCU=n and CONFIG_DEBUG_LOCK_ALLOC=y in this
case? That would get rid of the rcu_lockdep_assert(), but keep the
rcu_lock_acquire().

> but sometimes rcu_read_lock looks:
>
> <rcu_read_lock>:
> 55 push %rbp
> 48 89 e5 mov %rsp,%rbp
> 50 push %rax
> 68 83 1e 1c 81 pushq $0xffffffff811c1e83
> b9 02 00 00 00 mov $0x2,%ecx
> 31 d2 xor %edx,%edx
> 45 31 c9 xor %r9d,%r9d
> 45 31 c0 xor %r8d,%r8d
> 31 f6 xor %esi,%esi
> 48 c7 c7 50 64 e7 85 mov $0xffffffff85e76450,%rdi
> e8 86 4c f9 ff callq ffffffff81156b2e <lock_acquire>
> 5a pop %rdx
> 59 pop %rcx
> c9 leaveq
> c3 retq
>
>
> Means rcu_lock_acquire() is inlined here - but not in every compilation unit.
> Don't know exactly what forces gcc to inline not everywhere. Maybe register
> pressure in the function unit, or at least gcc is think that. I don't know.
>
> At the end you may notice that gcc inlining decisions are not always perfect
> and a little bit fuzzy (sure, they have their metric/scoring system). And
> sometimes the inlining should be enforced - as this patch do for some important
> functions. But as I said we should not enforce it everywhere, rather we should
> pray for better heuristics and let the compiler choose the best strategy (and
> incorporate -Os/-O2 decisions too). I think this is the best compromise here.

I am not arguing either way on the wisdom or lack thereof of gcc's
inlining decisions. But PROVE_RCU=n and CONFIG_DEBUG_LOCK_ALLOC=n should
make rcu_read_lock() and rcu_read_unlock() both be empty functions in
a CONFIG_PREEMPT=n, which should hopefully trivialize gcc's inlining
decisions in that particular case.

Apologies for not identifying CONFIG_DEBUG_LOCK_ALLOC=n to begin with.

Thanx, Paul

2015-04-25 10:53:48

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH] enforce function inlining for hot functions

On 2015.04.24 at 22:39 +0200, Hagen Paul Pfeifer wrote:
> On 24 April 2015 at 21:49, Andrew Morton <[email protected]> wrote:
> > I can't reproduce this with either gcc-4.8.2 or gcc-4.4.4. The patch
> > makes zero difference to `size vmlinux' and a bit of poking around with
> > nm doesn't show any out-of-lined versions of the functions you
> > identify.
> >
> > So. More details, please. How to demonstrate this, gcc versions, etc.
>
> first of all you should probably scan over https://lkml.org/lkml/2015/4/21/178
> Some questions are already answered there.
>
> Here is the situation: the inlining problem occur with the 4.9.x
> branch - I tried to reproduce it with 4.8.x and saw *no* problems. So
> it is probably limited to 4.9 - which is sad. I don't checked it with
> 5.0 or 5.1 yet. Then, if CONFIG_OPTIMIZE_INLINING is disabled inlining
> is enforced anyway, so there is no problem because all inlined marked
> functions are always inlined - gcc heuristic is defacto disabled. This
> patch makes sure that the hot functions always inlined, no matter what
> config is selected or compiler version is used. Yes, in an ideal world
> gcc had inlined most of them - now we enforce the ideal world.

Please note that this only happens with -Os (aka
CONFIG_CC_OPTIMIZE_FOR_SIZE). The reason for this is that gcc simply
doesn't inline when the function body would grow if the call is
considered cold.

--
Markus

2015-04-25 13:26:50

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [PATCH] enforce function inlining for hot functions

On 25 April 2015 at 12:31, Paul E. McKenney <[email protected]> wrote:

> I am not arguing either way on the wisdom or lack thereof of gcc's
> inlining decisions. But PROVE_RCU=n and CONFIG_DEBUG_LOCK_ALLOC=n should
> make rcu_read_lock() and rcu_read_unlock() both be empty functions in
> a CONFIG_PREEMPT=n, which should hopefully trivialize gcc's inlining
> decisions in that particular case.

Hey Paul,

yes, with DEBUG_LOCK_ALLOC disabled all rcu_read_lock and unlock
functions are perfectly inlined. So now we have the following
situation: depending on the gcc version and the particular kernel
configuration some hot functions are not inlined - they are duplicated
hundred times. Which is bad no matter how you consider
gcc/kernel-configuration. I think this should *never* happened.

With the patch we can make sure that hot functions are *always*
inlined - no matter what gcc version and kernel configuration is used.

Furthermore, as Markus already noted: compiled with -O2 this do not
happened. Duplicates are only generated for -Os[1]. Ok, but now the
question: should this happened for Os? I don't think so. I think we
can do it better and mark these few functions as always inline. For
the remaining inlined marked function we should provide gcc the
flexibility and do not artificially enforce inlining. The current
situation is bad: OPTIMIZE_INLINING is default no, which defacto
enforces inlining for *all* inlined marked functions. GCC inlining
mechanism is defacto disabled, which is also bad. Last but not least:
the patch do not change anything for the current user, because we will
still disable OPTIMIZE_INLINING (resulting in __always_inline for all
inlined marked functions). The patch effects users who enable
OPTIMIZE_INLINING and trust the compiler.

Hagen

PS: thank you Markus for the comment.

[1] which is nonsense: the functions are not inlined yet, but are
copied hundred times for "size optimized builds". gcc should rather
redeclare the functions global, define it one time and call this
function every time. But implementing such a scheme is probably a
monster of itself and LTO is required so solve all issues with such a
concept.

2015-04-25 13:51:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] enforce function inlining for hot functions

On Sat, Apr 25, 2015 at 03:26:48PM +0200, Hagen Paul Pfeifer wrote:
> On 25 April 2015 at 12:31, Paul E. McKenney <[email protected]> wrote:
>
> > I am not arguing either way on the wisdom or lack thereof of gcc's
> > inlining decisions. But PROVE_RCU=n and CONFIG_DEBUG_LOCK_ALLOC=n should
> > make rcu_read_lock() and rcu_read_unlock() both be empty functions in
> > a CONFIG_PREEMPT=n, which should hopefully trivialize gcc's inlining
> > decisions in that particular case.
>
> Hey Paul,
>
> yes, with DEBUG_LOCK_ALLOC disabled all rcu_read_lock and unlock
> functions are perfectly inlined.

Whew!!! ;-)

> So now we have the following
> situation: depending on the gcc version and the particular kernel
> configuration some hot functions are not inlined - they are duplicated
> hundred times. Which is bad no matter how you consider
> gcc/kernel-configuration. I think this should *never* happened.
>
> With the patch we can make sure that hot functions are *always*
> inlined - no matter what gcc version and kernel configuration is used.
>
> Furthermore, as Markus already noted: compiled with -O2 this do not
> happened. Duplicates are only generated for -Os[1]. Ok, but now the
> question: should this happened for Os? I don't think so. I think we
> can do it better and mark these few functions as always inline. For
> the remaining inlined marked function we should provide gcc the
> flexibility and do not artificially enforce inlining. The current
> situation is bad: OPTIMIZE_INLINING is default no, which defacto
> enforces inlining for *all* inlined marked functions. GCC inlining
> mechanism is defacto disabled, which is also bad. Last but not least:
> the patch do not change anything for the current user, because we will
> still disable OPTIMIZE_INLINING (resulting in __always_inline for all
> inlined marked functions). The patch effects users who enable
> OPTIMIZE_INLINING and trust the compiler.
>
> Hagen
>
> PS: thank you Markus for the comment.
>
> [1] which is nonsense: the functions are not inlined yet, but are
> copied hundred times for "size optimized builds". gcc should rather
> redeclare the functions global, define it one time and call this
> function every time. But implementing such a scheme is probably a
> monster of itself and LTO is required so solve all issues with such a
> concept.

I am guessing that there is only one duplicate per compilation unit?
I would also guess that the LTO guys would have a ready solution. ;-)

That said, if a function was invoked extremely many times, it might
make sense to duplicate it even within a single compilation unit if
doing so allowed saving more than the size of the function in the
form of call instructions with shorter address fields. But I have
no idea whether or not gcc would do this sort of thing.

Thanx, Paul