Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752485AbbEHJp3 (ORCPT ); Fri, 8 May 2015 05:45:29 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46059 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751078AbbEHJpZ (ORCPT ); Fri, 8 May 2015 05:45:25 -0400 Message-ID: <554C85B2.1010605@redhat.com> Date: Fri, 08 May 2015 11:45:22 +0200 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Alexander Duyck , "David S. Miller" CC: Jiri Pirko , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, netfilter-devel@vger.kernel.org Subject: Re: [PATCH] net: deinline netif_tx_stop_queue() and netif_tx_stop_all_queues() References: <1430998870-1453-1-git-send-email-dvlasenk@redhat.com> <554B9D82.80101@gmail.com> In-Reply-To: <554B9D82.80101@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3792 Lines: 98 On 05/07/2015 07:14 PM, Alexander Duyck wrote: > On 05/07/2015 04:41 AM, Denys Vlasenko wrote: >> These functions compile to ~60 bytes of machine code each. >> >> With this .config: http://busybox.net/~vda/kernel_config >> there are 617 calls to netif_tx_stop_queue() >> and 49 calls to netif_tx_stop_all_queues() in vmlinux. >> >> Code size is reduced by 27 kbytes: >> >> text data bss dec hex filename >> 82426986 22255416 20627456 125309858 77813a2 vmlinux.before >> 82399481 22255416 20627456 125282353 777a831 vmlinux >> >> It may seem strange that a seemingly simple code like one in >> netif_tx_stop_queue() compiles to ~60 bytes of code. >> Well, it's true. Here's its disassembly: >> >> netif_tx_stop_queue: ... >> 55 push %rbp >> be 7a 18 00 00 mov $0x187a,%esi >> 48 c7 c7 50 59 d8 85 mov $.rodata+0x1d85950,%rdi >> 48 89 e5 mov %rsp,%rbp >> e8 54 5a 7d fd callq >> 48 c7 c7 5f 59 d8 85 mov $.rodata+0x1d8595f,%rdi >> 31 c0 xor %eax,%eax >> e8 b0 47 48 00 callq >> eb 09 jmp > > This is the WARN_ON action. One thing you might try doing is moving > this to a function of its own instead of moving the entire thing > out of being an inline. If WARN_ON check would be moved into a function, the call overhead would still be there, while each callsite will be larder than with this patch. > You may find you still get most > of the space savings as I wonder if the string for the printk > isn't being duplicated for each caller. Yes, strings are duplicated: $ strings vmlinux0 | grep 'cannot be called before register_netdev' 6netif_stop_queue() cannot be called before register_netdev() 6tun: netif_stop_queue() cannot be called before register_netdev() 6cc770: netif_stop_queue() cannot be called before register_netdev() 63c589_cs: netif_stop_queue() cannot be called before register_netdev() 63c574_cs: netif_stop_queue() cannot be called before register_netdev() 6typhoon netif_stop_queue() cannot be called before register_netdev() 6axnet_cs: netif_stop_queue() cannot be called before register_netdev() 6pcnet_cs: netif_stop_queue() cannot be called before register_netdev() ... However, they amount only to ~5.7k out of 27k: $ strings vmlinux0 | grep 'cannot be called before register_netdev' | wc -c 5731 >> f0 80 8f e0 01 00 00 01 lock orb $0x1,0x1e0(%rdi) > > This is your set bit operation. If you were to drop the whole WARN_ON > then this is the only thing you would be inlining. It's up to networking people to decide. I would happily send a patch which drops WARN_ON if they say that's ok with them. Davem? > That is only 8 bytes in size which would probably be comparable to the callq > and register sorting needed for a function call. "lock or" in my tests takes 21 cycles even on exclusively cached L1 data cache line. Added "call+ret" is 4-5 cycles. > Have you done any performance testing on this change? No. > I suspect there will likely be a noticeable impact some some tests. (1) It's *transmit off* operation. Usually it means that we have to turn transmit off because hw TX queue is full. So the bottleneck is likely the network, not the CPU. (2) It was auto-deinlined by gcc anyway. We already were unknownigly using the uninlined version for some time. Apparently, it wasn't noticed. -- vda -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/