2006-08-03 13:48:36

by Arnd Hannemann

[permalink] [raw]
Subject: problems with e1000 and jumboframes

Hi,

im running vanilla 2.6.17.6 and if i try to set the mtu of my e1000 nic
to 9000 bytes, page allocation failures occur (see below).

However the box is a VIA Epia MII12000 with 1 GB of Ram and 1 GB of swap
enabled, so there should be plenty of memory available. HIGHMEM support
is off. The e1000 nic seems to be an 82540EM, which to my knowledge
should support jumboframes.

However I can't always reproduce this on a freshly booted system, so
someone else may be the culprit and leaking pages?

Any ideas how to debug this?

kernel config and other stuff available:
http://arndnet.de/~arnd/config-2.6.17.6
http://arndnet.de/~arnd/lsmod.txt
http://arndnet.de/~arnd/lspci.txt
http://arndnet.de/~arnd/dmesg.txt
http://arndnet.de/~arnd/slabinfo.txt


> e1000: eth1: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex
> e:0 free:3308 slab:41895 mapped:119264 pagetables:392
> DMA free:3576kB min:68kB low:84kB high:100kB active:4144kB inactive:0kB present:
> 16384kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 880 880
> DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB page
> s_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 880 880
> Normal free:9656kB min:3756kB low:4692kB high:5632kB active:593312kB inactive:11
> 6408kB present:901120kB pages_scanned:37 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:
> 0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0 0 0
> DMA: 256*4kB 47*8kB 2*16kB 1*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048 kB 0*4096kB = 3576kB
> DMA32: empty
> Normal: 1910*4kB 106*8kB 61*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 9656kB
> HighMem: empty
> Swap cache: add 333601, delete 331441, find 397667/415025, race 0+0
> Free swap = 937756kB
> Total swap = 979956kB
> Free swap: 937756kB
> 229376 pages of RAM
> 0 pages of HIGHMEM
> 2731 reserved pages
> 69480 pages shared
> 2160 pages swap cached
> 4 pages dirty
> 0 pages writeback
> 119264 pages mapped
> 41895 pages slab
> 392 pages pagetables
> kswapd0: page allocation failure. order:3, mode:0x20
> <c01369e2> __alloc_pages+0x1f2/0x2d0 <c0149c92> kmem_getpages+0x32/0xa0
> <c014a8fb> cache_grow+0x9b/0x150 <c014aae3> cache_alloc_refill+0x133/0x1b0
> <c014ae7e> __kmalloc+0x5e/0x70 <c024302a> __alloc_skb+0x4a/0x100
> <f8b6d1f7> e1000_alloc_rx_buffers+0x227/0x3a0 [e1000] <c0113c57> __wake_up_common+0x37/0x70
> <f8b6c807> e1000_clean_rx_irq+0x247/0x520 [e1000] <c01bfab8> end_that_request_last+0x98/0xd0
> <f8b6c2e0> e1000_intr+0x60/0x100 [e1000] <c0130f89> handle_IRQ_event+0x29/0x60
> <c0131012> __do_IRQ+0x52/0xa0 <c010567e> do_IRQ+0x3e/0x60
> =======================
> <c0103aea> common_interrupt+0x1a/0x20 <c011b6fe> __do_softirq+0x2e/0x90
> <c0105791> do_softirq+0x41/0x50
> =======================
> <c0105685> do_IRQ+0x45/0x60 <c0103aea> common_interrupt+0x1a/0x20
> <c01c8f52> _atomic_dec_and_lock+0x2/0x10 <c01631de> dput+0x1e/0x120
> <c01635f6> prune_dcache+0xe6/0xf0 <c0163914> shrink_dcache_memory+0x14/0x40
> <c013a09f> shrink_slab+0x16f/0x1d0 <c0137d56> throttle_vm_writeout+0x26/0x70
> <c013b413> balance_pgdat+0x2e3/0x3b0 <c013b5d3> kswapd+0xf3/0x110
> <c0127fb0> autoremove_wake_function+0x0/0x50 <c0127fb0> autoremove_wake_function+0x0/0x50
> <c013b4e0> kswapd+0x0/0x110 <c0100fe5> kernel_thread_helper+0x5/0x10

Thanks,
Arnd Hannemann





2006-08-03 13:59:38

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 03:48:39PM +0200, Arnd Hannemann ([email protected]) wrote:
> Hi,
>
> im running vanilla 2.6.17.6 and if i try to set the mtu of my e1000 nic
> to 9000 bytes, page allocation failures occur (see below).
>
> However the box is a VIA Epia MII12000 with 1 GB of Ram and 1 GB of swap
> enabled, so there should be plenty of memory available. HIGHMEM support
> is off. The e1000 nic seems to be an 82540EM, which to my knowledge
> should support jumboframes.

But it does not support splitting them into page sized chunks, so it
requires the whole jumbo frame allocation in one contiguous chunk, 9k
will be transferred into 16k allocation (order 3), since SLAB uses
power-of-2 allocation.

> However I can't always reproduce this on a freshly booted system, so
> someone else may be the culprit and leaking pages?

You will almost 100% reproduce it after "find / > /dev/null".

> Any ideas how to debug this?

It can not be debugged - you have cought a memory fragmentation problem,
which is quite common.

> > kswapd0: page allocation failure. order:3, mode:0x20

e1000 tries to allocate 3-order pages atomically?
Well, that's wrong.

> Thanks,
> Arnd Hannemann

--
Evgeniy Polyakov

2006-08-03 14:24:28

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 03:48:39PM +0200, Arnd Hannemann wrote:
> However the box is a VIA Epia MII12000 with 1 GB of Ram and 1 GB of swap
> enabled, so there should be plenty of memory available. HIGHMEM support
> is off. The e1000 nic seems to be an 82540EM, which to my knowledge
> should support jumboframes.

> However I can't always reproduce this on a freshly booted system, so
> someone else may be the culprit and leaking pages?
>
> Any ideas how to debug this?

This is memory fragmentation, and all you can do is work around it until
the e1000 driver is changed to split jumbo frames up on rx. Here are a
few ideas that should improve things for you:

- switch to a 2GB/2GB split to recover the memory lost to highmem
(see Processor Type and Features / Memory split)
- increase /proc/sys/vm/min_free_kbytes -- more free memory will
improve the odds that enough unfragmented memory is available for
incoming network packets

I hope this helps.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-08-03 14:37:31

by Arnd Hannemann

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

Hi,

Evgeniy Polyakov wrote:
> On Thu, Aug 03, 2006 at 03:48:39PM +0200, Arnd Hannemann ([email protected]) wrote:
>> Hi,
>>
>> im running vanilla 2.6.17.6 and if i try to set the mtu of my e1000 nic
>> to 9000 bytes, page allocation failures occur (see below).
>>
>> However the box is a VIA Epia MII12000 with 1 GB of Ram and 1 GB of swap
>> enabled, so there should be plenty of memory available. HIGHMEM support
>> is off. The e1000 nic seems to be an 82540EM, which to my knowledge
>> should support jumboframes.
>
> But it does not support splitting them into page sized chunks, so it
> requires the whole jumbo frame allocation in one contiguous chunk, 9k
> will be transferred into 16k allocation (order 3), since SLAB uses
> power-of-2 allocation.

Hmm, ok, what is the meaning of this line then:
> Normal: 44578*4kB 11117*8kB 800*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 280240kB

Are this the allocations which already happend? I thought they would
represent the free memory, not the already used one?

>> However I can't always reproduce this on a freshly booted system, so
>> someone else may be the culprit and leaking pages?
>
> You will almost 100% reproduce it after "find / > /dev/null".
>
>> Any ideas how to debug this?
>
> It can not be debugged - you have cought a memory fragmentation problem,
> which is quite common.

That's too bad :-(
However it seems hard for me to imagine why there is no contiguous chunk
of 16k when there are hundreds of Mbyte free. Can't those other pages be
moved by the kernel, if a higher order allocation is requested?

>
>>> kswapd0: page allocation failure. order:3, mode:0x20
>
> e1000 tries to allocate 3-order pages atomically?
> Well, that's wrong.
>

Why? After your explanation that makes sense for me. The driver needs
one contiguous chunk for those 9k packet buffer and thus requests a
3-order page of 16k. Or do i still do not understand this?

> Evgeniy Polyakov

Thank you for your fast answer,
Arnd Hannemann



2006-08-03 14:49:24

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes



On Thu, 3 Aug 2006, Benjamin LaHaise wrote:

> On Thu, Aug 03, 2006 at 03:48:39PM +0200, Arnd Hannemann wrote:
>> However the box is a VIA Epia MII12000 with 1 GB of Ram and 1 GB of swap
>> enabled, so there should be plenty of memory available. HIGHMEM support
>> is off. The e1000 nic seems to be an 82540EM, which to my knowledge
>> should support jumboframes.
>
>> However I can't always reproduce this on a freshly booted system, so
>> someone else may be the culprit and leaking pages?
>>
>> Any ideas how to debug this?
>
> This is memory fragmentation, and all you can do is work around it until
> the e1000 driver is changed to split jumbo frames up on rx. Here are a
> few ideas that should improve things for you:
>
> - switch to a 2GB/2GB split to recover the memory lost to highmem
> (see Processor Type and Features / Memory split)
With 1 GB of RAM full 1GB/3GB (CONFIG_VMSPLIT_3G_OPT) seems to be
enough...

> - increase /proc/sys/vm/min_free_kbytes -- more free memory will
> improve the odds that enough unfragmented memory is available for
> incoming network packets

True. IMO, 65535 is a good starting point.

Best regards,
Krzysztof Ol?dzki

2006-08-03 14:52:42

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 04:49:15PM +0200, Krzysztof Oledzki wrote:
> With 1 GB of RAM full 1GB/3GB (CONFIG_VMSPLIT_3G_OPT) seems to be
> enough...

Nope, you lose ~128MB of RAM for vmalloc space.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-08-03 15:03:46

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 04:37:35PM +0200, Arnd Hannemann ([email protected]) wrote:
> >> im running vanilla 2.6.17.6 and if i try to set the mtu of my e1000 nic
> >> to 9000 bytes, page allocation failures occur (see below).
> >>
> >> However the box is a VIA Epia MII12000 with 1 GB of Ram and 1 GB of swap
> >> enabled, so there should be plenty of memory available. HIGHMEM support
> >> is off. The e1000 nic seems to be an 82540EM, which to my knowledge
> >> should support jumboframes.
> >
> > But it does not support splitting them into page sized chunks, so it
> > requires the whole jumbo frame allocation in one contiguous chunk, 9k
> > will be transferred into 16k allocation (order 3), since SLAB uses
> > power-of-2 allocation.
>
> Hmm, ok, what is the meaning of this line then:
> > Normal: 44578*4kB 11117*8kB 800*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 280240kB
>
> Are this the allocations which already happend? I thought they would
> represent the free memory, not the already used one?

3-order is 32k actually.

> >> However I can't always reproduce this on a freshly booted system, so
> >> someone else may be the culprit and leaking pages?
> >
> > You will almost 100% reproduce it after "find / > /dev/null".
> >
> >> Any ideas how to debug this?
> >
> > It can not be debugged - you have cought a memory fragmentation problem,
> > which is quite common.
>
> That's too bad :-(
> However it seems hard for me to imagine why there is no contiguous chunk
> of 16k when there are hundreds of Mbyte free. Can't those other pages be
> moved by the kernel, if a higher order allocation is requested?

e1000 is trying to allocate 32k, not 16 for jumbo frames.

> >>> kswapd0: page allocation failure. order:3, mode:0x20
> >
> > e1000 tries to allocate 3-order pages atomically?
> > Well, that's wrong.
> >
>
> Why? After your explanation that makes sense for me. The driver needs
> one contiguous chunk for those 9k packet buffer and thus requests a
> 3-order page of 16k. Or do i still do not understand this?

Correct, except that it wants 32k.
e1000 logic is following:
align frame size to power-of-two, then skb_alloc adds a little
(sizeof(struct skb_shared_info)) at the end, and this ends up
in 32k request just for 9k jumbo frame.

And it wants it in atomic context.

--
Evgeniy Polyakov

2006-08-03 15:04:54

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes



On Thu, 3 Aug 2006, Benjamin LaHaise wrote:

> On Thu, Aug 03, 2006 at 04:49:15PM +0200, Krzysztof Oledzki wrote:
>> With 1 GB of RAM full 1GB/3GB (CONFIG_VMSPLIT_3G_OPT) seems to be
>> enough...
>
> Nope, you lose ~128MB of RAM for vmalloc space.

No sure:

Linux version 2.6.17.7 (root@r1) (gcc version 3.4.6) #1 SMP PREEMPT Fri Jul 28 18:05:40 CEST 2006
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
BIOS-e820: 0000000000100000 - 000000003ffc0000 (usable)
BIOS-e820: 000000003ffc0000 - 000000003ffcfc00 (ACPI data)
BIOS-e820: 000000003ffcfc00 - 000000003ffff000 (reserved)
BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000fec00000 - 00000000fec90000 (reserved)
BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
1023MB LOWMEM available.
found SMP MP-table at 000fe710
On node 0 totalpages: 262080
DMA zone: 4096 pages, LIFO batch:0
Normal zone: 257984 pages, LIFO batch:31
(...)

$ zcat /proc/config.gz |grep VMSPLIT
# CONFIG_VMSPLIT_3G is not set
CONFIG_VMSPLIT_3G_OPT=y
# CONFIG_VMSPLIT_2G is not set
# CONFIG_VMSPLIT_1G is not set


Best regards,

Krzysztof Ol?dzki

2006-08-03 15:09:00

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes



On Thu, 3 Aug 2006, Evgeniy Polyakov wrote:
<CUT>
>> Why? After your explanation that makes sense for me. The driver needs
>> one contiguous chunk for those 9k packet buffer and thus requests a
>> 3-order page of 16k. Or do i still do not understand this?
>
> Correct, except that it wants 32k.
> e1000 logic is following:
> align frame size to power-of-two,
16K?

> then skb_alloc adds a little
> (sizeof(struct skb_shared_info)) at the end, and this ends up
> in 32k request just for 9k jumbo frame.

Strange, why this skb_shared_info cannon be added before first alignment?
And what about smaller frames like 1500, does this driver behave similar
(first align then add)?

Best regards,

Krzysztof Ol?dzki

2006-08-03 15:16:45

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 05:08:51PM +0200, Krzysztof Oledzki ([email protected]) wrote:
> >>Why? After your explanation that makes sense for me. The driver needs
> >>one contiguous chunk for those 9k packet buffer and thus requests a
> >>3-order page of 16k. Or do i still do not understand this?
> >
> >Correct, except that it wants 32k.
> >e1000 logic is following:
> >align frame size to power-of-two,
> 16K?

Yep.

> >then skb_alloc adds a little
> >(sizeof(struct skb_shared_info)) at the end, and this ends up
> >in 32k request just for 9k jumbo frame.
>
> Strange, why this skb_shared_info cannon be added before first alignment?
> And what about smaller frames like 1500, does this driver behave similar
> (first align then add)?

It can be.
Could attached (completely untested) patch help?

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index da62db8..cf6506d 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3132,6 +3132,8 @@ #define MAX_STD_JUMBO_FRAME_SIZE 9234
* larger slab size
* i.e. RXBUFFER_2048 --> size-4096 slab */

+ max_frame += sizeof(struct skb_shared_info);
+
if (max_frame <= E1000_RXBUFFER_256)
adapter->rx_buffer_len = E1000_RXBUFFER_256;
else if (max_frame <= E1000_RXBUFFER_512)
@@ -3146,6 +3148,8 @@ #define MAX_STD_JUMBO_FRAME_SIZE 9234
adapter->rx_buffer_len = E1000_RXBUFFER_8192;
else if (max_frame <= E1000_RXBUFFER_16384)
adapter->rx_buffer_len = E1000_RXBUFFER_16384;
+
+ max_frame -= sizeof(struct skb_shared_info);

/* adjust allocation if LPE protects us, and we aren't using SBP */
if (!adapter->hw.tbi_compatibility_on &&

> Best regards,
>
> Krzysztof Olędzki


--
Evgeniy Polyakov

2006-08-03 15:23:59

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 05:08:51PM +0200, Krzysztof Oledzki ([email protected]) wrote:
> >then skb_alloc adds a little
> >(sizeof(struct skb_shared_info)) at the end, and this ends up
> >in 32k request just for 9k jumbo frame.
>
> Strange, why this skb_shared_info cannon be added before first alignment?
> And what about smaller frames like 1500, does this driver behave similar
> (first align then add)?

e1000 aligns it to 2k, which will be transformed into 4k allocation.

> Best regards,
>
> Krzysztof Olędzki


--
Evgeniy Polyakov

2006-08-03 15:32:55

by Arnd Hannemann

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

Benjamin LaHaise schrieb:
> On Thu, Aug 03, 2006 at 03:48:39PM +0200, Arnd Hannemann wrote:
>> However the box is a VIA Epia MII12000 with 1 GB of Ram and 1 GB of swap
>> enabled, so there should be plenty of memory available. HIGHMEM support
>> is off. The e1000 nic seems to be an 82540EM, which to my knowledge
>> should support jumboframes.
>
>> However I can't always reproduce this on a freshly booted system, so
>> someone else may be the culprit and leaking pages?
>>
>> Any ideas how to debug this?
>
> This is memory fragmentation, and all you can do is work around it until
> the e1000 driver is changed to split jumbo frames up on rx. Here are a
> few ideas that should improve things for you:
>
> - switch to a 2GB/2GB split to recover the memory lost to highmem
> (see Processor Type and Features / Memory split)
> - increase /proc/sys/vm/min_free_kbytes -- more free memory will
> improve the odds that enough unfragmented memory is available for
> incoming network packets
>
> I hope this helps.

:-) Yes it did. I increased /proc/sys/vm/min_free_kbytes to 65000 and
now it works. Thank you!

>
> -ben

Best regards,
Arnd Hannemann



2006-08-03 15:37:37

by Arnd Hannemann

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes


Evgeniy Polyakov wrote:
> On Thu, Aug 03, 2006 at 05:08:51PM +0200, Krzysztof Oledzki ([email protected]) wrote:
>>>> Why? After your explanation that makes sense for me. The driver needs
>>>> one contiguous chunk for those 9k packet buffer and thus requests a
>>>> 3-order page of 16k. Or do i still do not understand this?
>>> Correct, except that it wants 32k.
>>> e1000 logic is following:
>>> align frame size to power-of-two,
>> 16K?
>
> Yep.
>
>>> then skb_alloc adds a little
>>> (sizeof(struct skb_shared_info)) at the end, and this ends up
>>> in 32k request just for 9k jumbo frame.
>> Strange, why this skb_shared_info cannon be added before first alignment?
>> And what about smaller frames like 1500, does this driver behave similar
>> (first align then add)?
>
> It can be.
> Could attached (completely untested) patch help?

I will try this in a minute. However is there any way to see which
allocation e1000 does without triggering allocation failures? ;-)

Thanks,
Arnd Hannemann


2006-08-03 15:41:39

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 07:16:31PM +0400, Evgeniy Polyakov ([email protected]) wrote:
> > >then skb_alloc adds a little
> > >(sizeof(struct skb_shared_info)) at the end, and this ends up
> > >in 32k request just for 9k jumbo frame.
> >
> > Strange, why this skb_shared_info cannon be added before first alignment?
> > And what about smaller frames like 1500, does this driver behave similar
> > (first align then add)?
>
> It can be.
> Could attached (completely untested) patch help?

Actually this patch will not help, this new one could.

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index da62db8..1514628 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3978,9 +3978,11 @@ e1000_alloc_rx_buffers(struct e1000_adap
buffer_info = &rx_ring->buffer_info[i];

while (cleaned_count--) {
- if (!(skb = buffer_info->skb))
+ if (!(skb = buffer_info->skb)) {
+ if (SKB_DATA_ALIGN(adapter->hw.max_frame_size) + sizeof(struct skb_shared_info) <= bufsz)
+ bufsz -= sizeof(struct skb_shared_info);
skb = dev_alloc_skb(bufsz);
- else {
+ } else {
skb_trim(skb, 0);
goto map_skb;
}

--
Evgeniy Polyakov

2006-08-03 15:43:35

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 05:37:41PM +0200, Arnd Hannemann ([email protected]) wrote:
> > It can be.
> > Could attached (completely untested) patch help?
>
> I will try this in a minute. However is there any way to see which
> allocation e1000 does without triggering allocation failures? ;-)

One can add a printk at the end of e1000_change_mtu() and dump
adapter->rx_buffer_len + NET_IP_ALIGN there.

> Thanks,
> Arnd Hannemann

--
Evgeniy Polyakov

2006-08-03 15:57:40

by Chris Leech

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On 8/3/06, Evgeniy Polyakov <[email protected]> wrote:

> > Strange, why this skb_shared_info cannon be added before first alignment?
> > And what about smaller frames like 1500, does this driver behave similar
> > (first align then add)?
>
> It can be.
> Could attached (completely untested) patch help?

Note that e1000 uses power of two buffers because that's what the
hardware supports. Also, there's no program able MTU - only a single
bit for "long packet enable" that disables frame length checks when
using jumbo frames. That means that if you tell the e1000 it has a
16k buffer, and a 16k frame shows up on the wire, it's going to write
to the entire 16k regardless of your 9k MTU setting. If a 32k frame
shows up, two full 16k buffers get written to (OK, assuming the frame
can fit into the receive FIFO)

That's why I've always been against trying to optimize the allocation
sizes in the driver, even with your small change the skb_shinfo area
can get corrupted. It may be unlikely, because the frame still has to
be valid, but some switches aren't real picky about what sized frame
they'll forward on if you enable jumbo support either. So any box on
the LAN could send you larger than MTU frames in an attempt to corrupt
memory.

I believe that if you tell a hardware device it has a buffer of a
certain size, you need to be prepared for that entire buffer to get
written to. Unfortunately that means wasteful allocations for e1000
if a single buffer per frame is going to be used.

- Chris

2006-08-03 16:11:09

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 08:57:36AM -0700, Chris Leech ([email protected]) wrote:
> On 8/3/06, Evgeniy Polyakov <[email protected]> wrote:
>
> >> Strange, why this skb_shared_info cannon be added before first alignment?
> >> And what about smaller frames like 1500, does this driver behave similar
> >> (first align then add)?
> >
> >It can be.
> >Could attached (completely untested) patch help?
>
> Note that e1000 uses power of two buffers because that's what the
> hardware supports. Also, there's no program able MTU - only a single
> bit for "long packet enable" that disables frame length checks when
> using jumbo frames. That means that if you tell the e1000 it has a
> 16k buffer, and a 16k frame shows up on the wire, it's going to write
> to the entire 16k regardless of your 9k MTU setting. If a 32k frame
> shows up, two full 16k buffers get written to (OK, assuming the frame
> can fit into the receive FIFO)

Maximum e1000 frame is 16128 bytes, which is enough before being rounded
to 16k to have a space for shared info.
My patch just tricks refilling logic to request to allocate slightly less
than was setup when mtu was changed.

> That's why I've always been against trying to optimize the allocation
> sizes in the driver, even with your small change the skb_shinfo area
> can get corrupted. It may be unlikely, because the frame still has to
> be valid, but some switches aren't real picky about what sized frame
> they'll forward on if you enable jumbo support either. So any box on
> the LAN could send you larger than MTU frames in an attempt to corrupt
> memory.

It is trivial patch and it can be incorrect (especially for small sized
packets), but it is a hint, that 9k jumbo frame should not require 32k
allocation.

> I believe that if you tell a hardware device it has a buffer of a
> certain size, you need to be prepared for that entire buffer to get
> written to. Unfortunately that means wasteful allocations for e1000
> if a single buffer per frame is going to be used.

Hardware is not affected, second patch just checks if there is enough
space (e1000 stores real mtu). I can not believe that such modern NIC
like e1000 can not know in receive interrupt size of the received
packet, if it is true, than in generel you are right and some more
clever mechanisms shoud be used (at least turn hack off for small
packets and only enable it for less than 16 jumbo frames wheere place
always is), if size of the received packet is known, then it is enough
to compare aligned size and size of the packet to make a decision for
allocation.

> - Chris

--
Evgeniy Polyakov

2006-08-03 16:25:52

by Arnd Hannemann

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

Chris Leech schrieb:
> On 8/3/06, Evgeniy Polyakov <[email protected]> wrote:
>
>> > Strange, why this skb_shared_info cannon be added before first
>> alignment?
>> > And what about smaller frames like 1500, does this driver behave
>> similar
>> > (first align then add)?
>>
>> It can be.
>> Could attached (completely untested) patch help?
>
> Note that e1000 uses power of two buffers because that's what the
> hardware supports. Also, there's no program able MTU - only a single
> bit for "long packet enable" that disables frame length checks when
> using jumbo frames. That means that if you tell the e1000 it has a
> 16k buffer, and a 16k frame shows up on the wire, it's going to write
> to the entire 16k regardless of your 9k MTU setting. If a 32k frame
> shows up, two full 16k buffers get written to (OK, assuming the frame
> can fit into the receive FIFO)
>
> That's why I've always been against trying to optimize the allocation
> sizes in the driver, even with your small change the skb_shinfo area
> can get corrupted. It may be unlikely, because the frame still has to
> be valid, but some switches aren't real picky about what sized frame
> they'll forward on if you enable jumbo support either. So any box on
> the LAN could send you larger than MTU frames in an attempt to corrupt
> memory.
>
> I believe that if you tell a hardware device it has a buffer of a
> certain size, you need to be prepared for that entire buffer to get
> written to. Unfortunately that means wasteful allocations for e1000
> if a single buffer per frame is going to be used.

Well you say "if a single buffer per frame is going to be used". Well,
if I understood you correctly i could set the MTU to, lets say 4000.
Then the driver would enable the "jumbo frame bit" of the hardware, and
allocate only a 4k rx buffer, right? (and allocate 16k, because of
skb_shinfo)
Now if a new 9k frame arrives the hardware will accept it regardless of
the 2k MTU and will split it into 3x 4k rx buffers?
Does the current driver work in this way? That would be great.

Perhaps then one should change the driver in a way that the MTU can
changed independently of the buffer size?

> - Chris
> -

Thanks,
Arnd Hannemann.



2006-08-03 18:09:01

by Arnd Hannemann

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

--- linux-2.6.17.6/drivers/net/e1000/e1000_main.c 2006-08-03 17:38:53.000000000 +0200
+++ linux-2.6.17.6.patched/drivers/net/e1000/e1000_main.c 2006-08-03 19:38:53.000000000 +0200
@@ -3843,9 +3843,13 @@
buffer_info = &rx_ring->buffer_info[i];

while (cleaned_count--) {
- if (!(skb = buffer_info->skb))
+ if (!(skb = buffer_info->skb)) {
+ if (SKB_DATA_ALIGN(adapter->hw.max_frame_size) + sizeof(struct skb_shared_info) <= bufsz) {
+ bufsz -= sizeof(struct skb_shared_info);
+ printk(KERN_INFO "%s - bufsz %d\n",e1000_driver_string, bufsz);
+ }
skb = dev_alloc_skb(bufsz);
- else {
+ } else {
skb_trim(skb, 0);
goto map_skb;
}


Attachments:
patch.txt (661.00 B)

2006-08-03 18:29:33

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 08:09:07PM +0200, Arnd Hannemann ([email protected]) wrote:
> Evgeniy Polyakov schrieb:
> > On Thu, Aug 03, 2006 at 07:16:31PM +0400, Evgeniy Polyakov ([email protected]) wrote:
> >>>> then skb_alloc adds a little
> >>>> (sizeof(struct skb_shared_info)) at the end, and this ends up
> >>>> in 32k request just for 9k jumbo frame.
> >>> Strange, why this skb_shared_info cannon be added before first alignment?
> >>> And what about smaller frames like 1500, does this driver behave similar
> >>> (first align then add)?
> >> It can be.
> >> Could attached (completely untested) patch help?
> >
> > Actually this patch will not help, this new one could.
> >
>
> I applied the attached pachted. And got this output:
>
> > Intel(R) PRO/1000 Network Driver - bufsz 13762
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16058
> > Intel(R) PRO/1000 Network Driver - bufsz 15894
> > Intel(R) PRO/1000 Network Driver - bufsz 15730
> > Intel(R) PRO/1000 Network Driver - bufsz 15566
> > Intel(R) PRO/1000 Network Driver - bufsz 15402
> > Intel(R) PRO/1000 Network Driver - bufsz 15238
> > Intel(R) PRO/1000 Network Driver - bufsz 15074
> > Intel(R) PRO/1000 Network Driver - bufsz 14910
> > Intel(R) PRO/1000 Network Driver - bufsz 14746
> > Intel(R) PRO/1000 Network Driver - bufsz 14582
> > Intel(R) PRO/1000 Network Driver - bufsz 14418
> > Intel(R) PRO/1000 Network Driver - bufsz 14254
> > Intel(R) PRO/1000 Network Driver - bufsz 14090
> > Intel(R) PRO/1000 Network Driver - bufsz 13926
> > Intel(R) PRO/1000 Network Driver - bufsz 13762
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16058
> > Intel(R) PRO/1000 Network Driver - bufsz 15894
> > Intel(R) PRO/1000 Network Driver - bufsz 15730
> > Intel(R) PRO/1000 Network Driver - bufsz 15566
> > Intel(R) PRO/1000 Network Driver - bufsz 15402
> > Intel(R) PRO/1000 Network Driver - bufsz 15238
> > Intel(R) PRO/1000 Network Driver - bufsz 15074
> > Intel(R) PRO/1000 Network Driver - bufsz 14910
> > Intel(R) PRO/1000 Network Driver - bufsz 14746
> > Intel(R) PRO/1000 Network Driver - bufsz 14582
> > Intel(R) PRO/1000 Network Driver - bufsz 14418
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
> > Intel(R) PRO/1000 Network Driver - bufsz 16222
>
> I'm a bit puzzled that there are so much allocations. However the patch
> seems to work. (at least not obviously breaks things for me yet)

Very strange output actually - comments in the code say that frame size
can not exceed 0x3f00, but in this log it is much more than 16128 and
that is after sizeof(struct skb_shared_info) has been removed...
Could you please remove debug output and run some network stress test in
parallel with high disk/memory activity to check if that does not break
your system and watch /proc/slabinfo for 16k and 32k sized pools.

> Best regards
> Arnd


--
Evgeniy Polyakov

2006-08-03 20:32:15

by Chris Leech

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

> Maximum e1000 frame is 16128 bytes, which is enough before being rounded
> to 16k to have a space for shared info.
> My patch just tricks refilling logic to request to allocate slightly less
> than was setup when mtu was changed.

The maximum supported MTU size differs between e1000 devices due to
differences in FIFO size. For performance reasons the driver won't
enable a MTU that doesn't allow for at least two frames in the Tx FIFO
at once - you really want e1000 to be able to DMA the next frame into
Tx FIFO while the current one is going out on the wire. This doesn't
change the fact that with LPE set, anything that can fit into the Rx
FIFO and has a valid CRC will be DMAed into buffers regardless of
length.

> Hardware is not affected, second patch just checks if there is enough
> space (e1000 stores real mtu). I can not believe that such modern NIC
> like e1000 can not know in receive interrupt size of the received
> packet, if it is true, than in generel you are right and some more
> clever mechanisms shoud be used (at least turn hack off for small
> packets and only enable it for less than 16 jumbo frames wheere place
> always is), if size of the received packet is known, then it is enough
> to compare aligned size and size of the packet to make a decision for
> allocation.

You're changing the size of the buffer without telling the hardware.
In the interrupt context e1000 knows the size of what was DMAed into
the skb, but that's after the fact. So e1000 could detect that memory
was corrupted, but not prevent it if you don't give it power of 2
buffers. Actually, the power of 2 thing doesn't hold true for all
e1000 devices. Some have 1k granularity, but not Arnd's 82540.

You can't know the size of a received packet before it's DMAed into
host memory, no high performance network controller works that way.

- Chris

2006-08-03 20:34:15

by Chris Leech

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On 8/3/06, Arnd Hannemann <[email protected]> wrote:
> Well you say "if a single buffer per frame is going to be used". Well,
> if I understood you correctly i could set the MTU to, lets say 4000.
> Then the driver would enable the "jumbo frame bit" of the hardware, and
> allocate only a 4k rx buffer, right? (and allocate 16k, because of
> skb_shinfo)
> Now if a new 9k frame arrives the hardware will accept it regardless of
> the 2k MTU and will split it into 3x 4k rx buffers?
> Does the current driver work in this way? That would be great.
>
> Perhaps then one should change the driver in a way that the MTU can
> changed independently of the buffer size?

Yes, e1000 devices will spill over and use multiple buffers for a
single frame. We've been trying to find a good way to use multiple
buffers to take care of these allocation problems. The structure of
the sk_buff does not make it easy. Or should I say that it's the
limitation that drivers are not allowed to chain together multiple
sk_buffs to represent a single frame that does not make it easy.

PCI-Express e1000 devices support a feature called header split, where
the protocol headers go into a different buffer from the payload. We
use that today to put headers into the kmalloc() allocated skb->data
area, and payload into one or more skb->frags[] pages. You don't ever
have multiple page allocations from the driver in this mode.

We could try and only use page allocations for older e1000 devices,
putting headers and payload into skb->frags and copying the headers
out into the skb->data area as needed for processing. That would do
away with large allocations, but in Jesse's experiments calling
alloc_page() is slower than kmalloc(), so there can actually be a
performance hit from trying to use page allocations all the time.

It's an interesting problem.

- Chris

2006-08-03 21:40:22

by Arnd Hannemann

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

Evgeniy Polyakov wrote:
> On Thu, Aug 03, 2006 at 08:09:07PM +0200, Arnd Hannemann ([email protected]) wrote:
>> Evgeniy Polyakov schrieb:
>>> On Thu, Aug 03, 2006 at 07:16:31PM +0400, Evgeniy Polyakov ([email protected]) wrote:
>>>>>> then skb_alloc adds a little
>>>>>> (sizeof(struct skb_shared_info)) at the end, and this ends up
>>>>>> in 32k request just for 9k jumbo frame.
>>>>> Strange, why this skb_shared_info cannon be added before first alignment?
>>>>> And what about smaller frames like 1500, does this driver behave similar
>>>>> (first align then add)?
>>>> It can be.
>>>> Could attached (completely untested) patch help?
>>> Actually this patch will not help, this new one could.
>>>
>> I applied the attached pachted. And got this output:
>>
>>> Intel(R) PRO/1000 Network Driver - bufsz 13762
>>> ...

>> I'm a bit puzzled that there are so much allocations. However the patch
>> seems to work. (at least not obviously breaks things for me yet)
>
> Very strange output actually - comments in the code say that frame size
> can not exceed 0x3f00, but in this log it is much more than 16128 and
> that is after sizeof(struct skb_shared_info) has been removed...
> Could you please remove debug output and run some network stress test in
> parallel with high disk/memory activity to check if that does not break
> your system and watch /proc/slabinfo for 16k and 32k sized pools.

The system seems to be still stable.

>From /proc/slabinfo during netio test:
> size-32768(DMA) 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0
> size-32768 84 89 32768 1 8 : tunables 8 4 0 : slabdata 84 89 0
> size-16384(DMA) 0 0 16384 1 4 : tunables 8 4 0 : slabdata 0 0 0
> size-16384 184 188 16384 1 4 : tunables 8 4 0 : slabdata 184 188 0

Netio results:

NETIO - Network Throughput Benchmark, Version 1.26
(C) 1997-2005 Kai Uwe Rommel

TCP connection established.
Packet size 1k bytes: 72320 KByte/s Tx, 86656 KByte/s Rx.
Packet size 2k bytes: 71400 KByte/s Tx, 94703 KByte/s Rx.
Packet size 4k bytes: 71544 KByte/s Tx, 88463 KByte/s Rx.
Packet size 8k bytes: 70392 KByte/s Tx, 92127 KByte/s Rx.
Packet size 16k bytes: 70512 KByte/s Tx, 102607 KByte/s Rx.
Packet size 32k bytes: 71705 KByte/s Tx, 101083 KByte/s Rx.
Done.

Strange ist that receiving seems to be much faster than transmitting.


> --
> Evgeniy Polyakov

Thanks,
Arnd



2006-08-04 05:52:57

by Herbert Xu

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

Evgeniy Polyakov <[email protected]> wrote:
>
> But it does not support splitting them into page sized chunks, so it
> requires the whole jumbo frame allocation in one contiguous chunk, 9k
> will be transferred into 16k allocation (order 3), since SLAB uses
> power-of-2 allocation.

Actually order 3 is 32KB.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2006-08-04 05:55:05

by David Miller

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

From: Herbert Xu <[email protected]>
Date: Fri, 04 Aug 2006 15:52:40 +1000

> Evgeniy Polyakov <[email protected]> wrote:
> >
> > But it does not support splitting them into page sized chunks, so it
> > requires the whole jumbo frame allocation in one contiguous chunk, 9k
> > will be transferred into 16k allocation (order 3), since SLAB uses
> > power-of-2 allocation.
>
> Actually order 3 is 32KB.

It's 64KB on my computer :)

2006-08-04 05:59:58

by Herbert Xu

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

Chris Leech <[email protected]> wrote:
>
> We could try and only use page allocations for older e1000 devices,
> putting headers and payload into skb->frags and copying the headers
> out into the skb->data area as needed for processing. That would do
> away with large allocations, but in Jesse's experiments calling
> alloc_page() is slower than kmalloc(), so there can actually be a
> performance hit from trying to use page allocations all the time.

Interesting. Could you guys post figures on alloc_page speed vs. kmalloc?

Also, getting memory slower is better than not getting them at all :)
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2006-08-04 05:59:34

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 10:55:01PM -0700, David Miller ([email protected]) wrote:
> From: Herbert Xu <[email protected]>
> Date: Fri, 04 Aug 2006 15:52:40 +1000
>
> > Evgeniy Polyakov <[email protected]> wrote:
> > >
> > > But it does not support splitting them into page sized chunks, so it
> > > requires the whole jumbo frame allocation in one contiguous chunk, 9k
> > > will be transferred into 16k allocation (order 3), since SLAB uses
> > > power-of-2 allocation.
> >
> > Actually order 3 is 32KB.

Yep, e1000 align 9k to 16k, then alloc_skb adds shared info and align it
to 32k.

> It's 64KB on my computer :)

Nice overhead...

--
Evgeniy Polyakov

2006-08-04 06:15:47

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Fri, Aug 04, 2006 at 03:59:37PM +1000, Herbert Xu ([email protected]) wrote:
> Chris Leech <[email protected]> wrote:
> >
> > We could try and only use page allocations for older e1000 devices,
> > putting headers and payload into skb->frags and copying the headers
> > out into the skb->data area as needed for processing. That would do
> > away with large allocations, but in Jesse's experiments calling
> > alloc_page() is slower than kmalloc(), so there can actually be a
> > performance hit from trying to use page allocations all the time.
>
> Interesting. Could you guys post figures on alloc_page speed vs. kmalloc?

They probalby measured kmalloc cache access, which only falls to
alloc_pages when cache is refilled, so it will be faster for some short
period of time, but in general (especially for such big-sized
allocations) it is essencially the same.

> Also, getting memory slower is better than not getting them at all :)

Sure.

> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

--
Evgeniy Polyakov

2006-08-04 06:20:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Thu, Aug 03, 2006 at 01:32:10PM -0700, Chris Leech ([email protected]) wrote:
> >Maximum e1000 frame is 16128 bytes, which is enough before being rounded
> >to 16k to have a space for shared info.
> >My patch just tricks refilling logic to request to allocate slightly less
> >than was setup when mtu was changed.
>
> The maximum supported MTU size differs between e1000 devices due to
> differences in FIFO size. For performance reasons the driver won't
> enable a MTU that doesn't allow for at least two frames in the Tx FIFO
> at once - you really want e1000 to be able to DMA the next frame into
> Tx FIFO while the current one is going out on the wire. This doesn't
> change the fact that with LPE set, anything that can fit into the Rx
> FIFO and has a valid CRC will be DMAed into buffers regardless of
> length.

But still it must be less than MAX_JUMBO_FRAME_SIZE, which is 16128
bytes, at least it is maxiumum allowed mtu in e1000_change_mtu().

> >Hardware is not affected, second patch just checks if there is enough
> >space (e1000 stores real mtu). I can not believe that such modern NIC
> >like e1000 can not know in receive interrupt size of the received
> >packet, if it is true, than in generel you are right and some more
> >clever mechanisms shoud be used (at least turn hack off for small
> >packets and only enable it for less than 16 jumbo frames wheere place
> >always is), if size of the received packet is known, then it is enough
> >to compare aligned size and size of the packet to make a decision for
> >allocation.
>
> You're changing the size of the buffer without telling the hardware.
> In the interrupt context e1000 knows the size of what was DMAed into
> the skb, but that's after the fact. So e1000 could detect that memory
> was corrupted, but not prevent it if you don't give it power of 2
> buffers. Actually, the power of 2 thing doesn't hold true for all
> e1000 devices. Some have 1k granularity, but not Arnd's 82540.

I can not change it - code checks if requested mtu and additional size
is less than allocated aligned buffer it tricks allocator.
Or do you mean that even after 9k mtu was setup it is possible that card
can receive packets up to 16k?

> You can't know the size of a received packet before it's DMAed into
> host memory, no high performance network controller works that way.

> - Chris

--
Evgeniy Polyakov

2006-08-04 15:16:45

by Chris Leech

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On 8/3/06, Evgeniy Polyakov <[email protected]> wrote:
> > You're changing the size of the buffer without telling the hardware.
> > In the interrupt context e1000 knows the size of what was DMAed into
> > the skb, but that's after the fact. So e1000 could detect that memory
> > was corrupted, but not prevent it if you don't give it power of 2
> > buffers. Actually, the power of 2 thing doesn't hold true for all
> > e1000 devices. Some have 1k granularity, but not Arnd's 82540.
>
> I can not change it - code checks if requested mtu and additional size
> is less than allocated aligned buffer it tricks allocator.
> Or do you mean that even after 9k mtu was setup it is possible that card
> can receive packets up to 16k?

Yes, that's exactly what I mean. For anything above the standard 1500
bytes the e1000 _hardware_ has no concept of MTU, only buffer length.
So even if the driver is set to an MTU of 9000, the NIC will still
receive 16k frames. Otherwise the driver would simply allocate MTU
sized buffers.

-Chris

2006-08-04 15:34:49

by Chris Leech

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On 8/3/06, Evgeniy Polyakov <[email protected]> wrote:
> On Fri, Aug 04, 2006 at 03:59:37PM +1000, Herbert Xu ([email protected]) wrote:
> > Interesting. Could you guys post figures on alloc_page speed vs. kmalloc?
>
> They probalby measured kmalloc cache access, which only falls to
> alloc_pages when cache is refilled, so it will be faster for some short
> period of time, but in general (especially for such big-sized
> allocations) it is essencially the same.

I think you're right about that. In particular, I think Jesse was
looking at the impact that changing the drivers buffer allocation
method would have on 1500 byte MTU users. With a running network
driver you should see lots of fixed size allocations hitting the slab
cache, and occasionally causing an alloc_pages. If you replace that
with a call to alloc_pages for every packet that ever gets received
it's a performance hit.

So how many skb allocation schemes do you code into a single driver?
Kmalloc everything, page alloc everything, combination of kmalloc and
page buffers for hardware that does header split? That's three
versions of the drivers receive processing and skb allocation that
need to be maintained.

> > Also, getting memory slower is better than not getting them at all :)

Yep.

- Chris

2006-08-04 19:42:45

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Fri, Aug 04, 2006 at 08:34:46AM -0700, Chris Leech ([email protected]) wrote:
> So how many skb allocation schemes do you code into a single driver?
> Kmalloc everything, page alloc everything, combination of kmalloc and
> page buffers for hardware that does header split? That's three
> versions of the drivers receive processing and skb allocation that
> need to be maintained.

At least try to create scheme which will not end up in 32k allocation in
atomic context. Generally I would recommend to use frag_list as much as
possible (or you can reuse skb list).

> - Chris

--
Evgeniy Polyakov

2006-08-04 21:02:54

by Jesse Brandeburg

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On 8/4/06, Evgeniy Polyakov <[email protected]> wrote:
> On Fri, Aug 04, 2006 at 08:34:46AM -0700, Chris Leech ([email protected]) wrote:
> > So how many skb allocation schemes do you code into a single driver?
> > Kmalloc everything, page alloc everything, combination of kmalloc and
> > page buffers for hardware that does header split? That's three
> > versions of the drivers receive processing and skb allocation that
> > need to be maintained.
>
> At least try to create scheme which will not end up in 32k allocation in
> atomic context. Generally I would recommend to use frag_list as much as
> possible (or you can reuse skb list).

this is exactly what we ran into, you can't use skb list because the
ip fragmentation reassembly code overwrites it. If someone is feeling
particularly miffed by this i would love to see a patch that used
alloc_page() for all of our receive buffers for the legacy receive
path (e1000_clean_rx_irq) then we would be able to use nr_frags and
frag_list for receives.

Oh, except that eth_type_trans can't handle the entire packet in the
frag_list (it wants the header in the skb->data)

anyway, this is not as easy a problem to solve as it would seem on the surface.

Jesse

2006-08-05 09:59:39

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Fri, Aug 04, 2006 at 02:02:51PM -0700, Jesse Brandeburg ([email protected]) wrote:
> >> So how many skb allocation schemes do you code into a single driver?
> >> Kmalloc everything, page alloc everything, combination of kmalloc and
> >> page buffers for hardware that does header split? That's three
> >> versions of the drivers receive processing and skb allocation that
> >> need to be maintained.
> >
> >At least try to create scheme which will not end up in 32k allocation in
> >atomic context. Generally I would recommend to use frag_list as much as
> >possible (or you can reuse skb list).
>
> this is exactly what we ran into, you can't use skb list because the
> ip fragmentation reassembly code overwrites it. If someone is feeling
> particularly miffed by this i would love to see a patch that used
> alloc_page() for all of our receive buffers for the legacy receive
> path (e1000_clean_rx_irq) then we would be able to use nr_frags and
> frag_list for receives.
>
> Oh, except that eth_type_trans can't handle the entire packet in the
> frag_list (it wants the header in the skb->data)

Yes, part of the packet must live in skb->data, but it does not differ
from frag_list management - place part of the data in skb->data and the
rest into frag_list.
If you can create several skbs and link them togeter you defenitely can
organize pages into frag_list, just get pages from different skb->data
and free those skbs.

> anyway, this is not as easy a problem to solve as it would seem on the
> surface.

No one says it is easy or not, but I'me 100% sure that 32k allocation
for 9k jumbo frame in atomic context is not what people expect from
high-performance NIC.

> Jesse
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Evgeniy Polyakov

2006-08-05 10:10:21

by Herbert Xu

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Sat, Aug 05, 2006 at 01:58:46PM +0400, Evgeniy Polyakov wrote:
>
> If you can create several skbs and link them togeter you defenitely can
> organize pages into frag_list, just get pages from different skb->data
> and free those skbs.

Having a more flexible mechanism for managing skb_shared_info->frags
would definitely be an improvement. At the moment we can't indicate
whether the individual frags are writable so we assume every frag to
be read-only.

If we had a flag to indicate writability we could also have a flag to
indicate that the memory comes from kmalloc rather than alloc_page.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2006-08-05 10:25:02

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Sat, Aug 05, 2006 at 08:09:54PM +1000, Herbert Xu ([email protected]) wrote:
> > If you can create several skbs and link them togeter you defenitely can
> > organize pages into frag_list, just get pages from different skb->data
> > and free those skbs.
>
> Having a more flexible mechanism for managing skb_shared_info->frags
> would definitely be an improvement. At the moment we can't indicate
> whether the individual frags are writable so we assume every frag to
> be read-only.

Having one page inside frag_list writable does not make a lot of sence,
so we really need either all of them writable, or nothing.
And it is much less error-prone to assume that every page is read-only.

> If we had a flag to indicate writability we could also have a flag to
> indicate that the memory comes from kmalloc rather than alloc_page.

Yes, that would be good, but who will give us a bit in the struct page?
Can we recreate frag_list elements to be a bitmasks and steal couple
of them there, so we would not increase fragment's structure size?

> Cheers,
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

--
Evgeniy Polyakov

2006-08-05 10:33:24

by Herbert Xu

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Sat, Aug 05, 2006 at 02:24:36PM +0400, Evgeniy Polyakov wrote:
>
> > If we had a flag to indicate writability we could also have a flag to
> > indicate that the memory comes from kmalloc rather than alloc_page.
>
> Yes, that would be good, but who will give us a bit in the struct page?
> Can we recreate frag_list elements to be a bitmasks and steal couple
> of them there, so we would not increase fragment's structure size?

I wasn't thinking of a bit in struct page, but rather a bit in skb_frag_t.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2006-08-05 10:41:56

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: problems with e1000 and jumboframes

On Sat, Aug 05, 2006 at 08:33:07PM +1000, Herbert Xu ([email protected]) wrote:
> On Sat, Aug 05, 2006 at 02:24:36PM +0400, Evgeniy Polyakov wrote:
> >
> > > If we had a flag to indicate writability we could also have a flag to
> > > indicate that the memory comes from kmalloc rather than alloc_page.
> >
> > Yes, that would be good, but who will give us a bit in the struct page?
> > Can we recreate frag_list elements to be a bitmasks and steal couple
> > of them there, so we would not increase fragment's structure size?
>
> I wasn't thinking of a bit in struct page, but rather a bit in skb_frag_t.

Actually we can look into struct page, namely into page->lru.next,
PG_slab bit or page->private (for combined pages), which are pointers to
the appropriate cache, if given page was obtained through kmalloc.
Or we can create bitmaks in fragments.

> Cheers,
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

--
Evgeniy Polyakov