2023-08-19 12:20:52

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v3 2/2] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()

From: David Howells
> Sent: Friday, August 18, 2023 5:49 PM
>
> David Laight <[email protected]> wrote:
>
> > > iov_iter_init inc 0x27 -> 0x31 +0xa
> >
> > Are you hitting the gcc bug that loads the constant from memory?
>
> I'm not sure what that looks like. For your perusal, here's a disassembly of
> the use-switch-on-enum variant:
>
> 0xffffffff8177726c <+0>: cmp $0x1,%esi
> 0xffffffff8177726f <+3>: jbe 0xffffffff81777273 <iov_iter_init+7>
> 0xffffffff81777271 <+5>: ud2
> 0xffffffff81777273 <+7>: test %esi,%esi
> 0xffffffff81777275 <+9>: movw $0x1,(%rdi)
> 0xffffffff8177727a <+14>: setne 0x3(%rdi)
> 0xffffffff8177727e <+18>: xor %eax,%eax
> 0xffffffff81777280 <+20>: movb $0x0,0x2(%rdi)
> 0xffffffff81777284 <+24>: movb $0x1,0x4(%rdi)
> 0xffffffff81777288 <+28>: mov %rax,0x8(%rdi)
> 0xffffffff8177728c <+32>: mov %rdx,0x10(%rdi)
> 0xffffffff81777290 <+36>: mov %r8,0x18(%rdi)
> 0xffffffff81777294 <+40>: mov %rcx,0x20(%rdi)
> 0xffffffff81777298 <+44>: jmp 0xffffffff81d728a0 <__x86_return_thunk>
>
> versus the use-bitmap variant:
>
> 0xffffffff81777311 <+0>: cmp $0x1,%esi
> 0xffffffff81777314 <+3>: jbe 0xffffffff81777318 <iov_iter_init+7>
> 0xffffffff81777316 <+5>: ud2
> 0xffffffff81777318 <+7>: test %esi,%esi
> 0xffffffff8177731a <+9>: movb $0x2,(%rdi)
> 0xffffffff8177731d <+12>: setne 0x1(%rdi)
> 0xffffffff81777321 <+16>: xor %eax,%eax
> 0xffffffff81777323 <+18>: mov %rdx,0x10(%rdi)
> 0xffffffff81777327 <+22>: mov %rax,0x8(%rdi)
> 0xffffffff8177732b <+26>: mov %r8,0x18(%rdi)
> 0xffffffff8177732f <+30>: mov %rcx,0x20(%rdi)
> 0xffffffff81777333 <+34>: jmp 0xffffffff81d72960 <__x86_return_thunk>
>
> It seems to be that the former is loading byte constants individually, whereas
> Linus combined all those fields into a single byte and eliminated one of them.

I think you need to re-order the structure.
The top set writes to bytes 0..4 with:
> 0xffffffff81777275 <+9>: movw $0x1,(%rdi)
> 0xffffffff8177727a <+14>: setne 0x3(%rdi)
> 0xffffffff81777280 <+20>: movb $0x0,0x2(%rdi)
> 0xffffffff81777284 <+24>: movb $0x1,0x4(%rdi)
Note that the 'setne' writes into the middle of the constants.

The lower writes bytes 0..1 with:
> 0xffffffff8177731a <+9>: movb $0x2,(%rdi)
> 0xffffffff8177731d <+12>: setne 0x1(%rdi)

I think that if you move the 'conditional' value to offset 4
you'll get fewer writes.
Probably a 32bit load into %eax and then a write.

I don't think gcc likes generating 16bit immediates.
In some tests I did it loaded a 32bit value into %eax
and then wrote the low bits.
So the code is much the same (on x86) for 2 or 4 bytes
of constants.
I'm sure you can use the 'data-16' prefix with an immediate.

I'm not sure why you have two non-zero values when Linus
only had one though.

OTOH you don't want to be writing 3 bytes of constants.
Also gcc won't generate:
movl $0xaabbccdd,%eax
setne %al // overwriting the dd
movl %eax,(%rdi)
and I suspect the partial write (to %al) will be a stall.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)