2022-09-29 16:10:11

by Maxim Levitsky

[permalink] [raw]
Subject: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

Hi!

Recently I noticed that this commit broke the boot of some of the VMs that I run on my dev machine.

It seems that I am not the first to notice this but in my case it is a bit different

https://lore.kernel.org/all/[email protected]/

My VM is a normal x86 VM, and it uses virtio-blk in the guest to access the virtual disk,
which is a qcow2 file stored on ext4 filesystem which is stored on NVME drive with 4K sectors.
(however I was also able to reproduce this on a raw file)

It seems that the only two things that is needed to reproduce the issue are:

1. The qcow2/raw file has to be located on a drive which has 4K hardware block size.
2. Qemu needs to use direct IO (both aio and 'threads' reproduce this).

I did some debugging and I isolated the kernel change in behavior from qemu point of view:


Qemu, when using direct IO, 'probes' the underlying file.

It probes two things:

1. It probes the minimum block size it can read.
It does so by trying to read 1, 512, 1024, 2048 and 4096 bytes at offset 0,
using a 4096 bytes aligned buffer, and notes the first read that works as the hardware block size.

(The relevant function is 'raw_probe_alignment' in src/block/file-posix.c in qemu source code).


2. It probes the buffer alignment by reading 4096 bytes also at file offset 0,
this time using a buffer that is 1, 512, 1024, 2048 and 4096 aligned
(this is done by allocating a buffer which is 4K aligned and adding 1/512 and so on to its address)

First successful read is saved as the required buffer alignment.


Before the patch, both probes would yield 4096 and everything would work fine.
(The file in question is stored on 4K block device)


After the patch the buffer alignment probe succeeds at 512 bytes.
This means that the kernel now allows to read 4K of data at file offset 0 with a buffer that
is only 512 bytes aligned.

It is worth to note that the probe was done using 'pread' syscall.


Later on, qemu likely reads the 1st 512 sector of the drive.

It uses preadv with 2 io vectors:

First one is for 512 bytes and it seems to have 0xC00 offset into page
(likely depends on debug session but seems to be consistent)

Second one is for 3584 bytes and also has a buffer that is not 4K aligned.
(0x200 page offset this time)

This means that the qemu does respect the 4K block size but only respects 512 bytes buffer alignment,
which is consistent with the result of the probing.

And that preadv fails with -EINVAL

Forcing qemu to use 4K buffer size fixes the issue, as well as reverting the offending commit.

Any patches, suggestions are welcome.

I use 6.0-rc7, using mainline master branch as yesterday.

Best regards,
Maxim Levitsky


2022-09-29 16:28:33

by Maxim Levitsky

[permalink] [raw]
Subject: Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

On Thu, 2022-09-29 at 09:48 -0600, Keith Busch wrote:
> I am aware, and I've submitted the fix to qemu here:
>
> https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00398.html
>


Thanks for quick response!

Question is though, isn't this an kernel ABI breakage?

(I myself don't care, I would be happy to patch my qemu),

but I afraid that this will break *lots* of users that only updated the kernel
and not the qemu.

What do you think?

Best regards,
Maxim Levitsky

2022-09-29 17:23:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

On Thu, Sep 29, 2022 at 10:37:22AM -0600, Keith Busch wrote:
> I don't think so. Memory alignment and length granularity are two completely
> different concepts. If anything, the kernel's ABI had been that the length
> requirement was also required for the memory alignment, not the other way
> around. That usage will continue working with this kernel patch.

Well, Linus does treat anything that breaks significant userspace
as a regression. Qemu certainly is significant, but that might depend
on bit how common configurations hitting this issue are.

2022-09-29 17:37:00

by Keith Busch

[permalink] [raw]
Subject: Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

On Thu, Sep 29, 2022 at 07:16:29PM +0300, Maxim Levitsky wrote:
> On Thu, 2022-09-29 at 09:48 -0600, Keith Busch wrote:
> > I am aware, and I've submitted the fix to qemu here:
> >
> > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00398.html
> >
>
>
> Thanks for quick response!
>
> Question is though, isn't this an kernel ABI breakage?

I don't think so. Memory alignment and length granularity are two completely
different concepts. If anything, the kernel's ABI had been that the length
requirement was also required for the memory alignment, not the other way
around. That usage will continue working with this kernel patch.

2022-09-29 18:10:21

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

On 9/29/22 18:39, Christoph Hellwig wrote:
> On Thu, Sep 29, 2022 at 10:37:22AM -0600, Keith Busch wrote:
>>> I am aware, and I've submitted the fix to qemu here:
>>>
>>> https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00398.html
>>
>> I don't think so. Memory alignment and length granularity are two completely
>> different concepts. If anything, the kernel's ABI had been that the length
>> requirement was also required for the memory alignment, not the other way
>> around. That usage will continue working with this kernel patch.
>
> Well, Linus does treat anything that breaks significant userspace
> as a regression. Qemu certainly is significant, but that might depend
> on bit how common configurations hitting this issue are.

Seeing the QEMU patch, I agree that it's a QEMU bug though. I'm
surprised it has ever worked.

It requires 4K sectors in the host but not in the guest, and can be
worked around (if not migrating) by disabling O_DIRECT. I think it's
not that awful, but we probably should do some extra releases of QEMU
stable branches.

Paolo

2022-09-30 12:12:02

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.

[TLDR: I'm adding this regression report to the list of tracked
regressions; all text from me you find below is based on a few templates
paragraphs you might have encountered already already in similar form.]

Hi, this is your Linux kernel regression tracker. This might be a Qemu
bug, but it's exposed by kernel change, so I at least want to have it in
the tracking. I'll simply remove it in a few weeks, if it turns out that
nobody except Maxim hits this.

On 29.09.22 17:41, Maxim Levitsky wrote:
> Hi!
>
> Recently I noticed that this commit broke the boot of some of the VMs that I run on my dev machine.
>
> It seems that I am not the first to notice this but in my case it is a bit different
>
> https://lore.kernel.org/all/[email protected]/
>
> My VM is a normal x86 VM, and it uses virtio-blk in the guest to access the virtual disk,
> which is a qcow2 file stored on ext4 filesystem which is stored on NVME drive with 4K sectors.
> (however I was also able to reproduce this on a raw file)
>
> It seems that the only two things that is needed to reproduce the issue are:
>
> 1. The qcow2/raw file has to be located on a drive which has 4K hardware block size.
> 2. Qemu needs to use direct IO (both aio and 'threads' reproduce this).
>
> I did some debugging and I isolated the kernel change in behavior from qemu point of view:
>
>
> Qemu, when using direct IO, 'probes' the underlying file.
>
> It probes two things:
>
> 1. It probes the minimum block size it can read.
> It does so by trying to read 1, 512, 1024, 2048 and 4096 bytes at offset 0,
> using a 4096 bytes aligned buffer, and notes the first read that works as the hardware block size.
>
> (The relevant function is 'raw_probe_alignment' in src/block/file-posix.c in qemu source code).
>
>
> 2. It probes the buffer alignment by reading 4096 bytes also at file offset 0,
> this time using a buffer that is 1, 512, 1024, 2048 and 4096 aligned
> (this is done by allocating a buffer which is 4K aligned and adding 1/512 and so on to its address)
>
> First successful read is saved as the required buffer alignment.
>
>
> Before the patch, both probes would yield 4096 and everything would work fine.
> (The file in question is stored on 4K block device)
>
>
> After the patch the buffer alignment probe succeeds at 512 bytes.
> This means that the kernel now allows to read 4K of data at file offset 0 with a buffer that
> is only 512 bytes aligned.
>
> It is worth to note that the probe was done using 'pread' syscall.
>
>
> Later on, qemu likely reads the 1st 512 sector of the drive.
>
> It uses preadv with 2 io vectors:
>
> First one is for 512 bytes and it seems to have 0xC00 offset into page
> (likely depends on debug session but seems to be consistent)
>
> Second one is for 3584 bytes and also has a buffer that is not 4K aligned.
> (0x200 page offset this time)
>
> This means that the qemu does respect the 4K block size but only respects 512 bytes buffer alignment,
> which is consistent with the result of the probing.
>
> And that preadv fails with -EINVAL
>
> Forcing qemu to use 4K buffer size fixes the issue, as well as reverting the offending commit.
>
> Any patches, suggestions are welcome.
>
> I use 6.0-rc7, using mainline master branch as yesterday.
>
> Best regards,
> Maxim Levitsky
>
Thanks for the report. To be sure below issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
tracking bot:

#regzbot ^introduced bf8d08532bc1
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply -- ideally with also
telling regzbot about it, as explained here:
https://linux-regtracking.leemhuis.info/tracked-regression/

Reminder for developers: When fixing the issue, add 'Link:' tags
pointing to the report (the mail this one replies to), as explained for
in the Linux kernel's documentation; above webpage explains why this is
important for tracked regressions.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

2022-10-02 09:20:37

by Maxim Levitsky

[permalink] [raw]
Subject: Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

On Thu, 2022-09-29 at 19:35 +0200, Paolo Bonzini wrote:
> On 9/29/22 18:39, Christoph Hellwig wrote:
> > On Thu, Sep 29, 2022 at 10:37:22AM -0600, Keith Busch wrote:
> > > > I am aware, and I've submitted the fix to qemu here:
> > > >
> > > > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00398.html
> > >
> > > I don't think so. Memory alignment and length granularity are two completely
> > > different concepts. If anything, the kernel's ABI had been that the length
> > > requirement was also required for the memory alignment, not the other way
> > > around. That usage will continue working with this kernel patch.

Yes, this is how I also understand it - for example for O_DIRECT on a file which
resides on 4K block device, you have to use page aligned buffers.

But here after the patch, 512 aligned buffer starts working as well - If I
understand you correctly the ABI didn't guarantee that such usage would fail,
but rather that it might fail.

> >
> > Well, Linus does treat anything that breaks significant userspace
> > as a regression. Qemu certainly is significant, but that might depend
> > on bit how common configurations hitting this issue are.
>
> Seeing the QEMU patch, I agree that it's a QEMU bug though. I'm
> surprised it has ever worked.
>
> It requires 4K sectors in the host but not in the guest, and can be
> worked around (if not migrating) by disabling O_DIRECT. I think it's
> not that awful, but we probably should do some extra releases of QEMU
> stable branches.
>
> Paolo
>

I must admit I am out of the loop on the exact requirements of the O_DIRECT.


If I understand that correctly, after the patch in question,
qemu is able to use just 512 bytes aligned buffer to read a single 4K block from the disk,
which supposed to fail but wasn't guarnteed to fail.



Later qemu it submits iovec which also reads a 4K block but in two parts,
and if I understand that correctly, each part (iov) is considered
to be a separate IO operation, and thus each has to be in my case 4K in size,
and its memory buffer *should* also be 4K aligned.

(but it can work with smaller alignement as well).


Assuming that I understand all of this correctly, I agree with Paolo that this is qemu
bug, but I do fear that it can cause quite some problems for users,
especially for users that use outdated qemu version.

It might be too much to ask, but maybe add a Kconfig option to keep legacy behavier
for those that need it?

Best regards,
Maxim Levitsky

2022-10-02 14:16:32

by Keith Busch

[permalink] [raw]
Subject: Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

On Sun, Oct 02, 2022 at 11:59:42AM +0300, Maxim Levitsky wrote:
> On Thu, 2022-09-29 at 19:35 +0200, Paolo Bonzini wrote:
> > On 9/29/22 18:39, Christoph Hellwig wrote:
> > > On Thu, Sep 29, 2022 at 10:37:22AM -0600, Keith Busch wrote:
> > > > > I am aware, and I've submitted the fix to qemu here:
> > > > >
> > > > > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00398.html
> > > >
> > > > I don't think so. Memory alignment and length granularity are two completely
> > > > different concepts. If anything, the kernel's ABI had been that the length
> > > > requirement was also required for the memory alignment, not the other way
> > > > around. That usage will continue working with this kernel patch.
>
> Yes, this is how I also understand it - for example for O_DIRECT on a file which
> resides on 4K block device, you have to use page aligned buffers.
>
> But here after the patch, 512 aligned buffer starts working as well - If I
> understand you correctly the ABI didn't guarantee that such usage would fail,
> but rather that it might fail.

The kernel patch will allow buffer alignment to work with whatever the hardware
reports it can support. It could even as low as byte aligned if that's the
hardware can use that.

The patch aligns direct-io with the same criteria blk_rq_map_user() has always
used to know if the user space buffer is compatible with the hardware's dma
requirements. Prior to this patch, the direct-io memory alignment was an
artificial software constraint, and that constraint creates a lot of
unnecessary memory pressure.

As has always been the case, each segment needs to be a logical block length
granularity. QEMU assumed a buffer's page offset also defined the logical block
size instead of using the actual logical block size that it had previously
discovered directly.

> If I understand that correctly, after the patch in question,
> qemu is able to use just 512 bytes aligned buffer to read a single 4K block from the disk,
> which supposed to fail but wasn't guarnteed to fail.
>
> Later qemu it submits iovec which also reads a 4K block but in two parts,
> and if I understand that correctly, each part (iov) is considered
> to be a separate IO operation, and thus each has to be in my case 4K in size,
> and its memory buffer *should* also be 4K aligned.
>
> (but it can work with smaller alignement as well).

Right. The iov length needs to match the logical block size. The iov's memory
offset needs to align to the queue's dma_alignment attribute. The memory
alignment may be smaller than a block size.

> Assuming that I understand all of this correctly, I agree with Paolo that this is qemu
> bug, but I do fear that it can cause quite some problems for users,
> especially for users that use outdated qemu version.
>
> It might be too much to ask, but maybe add a Kconfig option to keep legacy behavier
> for those that need it?

Kconfig doesn't sound right.

The block layer exports all the attributes user space needs to know about for
direct io.

iov length: /sys/block/<block-dev>/queue/logical_block_size
iov mem align: /sys/block/<block-dev>/queue/dma_alignment

If you really want to change the behavior, I think maybe we could make the
dma_alignment attribute writeable (or perhaps add a new attribute specifically
for dio_alignment) so the user can request something larger.

2022-10-03 08:56:47

by Maxim Levitsky

[permalink] [raw]
Subject: Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

On Sun, 2022-10-02 at 07:56 -0600, Keith Busch wrote:
> On Sun, Oct 02, 2022 at 11:59:42AM +0300, Maxim Levitsky wrote:
> > On Thu, 2022-09-29 at 19:35 +0200, Paolo Bonzini wrote:
> > > On 9/29/22 18:39, Christoph Hellwig wrote:
> > > > On Thu, Sep 29, 2022 at 10:37:22AM -0600, Keith Busch wrote:
> > > > > > I am aware, and I've submitted the fix to qemu here:
> > > > > >
> > > > > > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00398.html
> > > > >
> > > > > I don't think so. Memory alignment and length granularity are two completely
> > > > > different concepts. If anything, the kernel's ABI had been that the length
> > > > > requirement was also required for the memory alignment, not the other way
> > > > > around. That usage will continue working with this kernel patch.
> >
> > Yes, this is how I also understand it - for example for O_DIRECT on a file which
> > resides on 4K block device, you have to use page aligned buffers.
> >
> > But here after the patch, 512 aligned buffer starts working as well - If I
> > understand you correctly the ABI didn't guarantee that such usage would fail,
> > but rather that it might fail.
>
> The kernel patch will allow buffer alignment to work with whatever the hardware
> reports it can support. It could even as low as byte aligned if that's the
> hardware can use that.
>
> The patch aligns direct-io with the same criteria blk_rq_map_user() has always
> used to know if the user space buffer is compatible with the hardware's dma
> requirements. Prior to this patch, the direct-io memory alignment was an
> artificial software constraint, and that constraint creates a lot of
> unnecessary memory pressure.
>
> As has always been the case, each segment needs to be a logical block length
> granularity. QEMU assumed a buffer's page offset also defined the logical block
> size instead of using the actual logical block size that it had previously
> discovered directly.
>
> > If I understand that correctly, after the patch in question,
> > qemu is able to use just 512 bytes aligned buffer to read a single 4K block from the disk,
> > which supposed to fail but wasn't guarnteed to fail.
> >
> > Later qemu it submits iovec which also reads a 4K block but in two parts,
> > and if I understand that correctly, each part (iov) is considered
> > to be a separate IO operation, and thus each has to be in my case 4K in size,
> > and its memory buffer *should* also be 4K aligned.
> >
> > (but it can work with smaller alignement as well).
>
> Right. The iov length needs to match the logical block size. The iov's memory
> offset needs to align to the queue's dma_alignment attribute. The memory
> alignment may be smaller than a block size.
>
> > Assuming that I understand all of this correctly, I agree with Paolo that this is qemu
> > bug, but I do fear that it can cause quite some problems for users,
> > especially for users that use outdated qemu version.
> >
> > It might be too much to ask, but maybe add a Kconfig option to keep legacy behavier
> > for those that need it?
>
> Kconfig doesn't sound right.
>
> The block layer exports all the attributes user space needs to know about for
> direct io.
>
> iov length: /sys/block/<block-dev>/queue/logical_block_size
> iov mem align: /sys/block/<block-dev>/queue/dma_alignment
>
> If you really want to change the behavior, I think maybe we could make the
> dma_alignment attribute writeable (or perhaps add a new attribute specifically
> for dio_alignment) so the user can request something larger.
>
All makes sense now.

New attribute could make sense I guess, and can be set by an udev rule or something.


Anyway I won't worry about this for now, and if there are issues I'll see how we could work
around them.

Thanks for everything,
Best regards,
Maxim Levitsky

2022-11-04 12:15:09

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures #forregzbot

[Note: this mail is primarily send for documentation purposes and/or for
regzbot, my Linux kernel regression tracking bot. That's why I removed
most or all folks from the list of recipients, but left any that looked
like a mailing lists. These mails usually contain '#forregzbot' in the
subject, to make them easy to spot and filter out.]

On 30.09.22 13:52, Thorsten Leemhuis wrote:
>
> Hi, this is your Linux kernel regression tracker. This might be a Qemu
> bug, but it's exposed by kernel change, so I at least want to have it in
> the tracking. I'll simply remove it in a few weeks, if it turns out that
> nobody except Maxim hits this.

There was one more report:
https://lore.kernel.org/all/[email protected]/

But that's it so far. I'll put this to rest for now:

#regzbot monitor:
https://lore.kernel.org/all/[email protected]/
#regzbot invalid: a kernel change exposed a bug in qemu that maybe still
needs to be fixed, *if* more people run into this