2015-12-20 17:51:27

by Linus Torvalds

[permalink] [raw]
Subject: IO errors after "block: remove bio_get_nr_vecs()"

Kent, Jens, Christoph et al,
please see this bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=109661

where Artem Tashkinov bisected his problems with 4.3 down to commit
b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
signed off on.

(Also Tejun - maybe you can see what's up - maybe that error message
tells you something)

I'm not sure what's up with his machine, the disk doesn't seem to be
anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda:

ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133

which doesn't strike me as odd.

Looking at the dmesg, it also looks like it's a pretty normal
Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI
ID for the AHCI chip seems to be (INTEL, 0x1c02).

Any ideas? Anybody?

Linus


2015-12-20 18:18:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:
> Kent, Jens, Christoph et al,
> please see this bugzilla:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=109661
>
> where Artem Tashkinov bisected his problems with 4.3 down to commit
> b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
> signed off on.

Artem,

can you re-check the commits around this series again? I would be
extremtly surprised if it's really this particular commit and not
one just before it causing the problem - it just allocates bios
to the biggest possible instead of only allocating up to what
bio_add_page would accept.

2015-12-20 18:41:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Sun, Dec 20, 2015 at 10:18 AM, Christoph Hellwig <[email protected]> wrote:
>
> Artem,
>
> can you re-check the commits around this series again? I would be
> extremtly surprised if it's really this particular commit and not
> one just before it causing the problem - it just allocates bios
> to the biggest possible instead of only allocating up to what
> bio_add_page would accept.

Judging by Artem's bisect log, the last commit he tested before the
bad one was the commit before: commit 6cf66b4caf9c ("fs: use helper
bio_add_page() instead of open coding on bi_io_vec") and he marked
that one good.

Sadly, without CONFIG_LOCALVERSION_AUTO, there's no way to match up
the dmesg files (in the same bisection tar-file as the bisection log)
with the actual versions. Also, Artem's bisect.log isn't actually the
.git/BISECT_LOG file that contains the full information about what was
marked good and bad, so it's a bit hard to read (ie I can tell that
Artem had to mark commit 6cf66b4caf9c as "good" not because his log
says so, but because that explains the next commit to be tested).

Of course, it's fairly easy to make a mistake while bisecting (just
doing a thinko), but usually bisection miistakes end up causing you to
go into some "all good" or "all bad" region of commits, and the fact
that Artem seems to have marked the previous commit good and the final
commit bad does seem to imply the bisection was successful.

But yes, it is always nice to double-check the bisection results. The
best way to do it is generally to try to revert the bad commit and
verify that things work after that, but that commit doesn't revert
cleanly on top of 4.3 due to other changes.

Attached is a *COMPLETELY*UNTESTED* revertish patch for 4.3. It's
basically a revert of b54ffb73cadc, but with a few fixups to make the
revert work on top of 4.3.

So Artem, if you can test whether 4.3 works with that revert, and/or
double-check booting that b54ffb73cadc again (to verify that it's
really bad), and its parent (to double-check that it's really good),
that would be a good way to verify that yes, it is really that *one*
commit that breaks things for you.

Linus


Attachments:
patch.diff (10.86 kB)

2015-12-20 18:44:13

by Kent Overstreet

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Sun, Dec 20, 2015 at 07:18:01PM +0100, Christoph Hellwig wrote:
> On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:
> > Kent, Jens, Christoph et al,
> ie please see this bugzilla:
> >o
> > httpps://bugzilla.kernel.org/show_bug.cgi?id=109661
> >
> > where Artem Tashkinov bisected his problems with 4.3 down to commit
> > b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
> > signed off on.
>
> Artem,
>
> can you re-check the commits around this series again? I would be
> extremtly surprised if it's really this particular commit and not
> one just before it causing the problem - it just allocates bios
> to the biggest possible instead of only allocating up to what
> bio_add_page would accept.

pretty sure it's something with how blk_bio_segment_split() decides what
segments are mergable and not. bio_get_nr_vecs() was just returning nr_pages ==
queue_max_segments (ignoring sectors for the moment) - so wait, wtf? that's
basically assuming no segment merging can ever happen, if it does then this was
causing us to send smaller requests to the device than we could have been.

so actually two possibilities I can see:
- in blk_bio_segment_split(), something's screwed up with how it decides what
segments are going to be mergable or not. but I don't think that's likely
since it's doing the exact same thing the rest of the segment merging code
does.
- or, the driver was lying in its queue limits, using queue_max_segments for
"the maximum number of pages I can possibly take", and that bug lurked
undiscovered because of the screwed-upness in bio_get_nr_vecs().

Offhand I don't know where to start digging in the driver code to look into the
second theory though. Tejun, you got any ideas?

2015-12-20 23:29:54

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-20 22:51, Linus Torvalds wrote:
> Kent, Jens, Christoph et al,
> please see this bugzilla:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=109661
>
> where Artem Tashkinov bisected his problems with 4.3 down to commit
> b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
> signed off on.
>
> (Also Tejun - maybe you can see what's up - maybe that error message
> tells you something)
>
> I'm not sure what's up with his machine, the disk doesn't seem to be
> anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda:
>
> ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133
>
> which doesn't strike me as odd.
>
> Looking at the dmesg, it also looks like it's a pretty normal
> Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI
> ID for the AHCI chip seems to be (INTEL, 0x1c02).
>
> Any ideas? Anybody?
>

That's correct. That's a very usual Asus P8P67 Pro motherboard (Intel
P67 chipset) in AHCI mode and run of the mill HDD which is the one you
identified.

2015-12-20 23:30:35

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-20 23:18, Christoph Hellwig wrote:
> On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:
>> Kent, Jens, Christoph et al,
>> please see this bugzilla:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=109661
>>
>> where Artem Tashkinov bisected his problems with 4.3 down to commit
>> b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
>> signed off on.
>
> Artem,
>
> can you re-check the commits around this series again? I would be
> extremtly surprised if it's really this particular commit and not
> one just before it causing the problem - it just allocates bios
> to the biggest possible instead of only allocating up to what
> bio_add_page would accept.

I'm positive about this particular commit. Of course, it might be
another
GCC 4.7.4 miscompilation which causes the errors which shouldn't be
there but
I'm not an expert, so.

2015-12-20 23:36:43

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-20 23:41, Linus Torvalds wrote:
> On Sun, Dec 20, 2015 at 10:18 AM, Christoph Hellwig wrote:
>>
>> Artem,
>>
>> can you re-check the commits around this series again? I would be
>> extremtly surprised if it's really this particular commit and not
>> one just before it causing the problem - it just allocates bios
>> to the biggest possible instead of only allocating up to what
>> bio_add_page would accept.
>
> Judging by Artem's bisect log, the last commit he tested before the
> bad one was the commit before: commit 6cf66b4caf9c ("fs: use helper
> bio_add_page() instead of open coding on bi_io_vec") and he marked
> that one good.
>
> Sadly, without CONFIG_LOCALVERSION_AUTO, there's no way to match up
> the dmesg files (in the same bisection tar-file as the bisection log)
> with the actual versions. Also, Artem's bisect.log isn't actually the
> .git/BISECT_LOG file that contains the full information about what was
> marked good and bad, so it's a bit hard to read (ie I can tell that
> Artem had to mark commit 6cf66b4caf9c as "good" not because his log
> says so, but because that explains the next commit to be tested).
>
> Of course, it's fairly easy to make a mistake while bisecting (just
> doing a thinko), but usually bisection miistakes end up causing you to
> go into some "all good" or "all bad" region of commits, and the fact
> that Artem seems to have marked the previous commit good and the final
> commit bad does seem to imply the bisection was successful.
>
> But yes, it is always nice to double-check the bisection results. The
> best way to do it is generally to try to revert the bad commit and
> verify that things work after that, but that commit doesn't revert
> cleanly on top of 4.3 due to other changes.
>
> Attached is a *COMPLETELY*UNTESTED* revertish patch for 4.3. It's
> basically a revert of b54ffb73cadc, but with a few fixups to make the
> revert work on top of 4.3.
>
> So Artem, if you can test whether 4.3 works with that revert, and/or
> double-check booting that b54ffb73cadc again (to verify that it's
> really bad), and its parent (to double-check that it's really good),
> that would be a good way to verify that yes, it is really that *one*
> commit that breaks things for you.
>

After reverting (applying) this patch on top of 4.3.3 everything is back
to normal. It's indeed a guilty commit.

2015-12-20 23:41:11

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-20 23:44, Kent Overstreet wrote:
> On Sun, Dec 20, 2015 at 07:18:01PM +0100, Christoph Hellwig wrote:
>> On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:
>> > Kent, Jens, Christoph et al,
>> ie please see this bugzilla:
>> >o
>> > httpps://bugzilla.kernel.org/show_bug.cgi?id=109661
>> >
>> > where Artem Tashkinov bisected his problems with 4.3 down to commit
>> > b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
>> > signed off on.
>>
>> Artem,
>>
>> can you re-check the commits around this series again? I would be
>> extremtly surprised if it's really this particular commit and not
>> one just before it causing the problem - it just allocates bios
>> to the biggest possible instead of only allocating up to what
>> bio_add_page would accept.
>
> pretty sure it's something with how blk_bio_segment_split() decides
> what
> segments are mergable and not. bio_get_nr_vecs() was just returning
> nr_pages ==
> queue_max_segments (ignoring sectors for the moment) - so wait, wtf?
> that's
> basically assuming no segment merging can ever happen, if it does then
> this was
> causing us to send smaller requests to the device than we could have
> been.
>
> so actually two possibilities I can see:
> - in blk_bio_segment_split(), something's screwed up with how it
> decides what
> segments are going to be mergable or not. but I don't think that's
> likely
> since it's doing the exact same thing the rest of the segment
> merging code
> does.
> - or, the driver was lying in its queue limits, using
> queue_max_segments for
> "the maximum number of pages I can possibly take", and that bug
> lurked
> undiscovered because of the screwed-upness in bio_get_nr_vecs().
>
> Offhand I don't know where to start digging in the driver code to look
> into the
> second theory though. Tejun, you got any ideas?

Here's an actual bisect log which Linus was missing:

git bisect start
# bad: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
git bisect bad 6a13feb9c82803e2b815eca72fa7a9f5561d7861
# good: [64291f7db5bd8150a74ad2036f1037e6a0428df2] Linux 4.2
git bisect good 64291f7db5bd8150a74ad2036f1037e6a0428df2
# bad: [807249d3ada1ff28a47c4054ca4edd479421b671] Merge branch
'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
git bisect bad 807249d3ada1ff28a47c4054ca4edd479421b671
# good: [102178108e2246cb4b329d3fb7872cd3d7120205] Merge tag
'armsoc-drivers' of
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect good 102178108e2246cb4b329d3fb7872cd3d7120205
# good: [62da98656b62a5ca57f22263705175af8ded5aa1] netfilter:
nf_conntrack: make nf_ct_zone_dflt built-in
git bisect good 62da98656b62a5ca57f22263705175af8ded5aa1
# good: [f1a3c0b933e7ff856223d6fcd7456d403e54e4e5] Merge tag
'devicetree-for-4.3' of
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
git bisect good f1a3c0b933e7ff856223d6fcd7456d403e54e4e5
# bad: [9cbf22b37ae0592dea809cb8d424990774c21786] Merge tag 'dlm-4.3' of
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
git bisect bad 9cbf22b37ae0592dea809cb8d424990774c21786
# good: [8bdc69b764013a9b5ebeef7df8f314f1066c5d79] Merge branch
'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
git bisect good 8bdc69b764013a9b5ebeef7df8f314f1066c5d79
# good: [df910390e2db07a76c87f258475f6c96253cee6c] Merge tag 'scsi-misc'
of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect good df910390e2db07a76c87f258475f6c96253cee6c
# bad: [d975f309a8b250e67b66eabeb56be6989c783629] Merge branch
'for-4.3/sg' of git://git.kernel.dk/linux-block
git bisect bad d975f309a8b250e67b66eabeb56be6989c783629
# bad: [89e2a8404e4415da1edbac6ca4f7332b4a74fae2] crypto/omap-sham:
remove an open coded access to ->page_link
git bisect bad 89e2a8404e4415da1edbac6ca4f7332b4a74fae2
# good: [0e28997ec476bad4c7dbe0a08775290051325f53] btrfs: remove bio
splitting and merge_bvec_fn() calls
git bisect good 0e28997ec476bad4c7dbe0a08775290051325f53
# bad: [2ec3182f9c20a9eef0dacc0512cf2ca2df7be5ad] Documentation: update
notes in biovecs about arbitrarily sized bios
git bisect bad 2ec3182f9c20a9eef0dacc0512cf2ca2df7be5ad
# good: [7140aafce2fc14c5af02fdb7859b6bea0108be3d] md/raid5: get rid of
bio_fits_rdev()
git bisect good 7140aafce2fc14c5af02fdb7859b6bea0108be3d
# good: [6cf66b4caf9c71f64a5486cadbd71ab58d0d4307] fs: use helper
bio_add_page() instead of open coding on bi_io_vec
git bisect good 6cf66b4caf9c71f64a5486cadbd71ab58d0d4307
# bad: [b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c] block: remove
bio_get_nr_vecs()
git bisect bad b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c

And like he said since the step before the last one was good and the
very last one was bad there was no way I could have made a mistake.

2015-12-20 23:42:42

by Kent Overstreet

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Mon, Dec 21, 2015 at 04:25:12AM +0500, Artem S. Tashkinov wrote:
> On 2015-12-20 23:18, Christoph Hellwig wrote:
> >On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:
> >>Kent, Jens, Christoph et al,
> >> please see this bugzilla:
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=109661
> >>
> >>where Artem Tashkinov bisected his problems with 4.3 down to commit
> >>b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
> >>signed off on.
> >
> >Artem,
> >
> >can you re-check the commits around this series again? I would be
> >extremtly surprised if it's really this particular commit and not
> >one just before it causing the problem - it just allocates bios
> >to the biggest possible instead of only allocating up to what
> >bio_add_page would accept.
>
> I'm positive about this particular commit. Of course, it might be another
> GCC 4.7.4 miscompilation which causes the errors which shouldn't be there
> but
> I'm not an expert, so.

I believe you on the commit, and I doubt this has anything to do with gcc - the
errors you're getting are exactly what you normally get when you send the device
an sglist to dma to/from that it doesn't like.

The queue limits stuff is annoyingly fragile, you'd think we'd be able to check
directly in the driver that the stuff we're sending the device is sane but we
don't.

If I came up with a debug patch could you try it out? I don't have any ideas for
one yet, but if someone who knows the ATA code doesn't jump in I'll call up
Tejun and make him walk me through it.

2015-12-20 23:49:51

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-21 04:42, Kent Overstreet wrote:
> On Mon, Dec 21, 2015 at 04:25:12AM +0500, Artem S. Tashkinov wrote:
>> On 2015-12-20 23:18, Christoph Hellwig wrote:
>> >On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:
>> >>Kent, Jens, Christoph et al,
>> >> please see this bugzilla:
>> >>
>> >> https://bugzilla.kernel.org/show_bug.cgi?id=109661
>> >>
>> >>where Artem Tashkinov bisected his problems with 4.3 down to commit
>> >>b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
>> >>signed off on.
>> >
>> >Artem,
>> >
>> >can you re-check the commits around this series again? I would be
>> >extremtly surprised if it's really this particular commit and not
>> >one just before it causing the problem - it just allocates bios
>> >to the biggest possible instead of only allocating up to what
>> >bio_add_page would accept.
>>
>> I'm positive about this particular commit. Of course, it might be
>> another
>> GCC 4.7.4 miscompilation which causes the errors which shouldn't be
>> there
>> but
>> I'm not an expert, so.
>
> I believe you on the commit, and I doubt this has anything to do with
> gcc - the
> errors you're getting are exactly what you normally get when you send
> the device
> an sglist to dma to/from that it doesn't like.
>
> The queue limits stuff is annoyingly fragile, you'd think we'd be able
> to check
> directly in the driver that the stuff we're sending the device is sane
> but we
> don't.
>
> If I came up with a debug patch could you try it out? I don't have any
> ideas for
> one yet, but if someone who knows the ATA code doesn't jump in I'll
> call up
> Tejun and make him walk me through it.

No problem, I just hope that this particular access mode (and you debug
patch) won't decrease the lifespan of my HDD. Seagate HDDs have been
very fragile (read atrociously unreliable) for the past five years.

2015-12-21 01:38:50

by Ming Lei

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Mon, Dec 21, 2015 at 1:51 AM, Linus Torvalds
<[email protected]> wrote:
> Kent, Jens, Christoph et al,
> please see this bugzilla:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=109661
>
> where Artem Tashkinov bisected his problems with 4.3 down to commit
> b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
> signed off on.
>
> (Also Tejun - maybe you can see what's up - maybe that error message
> tells you something)
>
> I'm not sure what's up with his machine, the disk doesn't seem to be
> anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda:
>
> ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133
>
> which doesn't strike me as odd.
>
> Looking at the dmesg, it also looks like it's a pretty normal
> Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI
> ID for the AHCI chip seems to be (INTEL, 0x1c02).
>
> Any ideas? Anybody?

BTW, I have posted very similar issue in the link:

http://marc.info/?l=linux-ide&m=145066119623811&w=2

Artem, I noticed from bugzillar that the hardware is i386, just
wondering if PAE is enabled? If yes, I am more confident
that both the two kinds of report are similar or same.

Thanks,

>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Ming Lei

2015-12-21 01:50:25

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-21 06:38, Ming Lei wrote:
> On Mon, Dec 21, 2015 at 1:51 AM, Linus Torvalds wrote:
>> Kent, Jens, Christoph et al,
>> please see this bugzilla:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=109661
>>
>> where Artem Tashkinov bisected his problems with 4.3 down to commit
>> b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
>> signed off on.
>>
>> (Also Tejun - maybe you can see what's up - maybe that error message
>> tells you something)
>>
>> I'm not sure what's up with his machine, the disk doesn't seem to be
>> anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda:
>>
>> ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133
>>
>> which doesn't strike me as odd.
>>
>> Looking at the dmesg, it also looks like it's a pretty normal
>> Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI
>> ID for the AHCI chip seems to be (INTEL, 0x1c02).
>>
>> Any ideas? Anybody?
>
> BTW, I have posted very similar issue in the link:
>
> http://marc.info/?l=linux-ide&m=145066119623811&w=2
>
> Artem, I noticed from bugzillar that the hardware is i386, just
> wondering if PAE is enabled? If yes, I am more confident
> that both the two kinds of report are similar or same.
>

Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically
mentioned in the corresponding bug report.

P.S. I know Linus doesn't condone PAE but I still find it more
preferrable than running a mixed environment with almost zero benefit in
regard to performance and quite obvious performance regressions related
to an increased number of libraries being loaded (i686 + x86_64) and
slightly bloated code which sometimes cannot fit in the CPU cache. Call
me old fashioned but I won't upgrade to x86_64 until most of the things
that I run locally are available for x86_64 and that won't happen any
time soon.

2015-12-21 02:18:45

by Ming Lei

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Mon, Dec 21, 2015 at 9:50 AM, Artem S. Tashkinov <[email protected]> wrote:
>> BTW, I have posted very similar issue in the link:
>>
>> http://marc.info/?l=linux-ide&m=145066119623811&w=2
>>
>> Artem, I noticed from bugzillar that the hardware is i386, just
>> wondering if PAE is enabled? If yes, I am more confident
>> that both the two kinds of report are similar or same.
>>
>
> Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically mentioned
> in the corresponding bug report.

OK, could you dump value of the following files under /sys/block/sdN/queue/ ?

max_hw_sectors_kb
max_sectors_kb
max_segments
max_segment_size

'sdN' is the faulted disk name.

Thanks,
Ming Lei

2015-12-21 02:25:25

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-21 07:18, Ming Lei wrote:
> On Mon, Dec 21, 2015 at 9:50 AM, Artem S. Tashkinov wrote:
>>> BTW, I have posted very similar issue in the link:
>>>
>>> http://marc.info/?l=linux-ide&m=145066119623811&w=2
>>>
>>> Artem, I noticed from bugzillar that the hardware is i386, just
>>> wondering if PAE is enabled? If yes, I am more confident
>>> that both the two kinds of report are similar or same.
>>>
>>
>> Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically
>> mentioned
>> in the corresponding bug report.
>
> OK, could you dump value of the following files under
> /sys/block/sdN/queue/ ?
>
> max_hw_sectors_kb
> max_sectors_kb
> max_segments
> max_segment_size
>
> 'sdN' is the faulted disk name.
>

# cat
/sys/block/sda/queue/{max_hw_sectors_kb,max_sectors_kb,max_segments,max_segment_size}
32767
32767
168
65536

2015-12-21 02:32:43

by Kent Overstreet

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Mon, Dec 21, 2015 at 06:50:21AM +0500, Artem S. Tashkinov wrote:
> On 2015-12-21 06:38, Ming Lei wrote:
> >On Mon, Dec 21, 2015 at 1:51 AM, Linus Torvalds wrote:
> >>Kent, Jens, Christoph et al,
> >> please see this bugzilla:
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=109661
> >>
> >>where Artem Tashkinov bisected his problems with 4.3 down to commit
> >>b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
> >>signed off on.
> >>
> >>(Also Tejun - maybe you can see what's up - maybe that error message
> >>tells you something)
> >>
> >>I'm not sure what's up with his machine, the disk doesn't seem to be
> >>anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda:
> >>
> >> ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133
> >>
> >>which doesn't strike me as odd.
> >>
> >>Looking at the dmesg, it also looks like it's a pretty normal
> >>Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI
> >>ID for the AHCI chip seems to be (INTEL, 0x1c02).
> >>
> >>Any ideas? Anybody?
> >
> >BTW, I have posted very similar issue in the link:
> >
> >http://marc.info/?l=linux-ide&m=145066119623811&w=2
> >
> >Artem, I noticed from bugzillar that the hardware is i386, just
> >wondering if PAE is enabled? If yes, I am more confident
> >that both the two kinds of report are similar or same.
> >
>
> Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically mentioned
> in the corresponding bug report.
>
> P.S. I know Linus doesn't condone PAE but I still find it more preferrable
> than running a mixed environment with almost zero benefit in regard to
> performance and quite obvious performance regressions related to an
> increased number of libraries being loaded (i686 + x86_64) and slightly
> bloated code which sometimes cannot fit in the CPU cache. Call me old
> fashioned but I won't upgrade to x86_64 until most of the things that I run
> locally are available for x86_64 and that won't happen any time soon.

oy vey. WTF's been happening in blk-merge.c?

Theyy're not the same bug. The bug in your thread was introduced by Jens in
5014c311ba "block: fix bogus compiler warnings in blk-merge.c", where he screwed
up the bvprv handling - but that patch comes after the patch Artem bisected to.

blk_bio_segment_split() looks correct in b54ffb73ca.

What we need to do is:
in the _driver_, immediately before handing the sglist off to the device, walk
the sglist and verify it obeys all the restrictions for that particular device
- and if it's not, print out exactly what we screwed up.

I don't know where that code lives in the ahci driver, and more importantly I
don't know where the dma restrictions come from, but if someone who knows the
driver code can walk me through it I'll write the patch.

--------------

Also - Ming, Christoph, anyone else who might be working on this stuff in the
future:

The way all the queue limits stuff works is still way too fragile; this has been
a recurring source of bugs. There's way too many different restrictions
different devices need, and it's easy for a driver to specify the restrictions
incorrectly in a way that just happens to work, but for the wrong reasons - e.g.
"I can't handle more than x segments, but saying I can't handle more than x
sectors happens to work for now because of some other bug in the upper layers" -
and then when we have to debug that later, we're screwed.

My intent when I was working on this was to eventually push the implementation
of the limits down as much as possible to the actual drivers - i.e. there the
limitations come from, so the driver can say, for example:

"ok, my device can only do scatter/gather dma to max 20 different addresses, so
I'll allocate sglists with 20 entries, and it doesn't matter if the bio or
request or whatever is bigger because when I call blk_rq_map_sg() it's just
going to map as much of the request as will fit in a given sglist and requests
will get processed incrementally until they're finished - and if a particular sg
entry can only be a particular size, or has alignment restrictions or whatever,
I'll just pass that directly to blk_rq_map_sg()"

so that the driver is ideally specifying _only_ its real restrictions, and
they're being specified in the code exactly where they're being used.

-------

Basically, blk_queue_split() was only meant to be an interim solution, so I'd
suggest that instead of doing performance optimizations on that codepath a
better use of time and effort would be to work towards ripping it out entirely.

2015-12-21 03:21:38

by Ming Lei

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Mon, Dec 21, 2015 at 10:25 AM, Artem S. Tashkinov <[email protected]> wrote:
> # cat
> /sys/block/sda/queue/{max_hw_sectors_kb,max_sectors_kb,max_segments,max_segment_size}
> 32767
> 32767
> 168
> 65536

Looks it is fine, then maybe it is related with BIOVEC_PHYS_MERGEABLE(),
BIOVEC_SEG_BOUNDARY() or sort of thing, because dma_addr_t and
phys_addr_t turn to 64-bit with PAE, but 'unsigned long' and 'void *'
is still 32bit.

It was confirmed that there isn't the issue if PAE is disabled.

Dumping both sata/ahci hw sg table and bio's bvec might be helpful.

On Mon, Dec 21, 2015 at 10:32 AM, Kent Overstreet
<[email protected]> wrote:
>
> oy vey. WTF's been happening in blk-merge.c?
>
> Theyy're not the same bug. The bug in your thread was introduced by Jens in
> 5014c311ba "block: fix bogus compiler warnings in blk-merge.c", where he screwed
> up the bvprv handling - but that patch comes after the patch Artem bisected to.
>
> blk_bio_segment_split() looks correct in b54ffb73ca.

Yes, that is why reverting 578270bfb(block: fix segment split) can make the
issue disappear, because 5014c311ba "block: fix bogus compiler
warnings in blk-merge.c" basically disables sg-merge and prevents the
issue from being
triggered.



Thanks,
Ming Lei

2015-12-21 03:36:56

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-21 08:21, Ming Lei wrote:
> On Mon, Dec 21, 2015 at 10:25 AM, Artem S. Tashkinov wrote:
>> # cat
>> /sys/block/sda/queue/{max_hw_sectors_kb,max_sectors_kb,max_segments,max_segment_size}
>> 32767
>> 32767
>> 168
>> 65536
>
> Looks it is fine, then maybe it is related with
> BIOVEC_PHYS_MERGEABLE(),
> BIOVEC_SEG_BOUNDARY() or sort of thing, because dma_addr_t and
> phys_addr_t turn to 64-bit with PAE, but 'unsigned long' and 'void *'
> is still 32bit.
>
> It was confirmed that there isn't the issue if PAE is disabled.
>
> Dumping both sata/ahci hw sg table and bio's bvec might be helpful.

Um, sorry, what exact variables/files do you want to see? I'm not an
expert in /sys.

>
> On Mon, Dec 21, 2015 at 10:32 AM, Kent Overstreet wrote:
>>
>> oy vey. WTF's been happening in blk-merge.c?
>>
>> Theyy're not the same bug. The bug in your thread was introduced by
>> Jens in
>> 5014c311ba "block: fix bogus compiler warnings in blk-merge.c", where
>> he screwed
>> up the bvprv handling - but that patch comes after the patch Artem
>> bisected to.
>>
>> blk_bio_segment_split() looks correct in b54ffb73ca.
>
> Yes, that is why reverting 578270bfb(block: fix segment split) can make
> the
> issue disappear, because 5014c311ba "block: fix bogus compiler
> warnings in blk-merge.c" basically disables sg-merge and prevents the
> issue from being
> triggered.

2015-12-21 04:26:16

by Tejun Heo

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

Hello, Linus.

On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:
...
> (Also Tejun - maybe you can see what's up - maybe that error message
> tells you something)

Hmmm... all it says is that something went wrong on the PCI side.

> I'm not sure what's up with his machine, the disk doesn't seem to be
> anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda:
>
> ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133
>
> which doesn't strike me as odd.
>
> Looking at the dmesg, it also looks like it's a pretty normal
> Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI
> ID for the AHCI chip seems to be (INTEL, 0x1c02).
>
> Any ideas? Anybody?

I wonder whether ahci is screwing up command / sg table setup in a way
that e.g. if there are too many segments the sg table overflows into
the neighboring one which is now being exposed by upper layer being
fixed to send down larger commands. Looking into it.

Thanks.

--
tejun

2015-12-21 04:32:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Sun, Dec 20, 2015 at 5:50 PM, Artem S. Tashkinov <[email protected]> wrote:
>
> P.S. I know Linus doesn't condone PAE but I still find it more preferrable
> than running a mixed environment with almost zero benefit in regard to
> performance and quite obvious performance regressions related to an
> increased number of libraries being loaded (i686 + x86_64) and slightly
> bloated code which sometimes cannot fit in the CPU cache. Call me old
> fashioned but I won't upgrade to x86_64 until most of the things that I run
> locally are available for x86_64 and that won't happen any time soon.

Don't upgrade *user* land. User land doesn't use the braindamage that is PAE.

Just run a 64-bit kernel. Keep all your 32-bit userland apps and libraries.

Trust me, that *will* be faster. PAE works really horribly badly,
because all your really important data structures like your inodes and
directory cache will all be in the low 1GB even if you have 16BG of
RAM.

Of course, I'd also like more people to run things that way just to
get more coverage of the whole "yes, we do all the compat stuff
correctly". So I have some other reasons to prefer people running
64-bit kernels with 32-bit user land. But PAE really is a disaster.

Linus

2015-12-21 04:43:27

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-21 09:32, Linus Torvalds wrote:
> On Sun, Dec 20, 2015 at 5:50 PM, Artem S. Tashkinov wrote:
>>
>> P.S. I know Linus doesn't condone PAE but I still find it more
>> preferrable
>> than running a mixed environment with almost zero benefit in regard to
>> performance and quite obvious performance regressions related to an
>> increased number of libraries being loaded (i686 + x86_64) and
>> slightly
>> bloated code which sometimes cannot fit in the CPU cache. Call me old
>> fashioned but I won't upgrade to x86_64 until most of the things that
>> I run
>> locally are available for x86_64 and that won't happen any time soon.
>
> Don't upgrade *user* land. User land doesn't use the braindamage that
> is PAE.
>
> Just run a 64-bit kernel. Keep all your 32-bit userland apps and
> libraries.
>
> Trust me, that *will* be faster. PAE works really horribly badly,
> because all your really important data structures like your inodes and
> directory cache will all be in the low 1GB even if you have 16BG of
> RAM.
>
> Of course, I'd also like more people to run things that way just to
> get more coverage of the whole "yes, we do all the compat stuff
> correctly". So I have some other reasons to prefer people running
> 64-bit kernels with 32-bit user land. But PAE really is a disaster.
>

In the past I happily ran an x86_64 bit kernel together with 32bit
userland for quite some time but then I hit a wall: VirtualBox expects
its kernel modules to have the same bitness as the application itself so
I had to revert back to an i686 PAE setup. It's probably high time to
try qemu however last time I looked at it a few years ago it lacked
several crucial features I need from a VM.

2015-12-21 04:47:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Sun, Dec 20, 2015 at 8:43 PM, Artem S. Tashkinov <[email protected]> wrote:
>
> In the past I happily ran an x86_64 bit kernel together with 32bit userland
> for quite some time but then I hit a wall: VirtualBox expects its kernel
> modules to have the same bitness as the application itself so I had to
> revert back to an i686 PAE setup.

Ugh, ok. That kind of forces your hand, yes.

Although:

> t's probably high time to try qemu however last time I looked at it a few
> years ago it lacked several crucial features I need from a VM.

kvm-qemu really ends up working pretty well.. Give it a try.

That said, we obviously need to figure out this current problem
regardless first..

Linus

2015-12-21 05:10:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Sun, Dec 20, 2015 at 8:26 PM, Tejun Heo <[email protected]> wrote:
>
> I wonder whether ahci is screwing up command / sg table setup in a way
> that e.g. if there are too many segments the sg table overflows into
> the neighboring one which is now being exposed by upper layer being
> fixed to send down larger commands. Looking into it.

That would explain the

Corrupted low memory at c0001000 ...

that Artem also saw.

Anyway, it would be lovely to have some verification in the ATA
routines that the passed-on IO actually h9onors the limits it set.
Could you add a WARN_ON_ONCE(check_io_limits())" or similar, and maybe
we could catch whatever causes the overflow red-handed?

On a totally separate issue:

Just looking at some of the merging code, and I have to say that it
strikes me as insane. This in particular:

#define __BIO_SEG_BOUNDARY(addr1, addr2, mask) \
(((addr1) | (mask)) == (((addr2) - 1) | (mask)))
#define BIOVEC_SEG_BOUNDARY(q, b1, b2) \
__BIO_SEG_BOUNDARY(bvec_to_phys((b1)), bvec_to_phys((b2)) +
(b2)->bv_len, queue_segment_boundary((q)))

seems just *stupid*.

Why does it do that "bvec_to_phys((b2)) + (b2)->bv_len -1" on the
second bvec? That's the :"physical address of the last byte of the
second bvec".

I understand the "round both addresses up by the mask, and we want to
make sure that they are in the same segment" part.

But since an individual bvec had better be fully inside one segment
(since we split at bvec boundaries anyway, so if ). why do all that
crap anyway? The end address doesn't matter, you could just use the
beginning.

So remove the "-1" and remove the "+bv_len".

At which it would become just

#define __BIO_SEG_BOUNDARY(addr1, addr2, mask) \
((addr1) | (mask) == (addr2)|(mask))
#define BIOVEC_SEG_BOUNDARY(q, b1, b2) \
__BIO_SEG_BOUNDARY(bvec_to_phys((b1)), bvec_to_phys((b2)),
queue_segment_boundary((q)))

which seems simpler and more understandable. "Are the beginning
addresses in within the same segment"

Or are there ever bv_len == 0 things at the boundary that we want to
merge. Because then the "-1+bv_len" case migth make sense.

Anyway, that shouldn't change the end result in any way, so that
doesn't all *matter*, but it worries me when things look more
complicated than I think they should be.

Is there something I'm missing?

Linus

2015-12-21 05:23:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Sun, Dec 20, 2015 at 8:47 PM, Linus Torvalds
<[email protected]> wrote:
>
> That said, we obviously need to figure out this current problem
> regardless first..

... although maybe it *would* be interesting to hear what happens if
you just compile a 64-bit kernel instead?

Do you still see the problem? Because if not, then we should look very
specifically for some 32-bit PAE issue.

For example, maybe we use "unsigned long" somewhere where we should
use "phys_addr_t". On x86-64, they obviously end up being the same. On
normal non-PAE x86-32, they are also the same. But ..

Linus

2015-12-21 06:55:32

by Tejun Heo

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

Artem, can you please reproduce the issue with the following patch
applied and attach the kernel log?

Thanks.

---
drivers/ata/libahci.c | 40 ++++++++++++++++++++++++++++++++++++++--
drivers/ata/libata-eh.c | 4 ++++
drivers/ata/libata-scsi.c | 1 +
3 files changed, 43 insertions(+), 2 deletions(-)

--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -2278,7 +2278,7 @@ static int ahci_port_start(struct ata_po
struct ahci_host_priv *hpriv = ap->host->private_data;
struct device *dev = ap->host->dev;
struct ahci_port_priv *pp;
- void *mem;
+ void *mem, *base;
dma_addr_t mem_dma;
size_t dma_sz, rx_fis_sz;

@@ -2319,7 +2319,9 @@ static int ahci_port_start(struct ata_po
rx_fis_sz = AHCI_RX_FIS_SZ;
}

- mem = dmam_alloc_coherent(dev, dma_sz, &mem_dma, GFP_KERNEL);
+ base = mem = dmam_alloc_coherent(dev, dma_sz, &mem_dma, GFP_KERNEL);
+ printk("XXX port %d dma_sz=%zu mem=%p mem_dma=%p",
+ ap->port_no, dma_sz, mem, (void *)mem_dma);
if (!mem)
return -ENOMEM;
memset(mem, 0, dma_sz);
@@ -2331,6 +2333,8 @@ static int ahci_port_start(struct ata_po
pp->cmd_slot = mem;
pp->cmd_slot_dma = mem_dma;

+ pr_cont(" cmd_slot=%zu", mem - base);
+
mem += AHCI_CMD_SLOT_SZ;
mem_dma += AHCI_CMD_SLOT_SZ;

@@ -2340,6 +2344,8 @@ static int ahci_port_start(struct ata_po
pp->rx_fis = mem;
pp->rx_fis_dma = mem_dma;

+ pr_cont(" rx_fis=%zu", mem - base);
+
mem += rx_fis_sz;
mem_dma += rx_fis_sz;

@@ -2350,6 +2356,8 @@ static int ahci_port_start(struct ata_po
pp->cmd_tbl = mem;
pp->cmd_tbl_dma = mem_dma;

+ pr_cont(" cmd_tbl=%zu\n", mem - base);
+
/*
* Save off initial list of interrupts to be enabled.
* This could be changed later
@@ -2540,6 +2548,34 @@ int ahci_host_activate(struct ata_host *
}
EXPORT_SYMBOL_GPL(ahci_host_activate);

+void ahci_dump_dma(struct ata_queued_cmd *qc)
+{
+ struct ata_port *ap = qc->ap;
+ struct ahci_port_priv *pp = ap->private_data;
+ struct ahci_cmd_hdr *cmd = &pp->cmd_slot[qc->tag];
+ void *cmd_tbl = pp->cmd_tbl + qc->tag * AHCI_CMD_TBL_SZ;
+ u32 *fis = cmd_tbl;
+ struct ahci_sg *ahci_sg = cmd_tbl + AHCI_CMD_TBL_HDR_SZ;
+ int prdtl = (cmd->opts & 0xffff0000) >> 16;
+ int i;
+
+ printk("XXX cmd=%p cmd_tbl=%p ahci_sg=%p\n", cmd, cmd_tbl, ahci_sg);
+ printk("XXX opts=%x st=%x addr=%x addr_hi=%x rsvd=%x:%x:%x:%x\n",
+ cmd->opts, cmd->status, cmd->tbl_addr, cmd->tbl_addr_hi,
+ cmd->reserved[0], cmd->reserved[1], cmd->reserved[2], cmd->reserved[3]);
+ printk("XXX fis=%08x:%08x:%08x:%08x %08x:%08x:%08x:%08x\n",
+ fis[0], fis[1], fis[2], fis[3],
+ fis[4], fis[5], fis[6], fis[7]);
+
+ printk("XXX qc->n_elem=%d fis_len=%d prdtl=%d\n",
+ qc->n_elem, cmd->opts & 0xf, prdtl);
+
+ for (i = 0; i < prdtl; i++)
+ printk("XXX sg[%d] = %x %x %x (%d)\n",
+ i, ahci_sg[i].addr, ahci_sg[i].addr_hi, ahci_sg[i].flags_size,
+ (ahci_sg[i].flags_size & 0x7fffffff) + 1);
+}
+
MODULE_AUTHOR("Jeff Garzik");
MODULE_DESCRIPTION("Common AHCI SATA low-level routines");
MODULE_LICENSE("GPL");
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -1059,6 +1059,7 @@ static int ata_do_link_abort(struct ata_

if (qc && (!link || qc->dev->link == link)) {
qc->flags |= ATA_QCFLAG_FAILED;
+ qc->err_mask = AC_ERR_DEV;
ata_qc_complete(qc);
nr_aborted++;
}
@@ -2416,6 +2417,8 @@ const char *ata_get_cmd_descript(u8 comm
}
EXPORT_SYMBOL_GPL(ata_get_cmd_descript);

+void ahci_dump_dma(struct ata_queued_cmd *qc);
+
/**
* ata_eh_link_report - report error handling to user
* @link: ATA link EH is going on
@@ -2590,6 +2593,7 @@ static void ata_eh_link_report(struct at
res->feature & ATA_IDNF ? "IDNF " : "",
res->feature & ATA_ABORTED ? "ABRT " : "");
#endif
+ ahci_dump_dma(qc);
}
}

--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -4035,6 +4035,7 @@ int ata_scsi_user_scan(struct Scsi_Host
}

if (rc == 0) {
+ ata_port_freeze(ap);
ata_port_schedule_eh(ap);
spin_unlock_irqrestore(ap->lock, flags);
ata_port_wait_eh(ap);

2015-12-21 07:25:31

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-21 11:55, Tejun Heo wrote:
> Artem, can you please reproduce the issue with the following patch
> applied and attach the kernel log?
>
> Thanks.
>

I've applied this patch on top of vanilla 4.3.3 kernel (without Linus'es
revert). Hopefully it's how you intended it to be.

Here's the result (I skipped the beginning of dmesg - it's the same as
always - see bugzilla).


Attachments:
dmesg.log (79.84 kB)

2015-12-21 07:31:19

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-21 10:23, Linus Torvalds wrote:
> On Sun, Dec 20, 2015 at 8:47 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> That said, we obviously need to figure out this current problem
>> regardless first..
>
> ... although maybe it *would* be interesting to hear what happens if
> you just compile a 64-bit kernel instead?
>
> Do you still see the problem? Because if not, then we should look very
> specifically for some 32-bit PAE issue.
>
> For example, maybe we use "unsigned long" somewhere where we should
> use "phys_addr_t". On x86-64, they obviously end up being the same. On
> normal non-PAE x86-32, they are also the same. But ..
>

Let's wait for what Tejun Heo might say - I've applied his debugging
patch and sent back the results.

Building x86_64 kernel here involves installing a 64bit Linux VM, so I'd
like it to be the last resort.

2015-12-21 11:21:48

by Dan Aloni

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Sun, Dec 20, 2015 at 10:41:44AM -0800, Linus Torvalds wrote:
>[..]
> Sadly, without CONFIG_LOCALVERSION_AUTO, there's no way to match up
> the dmesg files (in the same bisection tar-file as the bisection log)
> with the actual versions.

Perhaps we can print the Git revision in a manner independent of
CONFIG_LOCALVERSION_AUTO, using the attached patch. It will be emitted
in the dmesg Linux banner (though not in /proc/version, that's more
interface-ish and may break things).

--
Dan Aloni


Attachments:
(No filename) (497.00 B)
0001-init-version.c-add-SCM_REVISION.patch (3.29 kB)
Download all attachments

2015-12-21 19:35:55

by Tejun Heo

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

Hello, Artem.

On Mon, Dec 21, 2015 at 12:25:06PM +0500, Artem S. Tashkinov wrote:
> I've applied this patch on top of vanilla 4.3.3 kernel (without Linus'es
> revert). Hopefully it's how you intended it to be.
>
> Here's the result (I skipped the beginning of dmesg - it's the same as
> always - see bugzilla).

I added some debug messages during init. It isn't critical but it'd
be great if you attach the full log from now on. Something seemingly
unrelated surprisingly often turns out to be an important clue.

> [ 60.387407] Corrupted low memory at c0001000 (1000 phys) = cba3d25f
...
> [ 60.389131] Corrupted low memory at c0001ffc (1ffc phys) = 2712322a

It looks like the controller shat on the entire second page, which is
really puzzling. Looks like the controller is being fed corrupt DMA
SG table.

...
> [ 74.367608] ata1.00: exception Emask 0x3 SAct 0x180000 SErr 0x3040400 action 0x6 frozen
> [ 74.367613] ata1.00: irq_stat 0x45000008

The intresting bit here is that the controller is indicating OVERFLOW
which means that it consumed all PRD entries (ahci's DMA SG table) for
the command but the disk is still sending data to the host.

> [ 74.367617] ata1: SError: { Proto CommWake TrStaTrns UnrecFIS }
> [ 74.367621] ata1.00: failed command: READ FPDMA QUEUED
> [ 74.367627] ata1.00: cmd 60/00:98:00:89:21/07:00:04:00:00/40 tag 19 ncq 917504 in
> [ 74.367627] res 40/00:98:00:89:21/00:00:04:00:00/40 Emask 0x3 (HSM violation)
> [ 74.367630] ata1.00: status: { DRDY }

The followings are the data fed to the controller as seen from the
CPU.

> [ 74.367632] XXX cmd=ee9e0260 cmd_tbl=ee9ed600 ahci_sg=ee9ed680
> [ 74.367634] XXX opts=140005 st=0 addr=2e9ed600 addr_hi=0 rsvd=0:0:0:0
> [ 74.367637] XXX fis=00608027:40218900:07000004:08000098 00000000:00000000:00000000:00001fff
> [ 74.367639] XXX qc->n_elem=20 fis_len=5 prdtl=20
> [ 74.367641] XXX sg[0] = 29007000 0 8fff (36864)
> [ 74.367643] XXX sg[1] = 29218000 0 7fff (32768)
> [ 74.367645] XXX sg[2] = 29298000 0 7fff (32768)
> [ 74.367646] XXX sg[3] = 29ad8000 0 7fff (32768)
> [ 74.367648] XXX sg[4] = 29338000 0 7fff (32768)
> [ 74.367650] XXX sg[5] = 29370000 0 2fff (12288)
> [ 74.367652] XXX sg[6] = 219000 0 2fff (12288)
> [ 74.367653] XXX sg[7] = 230000 0 3fff (16384)
> [ 74.367655] XXX sg[8] = 29373000 0 4fff (20480)
> [ 74.367657] XXX sg[9] = 29130000 0 ffff (65536)
> [ 74.367659] XXX sg[10] = 29170000 0 ffff (65536)
> [ 74.367660] XXX sg[11] = 29280000 0 ffff (65536)
> [ 74.367662] XXX sg[12] = 29200000 0 ffff (65536)
> [ 74.367664] XXX sg[13] = 29320000 0 ffff (65536)
> [ 74.367666] XXX sg[14] = 29360000 0 ffff (65536)
> [ 74.367667] XXX sg[15] = 29340000 0 ffff (65536)
> [ 74.367669] XXX sg[16] = 29350000 0 ffff (65536)
> [ 74.367671] XXX sg[17] = 29300000 0 ffff (65536)
> [ 74.367672] XXX sg[18] = 29310000 0 ffff (65536)
> [ 74.367674] XXX sg[19] = 29020000 0 7fff (32768)

And everything checks out. Data lenghts are consistent and all the
addresses look kosher - at least nothing should upset the data
transfer itself.

> [ 74.367677] ata1.00: failed command: READ FPDMA QUEUED
> [ 74.367682] ata1.00: cmd 60/90:a0:c0:fe:23/00:00:04:00:00/40 tag 20 ncq 73728 in
> [ 74.367682] res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x7 (timeout)

This one looks like a collateral damage.

...
> [ 74.763895] ata1.00: exception Emask 0x3 SAct 0x800000 SErr 0x3040400 action 0x6
> [ 74.763900] ata1.00: irq_stat 0x45000008
> [ 74.763903] ata1: SError: { Proto CommWake TrStaTrns UnrecFIS }
> [ 74.763907] ata1.00: failed command: READ FPDMA QUEUED
> [ 74.763913] ata1.00: cmd 60/98:b8:00:c9:20/03:00:04:00:00/40 tag 23 ncq 471040 in
> [ 74.763913] res 40/00:b8:00:c9:20/00:00:04:00:00/40 Emask 0x3 (HSM violation)
> [ 74.763916] ata1.00: status: { DRDY }
> [ 74.763918] XXX cmd=ee9e02e0 cmd_tbl=ee9f0200 ahci_sg=ee9f0280
> [ 74.763921] XXX opts=160005 st=0 addr=2e9f0200 addr_hi=0 rsvd=0:0:0:0
> [ 74.763924] XXX fis=98608027:4020c900:03000004:080000b8 00000000:00000000:00000000:00000fff
> [ 74.763925] XXX qc->n_elem=22 fis_len=5 prdtl=22
> [ 74.763928] XXX sg[0] = 2ab1b000 0 fff (4096)
...
> [ 74.763964] XXX sg[21] = 29170000 0 dfff (57344)

This is a separate failure and shares the same pattern as before.
Everything looks proper.

The thing is ahci doesn't have much restrictions in terms of its DMA
capabilities. It can digest pretty much anything. The only
restriction is that each entry can't be larger than 4M - but our
segment maximum is 64k. There's no noticeable boundary crossing
happening both in target DMA regions and command tables. All
addresses are in linear mapped normal area.

If the controller is seeing what the host is seeing in the command
area, I can't see why it would be declaring overflow or dumping stuff
into the lowest pages.

Ming Lei reported a similar issue on 32bit ARM w/ PAE. I don't
understand why PAE is making any difference. Native 64bit machines
should be generating IOs just as large. Maybe iommu is hiding the
issue?

I'm afraid we'll have to go brute force with the problem - dump more
information on both 32 and 64bits and see where the differences lie.
At the moment, given that the DMA table looks completely fine and ahci
isn't picky about the shape of data areas, I think it's more likely
some obscure issue in address mapping under PAE or a silly bug in ahci
than block layer screwing up merging.

Thanks.

--
tejun

2015-12-21 20:07:26

by Tejun Heo

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

Hello, Artem.

Can you please apply the following patch on top and see whether
anything changes? If it does make the issue go away, can you please
revert the ".can_queue" part and test again?

Thanks.

---
drivers/ata/ahci.h | 2 +-
drivers/ata/libahci.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -365,7 +365,7 @@ extern struct device_attribute *ahci_sde
*/
#define AHCI_SHT(drv_name) \
ATA_NCQ_SHT(drv_name), \
- .can_queue = AHCI_MAX_CMDS - 1, \
+ .can_queue = 1/*AHCI_MAX_CMDS - 1*/, \
.sg_tablesize = AHCI_MAX_SG, \
.dma_boundary = AHCI_DMA_BOUNDARY, \
.shost_attrs = ahci_shost_attrs, \
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -420,7 +420,7 @@ void ahci_save_initial_config(struct dev
hpriv->saved_cap2 = cap2 = 0;

/* some chips have errata preventing 64bit use */
- if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) {
+ if ((cap & HOST_CAP_64)/* && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)*/) {
dev_info(dev, "controller can't do 64bit DMA, forcing 32bit\n");
cap &= ~HOST_CAP_64;
}

--
tejun

2015-12-21 21:08:15

by Tejun Heo

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

Hello, again.

On Mon, Dec 21, 2015 at 03:07:21PM -0500, Tejun Heo wrote:
> Hello, Artem.
>
> Can you please apply the following patch on top and see whether
> anything changes? If it does make the issue go away, can you please
> revert the ".can_queue" part and test again?

If the patch doesn't change anything, can you please try the
followings and see which one makes difference?

1. Exclude memory above 4G line with boot param "max_addr=4G".
2. Disable highmem with "highmem=0".
3. Try booting 64bit kernel.

At the moment, the only thing I can think of which can explain the PAE
+ bio_get_nr_vecs() situation is that the bio split code which is
activated by the bio_get_nr_vecs() somehow messes up 64bit or high
addresses on 32bit kernels. I scanned for the obvious but at bio
layer, memory is represented by struct page, so nothing obvious seems
broken.

Note that for now I'm ignoring the debug dumps from the ahci debug
patch which indicates that the passed in addresses are all fine. It
is possible that the controller gets confused with certain receiving
addresses and reports failure on later commands or maybe there is a
different sequence of events which can encompass both that we don't
know of yet.

Thanks.

--
tejun

2015-12-21 22:51:30

by Ming Lei

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Tue, Dec 22, 2015 at 3:35 AM, Tejun Heo <[email protected]> wrote:
>
>> [ 74.367632] XXX cmd=ee9e0260 cmd_tbl=ee9ed600 ahci_sg=ee9ed680
>> [ 74.367634] XXX opts=140005 st=0 addr=2e9ed600 addr_hi=0 rsvd=0:0:0:0
>> [ 74.367637] XXX fis=00608027:40218900:07000004:08000098 00000000:00000000:00000000:00001fff
>> [ 74.367639] XXX qc->n_elem=20 fis_len=5 prdtl=20
>> [ 74.367641] XXX sg[0] = 29007000 0 8fff (36864)
>> [ 74.367643] XXX sg[1] = 29218000 0 7fff (32768)
>> [ 74.367645] XXX sg[2] = 29298000 0 7fff (32768)
>> [ 74.367646] XXX sg[3] = 29ad8000 0 7fff (32768)
>> [ 74.367648] XXX sg[4] = 29338000 0 7fff (32768)
>> [ 74.367650] XXX sg[5] = 29370000 0 2fff (12288)
>> [ 74.367652] XXX sg[6] = 219000 0 2fff (12288)
>> [ 74.367653] XXX sg[7] = 230000 0 3fff (16384)
>> [ 74.367655] XXX sg[8] = 29373000 0 4fff (20480)
>> [ 74.367657] XXX sg[9] = 29130000 0 ffff (65536)
>> [ 74.367659] XXX sg[10] = 29170000 0 ffff (65536)
>> [ 74.367660] XXX sg[11] = 29280000 0 ffff (65536)
>> [ 74.367662] XXX sg[12] = 29200000 0 ffff (65536)
>> [ 74.367664] XXX sg[13] = 29320000 0 ffff (65536)
>> [ 74.367666] XXX sg[14] = 29360000 0 ffff (65536)
>> [ 74.367667] XXX sg[15] = 29340000 0 ffff (65536)
>> [ 74.367669] XXX sg[16] = 29350000 0 ffff (65536)
>> [ 74.367671] XXX sg[17] = 29300000 0 ffff (65536)
>> [ 74.367672] XXX sg[18] = 29310000 0 ffff (65536)
>> [ 74.367674] XXX sg[19] = 29020000 0 7fff (32768)
>
> And everything checks out. Data lenghts are consistent and all the
> addresses look kosher - at least nothing should upset the data
> transfer itself.

Maybe we can check more, such as if the sg element is correctly
merged from bvec, and the following code should be useful to check
that:

+static void ahci_dump_req(struct ata_queued_cmd *qc)
+{
+ struct scsi_cmnd *cmd = qc->scsicmd;
+ struct request *req = cmd->request;
+ struct req_iterator iter;
+ struct bio_vec bv;
+ int i = 0;
+ phys_addr_t paddr;
+
+ printk("%s: \n", __func__);
+ rq_for_each_segment(bv, req, iter) {
+ paddr = page_to_phys(bv.bv_page);
+ printk("\t %3d: %x-%x %x %u\n", i++,
+ (unsigned)paddr & 0xffffffff,
+ (unsigned)(paddr >> 32),
+ bv.bv_offset,
+ bv.bv_len);
+ }
+}

Thanks,

2015-12-22 03:43:27

by Kent Overstreet

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

I just reproduced it - Artem, I'll let you know when we have a possible fix but
hopefully there won't be any need for you to beat up your hardware any more :)

On Mon, Dec 21, 2015 at 04:08:11PM -0500, Tejun Heo wrote:
> Hello, again.
>
> On Mon, Dec 21, 2015 at 03:07:21PM -0500, Tejun Heo wrote:
> > Hello, Artem.
> >
> > Can you please apply the following patch on top and see whether
> > anything changes? If it does make the issue go away, can you please
> > revert the ".can_queue" part and test again?
>
> If the patch doesn't change anything, can you please try the
> followings and see which one makes difference?
>
> 1. Exclude memory above 4G line with boot param "max_addr=4G".
> 2. Disable highmem with "highmem=0".
> 3. Try booting 64bit kernel.
>
> At the moment, the only thing I can think of which can explain the PAE
> + bio_get_nr_vecs() situation is that the bio split code which is
> activated by the bio_get_nr_vecs() somehow messes up 64bit or high
> addresses on 32bit kernels. I scanned for the obvious but at bio
> layer, memory is represented by struct page, so nothing obvious seems
> broken.
>
> Note that for now I'm ignoring the debug dumps from the ahci debug
> patch which indicates that the passed in addresses are all fine. It
> is possible that the controller gets confused with certain receiving
> addresses and reports failure on later commands or maybe there is a
> different sequence of events which can encompass both that we don't
> know of yet.

It repros pretty easily, I should be able to just write some test code that
sends down single specific IOs so no matter what the controller is doing we know
exactly what IO triggered the error.

I'm checking all the 64 bit/pae/etc. stuff now.

2015-12-22 03:59:52

by Kent Overstreet

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Mon, Dec 21, 2015 at 04:08:11PM -0500, Tejun Heo wrote:

reproduced it with 32 bit pae:

> 1. Exclude memory above 4G line with boot param "max_addr=4G".

doesn't work - max_addr=1G doesn't work either

> 2. Disable highmem with "highmem=0".

works!

> 3. Try booting 64bit kernel.

works

Ok, so maybe it actually is PAE specific... but like you noted the block layer
works entirely in terms of pages so...

The one idea I can think of is - maybe BIOVEC_PHYS_MERGEABLE() is broken in PAE
mode? I am unfamiliar with anything PAE though.

Where does the ahci code consume the sglist?

2015-12-22 04:06:13

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-21 10:23, Linus Torvalds wrote:
> On Sun, Dec 20, 2015 at 8:47 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> That said, we obviously need to figure out this current problem
>> regardless first..
>
> ... although maybe it *would* be interesting to hear what happens if
> you just compile a 64-bit kernel instead?

Under x86-64 I cannot reproduce this problem. It seems like it's PAE
specific (Kent Overstreet says he has reproduced it).

>
> Do you still see the problem? Because if not, then we should look very
> specifically for some 32-bit PAE issue.
>
> For example, maybe we use "unsigned long" somewhere where we should
> use "phys_addr_t". On x86-64, they obviously end up being the same. On
> normal non-PAE x86-32, they are also the same. But ..
>
> Linus

2015-12-22 04:45:23

by Kent Overstreet

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

So what I _really_ want to know is - how the fuck is the actual ATA command
itself malformed?

You told me at one point that the error code indicated the controller was
claiming it overran the end of the sglist - well, if that's the case we ought to
be able to prove it with an assertion (I already tried; qc->nbytes does match
the sglist, checking from ata_sg_setup()).

Ignore bvec merging, PAE, all that crap - the controller doesn't know about any
of that. _WHAT_ are we feeding it that it doesn't like?

There shouldn't even be that much stuff to check, since in theory the only
possible thing that could be at fault is the sglist. Maybe they're too big, too
small, misaligned, too many of them, god knows what, but somehow the sglist
we're feeding the device has to be at fault, right?

But the sglists in Artem's debugging output look pretty uninteresting (one
of them has _no_ merged segments - it's just 18 4k pages - how could THAT be an
issue?)

Gonna apply your debugging patch and start throwing stuff at the wall next, I
guess...

2015-12-22 05:10:26

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-22 01:07, Tejun Heo wrote:
> Hello, Artem.
>
> Can you please apply the following patch on top and see whether
> anything changes? If it does make the issue go away, can you please
> revert the ".can_queue" part and test again?
>
> Thanks.
>
> ---
> drivers/ata/ahci.h | 2 +-
> drivers/ata/libahci.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> --- a/drivers/ata/ahci.h
> +++ b/drivers/ata/ahci.h
> @@ -365,7 +365,7 @@ extern struct device_attribute *ahci_sde
> */
> #define AHCI_SHT(drv_name) \
> ATA_NCQ_SHT(drv_name), \
> - .can_queue = AHCI_MAX_CMDS - 1, \
> + .can_queue = 1/*AHCI_MAX_CMDS - 1*/, \
> .sg_tablesize = AHCI_MAX_SG, \
> .dma_boundary = AHCI_DMA_BOUNDARY, \
> .shost_attrs = ahci_shost_attrs, \
> --- a/drivers/ata/libahci.c
> +++ b/drivers/ata/libahci.c
> @@ -420,7 +420,7 @@ void ahci_save_initial_config(struct dev
> hpriv->saved_cap2 = cap2 = 0;
>
> /* some chips have errata preventing 64bit use */
> - if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) {
> + if ((cap & HOST_CAP_64)/* && (hpriv->flags &
> AHCI_HFLAG_32BIT_ONLY)*/) {
> dev_info(dev, "controller can't do 64bit DMA, forcing 32bit\n");
> cap &= ~HOST_CAP_64;
> }

This patch fixes the issue for me. Now rechecking without .can_queue
part.

BTW, since I left debugging on, here's the part you wanted:

[ 0.613851] XXX port 0 dma_sz=91392 mem=c0020000 mem_dma=00020000
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[ 0.613865] XXX port 1 dma_sz=91392 mem=eea00000 mem_dma=2ea00000
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[ 0.620464] XXX port 2 dma_sz=91392 mem=eea20000 mem_dma=2ea20000
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[ 0.627121] XXX port 3 dma_sz=91392 mem=eea40000 mem_dma=2ea40000
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[ 0.633791] XXX port 4 dma_sz=91392 mem=eea60000 mem_dma=2ea60000
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[ 0.640445] XXX port 5 dma_sz=91392 mem=eea80000 mem_dma=2ea80000
cmd_slot=0 rx_fis=1024 cmd_tbl=1280

2015-12-22 05:20:22

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-22 01:07, Tejun Heo wrote:
> Hello, Artem.
>
> Can you please apply the following patch on top and see whether
> anything changes? If it does make the issue go away, can you please
> revert the ".can_queue" part and test again?
>
> Thanks.
>
> ---
> drivers/ata/ahci.h | 2 +-
> drivers/ata/libahci.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> --- a/drivers/ata/ahci.h
> +++ b/drivers/ata/ahci.h
> @@ -365,7 +365,7 @@ extern struct device_attribute *ahci_sde
> */
> #define AHCI_SHT(drv_name) \
> ATA_NCQ_SHT(drv_name), \
> - .can_queue = AHCI_MAX_CMDS - 1, \
> + .can_queue = 1/*AHCI_MAX_CMDS - 1*/, \
> .sg_tablesize = AHCI_MAX_SG, \
> .dma_boundary = AHCI_DMA_BOUNDARY, \
> .shost_attrs = ahci_shost_attrs, \
> --- a/drivers/ata/libahci.c
> +++ b/drivers/ata/libahci.c
> @@ -420,7 +420,7 @@ void ahci_save_initial_config(struct dev
> hpriv->saved_cap2 = cap2 = 0;
>
> /* some chips have errata preventing 64bit use */
> - if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) {
> + if ((cap & HOST_CAP_64)/* && (hpriv->flags &
> AHCI_HFLAG_32BIT_ONLY)*/) {
> dev_info(dev, "controller can't do 64bit DMA, forcing 32bit\n");
> cap &= ~HOST_CAP_64;
> }

With the ".can_queue" part left intact the bug resurfaced. Full dmesg
output is attached.


Attachments:
dmesg.xz (13.30 kB)

2015-12-22 05:27:15

by Junichi Nomura

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 12/22/15 12:59, Kent Overstreet wrote:
> reproduced it with 32 bit pae:
>
>> 1. Exclude memory above 4G line with boot param "max_addr=4G".
>
> doesn't work - max_addr=1G doesn't work either
>
>> 2. Disable highmem with "highmem=0".
>
> works!
>
>> 3. Try booting 64bit kernel.
>
> works

blk_queue_bio() does split then bounce, which makes the segment
counting based on pages before bouncing and could go wrong.

What do you think of a patch like this?

--
Jun'ichi Nomura, NEC Corporation

diff --git a/block/blk-core.c b/block/blk-core.c
index 5131993b..1d1c3c7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1689,8 +1689,6 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
struct request *req;
unsigned int request_count = 0;

- blk_queue_split(q, &bio, q->bio_split);
-
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
@@ -1698,6 +1696,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
*/
blk_queue_bounce(q, &bio);

+ blk_queue_split(q, &bio, q->bio_split);
+
if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
bio->bi_error = -EIO;
bio_endio(bio);-

2015-12-22 05:37:18

by Kent Overstreet

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Tue, Dec 22, 2015 at 05:26:12AM +0000, Junichi Nomura wrote:
> On 12/22/15 12:59, Kent Overstreet wrote:
> > reproduced it with 32 bit pae:
> >
> >> 1. Exclude memory above 4G line with boot param "max_addr=4G".
> >
> > doesn't work - max_addr=1G doesn't work either
> >
> >> 2. Disable highmem with "highmem=0".
> >
> > works!
> >
> >> 3. Try booting 64bit kernel.
> >
> > works
>
> blk_queue_bio() does split then bounce, which makes the segment
> counting based on pages before bouncing and could go wrong.
>
> What do you think of a patch like this?

Shit, you nailed it. Can't believe I didn't think to check that.

>
> --
> Jun'ichi Nomura, NEC Corporation
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 5131993b..1d1c3c7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1689,8 +1689,6 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
> struct request *req;
> unsigned int request_count = 0;
>
> - blk_queue_split(q, &bio, q->bio_split);
> -
> /*
> * low level driver can indicate that it wants pages above a
> * certain limit bounced to low memory (ie for highmem, or even
> @@ -1698,6 +1696,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
> */
> blk_queue_bounce(q, &bio);
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
> bio->bi_error = -EIO;
> bio_endio(bio);

2015-12-22 05:39:00

by Kent Overstreet

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Tue, Dec 22, 2015 at 05:26:12AM +0000, Junichi Nomura wrote:
> On 12/22/15 12:59, Kent Overstreet wrote:
> > reproduced it with 32 bit pae:
> >
> >> 1. Exclude memory above 4G line with boot param "max_addr=4G".
> >
> > doesn't work - max_addr=1G doesn't work either
> >
> >> 2. Disable highmem with "highmem=0".
> >
> > works!
> >
> >> 3. Try booting 64bit kernel.
> >
> > works
>
> blk_queue_bio() does split then bounce, which makes the segment
> counting based on pages before bouncing and could go wrong.
>
> What do you think of a patch like this?

Artem, can you give this patch a try?

>
> --
> Jun'ichi Nomura, NEC Corporation
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 5131993b..1d1c3c7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1689,8 +1689,6 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
> struct request *req;
> unsigned int request_count = 0;
>
> - blk_queue_split(q, &bio, q->bio_split);
> -
> /*
> * low level driver can indicate that it wants pages above a
> * certain limit bounced to low memory (ie for highmem, or even
> @@ -1698,6 +1696,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
> */
> blk_queue_bounce(q, &bio);
>
> + blk_queue_split(q, &bio, q->bio_split);
> +
> if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
> bio->bi_error = -EIO;
> bio_endio(bio);

2015-12-22 05:52:47

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-22 10:38, Kent Overstreet wrote:
> On Tue, Dec 22, 2015 at 05:26:12AM +0000, Junichi Nomura wrote:
>> On 12/22/15 12:59, Kent Overstreet wrote:
>> > reproduced it with 32 bit pae:
>> >
>> >> 1. Exclude memory above 4G line with boot param "max_addr=4G".
>> >
>> > doesn't work - max_addr=1G doesn't work either
>> >
>> >> 2. Disable highmem with "highmem=0".
>> >
>> > works!
>> >
>> >> 3. Try booting 64bit kernel.
>> >
>> > works
>>
>> blk_queue_bio() does split then bounce, which makes the segment
>> counting based on pages before bouncing and could go wrong.
>>
>> What do you think of a patch like this?
>
> Artem, can you give this patch a try?


This patch ostensibly fixes the issue - at least I cannot immediately
reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov"

>
>>
>> --
>> Jun'ichi Nomura, NEC Corporation
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 5131993b..1d1c3c7 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1689,8 +1689,6 @@ static blk_qc_t blk_queue_bio(struct
>> request_queue *q, struct bio *bio)
>> struct request *req;
>> unsigned int request_count = 0;
>>
>> - blk_queue_split(q, &bio, q->bio_split);
>> -
>> /*
>> * low level driver can indicate that it wants pages above a
>> * certain limit bounced to low memory (ie for highmem, or even
>> @@ -1698,6 +1696,8 @@ static blk_qc_t blk_queue_bio(struct
>> request_queue *q, struct bio *bio)
>> */
>> blk_queue_bounce(q, &bio);
>>
>> + blk_queue_split(q, &bio, q->bio_split);
>> +
>> if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
>> bio->bi_error = -EIO;
>> bio_endio(bio);

2015-12-22 05:55:18

by Kent Overstreet

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Tue, Dec 22, 2015 at 10:52:37AM +0500, Artem S. Tashkinov wrote:
> On 2015-12-22 10:38, Kent Overstreet wrote:
> >On Tue, Dec 22, 2015 at 05:26:12AM +0000, Junichi Nomura wrote:
> >>On 12/22/15 12:59, Kent Overstreet wrote:
> >>> reproduced it with 32 bit pae:
> >>>
> >>>> 1. Exclude memory above 4G line with boot param "max_addr=4G".
> >>>
> >>> doesn't work - max_addr=1G doesn't work either
> >>>
> >>>> 2. Disable highmem with "highmem=0".
> >>>
> >>> works!
> >>>
> >>>> 3. Try booting 64bit kernel.
> >>>
> >>> works
> >>
> >>blk_queue_bio() does split then bounce, which makes the segment
> >>counting based on pages before bouncing and could go wrong.
> >>
> >>What do you think of a patch like this?
> >
> >Artem, can you give this patch a try?
>
>
> This patch ostensibly fixes the issue - at least I cannot immediately
> reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov"

Let's all contemplate the fact that blk_segment_map_sg() _overrunning the end of
the provided sglist_ was this much of a clusterfuck to debug.

2015-12-22 05:59:15

by Artem S. Tashkinov

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-22 10:55, Kent Overstreet wrote:
> On Tue, Dec 22, 2015 at 10:52:37AM +0500, Artem S. Tashkinov wrote:
>> On 2015-12-22 10:38, Kent Overstreet wrote:
>> >On Tue, Dec 22, 2015 at 05:26:12AM +0000, Junichi Nomura wrote:
>> >>On 12/22/15 12:59, Kent Overstreet wrote:
>> >>> reproduced it with 32 bit pae:
>> >>>
>> >>>> 1. Exclude memory above 4G line with boot param "max_addr=4G".
>> >>>
>> >>> doesn't work - max_addr=1G doesn't work either
>> >>>
>> >>>> 2. Disable highmem with "highmem=0".
>> >>>
>> >>> works!
>> >>>
>> >>>> 3. Try booting 64bit kernel.
>> >>>
>> >>> works
>> >>
>> >>blk_queue_bio() does split then bounce, which makes the segment
>> >>counting based on pages before bouncing and could go wrong.
>> >>
>> >>What do you think of a patch like this?
>> >
>> >Artem, can you give this patch a try?
>>
>>
>> This patch ostensibly fixes the issue - at least I cannot immediately
>> reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov"
>
> Let's all contemplate the fact that blk_segment_map_sg() _overrunning
> the end of
> the provided sglist_ was this much of a clusterfuck to debug.

From the look of it this fix has nothing to do with PAE, so then why
only PAE users like me were affected by the original
(b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c) patch?

2015-12-22 06:02:34

by Kent Overstreet

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On Tue, Dec 22, 2015 at 10:59:09AM +0500, Artem S. Tashkinov wrote:
> On 2015-12-22 10:55, Kent Overstreet wrote:
> >On Tue, Dec 22, 2015 at 10:52:37AM +0500, Artem S. Tashkinov wrote:
> >>On 2015-12-22 10:38, Kent Overstreet wrote:
> >>>On Tue, Dec 22, 2015 at 05:26:12AM +0000, Junichi Nomura wrote:
> >>>>On 12/22/15 12:59, Kent Overstreet wrote:
> >>>>> reproduced it with 32 bit pae:
> >>>>>
> >>>>>> 1. Exclude memory above 4G line with boot param "max_addr=4G".
> >>>>>
> >>>>> doesn't work - max_addr=1G doesn't work either
> >>>>>
> >>>>>> 2. Disable highmem with "highmem=0".
> >>>>>
> >>>>> works!
> >>>>>
> >>>>>> 3. Try booting 64bit kernel.
> >>>>>
> >>>>> works
> >>>>
> >>>>blk_queue_bio() does split then bounce, which makes the segment
> >>>>counting based on pages before bouncing and could go wrong.
> >>>>
> >>>>What do you think of a patch like this?
> >>>
> >>>Artem, can you give this patch a try?
> >>
> >>
> >>This patch ostensibly fixes the issue - at least I cannot immediately
> >>reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov"
> >
> >Let's all contemplate the fact that blk_segment_map_sg() _overrunning the
> >end of
> >the provided sglist_ was this much of a clusterfuck to debug.
>
> From the look of it this fix has nothing to do with PAE, so then why only
> PAE users like me were affected by the original
> (b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c) patch?

The amusing thing is that I doubt PAE actually requires bouncing - addressing
limits come from the device, not the cpu.

But evidently in PAE mode, the block layer is in fact bouncing bios. Probably
from some default setting in the queue limits that no one ever looks at.

The whole queue limits design is an atrocity, it leads to exactly this kind of
crap where no one can predict the actual behaviour of any given setup.

2015-12-22 17:28:42

by Jens Axboe

[permalink] [raw]
Subject: Re: IO errors after "block: remove bio_get_nr_vecs()"

On 12/21/2015 10:26 PM, Junichi Nomura wrote:
> On 12/22/15 12:59, Kent Overstreet wrote:
>> reproduced it with 32 bit pae:
>>
>>> 1. Exclude memory above 4G line with boot param "max_addr=4G".
>>
>> doesn't work - max_addr=1G doesn't work either
>>
>>> 2. Disable highmem with "highmem=0".
>>
>> works!
>>
>>> 3. Try booting 64bit kernel.
>>
>> works
>
> blk_queue_bio() does split then bounce, which makes the segment
> counting based on pages before bouncing and could go wrong.

Good catch! The blk-mq parts aren't affected by this, the screw up only
happened in the old IO path. I've added this with the appropriate
tested-by from Artem, and CC stable and listed the commit that broke it:

commit 54efd50bfd873e2dbf784e0b21a8027ba4299a3e
Author: Kent Overstreet <[email protected]>
Date: Thu Apr 23 22:37:18 2015 -0700

block: make generic_make_request handle arbitrarily sized bios

Thanks to all involved in nailing this down, it'll go out shortly.

--
Jens Axboe