Message-Id: <58480D6E020000A10002393A@gwsmtp1.uni-regensburg.de>
Date: Wed, 07 Dec 2016 13:23:58 +0100
From: "Ulrich Windl" <Ulrich.Windl@rz.uni-regensburg.de>
To: <linux-kernel@vger.kernel.org>
Cc: <axboe@kernel.dk>, "Ulrich Windl" <Ulrich.Windl@rz.uni-regensburg.de>
Subject: Antw: 3.0.101: "blk_rq_check_limits: over max size limit."
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 8BIT
Content-Disposition: inline
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7475
Lines: 191

Hi again!

An addition: Processes doing such I/O seem to be unkillable, and I also cannot change the queue parameters while this problem occurs (the process trying to write (e.g.: to queue/scheduler) is also blocked. The process status of the process doing I/O looks like this:
# cat /proc/2762/status
Name:   randio
State:  D (disk sleep)
Tgid:   2762
Pid:    2762
PPid:   53250
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 0
Groups: 0 105
Threads:        1
SigQ:   5/38340
SigPnd: 0000000000000000
ShdPnd: 0000000000004102
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000180000000
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed:   ffffffff,ffffffff
Cpus_allowed_list:      0-63
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        5
nonvoluntary_ctxt_switches:     1

Best regards,
Ulrich

>>> Ulrich Windl schrieb am 07.12.2016 um 13:19 in Nachricht <5847FE66.7E4 : 161 :
60728>:
> Hi again!
> 
> Maybe someone can confirm this:
> If you have a device (e.g. multipath map) that limits max_sectors_kb to 
> maybe 64, and then define an LVM LV using that multipath map as PV, the LV 
> still has a larger max_sectors_kb. If you send a big request (read in my 
> case), the kernel will complain:
> 
> kernel: [173116.098798] blk_rq_check_limits: over max size limit.
> 
> Note that this message does not give any clue to the device being involved, 
> nor the size of the IO attempted, nor the limit of the IO size.
> 
> My expectation would be that the higher layer reports back an I/O error, and 
> the user process receives an I/O error, or, alternatively, the big request is 
> split into acceptable chunks before passing it to the lower layers.
> 
> However none of the above happens; instead the request seems to block the 
> request queue, because later TUR-checks also fail:
> kernel: [173116.105701] device-mapper: multipath: Failing path 66:384.
> kernel: [173116.105714] device-mapper: multipath: Failing path 66:352.
> multipathd: 66:384: mark as failed
> multipathd: NAP_S11: remaining active paths: 1
> multipathd: 66:352: mark as failed
> multipathd: NAP_S11: Entering recovery mode: max_retries=6
> multipathd: NAP_S11: remaining active paths: 0
> 
> (somewhat later)
>  multipathd: NAP_S11: sdkh - tur checker reports path is up
> multipathd: 66:336: reinstated
> multipathd: NAP_S11: Recovered to normal mode
> kernel: [173117.286712] device-mapper: multipath: Could not failover device 
> 66:368: Handler scsi_dh_alua error 8.
> (I don't know the implications of this)
> 
> Of course this error does not appear as long as all devices use the same 
> maximum request size, but tests have shown that different SAN disk systems 
> prefer different request sizes (as they split large requests internally to 
> handle them in chunks anyway).
> 
> Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 
> 3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch 
> revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205)
> 
> Regards,
> Ulrich
> 
> >>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : 161 
> :
> 60728>:
> > Hello!
> > 
> > While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I 
> noticed 
> > that I/Os dropped, until everything stood still more or less. Looking into 
> > the syslog I found that multipath's TUR-checker considered the paths (FC, 
> > BTW) as dead. Amazingly I did not have this problem when I did read-only 
> > tests.
> > 
> > The start looks like this:
> > Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed
> > Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3
> > Aug 23 14:44:58 h10 kernel: [  880.159425] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159611] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159615] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.159623] device-mapper: multipath: 
> Failing 
> > path 8:32.
> > Aug 23 14:44:58 h10 kernel: [  880.186609] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.186626] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.186628] blk_rq_check_limits: over max 
> > size limit.
> > Aug 23 14:44:58 h10 kernel: [  880.186631] device-mapper: multipath: 
> Failing 
> > path 129:112.
> > [...]
> > It seems the TUR-checker does some ping-pong-like game: Paths go up and 
> down
> > 
> > Now for the Linux part: I found the relevant message in blk-core.c 
> > (blk_rq_check_limits()).
> > First s/agaist/against/ in " *    Such request stacking drivers should 
> check 
> > those requests agaist", the there's the problem that the message neither 
> > outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the 
> > underlying device. That makes debugging somewhat difficult if you customize 
> 
> > the block queue settings per device as I did:
> > 
> > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > queue/rotational for FirstTest-31 (0)
> > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > queue/add_random for FirstTest-31 (0)
> > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > queue/scheduler for FirstTest-31 (noop)
> > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > queue/max_sectors_kb for FirstTest-31 (128)
> > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > queue/rotational for FirstTest-32 (0)
> > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > queue/add_random for FirstTest-32 (0)
> > Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of 
> > queue/scheduler for FirstTest-32 (noop)
> > Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of 
> > queue/max_sectors_kb for FirstTest-32 (128)
> > 
> > I suspect the "queue/max_sectors_kb=128" is the culprit:
> > # multipath -ll FirstTest-32
> > FirstTest-32 (360002ac000000000000000040001b383) dm-7 3PARdata,VV
> > size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
> > `-+- policy='service-time 0' prio=50 status=active
> >   |- 2:0:0:1  sdet 129:80  failed ready running
> >   |- 2:0:2:1  sdev 129:112 failed ready running
> >   |- 1:0:0:1  sdb  8:16    failed ready running
> >   `- 1:0:1:1  sdc  8:32    failed ready running
> > # cat /sys/block/{dm-7,sd{b,c},sde{t,v}}/queue/max_sectors_kb
> > 128
> > 128
> > 128
> > 128
> > 128
> > 
> > While writing this message, I noticed that I had created a primary 
> partition 
> > of dm-7:
> > # dmsetup ls |grep Fi
> > FirstTest-32_part1      (253:8)
> > FirstTest-32    (253:7)
> > # cat /sys/block/dm-8/queue/max_sectors_kb
> > 1024
> > 
> > After "# echo 128 >/sys/block/dm-8/queue/max_sectors_kb" things still did 
> not 
> > get better.
> > 
> > Can't blk_rq_check_limits() do anything more clever than returning -EIO?
> > 
> > Regards,
> > Ulrich
> > P.S: Keep me in CC:, please!
> > 
> > 
> > 
> 
> 
> 
>