2018-07-31 13:50:26

by Johannes Thumshirn

[permalink] [raw]
Subject: [PATCH 0/3] scsi: fcoe: memleak fixes

Ard reported a memory leak in FCoE at [1] and this patch set contains
the fix for this leak, a second leak I discovered while hunting the
first leak and a use-after-free in FCoE's debugging.

[1] https://marc.info/?l=linux-scsi&m=153261165228133&w=2

Johannes Thumshirn (3):
fcoe: fix use-after-free in fcoe_ctlr_els_send
scsi: fcoe: drop frames in ELS LOGO error path
scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO

drivers/scsi/fcoe/fcoe_ctlr.c | 6 +++---
drivers/scsi/libfc/fc_rport.c | 1 +
2 files changed, 4 insertions(+), 3 deletions(-)

--
2.16.4



2018-07-31 13:49:32

by Johannes Thumshirn

[permalink] [raw]
Subject: [PATCH 2/3] scsi: fcoe: drop frames in ELS LOGO error path

Drop the frames in the ELS LOGO error path instead of just returning
an error.

This fixes the following kmemleak report:
unreferenced object 0xffff880064cb1000 (size 424):
comm "kworker/0:2", pid 24, jiffies 4294904293 (age 68.504s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<(____ptrval____)>] _fc_frame_alloc+0x2c/0x180 [libfc]
[<(____ptrval____)>] fc_lport_enter_logo+0x106/0x360 [libfc]
[<(____ptrval____)>] fc_fabric_logoff+0x8c/0xc0 [libfc]
[<(____ptrval____)>] fcoe_if_destroy+0x79/0x3b0 [fcoe]
[<(____ptrval____)>] fcoe_destroy_work+0xd2/0x170 [fcoe]
[<(____ptrval____)>] process_one_work+0x7ff/0x1420
[<(____ptrval____)>] worker_thread+0x87/0xef0
[<(____ptrval____)>] kthread+0x2db/0x390
[<(____ptrval____)>] ret_from_fork+0x35/0x40
[<(____ptrval____)>] 0xffffffffffffffff

which can be triggered by issuing
echo eth0 > /sys/bus/fcoe/ctlr_destroy

Signed-off-by: Johannes Thumshirn <[email protected]>
---
drivers/scsi/fcoe/fcoe_ctlr.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/fcoe/fcoe_ctlr.c b/drivers/scsi/fcoe/fcoe_ctlr.c
index ceb35ebbeb8f..ffec695e0bfb 100644
--- a/drivers/scsi/fcoe/fcoe_ctlr.c
+++ b/drivers/scsi/fcoe/fcoe_ctlr.c
@@ -754,9 +754,9 @@ int fcoe_ctlr_els_send(struct fcoe_ctlr *fip, struct fc_lport *lport,
case ELS_LOGO:
if (fip->mode == FIP_MODE_VN2VN) {
if (fip->state != FIP_ST_VNMP_UP)
- return -EINVAL;
+ goto drop;
if (ntoh24(fh->fh_d_id) == FC_FID_FLOGI)
- return -EINVAL;
+ goto drop;
} else {
if (fip->state != FIP_ST_ENABLED)
return 0;
--
2.16.4


2018-07-31 13:49:36

by Johannes Thumshirn

[permalink] [raw]
Subject: [PATCH 3/3] scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO

When receiving a LOGO request we forget to clear the FC_RP_STARTED
flag before starting the rport delete routine.

As the started flag was not cleared, we're not deleting the rport but
waiting for a restart and thus are keeping the reference count of the
rdata object at 1.

This leads to the following kmemleak report:
unreferenced object 0xffff88006542aa00 (size 512):
comm "kworker/0:2", pid 24, jiffies 4294899222 (age 226.880s)
hex dump (first 32 bytes):
68 96 fe 65 00 88 ff ff 00 00 00 00 00 00 00 00 h..e............
01 00 00 00 08 00 00 00 02 c5 45 24 ac b8 00 10 ..........E$....
backtrace:
[<(____ptrval____)>] fcoe_ctlr_vn_add.isra.5+0x7f/0x770 [libfcoe]
[<(____ptrval____)>] fcoe_ctlr_vn_recv+0x12af/0x27f0 [libfcoe]
[<(____ptrval____)>] fcoe_ctlr_recv_work+0xd01/0x32f0 [libfcoe]
[<(____ptrval____)>] process_one_work+0x7ff/0x1420
[<(____ptrval____)>] worker_thread+0x87/0xef0
[<(____ptrval____)>] kthread+0x2db/0x390
[<(____ptrval____)>] ret_from_fork+0x35/0x40
[<(____ptrval____)>] 0xffffffffffffffff

Signed-off-by: Johannes Thumshirn <[email protected]>
Reported-by: ard <[email protected]>
---
drivers/scsi/libfc/fc_rport.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/libfc/fc_rport.c b/drivers/scsi/libfc/fc_rport.c
index 31d31aad3de1..89b1f1af2fd4 100644
--- a/drivers/scsi/libfc/fc_rport.c
+++ b/drivers/scsi/libfc/fc_rport.c
@@ -2164,6 +2164,7 @@ static void fc_rport_recv_logo_req(struct fc_lport *lport, struct fc_frame *fp)
FC_RPORT_DBG(rdata, "Received LOGO request while in state %s\n",
fc_rport_state(rdata));

+ rdata->flags &= ~FC_RP_STARTED;
fc_rport_enter_delete(rdata, RPORT_EV_STOP);
mutex_unlock(&rdata->rp_mutex);
kref_put(&rdata->kref, fc_rport_destroy);
--
2.16.4


2018-07-31 13:49:51

by Johannes Thumshirn

[permalink] [raw]
Subject: [PATCH 1/3] fcoe: fix use-after-free in fcoe_ctlr_els_send

KASAN reports a use-after-free in fcoe_ctlr_els_send() when we're
sending a LOGO and have FIP debugging enabled. This is because we're
first freeing the skb and then printing the frame's DID. But the DID
is a member of the FC frame header which in turn is the skb's payload.

Exchange the debug print and kfree_skb() calls so we're not touching
the freed data.

Signed-off-by: Johannes Thumshirn <[email protected]>
---
drivers/scsi/fcoe/fcoe_ctlr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/fcoe/fcoe_ctlr.c b/drivers/scsi/fcoe/fcoe_ctlr.c
index ea23c8dffc25..ceb35ebbeb8f 100644
--- a/drivers/scsi/fcoe/fcoe_ctlr.c
+++ b/drivers/scsi/fcoe/fcoe_ctlr.c
@@ -799,9 +799,9 @@ int fcoe_ctlr_els_send(struct fcoe_ctlr *fip, struct fc_lport *lport,
fip->send(fip, skb);
return -EINPROGRESS;
drop:
- kfree_skb(skb);
LIBFCOE_FIP_DBG(fip, "drop els_send op %u d_id %x\n",
op, ntoh24(fh->fh_d_id));
+ kfree_skb(skb);
return -EINVAL;
}
EXPORT_SYMBOL(fcoe_ctlr_els_send);
--
2.16.4


2018-08-01 06:31:36

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 1/3] fcoe: fix use-after-free in fcoe_ctlr_els_send

On 07/31/2018 03:46 PM, Johannes Thumshirn wrote:
> KASAN reports a use-after-free in fcoe_ctlr_els_send() when we're
> sending a LOGO and have FIP debugging enabled. This is because we're
> first freeing the skb and then printing the frame's DID. But the DID
> is a member of the FC frame header which in turn is the skb's payload.
>
> Exchange the debug print and kfree_skb() calls so we're not touching
> the freed data.
>
> Signed-off-by: Johannes Thumshirn <[email protected]>
> ---
> drivers/scsi/fcoe/fcoe_ctlr.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/fcoe/fcoe_ctlr.c b/drivers/scsi/fcoe/fcoe_ctlr.c
> index ea23c8dffc25..ceb35ebbeb8f 100644
> --- a/drivers/scsi/fcoe/fcoe_ctlr.c
> +++ b/drivers/scsi/fcoe/fcoe_ctlr.c
> @@ -799,9 +799,9 @@ int fcoe_ctlr_els_send(struct fcoe_ctlr *fip, struct fc_lport *lport,
> fip->send(fip, skb);
> return -EINPROGRESS;
> drop:
> - kfree_skb(skb);
> LIBFCOE_FIP_DBG(fip, "drop els_send op %u d_id %x\n",
> op, ntoh24(fh->fh_d_id));
> + kfree_skb(skb);
> return -EINVAL;
> }
> EXPORT_SYMBOL(fcoe_ctlr_els_send);
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

2018-08-01 06:31:45

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 2/3] scsi: fcoe: drop frames in ELS LOGO error path

On 07/31/2018 03:46 PM, Johannes Thumshirn wrote:
> Drop the frames in the ELS LOGO error path instead of just returning
> an error.
>
> This fixes the following kmemleak report:
> unreferenced object 0xffff880064cb1000 (size 424):
> comm "kworker/0:2", pid 24, jiffies 4294904293 (age 68.504s)
> hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> backtrace:
> [<(____ptrval____)>] _fc_frame_alloc+0x2c/0x180 [libfc]
> [<(____ptrval____)>] fc_lport_enter_logo+0x106/0x360 [libfc]
> [<(____ptrval____)>] fc_fabric_logoff+0x8c/0xc0 [libfc]
> [<(____ptrval____)>] fcoe_if_destroy+0x79/0x3b0 [fcoe]
> [<(____ptrval____)>] fcoe_destroy_work+0xd2/0x170 [fcoe]
> [<(____ptrval____)>] process_one_work+0x7ff/0x1420
> [<(____ptrval____)>] worker_thread+0x87/0xef0
> [<(____ptrval____)>] kthread+0x2db/0x390
> [<(____ptrval____)>] ret_from_fork+0x35/0x40
> [<(____ptrval____)>] 0xffffffffffffffff
>
> which can be triggered by issuing
> echo eth0 > /sys/bus/fcoe/ctlr_destroy
>
> Signed-off-by: Johannes Thumshirn <[email protected]>
> ---
> drivers/scsi/fcoe/fcoe_ctlr.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/scsi/fcoe/fcoe_ctlr.c b/drivers/scsi/fcoe/fcoe_ctlr.c
> index ceb35ebbeb8f..ffec695e0bfb 100644
> --- a/drivers/scsi/fcoe/fcoe_ctlr.c
> +++ b/drivers/scsi/fcoe/fcoe_ctlr.c
> @@ -754,9 +754,9 @@ int fcoe_ctlr_els_send(struct fcoe_ctlr *fip, struct fc_lport *lport,
> case ELS_LOGO:
> if (fip->mode == FIP_MODE_VN2VN) {
> if (fip->state != FIP_ST_VNMP_UP)
> - return -EINVAL;
> + goto drop;
> if (ntoh24(fh->fh_d_id) == FC_FID_FLOGI)
> - return -EINVAL;
> + goto drop;
> } else {
> if (fip->state != FIP_ST_ENABLED)
> return 0;
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

2018-08-01 06:33:21

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 3/3] scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO

On 07/31/2018 03:46 PM, Johannes Thumshirn wrote:
> When receiving a LOGO request we forget to clear the FC_RP_STARTED
> flag before starting the rport delete routine.
>
> As the started flag was not cleared, we're not deleting the rport but
> waiting for a restart and thus are keeping the reference count of the
> rdata object at 1.
>
> This leads to the following kmemleak report:
> unreferenced object 0xffff88006542aa00 (size 512):
> comm "kworker/0:2", pid 24, jiffies 4294899222 (age 226.880s)
> hex dump (first 32 bytes):
> 68 96 fe 65 00 88 ff ff 00 00 00 00 00 00 00 00 h..e............
> 01 00 00 00 08 00 00 00 02 c5 45 24 ac b8 00 10 ..........E$....
> backtrace:
> [<(____ptrval____)>] fcoe_ctlr_vn_add.isra.5+0x7f/0x770 [libfcoe]
> [<(____ptrval____)>] fcoe_ctlr_vn_recv+0x12af/0x27f0 [libfcoe]
> [<(____ptrval____)>] fcoe_ctlr_recv_work+0xd01/0x32f0 [libfcoe]
> [<(____ptrval____)>] process_one_work+0x7ff/0x1420
> [<(____ptrval____)>] worker_thread+0x87/0xef0
> [<(____ptrval____)>] kthread+0x2db/0x390
> [<(____ptrval____)>] ret_from_fork+0x35/0x40
> [<(____ptrval____)>] 0xffffffffffffffff
>
> Signed-off-by: Johannes Thumshirn <[email protected]>
> Reported-by: ard <[email protected]>
> ---
> drivers/scsi/libfc/fc_rport.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/scsi/libfc/fc_rport.c b/drivers/scsi/libfc/fc_rport.c
> index 31d31aad3de1..89b1f1af2fd4 100644
> --- a/drivers/scsi/libfc/fc_rport.c
> +++ b/drivers/scsi/libfc/fc_rport.c
> @@ -2164,6 +2164,7 @@ static void fc_rport_recv_logo_req(struct fc_lport *lport, struct fc_frame *fp)
> FC_RPORT_DBG(rdata, "Received LOGO request while in state %s\n",
> fc_rport_state(rdata));
>
> + rdata->flags &= ~FC_RP_STARTED;
> fc_rport_enter_delete(rdata, RPORT_EV_STOP);
> mutex_unlock(&rdata->rp_mutex);
> kref_put(&rdata->kref, fc_rport_destroy);
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

2018-08-06 09:26:07

by Johannes Thumshirn

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes

Ard,

Did you have time to test these patches?

Byte,
Johannes
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

2018-08-06 13:28:28

by Johannes Thumshirn

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes

On Mon, Aug 06, 2018 at 03:22:34PM +0200, ard wrote:
> Ahh, yeah, sorry.
> I did test them.
> I can't say I see a regression.
> I also can't say I see progress, although it feels like there is
> a tad less leaking.
> For that I need to do some statistics.
>
> But for now I ack that it does not regress further.

From what I've seen in your kmemleak reports, it wasn't only
libfc/fcoe that was leaking memory.

So kmemleak reports are welcome (especially if they confirm libfc/fcoe
is clean now ;-)).

@Martin can you stage these patches for 4.19?

--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

2018-08-06 13:57:04

by ard

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes

Hi,

Johannes,

On Mon, Aug 06, 2018 at 11:25:03AM +0200, Johannes Thumshirn wrote:
> Ard,
>
> Did you have time to test these patches?

Ahh, yeah, sorry.
I did test them.
I can't say I see a regression.
I also can't say I see progress, although it feels like there is
a tad less leaking.
For that I need to do some statistics.

But for now I ack that it does not regress further.

--
.signature not found

2018-08-06 14:25:29

by ard

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes

Hi Johannes,

On Mon, Aug 06, 2018 at 03:27:25PM +0200, Johannes Thumshirn wrote:
> On Mon, Aug 06, 2018 at 03:22:34PM +0200, ard wrote:
> > Ahh, yeah, sorry.
> > I did test them.
> > I can't say I see a regression.
> > I also can't say I see progress, although it feels like there is
> > a tad less leaking.
> > For that I need to do some statistics.
> >
> > But for now I ack that it does not regress further.
>
> From what I've seen in your kmemleak reports, it wasn't only
> libfc/fcoe that was leaking memory.
>
> So kmemleak reports are welcome (especially if they confirm libfc/fcoe
> is clean now ;-)).
I've updated the ticket:
https://github.com/hardkernel/linux/issues/360#issuecomment-410721997
dmesg (with full debugging):
https://github.com/hardkernel/linux/files/2262858/dmesg-2018-08-06.txt
and kmemleak:
https://github.com/hardkernel/linux/files/2262861/kmemleak-2018-08-06.txt

I'll have to take a good look at the what and why.
I'm gonna git clean -x -d and rebuild the kernel, because this is
weird.

This is the tree I build:
https://github.com/ardje/linux/tree/gb-4.14-fcoe-patch
So vanilla linux-4.14 with the patch set:
https://github.com/ardje/linux/commit/747bf04057a99be5b01f768654cfd61bc9f4fc6c
Meh, I should try to use git for patching.

I will spam some printk's later on, just to check which flows are
followed (if the git clean doesn't reveal some problem, although
I can't imagine).

Regards,
Ard

--
.signature not found

2018-08-07 06:55:25

by Johannes Thumshirn

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes

On Mon, Aug 06, 2018 at 04:24:14PM +0200, ard wrote:
> I've updated the ticket:
> https://github.com/hardkernel/linux/issues/360#issuecomment-410721997
> dmesg (with full debugging):
> https://github.com/hardkernel/linux/files/2262858/dmesg-2018-08-06.txt
> and kmemleak:
> https://github.com/hardkernel/linux/files/2262861/kmemleak-2018-08-06.txt
>
> I'll have to take a good look at the what and why.
> I'm gonna git clean -x -d and rebuild the kernel, because this is
> weird.
>
> This is the tree I build:
> https://github.com/ardje/linux/tree/gb-4.14-fcoe-patch
> So vanilla linux-4.14 with the patch set:
> https://github.com/ardje/linux/commit/747bf04057a99be5b01f768654cfd61bc9f4fc6c
> Meh, I should try to use git for patching.
>
> I will spam some printk's later on, just to check which flows are
> followed (if the git clean doesn't reveal some problem, although
> I can't imagine).

OK, now this is wired. Are you seeing this on the initiator or on the
target side? Also on x86_64 or just the odroid? I could reproduce your
reports in my Virtualized Environment [1][2] by issuing deletes from the
initiator side..

[1] https://github.com/rapido-linux/rapido
[2] https://github.com/rapido-linux/rapido/blob/master/fcoe_local_autorun.sh

Byte,
Johannes
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

2018-08-07 09:31:41

by ard

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes

Hi,

On Tue, Aug 07, 2018 at 08:54:00AM +0200, Johannes Thumshirn wrote:
> OK, now this is wired. Are you seeing this on the initiator or on the
> target side? Also on x86_64 or just the odroid? I could reproduce your
> reports in my Virtualized Environment [1][2] by issuing deletes from the
> initiator side..
Yes it is weird, and it is even more weird when I looked at the
collectd statistics:
The memory leak was almost none existent on my test odroid with
the PC turned off. When I turn it back on, it rises to 150MB/day
So it seems you need at least some party.
The most important thing to realise: this is pure vn2vn chatter.
There is no traffic going from or to the test odroid (to the test
pc there is some).
If I disable the FCoE vlan on the switch port, the chatter *and*
the memory leaks vanishes.
Meeh, this reports needs a better place than just e-mail, I got a
few nice graphs to show.

But here is an overview of my FCoE vlan:
(Sorted by hand)
(GS724Tv4) #show mac-addr-table vlan 11

Address Entries Currently in Use............... 89

MAC Address Interface Status
----------------- --------- ------------
00:1E:06:30:05:50 g4 odroid4 Xu4/exynos 5422/4.4.0-rc6 stable (330 days up)
0E:FD:00:00:05:50 g4 Learned
00:1E:06:30:04:E0 g6 odroid6 Xu4/exynos 5422/4.9.28 stable (330 days up)
0E:FD:00:00:04:E0 g6 Learned
00:1E:06:30:05:52 g7 odroid7 Xu4/exynos 5422/4.14.55 leaking (150MB leak/day)
0E:FD:00:00:05:52 g7 Learned
00:0E:0C:B0:68:37 g14 storage SS4000E/Xscale 80219/3.7.1 stable (295 days up)
0E:FD:00:00:68:37 g14 Learned
00:14:FD:16:DD:50 g15 thecus1 n4200eco/D525/4.3.0 stable (295 days up)
0E:FD:00:00:DD:50 g15 Learned
00:24:1D:7F:40:88 g17 antec PC/i7-920/4.14.59 leaking
0E:FD:00:00:40:88 g17 Learned


The system on G14 and G15 are both long time targets.
G4,6 and 7 (my production server is on 5 with FCoE and kmemleak, but with the
FCoE vlan removed) are odroids doing nothing more with FCoE but being there.
(Waiting for experiments for bcache on eMMC, I used to be able to
crash the FCoE *target* using btrfs on bcache on eMMC and FCoE.
(Target was running 4.0.0 back then).
Generic config (PC and odroid):
root@odroid6:~# cat /etc/network/interfaces.d/20-fcoe
auto fcoe
iface fcoe inet manual
pre-up modprobe fcoe || true
pre-up ip link add link eth0 name fcoe type vlan id 11
pre-up sysctl -w net.ipv6.conf.fcoe.disable_ipv6=1
up ip link set up dev fcoe
up sh -c 'echo fcoe > /sys/module/libfcoe/parameters/create_vn2vn'
#up /root/mountfcoe
#pre-down /root/stop-bcaches
pre-down sh -c 'echo fcoe > /sys/module/libfcoe/parameters/destroy'
down ip link set down dev fcoe
down ip link del fcoe

The targets are configured with some version of targetcli (so a
big echo shell script).

This is on the 4.14 systems:
root@antec:~# grep . /sys/class/fc_*/*/port_*
/sys/class/fc_host/host10/port_id:0x004088
/sys/class/fc_host/host10/port_name:0x200000241d7f4088
/sys/class/fc_host/host10/port_state:Online
/sys/class/fc_host/host10/port_type:NPort (fabric via point-to-point)
/sys/class/fc_remote_ports/rport-10:0-0/port_id:0x00dd50
/sys/class/fc_remote_ports/rport-10:0-0/port_name:0x20000014fd16dd50
/sys/class/fc_remote_ports/rport-10:0-0/port_state:Online
/sys/class/fc_remote_ports/rport-10:0-1/port_id:0x006837
/sys/class/fc_remote_ports/rport-10:0-1/port_name:0x2000000e0cb06837
/sys/class/fc_remote_ports/rport-10:0-1/port_state:Online
/sys/class/fc_remote_ports/rport-10:0-2/port_id:0x000550
/sys/class/fc_remote_ports/rport-10:0-2/port_name:0x2000001e06300550
/sys/class/fc_remote_ports/rport-10:0-2/port_state:Online
/sys/class/fc_remote_ports/rport-10:0-3/port_id:0x0004e0
/sys/class/fc_remote_ports/rport-10:0-3/port_name:0x2000001e063004e0
/sys/class/fc_remote_ports/rport-10:0-3/port_state:Online
/sys/class/fc_transport/target10:0:0/port_id:0x00dd50
/sys/class/fc_transport/target10:0:0/port_name:0x20000014fd16dd50

None of the other systems have an fc_transport, as they do not
have targets assigned to them (currently).
Notice that antec (PC) does not see odroid7.
The same is true vice versa.
All other systems see both antec and odroid7.

So they all can see eachother except for the 4.14 systems that
can't see eachother.

Now when I noticed that it only happened when my PC starts, I
wondered why it also happened when my PC is turned off, as I turn
it on once every few months and sometimes in the winter, it's
power usage is the same as the remaining systems combined.

And my next thing is: why did my production server seemed to die
less fast since a few kernel upgrades (in the 4.14 line).
I got it figured out now:
Before the heatwave, I had odroid5 turned on, my steam machine
(also with FCoE as an active initiator and 4.14 kernel) and the
PC turned off. So that still makes 6 FCoE ports on the network.
When the summer came I needed to turn off the steam machine as
much as possible. This resulted in my main production server only
needing a reboot once ever week instead of every 2 days. I
attributed that to kernel fixes (as I knew there was a memory
leak, just didn't know where yet).

Thinking about that some more: do I need 4.14 systems to trigger
a bug within eachother, or is it pure the number of fc hosts
that should be bigger than 5 to trigger a bug in 4.14?

So a conclusion of my rambling:
1) you either need 6 vn2vn hosts *or* you need more than one 4.14
kernel in a network to trigger. One of the two. I need to think
about this. The fact that the 4.14 systems can't see eachother is
an indicator. I can turn off the FCoE on some other system to
see if the memleak stops.
2) kernels up to 4.9.28 do not have a memoryleak. 4.14.28+ do
have the memory leak.
3) I need a place for graphs, I will see if I can abuse the
github ticket some more 8-D.
4) Just having FCoE enabled on an interface and
*receiving*/interacting with FCoE vn2vn chatter triggers the bug.
So that's only setting up the rports, maintaining ownership of
your port id.
5) The memleak itself is architecture independent and NIC
independent.


--
.signature not found

2018-08-07 09:58:26

by ard

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes

On Tue, Aug 07, 2018 at 11:26:19AM +0200, ard wrote:
> 3) I need a place for graphs, I will see if I can abuse the
> github ticket some more 8-D.
https://github.com/hardkernel/linux/issues/360#issuecomment-411001362

Regards,
Ard van Breemen
--
.signature not found

2018-08-07 16:47:30

by ard

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes

Hi,

On Tue, Aug 07, 2018 at 11:26:19AM +0200, ard wrote:
> So a conclusion of my rambling:
> 1) you either need 6 vn2vn hosts *or* you need more than one 4.14
> kernel in a network to trigger. One of the two. I need to think
> about this. The fact that the 4.14 systems can't see eachother is
> an indicator. I can turn off the FCoE on some other system to
> see if the memleak stops.

I've gone all out:

Ok, deleted everything, long story short:
(for now(!))

PC+steam machine with 4.14 (patched) and 4.16 (upstream,
nodebug): no kmemleaks
Every device sees every device.

Now if I add two odroids running 4.14 (not patched):
The odroids will see *every* device running < 4.14 and vv
The odroids will not see any device running >= 4.14 and vv
The PC starts to kmemleak a lot.

I've removed the two 4.14 odroids again from the FCoE lan, and
upgraded them with a patched 4.14 .
I rebooted the PC with steam (4.16 upstream unpatched) and all
other systems in the lan I mentioned.
To be clear: the steam machine has a dedicated nic for FCoE. All
other systems use a vlan.
I will let that talk with eachother for now.

I fear there is a platform specific thing in the vn2vn chatter
that has changed > 4.9 .
4.14 can communicate with lower revisions, but they can't see
eachother, and whatever they chat about, it induces memory leaks
on the PC.
Oh, another difference is that steam and the PC has targets
assigned. I might try that for the 4.14 too.

Anyway, I will do some cooking and let the current systems chat
with eachother and see if we get a kmemleak making my statement
bogus.

Regards,
Ard

--
.signature not found

2018-08-09 08:08:21

by Johannes Thumshirn

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes

On Mon, Aug 06, 2018 at 11:25:03AM +0200, Johannes Thumshirn wrote:
Martin,

though it looks like these patches aren't fixing all of the errors Ard
has, they fix all I could reproduce up to now and Ard already stated
they're not regressing.

So can we please stage them for 4.19?

Thanks,
Johannes

--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

2018-08-09 09:53:26

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes


Johannes,

> Martin,
>
> though it looks like these patches aren't fixing all of the errors Ard
> has, they fix all I could reproduce up to now and Ard already stated
> they're not regressing.
>
> So can we please stage them for 4.19?

I merged them last week.

--
Martin K. Petersen Oracle Linux Engineering

2018-08-09 09:57:51

by Johannes Thumshirn

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes

On Thu, Aug 09, 2018 at 05:52:13AM -0400, Martin K. Petersen wrote:
>
> Johannes,
>
> > Martin,
> >
> > though it looks like these patches aren't fixing all of the errors Ard
> > has, they fix all I could reproduce up to now and Ard already stated
> > they're not regressing.
> >
> > So can we please stage them for 4.19?
>
> I merged them last week.

Oh I'm sorry, missed that.

--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

2018-08-09 10:03:17

by ard

[permalink] [raw]
Subject: Re: [PATCH 0/3] scsi: fcoe: memleak fixes

Hi Guys,

On Tue, Aug 07, 2018 at 06:04:52PM +0200, ard wrote:
> PC+steam machine with 4.14 (patched) and 4.16 (upstream,
> nodebug): no kmemleaks
> Every device sees every device.

New day, new conflicting results.
Yay \0/.

As I did not trust the results, I redid the tests, and the same
tests gave some different results.
Before giving the results I've changed my stance on the bug:
The bug is not a regression in memory leak. As far as I can tell
now, the memory leaks were already there.
It's a regression in vn2vn enodes being able to PLOGI.
Since I've seen the steam machine and the PC setup an rport,
there must be some racy thing going on how the accept or reject
the PLOGI.
Now once it rejects, it will never succeed to accept, and the
relogin happens ad infinitum.
In this mode there are about 47 kmemleaks per 10 minutes.
I also notice that the kmemleaks takes a while to be detected or
to die out. So there are state timers involved that hold on to
the memory and after time out do not free it.
And another thing I noticed: When the pc and the steam machine
had a working rport, after a while the steam machine (4.16
unpatched) fc_timedout the rports to all nodes (so all nodes with
kernel < 4.14 too), and all with different timeouts, except the
one it has an fc_transport with.
So it's sole remaining rport was the "designated" target.
Currently I am compiling 4.9 with kmemleak to determine if that
exhibits the same leaks when disconnecting and reconnecting the
FCoE vlan.
This to determine if we have a single regression in just the
login handling or both.
I will add the dmesg's of a working rport, and a failing rport
later.

Regards,
Ard

--
.signature not found