LinuxLists.cc - [PATCH AUTOSEL 5.11 01/12] btrfs: avoid checking for RO block group twice during nocow writeback

2021-03-07 14:23:50

Subject: [PATCH AUTOSEL 5.11 01/12] btrfs: avoid checking for RO block group twice during nocow writeback

From: Filipe Manana <[email protected]>

[ Upstream commit 20903032cd9f0260b99aeab92e6540f0350e4a23 ]

During the nocow writeback path, we currently iterate the rbtree of block
groups twice: once for checking if the target block group is RO with the
call to btrfs_extent_readonly()), and once again for getting a nocow
reference on the block group with a call to btrfs_inc_nocow_writers().

Since btrfs_inc_nocow_writers() already returns false when the target
block group is RO, remove the call to btrfs_extent_readonly(). Not only
we avoid searching the blocks group rbtree twice, it also helps reduce
contention on the lock that protects it (specially since it is a spin
lock and not a read-write lock). That may make a noticeable difference
on very large filesystems, with thousands of allocated block groups.

Reviewed-by: Anand Jain <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
fs/btrfs/inode.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ad34c5a09bef..02c4bfa515fb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1657,9 +1657,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
*/
btrfs_release_path(path);

- /* If extent is RO, we must COW it */
- if (btrfs_extent_readonly(fs_info, disk_bytenr))
- goto out_check;
ret = btrfs_cross_ref_exist(root, ino,
found_key.offset -
extent_offset, disk_bytenr, false);
@@ -1706,6 +1703,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
WARN_ON_ONCE(freespace_inode);
goto out_check;
}
+ /* If the extent's block group is RO, we must COW */
if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
goto out_check;
nocow = true;
--
2.30.1

2021-03-07 14:24:14

by Sasha Levin

[permalink] [raw]

Subject: [PATCH AUTOSEL 5.11 04/12] selftests: kvm: Mmap the entire vcpu mmap area

From: Aaron Lewis <[email protected]>

[ Upstream commit 6528fc0a11de3d16339cf17639e2f69a68fcaf4d ]

The vcpu mmap area may consist of more than just the kvm_run struct.
Allocate enough space for the entire vcpu mmap area. Without this, on
x86, the PIO page, for example, will be missing. This is problematic
when dealing with an unhandled exception from the guest as the exception
vector will be incorrectly reported as 0x0.

Message-Id: <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Co-developed-by: Steve Rutherford <[email protected]>
Signed-off-by: Aaron Lewis <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
tools/testing/selftests/kvm/lib/kvm_util.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index fa5a90e6c6f0..859a0b57c683 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -21,6 +21,8 @@
#define KVM_UTIL_PGS_PER_HUGEPG 512
#define KVM_UTIL_MIN_PFN 2

+static int vcpu_mmap_sz(void);
+
/* Aligns x up to the next multiple of size. Size must be a power of 2. */
static void *align(void *x, size_t size)
{
@@ -509,7 +511,7 @@ static void vm_vcpu_rm(struct kvm_vm *vm, struct vcpu *vcpu)
vcpu->dirty_gfns = NULL;
}

- ret = munmap(vcpu->state, sizeof(*vcpu->state));
+ ret = munmap(vcpu->state, vcpu_mmap_sz());
TEST_ASSERT(ret == 0, "munmap of VCPU fd failed, rc: %i "
"errno: %i", ret, errno);
close(vcpu->fd);
@@ -978,7 +980,7 @@ void vm_vcpu_add(struct kvm_vm *vm, uint32_t vcpuid)
TEST_ASSERT(vcpu_mmap_sz() >= sizeof(*vcpu->state), "vcpu mmap size "
"smaller than expected, vcpu_mmap_sz: %i expected_min: %zi",
vcpu_mmap_sz(), sizeof(*vcpu->state));
- vcpu->state = (struct kvm_run *) mmap(NULL, sizeof(*vcpu->state),
+ vcpu->state = (struct kvm_run *) mmap(NULL, vcpu_mmap_sz(),
PROT_READ | PROT_WRITE, MAP_SHARED, vcpu->fd, 0);
TEST_ASSERT(vcpu->state != MAP_FAILED, "mmap vcpu_state failed, "
"vcpu id: %u errno: %i", vcpuid, errno);
--
2.30.1

2021-03-07 14:25:00

by Sasha Levin

[permalink] [raw]

Subject: [PATCH AUTOSEL 5.11 06/12] drm/amdgpu: enable BACO runpm by default on sienna cichlid and navy flounder

From: Alex Deucher <[email protected]>

[ Upstream commit 25951362db7b3791488ec45bf56c0043f107b94b ]

It works fine and was only disabled because primary GPUs
don't enter runpm if there is a console bound to the fbdev due
to the kmap. This will at least allow runpm on secondary cards.

Reviewed-by: Evan Quan <[email protected]>
Reviewed-by: Rajneesh Bhardwaj <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index b16b32797624..ccfa2f9d5446 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -173,8 +173,6 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
switch (adev->asic_type) {
case CHIP_VEGA20:
case CHIP_ARCTURUS:
- case CHIP_SIENNA_CICHLID:
- case CHIP_NAVY_FLOUNDER:
/* enable runpm if runpm=1 */
if (amdgpu_runtime_pm > 0)
adev->runpm = true;
--
2.30.1

2021-03-07 14:25:44

by Sasha Levin

[permalink] [raw]

Subject: [PATCH AUTOSEL 5.11 09/12] nvme-pci: mark Kingston SKC2000 as not supporting the deepest power state

From: Zoltán Böszörményi <[email protected]>

[ Upstream commit dc22c1c058b5c4fe967a20589e36f029ee42a706 ]

My 2TB SKC2000 showed the exact same symptoms that were provided
in 538e4a8c57 ("nvme-pci: avoid the deepest sleep state on
Kingston A2000 SSDs"), i.e. a complete NVME lockup that needed
cold boot to get it back.

According to some sources, the A2000 is simply a rebadged
SKC2000 with a slightly optimized firmware.

Adding the SKC2000 PCI ID to the quirk list with the same workaround
as the A2000 made my laptop survive a 5 hours long Yocto bootstrap
buildfest which reliably triggered the SSD lockup previously.

Signed-off-by: Zoltán Böszörményi <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
drivers/nvme/host/pci.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7a38d764b486..14c5b52400ef 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3262,6 +3262,8 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },
{ PCI_DEVICE(0x1d97, 0x2263), /* SPCC */
.driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },
+ { PCI_DEVICE(0x2646, 0x2262), /* KINGSTON SKC2000 NVMe SSD */
+ .driver_data = NVME_QUIRK_NO_DEEPEST_PS, },
{ PCI_DEVICE(0x2646, 0x2263), /* KINGSTON A2000 NVMe SSD */
.driver_data = NVME_QUIRK_NO_DEEPEST_PS, },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001),
--
2.30.1

2021-03-07 14:25:58

by Sasha Levin

[permalink] [raw]

Subject: [PATCH AUTOSEL 5.11 10/12] nvme-pci: add quirks for Lexar 256GB SSD

From: Pascal Terjan <[email protected]>

[ Upstream commit 6e6a6828c517fb6819479bf5187df5f39084eb9e ]

Add the NVME_QUIRK_NO_NS_DESC_LIST and NVME_QUIRK_IGNORE_DEV_SUBNQN
quirks for this buggy device.

Reported and tested in https://bugs.mageia.org/show_bug.cgi?id=28417

Signed-off-by: Pascal Terjan <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
drivers/nvme/host/pci.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 14c5b52400ef..806a5d071ef6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3245,6 +3245,9 @@ static const struct pci_device_id nvme_id_table[] = {
NVME_QUIRK_IGNORE_DEV_SUBNQN, },
{ PCI_DEVICE(0x1987, 0x5016), /* Phison E16 */
.driver_data = NVME_QUIRK_IGNORE_DEV_SUBNQN, },
+ { PCI_DEVICE(0x1b4b, 0x1092), /* Lexar 256 GB SSD */
+ .driver_data = NVME_QUIRK_NO_NS_DESC_LIST |
+ NVME_QUIRK_IGNORE_DEV_SUBNQN, },
{ PCI_DEVICE(0x1d1d, 0x1f1f), /* LighNVM qemu device */
.driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE(0x1d1d, 0x2807), /* CNEX WL */
--
2.30.1

2021-03-07 15:25:00

by Sasha Levin

[permalink] [raw]

Subject: [PATCH AUTOSEL 5.11 03/12] btrfs: subpage: fix the false data csum mismatch error

From: Qu Wenruo <[email protected]>

[ Upstream commit c28ea613fafad910d08f67efe76ae552b1434e44 ]

[BUG]
When running fstresss, we can hit strange data csum mismatch where the
on-disk data is in fact correct (passes scrub).

With some extra debug info added, we have the following traces:

0482us: btrfs_do_readpage: root=5 ino=284 offset=393216, submit force=0 pgoff=0 iosize=8192
0494us: btrfs_do_readpage: root=5 ino=284 offset=401408, submit force=0 pgoff=8192 iosize=4096
0498us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=393216 len=8192
0591us: btrfs_do_readpage: root=5 ino=284 offset=405504, submit force=0 pgoff=12288 iosize=36864
0594us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=401408 len=4096
0863us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=405504 len=36864
0933us: btrfs_verify_data_csum: root=5 ino=284 offset=393216 len=8192
0967us: btrfs_do_readpage: root=5 ino=284 offset=442368, skip beyond isize pgoff=49152 iosize=16384
1047us: btrfs_verify_data_csum: root=5 ino=284 offset=401408 len=4096
1163us: btrfs_verify_data_csum: root=5 ino=284 offset=405504 len=36864
1290us: check_data_csum: !!! root=5 ino=284 offset=438272 pg_off=45056 !!!
7387us: end_bio_extent_readpage: root=5 ino=284 before pending_read_bios=0

[CAUSE]
Normally we expect all submitted bio reads to only touch the range we
specified, and under subpage context, it means we should only touch the
range specified in each bvec.

But in data read path, inside end_bio_extent_readpage(), we have page
zeroing which only takes regular page size into consideration.

This means for subpage if we have an inode whose content looks like below:

0 16K 32K 48K 64K
|///////| |///////| |

|//| = data needs to be read from disk
| | = hole

And i_size is 64K initially.

Then the following race can happen:

T1 | T2
--------------------------------+--------------------------------
btrfs_do_readpage() |
|- isize = 64K; |
| At this time, the isize is |
| 64K |
| |
|- submit_extent_page() |
| submit previous assembled bio|
| assemble bio for [0, 16K) |
| |
|- submit_extent_page() |
submit read bio for [0, 16K) |
assemble read bio for |
[32K, 48K) |
|
| btrfs_setsize()
| |- i_size_write(, 16K);
| Now i_size is only 16K
end_io() for [0K, 16K) |
|- end_bio_extent_readpage() |
|- btrfs_verify_data_csum() |
| No csum error |
|- i_size = 16K; |
|- zero_user_segment(16K, |
PAGE_SIZE); |
!!! We zeroed range |
!!! [32K, 48K) |
| end_io for [32K, 48K)
| |- end_bio_extent_readpage()
| |- btrfs_verify_data_csum()
| ! CSUM MISMATCH !
| ! As the range is zeroed now !

[FIX]
To fix the problem, make end_bio_extent_readpage() to only zero the
range of bvec.

The bug only affects subpage read-write support, as for full read-only
mount we can't change i_size thus won't hit the race condition.

Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
fs/btrfs/extent_io.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c9cee458e001..ff1a0f97ba84 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2972,12 +2972,23 @@ static void end_bio_extent_readpage(struct bio *bio)
if (likely(uptodate)) {
loff_t i_size = i_size_read(inode);
pgoff_t end_index = i_size >> PAGE_SHIFT;
- unsigned off;

- /* Zero out the end if this page straddles i_size */
- off = offset_in_page(i_size);
- if (page->index == end_index && off)
- zero_user_segment(page, off, PAGE_SIZE);
+ /*
+ * Zero out the remaining part if this range straddles
+ * i_size.
+ *
+ * Here we should only zero the range inside the bvec,
+ * not touch anything else.
+ *
+ * NOTE: i_size is exclusive while end is inclusive.
+ */
+ if (page->index == end_index && i_size <= end) {
+ u32 zero_start = max(offset_in_page(i_size),
+ offset_in_page(end));
+
+ zero_user_segment(page, zero_start,
+ offset_in_page(end) + 1);
+ }
}
ASSERT(bio_offset + len > bio_offset);
bio_offset += len;
--
2.30.1

2021-03-07 15:25:21

by Sasha Levin

[permalink] [raw]

Subject: [PATCH AUTOSEL 5.11 05/12] pstore/ram: Rate-limit "uncorrectable error in header" message

From: Dmitry Osipenko <[email protected]>

[ Upstream commit 7db688e99c0f770ae73e0f1f3fb67f9b64266445 ]

There is a quite huge "uncorrectable error in header" flood in KMSG
on a clean system boot since there is no pstore buffer saved in RAM.
Let's silence the redundant noisy messages by rate-limiting the printk
message. Now there are maximum 10 messages printed repeatedly instead
of 35+.

Signed-off-by: Dmitry Osipenko <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
---
fs/pstore/ram_core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index aa8e0b65ff1a..fff363bfd484 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -246,7 +246,7 @@ static int persistent_ram_init_ecc(struct persistent_ram_zone *prz,
pr_info("error in header, %d\n", numerr);
prz->corrected_bytes += numerr;
} else if (numerr < 0) {
- pr_info("uncorrectable error in header\n");
+ pr_info_ratelimited("uncorrectable error in header\n");
prz->bad_blocks++;
}

--
2.30.1

2021-03-07 15:25:21

by Sasha Levin

[permalink] [raw]

Subject: [PATCH AUTOSEL 5.11 07/12] tracing: Skip selftests if tracing is disabled

From: "Steven Rostedt (VMware)" <[email protected]>

[ Upstream commit ee666a185558ac9a929e53b902a568442ed62416 ]

If tracing is disabled for some reason (traceoff_on_warning, command line,
etc), the ftrace selftests are guaranteed to fail, as their results are
defined by trace data in the ring buffers. If the ring buffers are turned
off, the tests will fail, due to lack of data.

Because tracing being disabled is for a specific reason (warning, user
decided to, etc), it does not make sense to enable tracing to run the self
tests, as the test output may corrupt the reason for the tracing to be
disabled.

Instead, simply skip the self tests and report that they are being skipped
due to tracing being disabled.

Signed-off-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
kernel/trace/trace.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b5815a022ecc..4b6df07d6dc6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1932,6 +1932,12 @@ static int run_tracer_selftest(struct tracer *type)
if (!selftests_can_run)
return save_selftest(type);

+ if (!tracing_is_on()) {
+ pr_warn("Selftest for tracer %s skipped due to tracing disabled\n",
+ type->name);
+ return 0;
+ }
+
/*
* Run a selftest on this tracer.
* Here we reset the trace buffer, and set the current
--
2.30.1

2021-03-07 15:25:43

by Sasha Levin

[permalink] [raw]

Subject: [PATCH AUTOSEL 5.11 08/12] nvme-pci: mark Seagate Nytro XM1440 as QUIRK_NO_NS_DESC_LIST.

From: Julian Einwag <[email protected]>

[ Upstream commit 5e112d3fb89703a4981ded60561b5647db3693bf ]

The kernel fails to fully detect these SSDs, only the character devices
are present:

[ 10.785605] nvme nvme0: pci function 0000:04:00.0
[ 10.876787] nvme nvme1: pci function 0000:81:00.0
[ 13.198614] nvme nvme0: missing or invalid SUBNQN field.
[ 13.198658] nvme nvme1: missing or invalid SUBNQN field.
[ 13.206896] nvme nvme0: Shutdown timeout set to 20 seconds
[ 13.215035] nvme nvme1: Shutdown timeout set to 20 seconds
[ 13.225407] nvme nvme0: 16/0/0 default/read/poll queues
[ 13.233602] nvme nvme1: 16/0/0 default/read/poll queues
[ 13.239627] nvme nvme0: Identify Descriptors failed (8194)
[ 13.246315] nvme nvme1: Identify Descriptors failed (8194)

Adding the NVME_QUIRK_NO_NS_DESC_LIST fixes this problem.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=205679
Signed-off-by: Julian Einwag <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
drivers/nvme/host/pci.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6bad4d4dcdf0..7a38d764b486 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3230,7 +3230,8 @@ static const struct pci_device_id nvme_id_table[] = {
{ PCI_DEVICE(0x126f, 0x2263), /* Silicon Motion unidentified */
.driver_data = NVME_QUIRK_NO_NS_DESC_LIST, },
{ PCI_DEVICE(0x1bb1, 0x0100), /* Seagate Nytro Flash Storage */
- .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
+ .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY |
+ NVME_QUIRK_NO_NS_DESC_LIST, },
{ PCI_DEVICE(0x1c58, 0x0003), /* HGST adapter */
.driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
{ PCI_DEVICE(0x1c58, 0x0023), /* WDC SN200 adapter */
--
2.30.1

2021-03-07 15:25:56

by Sasha Levin

[permalink] [raw]

Subject: [PATCH AUTOSEL 5.11 11/12] nvme-fabrics: fix kato initialization

From: Martin George <[email protected]>

[ Upstream commit 32feb6de47242e54692eceab52cfae8616aa0518 ]

Currently kato is initialized to NVME_DEFAULT_KATO for both
discovery & i/o controllers. This is a problem specifically
for non-persistent discovery controllers since it always ends
up with a non-zero kato value. Fix this by initializing kato
to zero instead, and ensuring various controllers are assigned
appropriate kato values as follows:

non-persistent controllers - kato set to zero
persistent controllers - kato set to NVMF_DEV_DISC_TMO
(or any positive int via nvme-cli)
i/o controllers - kato set to NVME_DEFAULT_KATO
(or any positive int via nvme-cli)

Signed-off-by: Martin George <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
drivers/nvme/host/fabrics.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 72ac00173500..684acd6813bc 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -634,7 +634,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
opts->queue_size = NVMF_DEF_QUEUE_SIZE;
opts->nr_io_queues = num_online_cpus();
opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
- opts->kato = NVME_DEFAULT_KATO;
+ opts->kato = 0;
opts->duplicate_connect = false;
opts->fast_io_fail_tmo = NVMF_DEF_FAIL_FAST_TMO;
opts->hdr_digest = false;
@@ -897,6 +897,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
opts->nr_write_queues = 0;
opts->nr_poll_queues = 0;
opts->duplicate_connect = true;
+ } else {
+ if (!opts->kato)
+ opts->kato = NVME_DEFAULT_KATO;
}
if (ctrl_loss_tmo < 0) {
opts->max_reconnects = -1;
--
2.30.1

2021-03-07 15:26:35

by Sasha Levin

[permalink] [raw]

Subject: [PATCH AUTOSEL 5.11 12/12] nvmet: model_number must be immutable once set

From: Max Gurtovoy <[email protected]>

[ Upstream commit d9f273b7585c380d7a10d4b3187ddc2d37f2740b ]

In case we have already established connection to nvmf target, it
shouldn't be allowed to change the model_number. E.g. if someone will
identify ctrl and get model_number of "my_model" later on will change
the model_numbel via configfs to "my_new_model" this will break the NVMe
specification for "Get Log Page – Persistent Event Log" that refers to
Model Number as: "This field contains the same value as reported in the
Model Number field of the Identify Controller data structure, bytes
63:24."

Although it doesn't mentioned explicitly that this field can't be
changed, we can assume it.

So allow setting this field only once: using configfs or in the first
identify ctrl operation.

Signed-off-by: Max Gurtovoy <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
drivers/nvme/target/admin-cmd.c | 36 ++++++++++++++++--------
drivers/nvme/target/configfs.c | 50 +++++++++++++++------------------
drivers/nvme/target/core.c | 2 +-
drivers/nvme/target/nvmet.h | 7 +----
4 files changed, 50 insertions(+), 45 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 1827d8d8f3b0..44d6d9f419d0 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -319,27 +319,40 @@ static void nvmet_execute_get_log_page(struct nvmet_req *req)
nvmet_req_complete(req, NVME_SC_INVALID_FIELD | NVME_SC_DNR);
}

-static void nvmet_id_set_model_number(struct nvme_id_ctrl *id,
- struct nvmet_subsys *subsys)
+static u16 nvmet_set_model_number(struct nvmet_subsys *subsys)
{
- const char *model = NVMET_DEFAULT_CTRL_MODEL;
- struct nvmet_subsys_model *subsys_model;
+ u16 status = 0;
+
+ mutex_lock(&subsys->lock);
+ if (!subsys->model_number) {
+ subsys->model_number =
+ kstrdup(NVMET_DEFAULT_CTRL_MODEL, GFP_KERNEL);
+ if (!subsys->model_number)
+ status = NVME_SC_INTERNAL;
+ }
+ mutex_unlock(&subsys->lock);

- rcu_read_lock();
- subsys_model = rcu_dereference(subsys->model);
- if (subsys_model)
- model = subsys_model->number;
- memcpy_and_pad(id->mn, sizeof(id->mn), model, strlen(model), ' ');
- rcu_read_unlock();
+ return status;
}

static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
{
struct nvmet_ctrl *ctrl = req->sq->ctrl;
+ struct nvmet_subsys *subsys = ctrl->subsys;
struct nvme_id_ctrl *id;
u32 cmd_capsule_size;
u16 status = 0;

+ /*
+ * If there is no model number yet, set it now. It will then remain
+ * stable for the life time of the subsystem.
+ */
+ if (!subsys->model_number) {
+ status = nvmet_set_model_number(subsys);
+ if (status)
+ goto out;
+ }
+
id = kzalloc(sizeof(*id), GFP_KERNEL);
if (!id) {
status = NVME_SC_INTERNAL;
@@ -353,7 +366,8 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
memset(id->sn, ' ', sizeof(id->sn));
bin2hex(id->sn, &ctrl->subsys->serial,
min(sizeof(ctrl->subsys->serial), sizeof(id->sn) / 2));
- nvmet_id_set_model_number(id, ctrl->subsys);
+ memcpy_and_pad(id->mn, sizeof(id->mn), subsys->model_number,
+ strlen(subsys->model_number), ' ');
memcpy_and_pad(id->fr, sizeof(id->fr),
UTS_RELEASE, strlen(UTS_RELEASE), ' ');

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index c61ffd767062..4b809fe499c2 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -1120,16 +1120,12 @@ static ssize_t nvmet_subsys_attr_model_show(struct config_item *item,
char *page)
{
struct nvmet_subsys *subsys = to_subsys(item);
- struct nvmet_subsys_model *subsys_model;
- char *model = NVMET_DEFAULT_CTRL_MODEL;
int ret;

- rcu_read_lock();
- subsys_model = rcu_dereference(subsys->model);
- if (subsys_model)
- model = subsys_model->number;
- ret = snprintf(page, PAGE_SIZE, "%s\n", model);
- rcu_read_unlock();
+ mutex_lock(&subsys->lock);
+ ret = snprintf(page, PAGE_SIZE, "%s\n", subsys->model_number ?
+ subsys->model_number : NVMET_DEFAULT_CTRL_MODEL);
+ mutex_unlock(&subsys->lock);

return ret;
}
@@ -1140,14 +1136,17 @@ static bool nvmet_is_ascii(const char c)
return c >= 0x20 && c <= 0x7e;
}

-static ssize_t nvmet_subsys_attr_model_store(struct config_item *item,
- const char *page, size_t count)
+static ssize_t nvmet_subsys_attr_model_store_locked(struct nvmet_subsys *subsys,
+ const char *page, size_t count)
{
- struct nvmet_subsys *subsys = to_subsys(item);
- struct nvmet_subsys_model *new_model;
- char *new_model_number;
int pos = 0, len;

+ if (subsys->model_number) {
+ pr_err("Can't set model number. %s is already assigned\n",
+ subsys->model_number);
+ return -EINVAL;
+ }
+
len = strcspn(page, "\n");
if (!len)
return -EINVAL;
@@ -1157,28 +1156,25 @@ static ssize_t nvmet_subsys_attr_model_store(struct config_item *item,
return -EINVAL;
}

- new_model_number = kmemdup_nul(page, len, GFP_KERNEL);
- if (!new_model_number)
+ subsys->model_number = kmemdup_nul(page, len, GFP_KERNEL);
+ if (!subsys->model_number)
return -ENOMEM;
+ return count;
+}

- new_model = kzalloc(sizeof(*new_model) + len + 1, GFP_KERNEL);
- if (!new_model) {
- kfree(new_model_number);
- return -ENOMEM;
- }
- memcpy(new_model->number, new_model_number, len);
+static ssize_t nvmet_subsys_attr_model_store(struct config_item *item,
+ const char *page, size_t count)
+{
+ struct nvmet_subsys *subsys = to_subsys(item);
+ ssize_t ret;

down_write(&nvmet_config_sem);
mutex_lock(&subsys->lock);
- new_model = rcu_replace_pointer(subsys->model, new_model,
- mutex_is_locked(&subsys->lock));
+ ret = nvmet_subsys_attr_model_store_locked(subsys, page, count);
mutex_unlock(&subsys->lock);
up_write(&nvmet_config_sem);

- kfree_rcu(new_model, rcuhead);
- kfree(new_model_number);
-
- return count;
+ return ret;
}
CONFIGFS_ATTR(nvmet_subsys_, attr_model);

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 8ce4d59cc9e7..c7af907912f2 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -1521,7 +1521,7 @@ static void nvmet_subsys_free(struct kref *ref)
nvmet_passthru_subsys_free(subsys);

kfree(subsys->subsysnqn);
- kfree_rcu(subsys->model, rcuhead);
+ kfree(subsys->model_number);
kfree(subsys);
}

diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 592763732065..aac741bf378c 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -208,11 +208,6 @@ struct nvmet_ctrl {
bool pi_support;
};

-struct nvmet_subsys_model {
- struct rcu_head rcuhead;
- char number[];
-};
-
struct nvmet_subsys {
enum nvme_subsys_type type;

@@ -242,7 +237,7 @@ struct nvmet_subsys {
struct config_group namespaces_group;
struct config_group allowed_hosts_group;

- struct nvmet_subsys_model __rcu *model;
+ char *model_number;

#ifdef CONFIG_NVME_TARGET_PASSTHRU
struct nvme_ctrl *passthru_ctrl;
--
2.30.1

2021-03-07 22:26:04

by Zoltán Böszörményi

[permalink] [raw]

Subject: Re: [PATCH AUTOSEL 5.11 09/12] nvme-pci: mark Kingston SKC2000 as not supporting the deepest power state

Hi,

2021. 03. 07. 14:57 keltezéssel, Sasha Levin írta:
> From: Zoltán Böszörményi <[email protected]>
>
> [ Upstream commit dc22c1c058b5c4fe967a20589e36f029ee42a706 ]
>
> My 2TB SKC2000 showed the exact same symptoms that were provided
> in 538e4a8c57 ("nvme-pci: avoid the deepest sleep state on
> Kingston A2000 SSDs"), i.e. a complete NVME lockup that needed
> cold boot to get it back.
>
> According to some sources, the A2000 is simply a rebadged
> SKC2000 with a slightly optimized firmware.
>
> Adding the SKC2000 PCI ID to the quirk list with the same workaround
> as the A2000 made my laptop survive a 5 hours long Yocto bootstrap
> buildfest which reliably triggered the SSD lockup previously.
>
> Signed-off-by: Zoltán Böszörményi <[email protected]>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Sasha Levin <[email protected]>

Thanks for picking it up for stable.

May I suggest to include it in 5.10.x as well?
Originally this patch was tested on 5.10.17.

Best regards,
Zoltán Böszörményi

> ---
> drivers/nvme/host/pci.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 7a38d764b486..14c5b52400ef 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3262,6 +3262,8 @@ static const struct pci_device_id nvme_id_table[] = {
> .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },
> { PCI_DEVICE(0x1d97, 0x2263), /* SPCC */
> .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },
> + { PCI_DEVICE(0x2646, 0x2262), /* KINGSTON SKC2000 NVMe SSD */
> + .driver_data = NVME_QUIRK_NO_DEEPEST_PS, },
> { PCI_DEVICE(0x2646, 0x2263), /* KINGSTON A2000 NVMe SSD */
> .driver_data = NVME_QUIRK_NO_DEEPEST_PS, },
> { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001),

2021-03-08 15:47:17

by David Sterba

[permalink] [raw]

Subject: Re: [PATCH AUTOSEL 5.11 03/12] btrfs: subpage: fix the false data csum mismatch error

On Sun, Mar 07, 2021 at 08:57:37AM -0500, Sasha Levin wrote:
> From: Qu Wenruo <[email protected]>
>
> [ Upstream commit c28ea613fafad910d08f67efe76ae552b1434e44 ]
>
> [BUG]
> When running fstresss, we can hit strange data csum mismatch where the
> on-disk data is in fact correct (passes scrub).
>
> With some extra debug info added, we have the following traces:
>
> 0482us: btrfs_do_readpage: root=5 ino=284 offset=393216, submit force=0 pgoff=0 iosize=8192
> 0494us: btrfs_do_readpage: root=5 ino=284 offset=401408, submit force=0 pgoff=8192 iosize=4096
> 0498us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=393216 len=8192
> 0591us: btrfs_do_readpage: root=5 ino=284 offset=405504, submit force=0 pgoff=12288 iosize=36864
> 0594us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=401408 len=4096
> 0863us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=405504 len=36864
> 0933us: btrfs_verify_data_csum: root=5 ino=284 offset=393216 len=8192
> 0967us: btrfs_do_readpage: root=5 ino=284 offset=442368, skip beyond isize pgoff=49152 iosize=16384
> 1047us: btrfs_verify_data_csum: root=5 ino=284 offset=401408 len=4096
> 1163us: btrfs_verify_data_csum: root=5 ino=284 offset=405504 len=36864
> 1290us: check_data_csum: !!! root=5 ino=284 offset=438272 pg_off=45056 !!!
> 7387us: end_bio_extent_readpage: root=5 ino=284 before pending_read_bios=0
>
> [CAUSE]
> Normally we expect all submitted bio reads to only touch the range we
> specified, and under subpage context, it means we should only touch the
> range specified in each bvec.
>
> But in data read path, inside end_bio_extent_readpage(), we have page
> zeroing which only takes regular page size into consideration.
>
> This means for subpage if we have an inode whose content looks like below:
>
> 0 16K 32K 48K 64K
> |///////| |///////| |
>
> |//| = data needs to be read from disk
> | | = hole
>
> And i_size is 64K initially.
>
> Then the following race can happen:
>
> T1 | T2
> --------------------------------+--------------------------------
> btrfs_do_readpage() |
> |- isize = 64K; |
> | At this time, the isize is |
> | 64K |
> | |
> |- submit_extent_page() |
> | submit previous assembled bio|
> | assemble bio for [0, 16K) |
> | |
> |- submit_extent_page() |
> submit read bio for [0, 16K) |
> assemble read bio for |
> [32K, 48K) |
> |
> | btrfs_setsize()
> | |- i_size_write(, 16K);
> | Now i_size is only 16K
> end_io() for [0K, 16K) |
> |- end_bio_extent_readpage() |
> |- btrfs_verify_data_csum() |
> | No csum error |
> |- i_size = 16K; |
> |- zero_user_segment(16K, |
> PAGE_SIZE); |
> !!! We zeroed range |
> !!! [32K, 48K) |
> | end_io for [32K, 48K)
> | |- end_bio_extent_readpage()
> | |- btrfs_verify_data_csum()
> | ! CSUM MISMATCH !
> | ! As the range is zeroed now !
>
> [FIX]
> To fix the problem, make end_bio_extent_readpage() to only zero the
> range of bvec.
>
> The bug only affects subpage read-write support, as for full read-only
> mount we can't change i_size thus won't hit the race condition.

Please drop this patch from autosel because of the above, this is in a
feature that's in progress and does not affect regular filesystems.