This is the start of the stable review cycle for the 4.4.58 release.
There are 76 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu Mar 30 12:25:40 UTC 2017.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.58-rc1.gz
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <[email protected]>
Linux 4.4.58-rc1
Jiri Slaby <[email protected]>
crypto: algif_hash - avoid zero-sized array
Takashi Iwai <[email protected]>
fbcon: Fix vc attr at deinit
Sumit Semwal <[email protected]>
serial: 8250_pci: Detach low-level driver during PCI error recovery
Sumit Semwal <[email protected]>
ACPI / blacklist: Make Dell Latitude 3350 ethernet work
Sumit Semwal <[email protected]>
ACPI / blacklist: add _REV quirks for Dell Precision 5520 and 3520
Sumit Semwal <[email protected]>
uvcvideo: uvc_scan_fallback() for webcams with broken chain
Sumit Semwal <[email protected]>
s390/zcrypt: Introduce CEX6 toleration
Sumit Semwal <[email protected]>
block: allow WRITE_SAME commands with the SG_IO ioctl
Sumit Semwal <[email protected]>
vfio/spapr: Postpone allocation of userspace version of TCE table
Sumit Semwal <[email protected]>
PCI: Do any VF BAR updates before enabling the BARs
Sumit Semwal <[email protected]>
PCI: Ignore BAR updates on virtual functions
Sumit Semwal <[email protected]>
PCI: Update BARs using property bits appropriate for type
Sumit Semwal <[email protected]>
PCI: Don't update VF BARs while VF memory space is enabled
Sumit Semwal <[email protected]>
PCI: Decouple IORESOURCE_ROM_ENABLE and PCI_ROM_ADDRESS_ENABLE
Sumit Semwal <[email protected]>
PCI: Add comments about ROM BAR updating
Sumit Semwal <[email protected]>
PCI: Remove pci_resource_bar() and pci_iov_resource_bar()
Sumit Semwal <[email protected]>
PCI: Separate VF BAR updates from standard BAR updates
Sumit Semwal <[email protected]>
x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic
Sumit Semwal <[email protected]>
igb: add i211 to i210 PHY workaround
Sumit Semwal <[email protected]>
igb: Workaround for igb i210 firmware issue
Sumit Semwal <[email protected]>
xen: do not re-use pirq number cached in pci device msi msg data
Darrick J. Wong <[email protected]>
xfs: clear _XBF_PAGES from buffers when readahead page
Johan Hovold <[email protected]>
USB: usbtmc: add missing endpoint sanity check
Johannes Berg <[email protected]>
nl80211: fix dumpit error path RTNL deadlocks
Eric Sandeen <[email protected]>
xfs: fix up xfs_swap_extent_forks inline extent handling
Darrick J. Wong <[email protected]>
xfs: don't allow di_size with high bit set
Ilya Dryomov <[email protected]>
libceph: don't set weight to IN when OSD is destroyed
Tomasz Majchrzak <[email protected]>
raid10: increment write counter after bio is split
Ilya Dryomov <[email protected]>
libceph: force GFP_NOIO for socket allocations
Viresh Kumar <[email protected]>
cpufreq: Restore policy min/max limits on CPU online
Nicolas Ferre <[email protected]>
ARM: dts: at91: sama5d2: add dma properties to UART nodes
Nicolas Ferre <[email protected]>
ARM: at91: pm: cpu_idle: switch DDR to power-down mode
Koos Vriezen <[email protected]>
iommu/vt-d: Fix NULL pointer dereference in device_to_iommu
Ankur Arora <[email protected]>
xen/acpi: upload PM state from init-domain to Xen
Adrian Hunter <[email protected]>
mmc: sdhci: Do not disable interrupts while waiting for clock
Eric Biggers <[email protected]>
ext4: mark inode dirty after converting inline directory
Sudip Mukherjee <[email protected]>
parport: fix attempt to write duplicate procfiles
Song Hongyan <[email protected]>
iio: hid-sensor-trigger: Change get poll value function order to avoid sensor properties losing after resume from S3
Michael Engl <[email protected]>
iio: adc: ti_am335x_adc: fix fifo overrun recovery
Johan Hovold <[email protected]>
mmc: ushc: fix NULL-deref at probe
Johan Hovold <[email protected]>
uwb: hwa-rc: fix NULL-deref at probe
Johan Hovold <[email protected]>
uwb: i1480-dfu: fix NULL-deref at probe
Guenter Roeck <[email protected]>
usb: hub: Fix crash after failure to read BOS descriptor
Bin Liu <[email protected]>
usb: musb: cppi41: don't check early-TX-interrupt for Isoch transfer
Johan Hovold <[email protected]>
USB: wusbcore: fix NULL-deref at probe
Johan Hovold <[email protected]>
USB: idmouse: fix NULL-deref at probe
Johan Hovold <[email protected]>
USB: lvtest: fix NULL-deref at probe
Johan Hovold <[email protected]>
USB: uss720: fix NULL-deref at probe
Samuel Thibault <[email protected]>
usb-core: Add LINEAR_FRAME_INTR_BINTERVAL USB quirk
Roger Quadros <[email protected]>
usb: gadget: f_uvc: Fix SuperSpeed companion descriptor's wBytesPerInterval
Oliver Neukum <[email protected]>
ACM gadget: fix endianness in notifications
Bjørn Mork <[email protected]>
USB: serial: qcserial: add Dell DW5811e
Dan Williams <[email protected]>
USB: serial: option: add Quectel UC15, UC20, EC21, and EC25 modems
Hui Wang <[email protected]>
ALSA: hda - Adding a group of pin definition to fix headset problem
Takashi Iwai <[email protected]>
ALSA: ctxfi: Fix the incorrect check of dma_set_mask() call
Takashi Iwai <[email protected]>
ALSA: seq: Fix racy cell insertions during snd_seq_pool_done()
Johan Hovold <[email protected]>
Input: sur40 - validate number of endpoints before using them
Johan Hovold <[email protected]>
Input: kbtab - validate number of endpoints before using them
Johan Hovold <[email protected]>
Input: cm109 - validate number of endpoints before using them
Johan Hovold <[email protected]>
Input: yealink - validate number of endpoints before using them
Johan Hovold <[email protected]>
Input: hanwang - validate number of endpoints before using them
Johan Hovold <[email protected]>
Input: ims-pcu - validate number of endpoints before using them
Johan Hovold <[email protected]>
Input: iforce - validate number of endpoints before using them
Kai-Heng Feng <[email protected]>
Input: i8042 - add noloop quirk for Dell Embedded Box PC 3000
Matjaz Hegedic <[email protected]>
Input: elan_i2c - add ASUS EeeBook X205TA special touchpad fw
Eric Dumazet <[email protected]>
tcp: initialize icsk_ack.lrcvtime at session start time
Daniel Borkmann <[email protected]>
socket, bpf: fix sk_filter use after free in sk_clone_lock
Eric Dumazet <[email protected]>
ipv4: provide stronger user input validation in nl_fib_input()
Doug Berger <[email protected]>
net: bcmgenet: remove bcmgenet_internal_phy_setup()
Gal Pressman <[email protected]>
net/mlx5e: Count LRO packets correctly
Maor Gottlieb <[email protected]>
net/mlx5: Increase number of max QPs in default profile
Andrey Ulanov <[email protected]>
net: unix: properly re-increment inflight counter of GC discarded candidates
Lendacky, Thomas <[email protected]>
amd-xgbe: Fix jumbo MTU processing on newer hardware
Eric Dumazet <[email protected]>
net: properly release sk_frag.page
Florian Fainelli <[email protected]>
net: bcmgenet: Do not suspend PHY if Wake-on-LAN is enabled
Or Gerlitz <[email protected]>
net/openvswitch: Set the ipv6 source tunnel key address attribute correctly
-------------
Diffstat:
Makefile | 4 +-
arch/arm/boot/dts/sama5d2.dtsi | 35 ++++++
arch/arm/mach-at91/pm.c | 18 ++-
arch/x86/kernel/cpu/mshyperv.c | 24 ++++
arch/x86/pci/xen.c | 23 ++--
block/scsi_ioctl.c | 3 +
crypto/algif_hash.c | 2 +-
drivers/acpi/blacklist.c | 28 +++++
drivers/cpufreq/cpufreq.c | 3 +
drivers/iio/adc/ti_am335x_adc.c | 13 ++-
.../iio/common/hid-sensors/hid-sensor-trigger.c | 6 +-
drivers/input/joystick/iforce/iforce-usb.c | 3 +
drivers/input/misc/cm109.c | 4 +
drivers/input/misc/ims-pcu.c | 4 +
drivers/input/misc/yealink.c | 4 +
drivers/input/mouse/elan_i2c_core.c | 20 ++--
drivers/input/serio/i8042-x86ia64io.h | 7 ++
drivers/input/tablet/hanwang.c | 3 +
drivers/input/tablet/kbtab.c | 3 +
drivers/input/touchscreen/sur40.c | 3 +
drivers/iommu/intel-iommu.c | 2 +-
drivers/md/raid10.c | 4 +-
drivers/media/usb/uvc/uvc_driver.c | 118 +++++++++++++++++++-
drivers/mmc/host/sdhci.c | 4 +-
drivers/mmc/host/ushc.c | 3 +
drivers/net/ethernet/amd/xgbe/xgbe-common.h | 6 +-
drivers/net/ethernet/amd/xgbe/xgbe-dev.c | 20 ++--
drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 102 ++++++++++-------
drivers/net/ethernet/broadcom/genet/bcmgenet.c | 6 +-
drivers/net/ethernet/broadcom/genet/bcmmii.c | 15 ---
drivers/net/ethernet/intel/igb/e1000_phy.c | 4 +
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 4 +
drivers/net/ethernet/mellanox/mlx5/core/main.c | 2 +-
drivers/parport/share.c | 6 +-
drivers/pci/iov.c | 70 +++++++++---
drivers/pci/pci.c | 34 ------
drivers/pci/pci.h | 7 +-
drivers/pci/probe.c | 3 +-
drivers/pci/rom.c | 5 +
drivers/pci/setup-res.c | 48 +++++---
drivers/s390/crypto/ap_bus.c | 3 +
drivers/s390/crypto/ap_bus.h | 1 +
drivers/tty/serial/8250/8250_pci.c | 23 +++-
drivers/usb/class/usbtmc.c | 9 +-
drivers/usb/core/config.c | 10 ++
drivers/usb/core/hub.c | 2 +-
drivers/usb/core/quirks.c | 8 ++
drivers/usb/gadget/function/f_acm.c | 4 +-
drivers/usb/gadget/function/f_uvc.c | 2 +-
drivers/usb/misc/idmouse.c | 3 +
drivers/usb/misc/lvstest.c | 4 +
drivers/usb/misc/uss720.c | 5 +
drivers/usb/musb/musb_cppi41.c | 23 +++-
drivers/usb/serial/option.c | 17 ++-
drivers/usb/serial/qcserial.c | 2 +
drivers/usb/wusbcore/wa-hc.c | 3 +
drivers/uwb/hwa-rc.c | 3 +
drivers/uwb/i1480/dfu/usb.c | 3 +
drivers/vfio/vfio_iommu_spapr_tce.c | 20 ++--
drivers/video/console/fbcon.c | 67 +++++++-----
drivers/xen/xen-acpi-processor.c | 34 ++++--
fs/ext4/inline.c | 5 +-
fs/xfs/libxfs/xfs_inode_buf.c | 8 ++
fs/xfs/xfs_bmap_util.c | 7 +-
fs/xfs/xfs_buf.c | 1 +
include/linux/usb/quirks.h | 6 +
net/ceph/messenger.c | 6 +
net/ceph/osdmap.c | 1 -
net/core/sock.c | 16 ++-
net/ipv4/fib_frontend.c | 3 +-
net/ipv4/tcp_input.c | 2 +-
net/ipv4/tcp_minisocks.c | 1 +
net/openvswitch/flow_netlink.c | 2 +-
net/unix/garbage.c | 17 +--
net/wireless/nl80211.c | 121 +++++++++------------
sound/core/seq/seq_clientmgr.c | 1 +
sound/core/seq/seq_fifo.c | 3 +
sound/core/seq/seq_memory.c | 17 ++-
sound/core/seq/seq_memory.h | 1 +
sound/pci/ctxfi/cthw20k1.c | 2 +-
sound/pci/hda/patch_realtek.c | 2 +
81 files changed, 804 insertions(+), 337 deletions(-)
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit 03ace948a4eb89d1cf51c06afdfc41ebca5fdb27 upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer or accessing memory beyond the endpoint array should a
malicious device lack the expected endpoints.
This specifically fixes the NULL-pointer dereference when probing HWA HC
devices.
Fixes: df3654236e31 ("wusb: add the Wire Adapter (WA) core")
Cc: Inaky Perez-Gonzalez <[email protected]>
Cc: David Vrabel <[email protected]>
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/usb/wusbcore/wa-hc.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/usb/wusbcore/wa-hc.c
+++ b/drivers/usb/wusbcore/wa-hc.c
@@ -39,6 +39,9 @@ int wa_create(struct wahc *wa, struct us
int result;
struct device *dev = &iface->dev;
+ if (iface->cur_altsetting->desc.bNumEndpoints < 3)
+ return -ENODEV;
+
result = wa_rpipes_create(wa);
if (result < 0)
goto error_rpipes_create;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Guenter Roeck <[email protected]>
commit 7b2db29fbb4e766fcd02207eb2e2087170bd6ebc upstream.
If usb_get_bos_descriptor() returns an error, usb->bos will be NULL.
Nevertheless, it is dereferenced unconditionally in
hub_set_initial_usb2_lpm_policy() if usb2_hw_lpm_capable is set.
This results in a crash.
usb 5-1: unable to get BOS descriptor
...
Unable to handle kernel NULL pointer dereference at virtual address 00000008
pgd = ffffffc00165f000
[00000008] *pgd=000000000174f003, *pud=000000000174f003,
*pmd=0000000001750003, *pte=00e8000001751713
Internal error: Oops: 96000005 [#1] PREEMPT SMP
Modules linked in: uinput uvcvideo videobuf2_vmalloc cmac [ ... ]
CPU: 5 PID: 3353 Comm: kworker/5:3 Tainted: G B 4.4.52 #480
Hardware name: Google Kevin (DT)
Workqueue: events driver_set_config_work
task: ffffffc0c3690000 ti: ffffffc0ae9a8000 task.ti: ffffffc0ae9a8000
PC is at hub_port_init+0xc3c/0xd10
LR is at hub_port_init+0xc3c/0xd10
...
Call trace:
[<ffffffc0007fbbfc>] hub_port_init+0xc3c/0xd10
[<ffffffc0007fbe2c>] usb_reset_and_verify_device+0x15c/0x82c
[<ffffffc0007fc5e0>] usb_reset_device+0xe4/0x298
[<ffffffbffc0e3fcc>] rtl8152_probe+0x84/0x9b0 [r8152]
[<ffffffc00080ca8c>] usb_probe_interface+0x244/0x2f8
[<ffffffc000774a24>] driver_probe_device+0x180/0x3b4
[<ffffffc000774e48>] __device_attach_driver+0xb4/0xe0
[<ffffffc000772168>] bus_for_each_drv+0xb4/0xe4
[<ffffffc0007747ec>] __device_attach+0xd0/0x158
[<ffffffc000775080>] device_initial_probe+0x24/0x30
[<ffffffc0007739d4>] bus_probe_device+0x50/0xe4
[<ffffffc000770bd0>] device_add+0x414/0x738
[<ffffffc000809fe8>] usb_set_configuration+0x89c/0x914
[<ffffffc00080a120>] driver_set_config_work+0xc0/0xf0
[<ffffffc000249bb8>] process_one_work+0x390/0x6b8
[<ffffffc00024abcc>] worker_thread+0x480/0x610
[<ffffffc000251a80>] kthread+0x164/0x178
[<ffffffc0002045d0>] ret_from_fork+0x10/0x40
Since we don't know anything about LPM capabilities without BOS descriptor,
don't attempt to enable LPM if it is not available.
Fixes: 890dae886721 ("xhci: Enable LPM support only for hardwired ...")
Cc: Mathias Nyman <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Acked-by: Mathias Nyman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/usb/core/hub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -4199,7 +4199,7 @@ static void hub_set_initial_usb2_lpm_pol
struct usb_hub *hub = usb_hub_to_struct_hub(udev->parent);
int connect_type = USB_PORT_CONNECT_TYPE_UNKNOWN;
- if (!udev->usb2_hw_lpm_capable)
+ if (!udev->usb2_hw_lpm_capable || !udev->bos)
return;
if (hub)
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit 181302dc7239add8ab1449c23ecab193f52ee6ab upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer should a malicious device lack endpoints.
Fixes: 53f3a9e26ed5 ("mmc: USB SD Host Controller (USHC) driver")
Cc: David Vrabel <[email protected]>
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/mmc/host/ushc.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/mmc/host/ushc.c
+++ b/drivers/mmc/host/ushc.c
@@ -426,6 +426,9 @@ static int ushc_probe(struct usb_interfa
struct ushc_data *ushc;
int ret;
+ if (intf->cur_altsetting->desc.bNumEndpoints < 1)
+ return -ENODEV;
+
mmc = mmc_alloc_host(sizeof(struct ushc_data), &intf->dev);
if (mmc == NULL)
return -ENOMEM;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Adrian Hunter <[email protected]>
commit e2ebfb2142acefecc2496e71360f50d25726040b upstream.
Disabling interrupts for even a millisecond can cause problems for some
devices. That can happen when sdhci changes clock frequency because it
waits for the clock to become stable under a spin lock.
The spin lock is not necessary here. Anything that is racing with changes
to the I/O state is already broken. The mmc core already provides
synchronization via "claiming" the host.
Although the spin lock probably should be removed from the code paths that
lead to this point, such a patch would touch too much code to be suitable
for stable trees. Consequently, for this patch, just drop the spin lock
while waiting.
Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
Tested-by: Ludovic Desroches <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/mmc/host/sdhci.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -1274,7 +1274,9 @@ clock_set:
return;
}
timeout--;
- mdelay(1);
+ spin_unlock_irq(&host->lock);
+ usleep_range(900, 1100);
+ spin_lock_irq(&host->lock);
}
clk |= SDHCI_CLOCK_CARD_EN;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Song Hongyan <[email protected]>
commit 3bec247474469f769af41e8c80d3a100dd97dd76 upstream.
In function _hid_sensor_power_state(), when hid_sensor_read_poll_value()
is called, sensor's all properties will be updated by the value from
sensor hardware/firmware.
In some implementation, sensor hardware/firmware will do a power cycle
during S3. In this case, after resume, once hid_sensor_read_poll_value()
is called, sensor's all properties which are kept by driver during S3
will be changed to default value.
But instead, if a set feature function is called first, sensor
hardware/firmware will be recovered to the last status. So change the
sensor_hub_set_feature() calling order to behind of set feature function
to avoid sensor properties lose.
Signed-off-by: Song Hongyan <[email protected]>
Acked-by: Srinivas Pandruvada <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/iio/common/hid-sensors/hid-sensor-trigger.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/drivers/iio/common/hid-sensors/hid-sensor-trigger.c
+++ b/drivers/iio/common/hid-sensors/hid-sensor-trigger.c
@@ -51,8 +51,6 @@ static int _hid_sensor_power_state(struc
st->report_state.report_id,
st->report_state.index,
HID_USAGE_SENSOR_PROP_REPORTING_STATE_ALL_EVENTS_ENUM);
-
- poll_value = hid_sensor_read_poll_value(st);
} else {
int val;
@@ -89,7 +87,9 @@ static int _hid_sensor_power_state(struc
sensor_hub_get_feature(st->hsdev, st->power_state.report_id,
st->power_state.index,
sizeof(state_val), &state_val);
- if (state && poll_value)
+ if (state)
+ poll_value = hid_sensor_read_poll_value(st);
+ if (poll_value > 0)
msleep_interruptible(poll_value * 2);
return 0;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Hui Wang <[email protected]>
commit 3f307834e695f59dac4337a40316bdecfb9d0508 upstream.
A new Dell laptop needs to apply ALC269_FIXUP_DELL1_MIC_NO_PRESENCE to
fix the headset problem, and the pin definiton of this machine is not
in the pin quirk table yet, now adding it to the table.
Signed-off-by: Hui Wang <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
sound/pci/hda/patch_realtek.c | 2 ++
1 file changed, 2 insertions(+)
--- a/sound/pci/hda/patch_realtek.c
+++ b/sound/pci/hda/patch_realtek.c
@@ -6040,6 +6040,8 @@ static const struct snd_hda_pin_quirk al
ALC295_STANDARD_PINS,
{0x17, 0x21014040},
{0x18, 0x21a19050}),
+ SND_HDA_PIN_QUIRK(0x10ec0295, 0x1028, "Dell", ALC269_FIXUP_DELL1_MIC_NO_PRESENCE,
+ ALC295_STANDARD_PINS),
SND_HDA_PIN_QUIRK(0x10ec0298, 0x1028, "Dell", ALC298_FIXUP_DELL1_MIC_NO_PRESENCE,
ALC298_STANDARD_PINS,
{0x17, 0x90170110}),
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Andrey Ulanov <[email protected]>
[ Upstream commit 7df9c24625b9981779afb8fcdbe2bb4765e61147 ]
Dmitry has reported that a BUG_ON() condition in unix_notinflight()
may be triggered by a simple code that forwards unix socket in an
SCM_RIGHTS message.
That is caused by incorrect unix socket GC implementation in unix_gc().
The GC first collects list of candidates, then (a) decrements their
"children's" inflight counter, (b) checks which inflight counters are
now 0, and then (c) increments all inflight counters back.
(a) and (c) are done by calling scan_children() with inc_inflight or
dec_inflight as the second argument.
Commit 6209344f5a37 ("net: unix: fix inflight counting bug in garbage
collector") changed scan_children() such that it no longer considers
sockets that do not have UNIX_GC_CANDIDATE flag. It also added a block
of code that that unsets this flag _before_ invoking
scan_children(, dec_iflight, ). This may lead to incorrect inflight
counters for some sockets.
This change fixes this bug by changing order of operations:
UNIX_GC_CANDIDATE is now unset only after all inflight counters are
restored to the original state.
kernel BUG at net/unix/garbage.c:149!
RIP: 0010:[<ffffffff8717ebf4>] [<ffffffff8717ebf4>]
unix_notinflight+0x3b4/0x490 net/unix/garbage.c:149
Call Trace:
[<ffffffff8716cfbf>] unix_detach_fds.isra.19+0xff/0x170 net/unix/af_unix.c:1487
[<ffffffff8716f6a9>] unix_destruct_scm+0xf9/0x210 net/unix/af_unix.c:1496
[<ffffffff86a90a01>] skb_release_head_state+0x101/0x200 net/core/skbuff.c:655
[<ffffffff86a9808a>] skb_release_all+0x1a/0x60 net/core/skbuff.c:668
[<ffffffff86a980ea>] __kfree_skb+0x1a/0x30 net/core/skbuff.c:684
[<ffffffff86a98284>] kfree_skb+0x184/0x570 net/core/skbuff.c:705
[<ffffffff871789d5>] unix_release_sock+0x5b5/0xbd0 net/unix/af_unix.c:559
[<ffffffff87179039>] unix_release+0x49/0x90 net/unix/af_unix.c:836
[<ffffffff86a694b2>] sock_release+0x92/0x1f0 net/socket.c:570
[<ffffffff86a6962b>] sock_close+0x1b/0x20 net/socket.c:1017
[<ffffffff81a76b8e>] __fput+0x34e/0x910 fs/file_table.c:208
[<ffffffff81a771da>] ____fput+0x1a/0x20 fs/file_table.c:244
[<ffffffff81483ab0>] task_work_run+0x1a0/0x280 kernel/task_work.c:116
[< inline >] exit_task_work include/linux/task_work.h:21
[<ffffffff8141287a>] do_exit+0x183a/0x2640 kernel/exit.c:828
[<ffffffff8141383e>] do_group_exit+0x14e/0x420 kernel/exit.c:931
[<ffffffff814429d3>] get_signal+0x663/0x1880 kernel/signal.c:2307
[<ffffffff81239b45>] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
[<ffffffff8100666a>] exit_to_usermode_loop+0x1ea/0x2d0
arch/x86/entry/common.c:156
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[<ffffffff81009693>] syscall_return_slowpath+0x4d3/0x570
arch/x86/entry/common.c:259
[<ffffffff881478e6>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Link: https://lkml.org/lkml/2017/3/6/252
Signed-off-by: Andrey Ulanov <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Fixes: 6209344 ("net: unix: fix inflight counting bug in garbage collector")
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/unix/garbage.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -146,6 +146,7 @@ void unix_notinflight(struct user_struct
if (s) {
struct unix_sock *u = unix_sk(s);
+ BUG_ON(!atomic_long_read(&u->inflight));
BUG_ON(list_empty(&u->link));
if (atomic_long_dec_and_test(&u->inflight))
@@ -341,6 +342,14 @@ void unix_gc(void)
}
list_del(&cursor);
+ /* Now gc_candidates contains only garbage. Restore original
+ * inflight counters for these as well, and remove the skbuffs
+ * which are creating the cycle(s).
+ */
+ skb_queue_head_init(&hitlist);
+ list_for_each_entry(u, &gc_candidates, link)
+ scan_children(&u->sk, inc_inflight, &hitlist);
+
/* not_cycle_list contains those sockets which do not make up a
* cycle. Restore these to the inflight list.
*/
@@ -350,14 +359,6 @@ void unix_gc(void)
list_move_tail(&u->link, &gc_inflight_list);
}
- /* Now gc_candidates contains only garbage. Restore original
- * inflight counters for these as well, and remove the skbuffs
- * which are creating the cycle(s).
- */
- skb_queue_head_init(&hitlist);
- list_for_each_entry(u, &gc_candidates, link)
- scan_children(&u->sk, inc_inflight, &hitlist);
-
spin_unlock(&unix_gc_lock);
/* Here we are. Hitlist is filled. Die. */
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Koos Vriezen <[email protected]>
commit 5003ae1e735e6bfe4679d9bed6846274f322e77e upstream.
The function device_to_iommu() in the Intel VT-d driver
lacks a NULL-ptr check, resulting in this oops at boot on
some platforms:
BUG: unable to handle kernel NULL pointer dereference at 00000000000007ab
IP: [<ffffffff8132234a>] device_to_iommu+0x11a/0x1a0
PGD 0
[...]
Call Trace:
? find_or_alloc_domain.constprop.29+0x1a/0x300
? dw_dma_probe+0x561/0x580 [dw_dmac_core]
? __get_valid_domain_for_dev+0x39/0x120
? __intel_map_single+0x138/0x180
? intel_alloc_coherent+0xb6/0x120
? sst_hsw_dsp_init+0x173/0x420 [snd_soc_sst_haswell_pcm]
? mutex_lock+0x9/0x30
? kernfs_add_one+0xdb/0x130
? devres_add+0x19/0x60
? hsw_pcm_dev_probe+0x46/0xd0 [snd_soc_sst_haswell_pcm]
? platform_drv_probe+0x30/0x90
? driver_probe_device+0x1ed/0x2b0
? __driver_attach+0x8f/0xa0
? driver_probe_device+0x2b0/0x2b0
? bus_for_each_dev+0x55/0x90
? bus_add_driver+0x110/0x210
? 0xffffffffa11ea000
? driver_register+0x52/0xc0
? 0xffffffffa11ea000
? do_one_initcall+0x32/0x130
? free_vmap_area_noflush+0x37/0x70
? kmem_cache_alloc+0x88/0xd0
? do_init_module+0x51/0x1c4
? load_module+0x1ee9/0x2430
? show_taint+0x20/0x20
? kernel_read_file+0xfd/0x190
? SyS_finit_module+0xa3/0xb0
? do_syscall_64+0x4a/0xb0
? entry_SYSCALL64_slow_path+0x25/0x25
Code: 78 ff ff ff 4d 85 c0 74 ee 49 8b 5a 10 0f b6 9b e0 00 00 00 41 38 98 e0 00 00 00 77 da 0f b6 eb 49 39 a8 88 00 00 00 72 ce eb 8f <41> f6 82 ab 07 00 00 04 0f 85 76 ff ff ff 0f b6 4d 08 88 0e 49
RIP [<ffffffff8132234a>] device_to_iommu+0x11a/0x1a0
RSP <ffffc90001457a78>
CR2: 00000000000007ab
---[ end trace 16f974b6d58d0aad ]---
Add the missing pointer check.
Fixes: 1c387188c60f53b338c20eee32db055dfe022a9b ("iommu/vt-d: Fix IOMMU lookup for SR-IOV Virtual Functions")
Signed-off-by: Koos Vriezen <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/iommu/intel-iommu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -908,7 +908,7 @@ static struct intel_iommu *device_to_iom
* which we used for the IOMMU lookup. Strictly speaking
* we could do this for all PCI devices; we only need to
* get the BDF# from the scope table for ACPI matches. */
- if (pdev->is_virtfn)
+ if (pdev && pdev->is_virtfn)
goto got_pdev;
*bus = drhd->devices[i].bus;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Nicolas Ferre <[email protected]>
commit 60b89f1928af80b546b5c3fd8714a62f6f4b8844 upstream.
On some DDR controllers, compatible with the sama5d3 one,
the sequence to enter/exit/re-enter the self-refresh mode adds
more constrains than what is currently written in the at91_idle
driver. An actual access to the DDR chip is needed between exit
and re-enter of this mode which is somehow difficult to implement.
This sequence can completely hang the SoC. It is particularly
experienced on parts which embed a L2 cache if the code run
between IDLE calls fits in it...
Moreover, as the intention is to enter and exit pretty rapidly
from IDLE, the power-down mode is a good candidate.
So now we use power-down instead of self-refresh. As we can
simplify the code for sama5d3 compatible DDR controllers,
we instantiate a new sama5d3_ddr_standby() function.
Signed-off-by: Nicolas Ferre <[email protected]>
Fixes: 017b5522d5e3 ("ARM: at91: Add new binding for sama5d3-ddramc")
Signed-off-by: Alexandre Belloni <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
arch/arm/mach-at91/pm.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
--- a/arch/arm/mach-at91/pm.c
+++ b/arch/arm/mach-at91/pm.c
@@ -286,6 +286,22 @@ static void at91_ddr_standby(void)
at91_ramc_write(1, AT91_DDRSDRC_LPR, saved_lpr1);
}
+static void sama5d3_ddr_standby(void)
+{
+ u32 lpr0;
+ u32 saved_lpr0;
+
+ saved_lpr0 = at91_ramc_read(0, AT91_DDRSDRC_LPR);
+ lpr0 = saved_lpr0 & ~AT91_DDRSDRC_LPCB;
+ lpr0 |= AT91_DDRSDRC_LPCB_POWER_DOWN;
+
+ at91_ramc_write(0, AT91_DDRSDRC_LPR, lpr0);
+
+ cpu_do_idle();
+
+ at91_ramc_write(0, AT91_DDRSDRC_LPR, saved_lpr0);
+}
+
/* We manage both DDRAM/SDRAM controllers, we need more than one value to
* remember.
*/
@@ -320,7 +336,7 @@ static const struct of_device_id const r
{ .compatible = "atmel,at91rm9200-sdramc", .data = at91rm9200_standby },
{ .compatible = "atmel,at91sam9260-sdramc", .data = at91sam9_sdram_standby },
{ .compatible = "atmel,at91sam9g45-ddramc", .data = at91_ddr_standby },
- { .compatible = "atmel,sama5d3-ddramc", .data = at91_ddr_standby },
+ { .compatible = "atmel,sama5d3-ddramc", .data = sama5d3_ddr_standby },
{ /*sentinel*/ }
};
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Ilya Dryomov <[email protected]>
commit b581a5854eee4b7851dedb0f8c2ceb54fb902c06 upstream.
Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs. Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.
Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.
Fixes: 930c53286977 ("libceph: apply new_state before new_up_client on incrementals")
Link: http://tracker.ceph.com/issues/19122
Signed-off-by: Ilya Dryomov <[email protected]>
Reviewed-by: Sage Weil <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ceph/osdmap.c | 1 -
1 file changed, 1 deletion(-)
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -1265,7 +1265,6 @@ static int decode_new_up_state_weight(vo
if ((map->osd_state[osd] & CEPH_OSD_EXISTS) &&
(xorstate & CEPH_OSD_EXISTS)) {
pr_info("osd%d does not exist\n", osd);
- map->osd_weight[osd] = CEPH_OSD_IN;
ret = set_primary_affinity(map, osd,
CEPH_OSD_DEFAULT_PRIMARY_AFFINITY);
if (ret)
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit 4ce362711d78a4999011add3115b8f4b0bc25e8c upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer should a malicious device lack endpoints.
Note that the dereference happens in the cmd and wait_init_done
callbacks which are called during probe.
Fixes: 1ba47da52712 ("uwb: add the i1480 DFU driver")
Cc: Inaky Perez-Gonzalez <[email protected]>
Cc: David Vrabel <[email protected]>
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/uwb/i1480/dfu/usb.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/uwb/i1480/dfu/usb.c
+++ b/drivers/uwb/i1480/dfu/usb.c
@@ -362,6 +362,9 @@ int i1480_usb_probe(struct usb_interface
result);
}
+ if (iface->cur_altsetting->desc.bNumEndpoints < 1)
+ return -ENODEV;
+
result = -ENOMEM;
i1480_usb = kzalloc(sizeof(*i1480_usb), GFP_KERNEL);
if (i1480_usb == NULL) {
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Dan Williams <[email protected]>
commit 6e9f44eaaef0df7b846e9316fa9ca72a02025d44 upstream.
Add Quectel UC15, UC20, EC21, and EC25. The EC20 is handled by
qcserial due to a USB VID/PID conflict with an existing Acer
device.
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/usb/serial/option.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
--- a/drivers/usb/serial/option.c
+++ b/drivers/usb/serial/option.c
@@ -233,6 +233,14 @@ static void option_instat_callback(struc
#define BANDRICH_PRODUCT_1012 0x1012
#define QUALCOMM_VENDOR_ID 0x05C6
+/* These Quectel products use Qualcomm's vendor ID */
+#define QUECTEL_PRODUCT_UC20 0x9003
+#define QUECTEL_PRODUCT_UC15 0x9090
+
+#define QUECTEL_VENDOR_ID 0x2c7c
+/* These Quectel products use Quectel's vendor ID */
+#define QUECTEL_PRODUCT_EC21 0x0121
+#define QUECTEL_PRODUCT_EC25 0x0125
#define CMOTECH_VENDOR_ID 0x16d8
#define CMOTECH_PRODUCT_6001 0x6001
@@ -1161,7 +1169,14 @@ static const struct usb_device_id option
{ USB_DEVICE(QUALCOMM_VENDOR_ID, 0x6613)}, /* Onda H600/ZTE MF330 */
{ USB_DEVICE(QUALCOMM_VENDOR_ID, 0x0023)}, /* ONYX 3G device */
{ USB_DEVICE(QUALCOMM_VENDOR_ID, 0x9000)}, /* SIMCom SIM5218 */
- { USB_DEVICE(QUALCOMM_VENDOR_ID, 0x9003), /* Quectel UC20 */
+ /* Quectel products using Qualcomm vendor ID */
+ { USB_DEVICE(QUALCOMM_VENDOR_ID, QUECTEL_PRODUCT_UC15)},
+ { USB_DEVICE(QUALCOMM_VENDOR_ID, QUECTEL_PRODUCT_UC20),
+ .driver_info = (kernel_ulong_t)&net_intf4_blacklist },
+ /* Quectel products using Quectel vendor ID */
+ { USB_DEVICE(QUECTEL_VENDOR_ID, QUECTEL_PRODUCT_EC21),
+ .driver_info = (kernel_ulong_t)&net_intf4_blacklist },
+ { USB_DEVICE(QUECTEL_VENDOR_ID, QUECTEL_PRODUCT_EC25),
.driver_info = (kernel_ulong_t)&net_intf4_blacklist },
{ USB_DEVICE(CMOTECH_VENDOR_ID, CMOTECH_PRODUCT_6001) },
{ USB_DEVICE(CMOTECH_VENDOR_ID, CMOTECH_PRODUCT_CMU_300) },
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit ac2ee9ba953afe88f7a673e1c0c839227b1d7891 upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer should a malicious device lack endpoints.
Fixes: c04148f915e5 ("Input: add driver for USB VoIP phones with CM109...")
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/input/misc/cm109.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/drivers/input/misc/cm109.c
+++ b/drivers/input/misc/cm109.c
@@ -675,6 +675,10 @@ static int cm109_usb_probe(struct usb_in
int error = -ENOMEM;
interface = intf->cur_altsetting;
+
+ if (interface->desc.bNumEndpoints < 1)
+ return -ENODEV;
+
endpoint = &interface->endpoint[0].desc;
if (!usb_endpoint_is_int_in(endpoint))
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Dan Streetman <[email protected]>
[ Upstream commit c74fd80f2f41d05f350bb478151021f88551afe8 ]
Revert the main part of commit:
af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
That commit introduced reading the pci device's msi message data to see
if a pirq was previously configured for the device's msi/msix, and re-use
that pirq. At the time, that was the correct behavior. However, a
later change to Qemu caused it to call into the Xen hypervisor to unmap
all pirqs for a pci device, when the pci device disables its MSI/MSIX
vectors; specifically the Qemu commit:
c976437c7dba9c7444fb41df45468968aaa326ad
("qemu-xen: free all the pirqs for msi/msix when driver unload")
Once Qemu added this pirq unmapping, it was no longer correct for the
kernel to re-use the pirq number cached in the pci device msi message
data. All Qemu releases since 2.1.0 contain the patch that unmaps the
pirqs when the pci device disables its MSI/MSIX vectors.
This bug is causing failures to initialize multiple NVMe controllers
under Xen, because the NVMe driver sets up a single MSIX vector for
each controller (concurrently), and then after using that to talk to
the controller for some configuration data, it disables the single MSIX
vector and re-configures all the MSIX vectors it needs. So the MSIX
setup code tries to re-use the cached pirq from the first vector
for each controller, but the hypervisor has already given away that
pirq to another controller, and its initialization fails.
This is discussed in more detail at:
https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
Signed-off-by: Dan Streetman <[email protected]>
Reviewed-by: Stefano Stabellini <[email protected]>
Acked-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
arch/x86/pci/xen.c | 23 +++++++----------------
1 file changed, 7 insertions(+), 16 deletions(-)
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -231,23 +231,14 @@ static int xen_hvm_setup_msi_irqs(struct
return 1;
for_each_pci_msi_entry(msidesc, dev) {
- __pci_read_msi_msg(msidesc, &msg);
- pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
- ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
- if (msg.data != XEN_PIRQ_MSI_DATA ||
- xen_irq_from_pirq(pirq) < 0) {
- pirq = xen_allocate_pirq_msi(dev, msidesc);
- if (pirq < 0) {
- irq = -ENODEV;
- goto error;
- }
- xen_msi_compose_msg(dev, pirq, &msg);
- __pci_write_msi_msg(msidesc, &msg);
- dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
- } else {
- dev_dbg(&dev->dev,
- "xen: msi already bound to pirq=%d\n", pirq);
+ pirq = xen_allocate_pirq_msi(dev, msidesc);
+ if (pirq < 0) {
+ irq = -ENODEV;
+ goto error;
}
+ xen_msi_compose_msg(dev, pirq, &msg);
+ __pci_write_msi_msg(msidesc, &msg);
+ dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
(type == PCI_CAP_ID_MSI) ? nvec : 1,
(type == PCI_CAP_ID_MSIX) ?
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Ilya Dryomov <[email protected]>
commit 633ee407b9d15a75ac9740ba9d3338815e1fcb95 upstream.
sock_alloc_inode() allocates socket+inode and socket_wq with
GFP_KERNEL, which is not allowed on the writeback path:
Workqueue: ceph-msgr con_work [libceph]
ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
Call Trace:
[<ffffffff816dd629>] schedule+0x29/0x70
[<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
[<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
[<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
[<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
[<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
[<ffffffff81086335>] flush_work+0x165/0x250
[<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
[<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
[<ffffffff816d6b42>] ? __slab_free+0xee/0x234
[<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
[<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
[<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
[<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
[<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
[<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
[<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
[<ffffffff811c0c18>] super_cache_scan+0x178/0x180
[<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
[<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
[<ffffffff8115af70>] shrink_slab+0x100/0x140
[<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
[<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
[<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
[<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
[<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
[<ffffffff811a0ac5>] new_slab+0x2c5/0x390
[<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
[<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
[<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
[<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
[<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
[<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
[<ffffffff811d8566>] alloc_inode+0x26/0xa0
[<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
[<ffffffff815b933e>] sock_alloc+0x1e/0x80
[<ffffffff815ba855>] __sock_create+0x95/0x220
[<ffffffff815baa04>] sock_create_kern+0x24/0x30
[<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
[<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
[<ffffffff81084c19>] process_one_work+0x159/0x4f0
[<ffffffff8108561b>] worker_thread+0x11b/0x530
[<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[<ffffffff8108b6f9>] kthread+0xc9/0xe0
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.
Link: http://tracker.ceph.com/issues/19309
Reported-by: Sergey Jerusalimov <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ceph/messenger.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -7,6 +7,7 @@
#include <linux/kthread.h>
#include <linux/net.h>
#include <linux/nsproxy.h>
+#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/socket.h>
#include <linux/string.h>
@@ -478,11 +479,16 @@ static int ceph_tcp_connect(struct ceph_
{
struct sockaddr_storage *paddr = &con->peer_addr.in_addr;
struct socket *sock;
+ unsigned int noio_flag;
int ret;
BUG_ON(con->sock);
+
+ /* sock_create_kern() allocates with GFP_KERNEL */
+ noio_flag = memalloc_noio_save();
ret = sock_create_kern(read_pnet(&con->msgr->net), paddr->ss_family,
SOCK_STREAM, IPPROTO_TCP, &sock);
+ memalloc_noio_restore(noio_flag);
if (ret)
return ret;
sock->sk->sk_allocation = GFP_NOFS;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Darrick J. Wong <[email protected]>
commit 2aa6ba7b5ad3189cc27f14540aa2f57f0ed8df4b upstream.
If we try to allocate memory pages to back an xfs_buf that we're trying
to read, it's possible that we'll be so short on memory that the page
allocation fails. For a blocking read we'll just wait, but for
readahead we simply dump all the pages we've collected so far.
Unfortunately, after dumping the pages we neglect to clear the
_XBF_PAGES state, which means that the subsequent call to xfs_buf_free
thinks that b_pages still points to pages we own. It then double-frees
the b_pages pages.
This results in screaming about negative page refcounts from the memory
manager, which xfs oughtn't be triggering. To reproduce this case,
mount a filesystem where the size of the inodes far outweighs the
availalble memory (a ~500M inode filesystem on a VM with 300MB memory
did the trick here) and run bulkstat in parallel with other memory
eating processes to put a huge load on the system. The "check summary"
phase of xfs_scrub also works for this purpose.
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Eric Sandeen <[email protected]>
Cc: Ivan Kozik <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
fs/xfs/xfs_buf.c | 1 +
1 file changed, 1 insertion(+)
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -375,6 +375,7 @@ retry:
out_free_pages:
for (i = 0; i < bp->b_page_count; i++)
__free_page(bp->b_pages[i]);
+ bp->b_flags &= ~_XBF_PAGES;
return error;
}
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Roger Quadros <[email protected]>
commit 09424c50b7dff40cb30011c09114404a4656e023 upstream.
The streaming_maxburst module parameter is 0 offset (0..15)
so we must add 1 while using it for wBytesPerInterval
calculation for the SuperSpeed companion descriptor.
Without this host uvcvideo driver will always see the wrong
wBytesPerInterval for SuperSpeed uvc gadget and may not find
a suitable video interface endpoint.
e.g. for streaming_maxburst = 0 case it will always
fail as wBytePerInterval was evaluating to 0.
Reviewed-by: Laurent Pinchart <[email protected]>
Signed-off-by: Roger Quadros <[email protected]>
Signed-off-by: Felipe Balbi <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/usb/gadget/function/f_uvc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/usb/gadget/function/f_uvc.c
+++ b/drivers/usb/gadget/function/f_uvc.c
@@ -625,7 +625,7 @@ uvc_function_bind(struct usb_configurati
uvc_ss_streaming_comp.bMaxBurst = opts->streaming_maxburst;
uvc_ss_streaming_comp.wBytesPerInterval =
cpu_to_le16(max_packet_size * max_packet_mult *
- opts->streaming_maxburst);
+ (opts->streaming_maxburst + 1));
/* Allocate endpoints. */
ep = usb_ep_autoconfig(cdev->gadget, &uvc_control_ep);
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Bjorn Helgaas <[email protected]>
[ Upstream commit 0b457dde3cf8b7c76a60f8e960f21bbd4abdc416 ]
pci_update_resource() updates a hardware BAR so its address matches the
kernel's struct resource UNLESS it's a disabled ROM BAR. We only update
those when we enable the ROM.
It's not obvious from the code why ROM BARs should be handled specially.
Apparently there are Matrox devices with defective ROM BARs that read as
zero when disabled. That means that if pci_enable_rom() reads the disabled
BAR, sets PCI_ROM_ADDRESS_ENABLE (without re-inserting the address), and
writes it back, it would enable the ROM at address zero.
Add comments and references to explain why we can't make the code look more
rational.
The code changes are from 755528c860b0 ("Ignore disabled ROM resources at
setup") and 8085ce084c0f ("[PATCH] Fix PCI ROM mapping").
Link: https://lkml.org/lkml/2005/8/30/138
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Gavin Shan <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
[sumits: minor fixup in rom.c for 4.4.y]
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/pci/rom.c | 5 +++++
drivers/pci/setup-res.c | 6 ++++++
2 files changed, 11 insertions(+)
--- a/drivers/pci/rom.c
+++ b/drivers/pci/rom.c
@@ -31,6 +31,11 @@ int pci_enable_rom(struct pci_dev *pdev)
if (!res->flags)
return -1;
+ /*
+ * Ideally pci_update_resource() would update the ROM BAR address,
+ * and we would only set the enable bit here. But apparently some
+ * devices have buggy ROM BARs that read as zero when disabled.
+ */
pcibios_resource_to_bus(pdev->bus, ®ion, res);
pci_read_config_dword(pdev, pdev->rom_base_reg, &rom_addr);
rom_addr &= ~PCI_ROM_ADDRESS_MASK;
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -68,6 +68,12 @@ static void pci_std_update_resource(stru
if (resno < PCI_ROM_RESOURCE) {
reg = PCI_BASE_ADDRESS_0 + 4 * resno;
} else if (resno == PCI_ROM_RESOURCE) {
+
+ /*
+ * Apparently some Matrox devices have ROM BARs that read
+ * as zero when disabled, so don't update ROM BARs unless
+ * they're enabled. See https://lkml.org/lkml/2005/8/30/138.
+ */
if (!(res->flags & IORESOURCE_ROM_ENABLE))
return;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Alex Hung <[email protected]>
[ Upstream commit 9523b9bf6dceef6b0215e90b2348cd646597f796 ]
Precision 5520 and 3520 either hang at login and during suspend or reboot.
It turns out that that adding them to acpi_rev_dmi_table[] helps to work
around those issues.
Signed-off-by: Alex Hung <[email protected]>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/acpi/blacklist.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
--- a/drivers/acpi/blacklist.c
+++ b/drivers/acpi/blacklist.c
@@ -346,6 +346,22 @@ static struct dmi_system_id acpi_osi_dmi
DMI_MATCH(DMI_PRODUCT_NAME, "XPS 13 9343"),
},
},
+ {
+ .callback = dmi_enable_rev_override,
+ .ident = "DELL Precision 5520",
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "Precision 5520"),
+ },
+ },
+ {
+ .callback = dmi_enable_rev_override,
+ .ident = "DELL Precision 3520",
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "Precision 3520"),
+ },
+ },
#endif
{}
};
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Vitaly Kuznetsov <[email protected]>
[ Upstream commit 59107e2f48831daedc46973ce4988605ab066de3 ]
There is a feature in Hyper-V ('Debug-VM --InjectNonMaskableInterrupt')
which injects NMI to the guest. We may want to crash the guest and do kdump
on this NMI by enabling unknown_nmi_panic. To make kdump succeed we need to
allow the kdump kernel to re-establish VMBus connection so it will see
VMBus devices (storage, network,..).
To properly unload VMBus making it possible to start over during kdump we
need to do the following:
- Send an 'unload' message to the hypervisor. This can be done on any CPU
so we do this the crashing CPU.
- Receive the 'unload finished' reply message. WS2012R2 delivers this
message to the CPU which was used to establish VMBus connection during
module load and this CPU may differ from the CPU sending 'unload'.
Receiving a VMBus message means the following:
- There is a per-CPU slot in memory for one message. This slot can in
theory be accessed by any CPU.
- We get an interrupt on the CPU when a message was placed into the slot.
- When we read the message we need to clear the slot and signal the fact
to the hypervisor. In case there are more messages to this CPU pending
the hypervisor will deliver the next message. The signaling is done by
writing to an MSR so this can only be done on the appropriate CPU.
To avoid doing cross-CPU work on crash we have vmbus_wait_for_unload()
function which checks message slots for all CPUs in a loop waiting for the
'unload finished' messages. However, there is an issue which arises when
these conditions are met:
- We're crashing on a CPU which is different from the one which was used
to initially contact the hypervisor.
- The CPU which was used for the initial contact is blocked with interrupts
disabled and there is a message pending in the message slot.
In this case we won't be able to read the 'unload finished' message on the
crashing CPU. This is reproducible when we receive unknown NMIs on all CPUs
simultaneously: the first CPU entering panic() will proceed to crash and
all other CPUs will stop themselves with interrupts disabled.
The suggested solution is to handle unknown NMIs for Hyper-V guests on the
first CPU which gets them only. This will allow us to rely on VMBus
interrupt handler being able to receive the 'unload finish' message in
case it is delivered to a different CPU.
The issue is not reproducible on WS2016 as Debug-VM delivers NMI to the
boot CPU only, WS2012R2 and earlier Hyper-V versions are affected.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Acked-by: K. Y. Srinivasan <[email protected]>
Cc: [email protected]
Cc: Haiyang Zhang <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
arch/x86/kernel/cpu/mshyperv.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -30,6 +30,7 @@
#include <asm/apic.h>
#include <asm/timer.h>
#include <asm/reboot.h>
+#include <asm/nmi.h>
struct ms_hyperv_info ms_hyperv;
EXPORT_SYMBOL_GPL(ms_hyperv);
@@ -157,6 +158,26 @@ static unsigned char hv_get_nmi_reason(v
return 0;
}
+#ifdef CONFIG_X86_LOCAL_APIC
+/*
+ * Prior to WS2016 Debug-VM sends NMIs to all CPUs which makes
+ * it dificult to process CHANNELMSG_UNLOAD in case of crash. Handle
+ * unknown NMI on the first CPU which gets it.
+ */
+static int hv_nmi_unknown(unsigned int val, struct pt_regs *regs)
+{
+ static atomic_t nmi_cpu = ATOMIC_INIT(-1);
+
+ if (!unknown_nmi_panic)
+ return NMI_DONE;
+
+ if (atomic_cmpxchg(&nmi_cpu, -1, raw_smp_processor_id()) != -1)
+ return NMI_HANDLED;
+
+ return NMI_DONE;
+}
+#endif
+
static void __init ms_hyperv_init_platform(void)
{
/*
@@ -182,6 +203,9 @@ static void __init ms_hyperv_init_platfo
printk(KERN_INFO "HyperV: LAPIC Timer Frequency: %#x\n",
lapic_timer_frequency);
}
+
+ register_nmi_handler(NMI_UNKNOWN, hv_nmi_unknown, NMI_FLAG_FIRST,
+ "hv_nmi_unknown");
#endif
if (ms_hyperv.features & HV_X64_MSR_TIME_REF_COUNT_AVAILABLE)
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Michael Pobega <[email protected]>
[ Upstream commit 708f5dcc21ae9b35f395865fc154b0105baf4de4 ]
The Dell Latitude 3350's ethernet card attempts to use a reserved
IRQ (18), resulting in ACPI being unable to enable the ethernet.
Adding it to acpi_rev_dmi_table[] helps to work around this problem.
Signed-off-by: Michael Pobega <[email protected]>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/acpi/blacklist.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
--- a/drivers/acpi/blacklist.c
+++ b/drivers/acpi/blacklist.c
@@ -362,6 +362,18 @@ static struct dmi_system_id acpi_osi_dmi
DMI_MATCH(DMI_PRODUCT_NAME, "Precision 3520"),
},
},
+ /*
+ * Resolves a quirk with the Dell Latitude 3350 that
+ * causes the ethernet adapter to not function.
+ */
+ {
+ .callback = dmi_enable_rev_override,
+ .ident = "DELL Latitude 3350",
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "Latitude 3350"),
+ },
+ },
#endif
{}
};
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Takashi Iwai <[email protected]>
commit 8aac7f34369726d1a158788ae8aff3002d5eb528 upstream.
fbcon can deal with vc_hi_font_mask (the upper 256 chars) and adjust
the vc attrs dynamically when vc_hi_font_mask is changed at
fbcon_init(). When the vc_hi_font_mask is set, it remaps the attrs in
the existing console buffer with one bit shift up (for 9 bits), while
it remaps with one bit shift down (for 8 bits) when the value is
cleared. It works fine as long as the font gets updated after fbcon
was initialized.
However, we hit a bizarre problem when the console is switched to
another fb driver (typically from vesafb or efifb to drmfb). At
switching to the new fb driver, we temporarily rebind the console to
the dummy console, then rebind to the new driver. During the
switching, we leave the modified attrs as is. Thus, the new fbcon
takes over the old buffer as if it were to contain 8 bits chars
(although the attrs are still shifted for 9 bits), and effectively
this results in the yellow color texts instead of the original white
color, as found in the bugzilla entry below.
An easy fix for this is to re-adjust the attrs before leaving the
fbcon at con_deinit callback. Since the code to adjust the attrs is
already present in the current fbcon code, in this patch, we simply
factor out the relevant code, and call it from fbcon_deinit().
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1000619
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Bartlomiej Zolnierkiewicz <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/video/console/fbcon.c | 67 +++++++++++++++++++++++++-----------------
1 file changed, 40 insertions(+), 27 deletions(-)
--- a/drivers/video/console/fbcon.c
+++ b/drivers/video/console/fbcon.c
@@ -1168,6 +1168,8 @@ static void fbcon_free_font(struct displ
p->userfont = 0;
}
+static void set_vc_hi_font(struct vc_data *vc, bool set);
+
static void fbcon_deinit(struct vc_data *vc)
{
struct display *p = &fb_display[vc->vc_num];
@@ -1203,6 +1205,9 @@ finished:
if (free_font)
vc->vc_font.data = NULL;
+ if (vc->vc_hi_font_mask)
+ set_vc_hi_font(vc, false);
+
if (!con_is_bound(&fb_con))
fbcon_exit();
@@ -2439,32 +2444,10 @@ static int fbcon_get_font(struct vc_data
return 0;
}
-static int fbcon_do_set_font(struct vc_data *vc, int w, int h,
- const u8 * data, int userfont)
+/* set/clear vc_hi_font_mask and update vc attrs accordingly */
+static void set_vc_hi_font(struct vc_data *vc, bool set)
{
- struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]];
- struct fbcon_ops *ops = info->fbcon_par;
- struct display *p = &fb_display[vc->vc_num];
- int resize;
- int cnt;
- char *old_data = NULL;
-
- if (CON_IS_VISIBLE(vc) && softback_lines)
- fbcon_set_origin(vc);
-
- resize = (w != vc->vc_font.width) || (h != vc->vc_font.height);
- if (p->userfont)
- old_data = vc->vc_font.data;
- if (userfont)
- cnt = FNTCHARCNT(data);
- else
- cnt = 256;
- vc->vc_font.data = (void *)(p->fontdata = data);
- if ((p->userfont = userfont))
- REFCOUNT(data)++;
- vc->vc_font.width = w;
- vc->vc_font.height = h;
- if (vc->vc_hi_font_mask && cnt == 256) {
+ if (!set) {
vc->vc_hi_font_mask = 0;
if (vc->vc_can_do_color) {
vc->vc_complement_mask >>= 1;
@@ -2487,7 +2470,7 @@ static int fbcon_do_set_font(struct vc_d
((c & 0xfe00) >> 1) | (c & 0xff);
vc->vc_attr >>= 1;
}
- } else if (!vc->vc_hi_font_mask && cnt == 512) {
+ } else {
vc->vc_hi_font_mask = 0x100;
if (vc->vc_can_do_color) {
vc->vc_complement_mask <<= 1;
@@ -2519,8 +2502,38 @@ static int fbcon_do_set_font(struct vc_d
} else
vc->vc_video_erase_char = c & ~0x100;
}
-
}
+}
+
+static int fbcon_do_set_font(struct vc_data *vc, int w, int h,
+ const u8 * data, int userfont)
+{
+ struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]];
+ struct fbcon_ops *ops = info->fbcon_par;
+ struct display *p = &fb_display[vc->vc_num];
+ int resize;
+ int cnt;
+ char *old_data = NULL;
+
+ if (CON_IS_VISIBLE(vc) && softback_lines)
+ fbcon_set_origin(vc);
+
+ resize = (w != vc->vc_font.width) || (h != vc->vc_font.height);
+ if (p->userfont)
+ old_data = vc->vc_font.data;
+ if (userfont)
+ cnt = FNTCHARCNT(data);
+ else
+ cnt = 256;
+ vc->vc_font.data = (void *)(p->fontdata = data);
+ if ((p->userfont = userfont))
+ REFCOUNT(data)++;
+ vc->vc_font.width = w;
+ vc->vc_font.height = h;
+ if (vc->vc_hi_font_mask && cnt == 256)
+ set_vc_hi_font(vc, false);
+ else if (!vc->vc_hi_font_mask && cnt == 512)
+ set_vc_hi_font(vc, true);
if (resize) {
int cols, rows;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Bjorn Helgaas <[email protected]>
[ Upstream commit 546ba9f8f22f71b0202b6ba8967be5cc6dae4e21 ]
If we update a VF BAR while it's enabled, there are two potential problems:
1) Any driver that's using the VF has a cached BAR value that is stale
after the update, and
2) We can't update 64-bit BARs atomically, so the intermediate state
(new lower dword with old upper dword) may conflict with another
device, and an access by a driver unrelated to the VF may cause a bus
error.
Warn about attempts to update VF BARs while they are enabled. This is a
programming error, so use dev_WARN() to get a backtrace.
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Gavin Shan <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/pci/iov.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -567,6 +567,7 @@ void pci_iov_update_resource(struct pci_
struct resource *res = dev->resource + resno;
int vf_bar = resno - PCI_IOV_RESOURCES;
struct pci_bus_region region;
+ u16 cmd;
u32 new;
int reg;
@@ -578,6 +579,13 @@ void pci_iov_update_resource(struct pci_
if (!iov)
return;
+ pci_read_config_word(dev, iov->pos + PCI_SRIOV_CTRL, &cmd);
+ if ((cmd & PCI_SRIOV_CTRL_VFE) && (cmd & PCI_SRIOV_CTRL_MSE)) {
+ dev_WARN(&dev->dev, "can't update enabled VF BAR%d %pR\n",
+ vf_bar, res);
+ return;
+ }
+
/*
* Ignore unimplemented BARs, unused resource slots for 64-bit
* BARs, and non-movable resources, e.g., those described via
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Takashi Iwai <[email protected]>
commit f363a06642f28caaa78cb6446bbad90c73fe183c upstream.
In the commit [15c75b09f8d1: ALSA: ctxfi: Fallback DMA mask to 32bit],
I forgot to put "!" at dam_set_mask() call check in cthw20k1.c (while
cthw20k2.c is OK). This patch fixes that obvious bug.
(As a side note: although the original commit was completely wrong,
it's still working for most of machines, as it sets to 32bit DMA mask
in the end. So the bug severity is low.)
Fixes: 15c75b09f8d1 ("ALSA: ctxfi: Fallback DMA mask to 32bit")
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
sound/pci/ctxfi/cthw20k1.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/sound/pci/ctxfi/cthw20k1.c
+++ b/sound/pci/ctxfi/cthw20k1.c
@@ -1905,7 +1905,7 @@ static int hw_card_start(struct hw *hw)
return err;
/* Set DMA transfer mask */
- if (dma_set_mask(&pci->dev, DMA_BIT_MASK(dma_bits))) {
+ if (!dma_set_mask(&pci->dev, DMA_BIT_MASK(dma_bits))) {
dma_set_coherent_mask(&pci->dev, DMA_BIT_MASK(dma_bits));
} else {
dma_set_mask(&pci->dev, DMA_BIT_MASK(32));
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johannes Berg <[email protected]>
commit ea90e0dc8cecba6359b481e24d9c37160f6f524f upstream.
Sowmini pointed out Dmitry's RTNL deadlock report to me, and it turns out
to be perfectly accurate - there are various error paths that miss unlock
of the RTNL.
To fix those, change the locking a bit to not be conditional in all those
nl80211_prepare_*_dump() functions, but make those require the RTNL to
start with, and fix the buggy error paths. This also let me use sparse
(by appropriately overriding the rtnl_lock/rtnl_unlock functions) to
validate the changes.
Reported-by: Sowmini Varadhan <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/wireless/nl80211.c | 121 +++++++++++++++++++++----------------------------
1 file changed, 53 insertions(+), 68 deletions(-)
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -492,21 +492,17 @@ static int nl80211_prepare_wdev_dump(str
{
int err;
- rtnl_lock();
-
if (!cb->args[0]) {
err = nlmsg_parse(cb->nlh, GENL_HDRLEN + nl80211_fam.hdrsize,
nl80211_fam.attrbuf, nl80211_fam.maxattr,
nl80211_policy);
if (err)
- goto out_unlock;
+ return err;
*wdev = __cfg80211_wdev_from_attrs(sock_net(skb->sk),
nl80211_fam.attrbuf);
- if (IS_ERR(*wdev)) {
- err = PTR_ERR(*wdev);
- goto out_unlock;
- }
+ if (IS_ERR(*wdev))
+ return PTR_ERR(*wdev);
*rdev = wiphy_to_rdev((*wdev)->wiphy);
/* 0 is the first index - add 1 to parse only once */
cb->args[0] = (*rdev)->wiphy_idx + 1;
@@ -516,10 +512,8 @@ static int nl80211_prepare_wdev_dump(str
struct wiphy *wiphy = wiphy_idx_to_wiphy(cb->args[0] - 1);
struct wireless_dev *tmp;
- if (!wiphy) {
- err = -ENODEV;
- goto out_unlock;
- }
+ if (!wiphy)
+ return -ENODEV;
*rdev = wiphy_to_rdev(wiphy);
*wdev = NULL;
@@ -530,21 +524,11 @@ static int nl80211_prepare_wdev_dump(str
}
}
- if (!*wdev) {
- err = -ENODEV;
- goto out_unlock;
- }
+ if (!*wdev)
+ return -ENODEV;
}
return 0;
- out_unlock:
- rtnl_unlock();
- return err;
-}
-
-static void nl80211_finish_wdev_dump(struct cfg80211_registered_device *rdev)
-{
- rtnl_unlock();
}
/* IE validation */
@@ -3884,9 +3868,10 @@ static int nl80211_dump_station(struct s
int sta_idx = cb->args[2];
int err;
+ rtnl_lock();
err = nl80211_prepare_wdev_dump(skb, cb, &rdev, &wdev);
if (err)
- return err;
+ goto out_err;
if (!wdev->netdev) {
err = -EINVAL;
@@ -3922,7 +3907,7 @@ static int nl80211_dump_station(struct s
cb->args[2] = sta_idx;
err = skb->len;
out_err:
- nl80211_finish_wdev_dump(rdev);
+ rtnl_unlock();
return err;
}
@@ -4639,9 +4624,10 @@ static int nl80211_dump_mpath(struct sk_
int path_idx = cb->args[2];
int err;
+ rtnl_lock();
err = nl80211_prepare_wdev_dump(skb, cb, &rdev, &wdev);
if (err)
- return err;
+ goto out_err;
if (!rdev->ops->dump_mpath) {
err = -EOPNOTSUPP;
@@ -4675,7 +4661,7 @@ static int nl80211_dump_mpath(struct sk_
cb->args[2] = path_idx;
err = skb->len;
out_err:
- nl80211_finish_wdev_dump(rdev);
+ rtnl_unlock();
return err;
}
@@ -4835,9 +4821,10 @@ static int nl80211_dump_mpp(struct sk_bu
int path_idx = cb->args[2];
int err;
+ rtnl_lock();
err = nl80211_prepare_wdev_dump(skb, cb, &rdev, &wdev);
if (err)
- return err;
+ goto out_err;
if (!rdev->ops->dump_mpp) {
err = -EOPNOTSUPP;
@@ -4870,7 +4857,7 @@ static int nl80211_dump_mpp(struct sk_bu
cb->args[2] = path_idx;
err = skb->len;
out_err:
- nl80211_finish_wdev_dump(rdev);
+ rtnl_unlock();
return err;
}
@@ -6806,9 +6793,12 @@ static int nl80211_dump_scan(struct sk_b
int start = cb->args[2], idx = 0;
int err;
+ rtnl_lock();
err = nl80211_prepare_wdev_dump(skb, cb, &rdev, &wdev);
- if (err)
+ if (err) {
+ rtnl_unlock();
return err;
+ }
wdev_lock(wdev);
spin_lock_bh(&rdev->bss_lock);
@@ -6831,7 +6821,7 @@ static int nl80211_dump_scan(struct sk_b
wdev_unlock(wdev);
cb->args[2] = idx;
- nl80211_finish_wdev_dump(rdev);
+ rtnl_unlock();
return skb->len;
}
@@ -6915,9 +6905,10 @@ static int nl80211_dump_survey(struct sk
int res;
bool radio_stats;
+ rtnl_lock();
res = nl80211_prepare_wdev_dump(skb, cb, &rdev, &wdev);
if (res)
- return res;
+ goto out_err;
/* prepare_wdev_dump parsed the attributes */
radio_stats = nl80211_fam.attrbuf[NL80211_ATTR_SURVEY_RADIO_STATS];
@@ -6958,7 +6949,7 @@ static int nl80211_dump_survey(struct sk
cb->args[2] = survey_idx;
res = skb->len;
out_err:
- nl80211_finish_wdev_dump(rdev);
+ rtnl_unlock();
return res;
}
@@ -10158,17 +10149,13 @@ static int nl80211_prepare_vendor_dump(s
void *data = NULL;
unsigned int data_len = 0;
- rtnl_lock();
-
if (cb->args[0]) {
/* subtract the 1 again here */
struct wiphy *wiphy = wiphy_idx_to_wiphy(cb->args[0] - 1);
struct wireless_dev *tmp;
- if (!wiphy) {
- err = -ENODEV;
- goto out_unlock;
- }
+ if (!wiphy)
+ return -ENODEV;
*rdev = wiphy_to_rdev(wiphy);
*wdev = NULL;
@@ -10189,13 +10176,11 @@ static int nl80211_prepare_vendor_dump(s
nl80211_fam.attrbuf, nl80211_fam.maxattr,
nl80211_policy);
if (err)
- goto out_unlock;
+ return err;
if (!nl80211_fam.attrbuf[NL80211_ATTR_VENDOR_ID] ||
- !nl80211_fam.attrbuf[NL80211_ATTR_VENDOR_SUBCMD]) {
- err = -EINVAL;
- goto out_unlock;
- }
+ !nl80211_fam.attrbuf[NL80211_ATTR_VENDOR_SUBCMD])
+ return -EINVAL;
*wdev = __cfg80211_wdev_from_attrs(sock_net(skb->sk),
nl80211_fam.attrbuf);
@@ -10204,10 +10189,8 @@ static int nl80211_prepare_vendor_dump(s
*rdev = __cfg80211_rdev_from_attrs(sock_net(skb->sk),
nl80211_fam.attrbuf);
- if (IS_ERR(*rdev)) {
- err = PTR_ERR(*rdev);
- goto out_unlock;
- }
+ if (IS_ERR(*rdev))
+ return PTR_ERR(*rdev);
vid = nla_get_u32(nl80211_fam.attrbuf[NL80211_ATTR_VENDOR_ID]);
subcmd = nla_get_u32(nl80211_fam.attrbuf[NL80211_ATTR_VENDOR_SUBCMD]);
@@ -10220,19 +10203,15 @@ static int nl80211_prepare_vendor_dump(s
if (vcmd->info.vendor_id != vid || vcmd->info.subcmd != subcmd)
continue;
- if (!vcmd->dumpit) {
- err = -EOPNOTSUPP;
- goto out_unlock;
- }
+ if (!vcmd->dumpit)
+ return -EOPNOTSUPP;
vcmd_idx = i;
break;
}
- if (vcmd_idx < 0) {
- err = -EOPNOTSUPP;
- goto out_unlock;
- }
+ if (vcmd_idx < 0)
+ return -EOPNOTSUPP;
if (nl80211_fam.attrbuf[NL80211_ATTR_VENDOR_DATA]) {
data = nla_data(nl80211_fam.attrbuf[NL80211_ATTR_VENDOR_DATA]);
@@ -10249,9 +10228,6 @@ static int nl80211_prepare_vendor_dump(s
/* keep rtnl locked in successful case */
return 0;
- out_unlock:
- rtnl_unlock();
- return err;
}
static int nl80211_vendor_cmd_dump(struct sk_buff *skb,
@@ -10266,9 +10242,10 @@ static int nl80211_vendor_cmd_dump(struc
int err;
struct nlattr *vendor_data;
+ rtnl_lock();
err = nl80211_prepare_vendor_dump(skb, cb, &rdev, &wdev);
if (err)
- return err;
+ goto out;
vcmd_idx = cb->args[2];
data = (void *)cb->args[3];
@@ -10277,18 +10254,26 @@ static int nl80211_vendor_cmd_dump(struc
if (vcmd->flags & (WIPHY_VENDOR_CMD_NEED_WDEV |
WIPHY_VENDOR_CMD_NEED_NETDEV)) {
- if (!wdev)
- return -EINVAL;
+ if (!wdev) {
+ err = -EINVAL;
+ goto out;
+ }
if (vcmd->flags & WIPHY_VENDOR_CMD_NEED_NETDEV &&
- !wdev->netdev)
- return -EINVAL;
+ !wdev->netdev) {
+ err = -EINVAL;
+ goto out;
+ }
if (vcmd->flags & WIPHY_VENDOR_CMD_NEED_RUNNING) {
if (wdev->netdev &&
- !netif_running(wdev->netdev))
- return -ENETDOWN;
- if (!wdev->netdev && !wdev->p2p_started)
- return -ENETDOWN;
+ !netif_running(wdev->netdev)) {
+ err = -ENETDOWN;
+ goto out;
+ }
+ if (!wdev->netdev && !wdev->p2p_started) {
+ err = -ENETDOWN;
+ goto out;
+ }
}
}
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Bjorn Helgaas <[email protected]>
[ Upstream commit 286c2378aaccc7343ebf17ec6cd86567659caf70 ]
pci_std_update_resource() only deals with standard BARs, so we don't have
to worry about the complications of VF BARs in an SR-IOV capability.
Compute the BAR address inline and remove pci_resource_bar(). That makes
pci_iov_resource_bar() unused, so remove that as well.
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Gavin Shan <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/pci/iov.c | 18 ------------------
drivers/pci/pci.c | 30 ------------------------------
drivers/pci/pci.h | 6 ------
drivers/pci/setup-res.c | 13 +++++++------
4 files changed, 7 insertions(+), 60 deletions(-)
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -555,24 +555,6 @@ void pci_iov_release(struct pci_dev *dev
}
/**
- * pci_iov_resource_bar - get position of the SR-IOV BAR
- * @dev: the PCI device
- * @resno: the resource number
- *
- * Returns position of the BAR encapsulated in the SR-IOV capability.
- */
-int pci_iov_resource_bar(struct pci_dev *dev, int resno)
-{
- if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCE_END)
- return 0;
-
- BUG_ON(!dev->is_physfn);
-
- return dev->sriov->pos + PCI_SRIOV_BAR +
- 4 * (resno - PCI_IOV_RESOURCES);
-}
-
-/**
* pci_iov_update_resource - update a VF BAR
* @dev: the PCI device
* @resno: the resource number
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4472,36 +4472,6 @@ int pci_select_bars(struct pci_dev *dev,
}
EXPORT_SYMBOL(pci_select_bars);
-/**
- * pci_resource_bar - get position of the BAR associated with a resource
- * @dev: the PCI device
- * @resno: the resource number
- * @type: the BAR type to be filled in
- *
- * Returns BAR position in config space, or 0 if the BAR is invalid.
- */
-int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type)
-{
- int reg;
-
- if (resno < PCI_ROM_RESOURCE) {
- *type = pci_bar_unknown;
- return PCI_BASE_ADDRESS_0 + 4 * resno;
- } else if (resno == PCI_ROM_RESOURCE) {
- *type = pci_bar_mem32;
- return dev->rom_base_reg;
- } else if (resno < PCI_BRIDGE_RESOURCES) {
- /* device specific resource */
- *type = pci_bar_unknown;
- reg = pci_iov_resource_bar(dev, resno);
- if (reg)
- return reg;
- }
-
- dev_err(&dev->dev, "BAR %d: invalid resource\n", resno);
- return 0;
-}
-
/* Some architectures require additional programming to enable VGA */
static arch_set_vga_state_t arch_set_vga_state;
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -232,7 +232,6 @@ bool pci_bus_read_dev_vendor_id(struct p
int pci_setup_device(struct pci_dev *dev);
int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
struct resource *res, unsigned int reg);
-int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type);
void pci_configure_ari(struct pci_dev *dev);
void __pci_bus_size_bridges(struct pci_bus *bus,
struct list_head *realloc_head);
@@ -276,7 +275,6 @@ static inline void pci_restore_ats_state
#ifdef CONFIG_PCI_IOV
int pci_iov_init(struct pci_dev *dev);
void pci_iov_release(struct pci_dev *dev);
-int pci_iov_resource_bar(struct pci_dev *dev, int resno);
void pci_iov_update_resource(struct pci_dev *dev, int resno);
resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
void pci_restore_iov_state(struct pci_dev *dev);
@@ -291,10 +289,6 @@ static inline void pci_iov_release(struc
{
}
-static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno)
-{
- return 0;
-}
static inline void pci_restore_iov_state(struct pci_dev *dev)
{
}
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -32,7 +32,6 @@ static void pci_std_update_resource(stru
u16 cmd;
u32 new, check, mask;
int reg;
- enum pci_bar_type type;
struct resource *res = dev->resource + resno;
if (dev->is_virtfn) {
@@ -66,14 +65,16 @@ static void pci_std_update_resource(stru
else
mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
- reg = pci_resource_bar(dev, resno, &type);
- if (!reg)
- return;
- if (type != pci_bar_unknown) {
+ if (resno < PCI_ROM_RESOURCE) {
+ reg = PCI_BASE_ADDRESS_0 + 4 * resno;
+ } else if (resno == PCI_ROM_RESOURCE) {
if (!(res->flags & IORESOURCE_ROM_ENABLE))
return;
+
+ reg = dev->rom_base_reg;
new |= PCI_ROM_ADDRESS_ENABLE;
- }
+ } else
+ return;
/*
* We can't update a 64-bit BAR atomically, so when possible,
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Alexey Kardashevskiy <[email protected]>
[ Upstream commit 39701e56f5f16ea0cf8fc9e8472e645f8de91d23 ]
The iommu_table struct manages a hardware TCE table and a vmalloc'd
table with corresponding userspace addresses. Both are allocated when
the default DMA window is created and this happens when the very first
group is attached to a container.
As we are going to allow the userspace to configure container in one
memory context and pas container fd to another, we have to postpones
such allocations till a container fd is passed to the destination
user process so we would account locked memory limit against the actual
container user constrainsts.
This postpones the it_userspace array allocation till it is used first
time for mapping. The unmapping patch already checks if the array is
allocated.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Reviewed-by: David Gibson <[email protected]>
Acked-by: Alex Williamson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/vfio/vfio_iommu_spapr_tce.c | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -511,6 +511,12 @@ static long tce_iommu_build_v2(struct tc
unsigned long hpa;
enum dma_data_direction dirtmp;
+ if (!tbl->it_userspace) {
+ ret = tce_iommu_userspace_view_alloc(tbl);
+ if (ret)
+ return ret;
+ }
+
for (i = 0; i < pages; ++i) {
struct mm_iommu_table_group_mem_t *mem = NULL;
unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
@@ -584,15 +590,6 @@ static long tce_iommu_create_table(struc
WARN_ON(!ret && !(*ptbl)->it_ops->free);
WARN_ON(!ret && ((*ptbl)->it_allocated_size != table_size));
- if (!ret && container->v2) {
- ret = tce_iommu_userspace_view_alloc(*ptbl);
- if (ret)
- (*ptbl)->it_ops->free(*ptbl);
- }
-
- if (ret)
- decrement_locked_vm(table_size >> PAGE_SHIFT);
-
return ret;
}
@@ -1064,10 +1061,7 @@ static int tce_iommu_take_ownership(stru
if (!tbl || !tbl->it_map)
continue;
- rc = tce_iommu_userspace_view_alloc(tbl);
- if (!rc)
- rc = iommu_take_ownership(tbl);
-
+ rc = iommu_take_ownership(tbl);
if (rc) {
for (j = 0; j < i; ++j)
iommu_release_ownership(
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Bjorn Helgaas <[email protected]>
[ Upstream commit 7a6d312b50e63f598f5b5914c4fd21878ac2b595 ]
Remove the assumption that IORESOURCE_ROM_ENABLE == PCI_ROM_ADDRESS_ENABLE.
PCI_ROM_ADDRESS_ENABLE is the ROM enable bit defined by the PCI spec, so if
we're reading or writing a BAR register value, that's what we should use.
IORESOURCE_ROM_ENABLE is a corresponding bit in struct resource flags.
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Gavin Shan <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/pci/probe.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -226,7 +226,8 @@ int __pci_read_base(struct pci_dev *dev,
mask64 = (u32)PCI_BASE_ADDRESS_MEM_MASK;
}
} else {
- res->flags |= (l & IORESOURCE_ROM_ENABLE);
+ if (l & PCI_ROM_ADDRESS_ENABLE)
+ res->flags |= IORESOURCE_ROM_ENABLE;
l64 = l & PCI_ROM_ADDRESS_MASK;
sz64 = sz & PCI_ROM_ADDRESS_MASK;
mask64 = (u32)PCI_ROM_ADDRESS_MASK;
On Tue 28-03-17 14:30:45, Greg KH wrote:
> 4.4-stable review patch. If anyone has any objections, please let me know.
I haven't seen the original patch but the changelog makes me worried.
How exactly this is a problem? Where do we lockup? Does rbd/libceph take
any xfs locks?
> ------------------
>
> From: Ilya Dryomov <[email protected]>
>
> commit 633ee407b9d15a75ac9740ba9d3338815e1fcb95 upstream.
>
> sock_alloc_inode() allocates socket+inode and socket_wq with
> GFP_KERNEL, which is not allowed on the writeback path:
>
> Workqueue: ceph-msgr con_work [libceph]
> ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
> 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
> ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
> Call Trace:
> [<ffffffff816dd629>] schedule+0x29/0x70
> [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
> [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
> [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> [<ffffffff81086335>] flush_work+0x165/0x250
> [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
> [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
> [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
> [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
> [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
> [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
> [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
> [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
> [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
> [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
> [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
> [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
> [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
> [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
> [<ffffffff811c0c18>] super_cache_scan+0x178/0x180
> [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
> [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
> [<ffffffff8115af70>] shrink_slab+0x100/0x140
> [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
> [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
> [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
> [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
> [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
> [<ffffffff811a0ac5>] new_slab+0x2c5/0x390
> [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
> [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
> [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
> [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
> [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
> [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
> [<ffffffff811d8566>] alloc_inode+0x26/0xa0
> [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
> [<ffffffff815b933e>] sock_alloc+0x1e/0x80
> [<ffffffff815ba855>] __sock_create+0x95/0x220
> [<ffffffff815baa04>] sock_create_kern+0x24/0x30
> [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
> [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
> [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [<ffffffff8108561b>] worker_thread+0x11b/0x530
> [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>
> Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.
>
> Link: http://tracker.ceph.com/issues/19309
> Reported-by: Sergey Jerusalimov <[email protected]>
> Signed-off-by: Ilya Dryomov <[email protected]>
> Reviewed-by: Jeff Layton <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> ---
> net/ceph/messenger.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -7,6 +7,7 @@
> #include <linux/kthread.h>
> #include <linux/net.h>
> #include <linux/nsproxy.h>
> +#include <linux/sched.h>
> #include <linux/slab.h>
> #include <linux/socket.h>
> #include <linux/string.h>
> @@ -478,11 +479,16 @@ static int ceph_tcp_connect(struct ceph_
> {
> struct sockaddr_storage *paddr = &con->peer_addr.in_addr;
> struct socket *sock;
> + unsigned int noio_flag;
> int ret;
>
> BUG_ON(con->sock);
> +
> + /* sock_create_kern() allocates with GFP_KERNEL */
> + noio_flag = memalloc_noio_save();
> ret = sock_create_kern(read_pnet(&con->msgr->net), paddr->ss_family,
> SOCK_STREAM, IPPROTO_TCP, &sock);
> + memalloc_noio_restore(noio_flag);
> if (ret)
> return ret;
> sock->sk->sk_allocation = GFP_NOFS;
>
--
Michal Hocko
SUSE Labs
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Gabriel Krisman Bertazi <[email protected]>
[ Upstream commit f209fa03fc9d131b3108c2e4936181eabab87416 ]
During a PCI error recovery, like the ones provoked by EEH in the ppc64
platform, all IO to the device must be blocked while the recovery is
completed. Current 8250_pci implementation only suspends the port
instead of detaching it, which doesn't prevent incoming accesses like
TIOCMGET and TIOCMSET calls from reaching the device. Those end up
racing with the EEH recovery, crashing it. Similar races were also
observed when opening the device and when shutting it down during
recovery.
This patch implements a more robust IO blockage for the 8250_pci
recovery by unregistering the port at the beginning of the procedure and
re-adding it afterwards. Since the port is detached from the uart
layer, we can be sure that no request will make through to the device
during recovery. This is similar to the solution used by the JSM serial
driver.
I thank Peter Hurley <[email protected]> for valuable input on
this one over one year ago.
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/tty/serial/8250/8250_pci.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)
--- a/drivers/tty/serial/8250/8250_pci.c
+++ b/drivers/tty/serial/8250/8250_pci.c
@@ -57,6 +57,7 @@ struct serial_private {
unsigned int nr;
void __iomem *remapped_bar[PCI_NUM_BAR_RESOURCES];
struct pci_serial_quirk *quirk;
+ const struct pciserial_board *board;
int line[0];
};
@@ -4058,6 +4059,7 @@ pciserial_init_ports(struct pci_dev *dev
}
}
priv->nr = i;
+ priv->board = board;
return priv;
err_deinit:
@@ -4068,7 +4070,7 @@ err_out:
}
EXPORT_SYMBOL_GPL(pciserial_init_ports);
-void pciserial_remove_ports(struct serial_private *priv)
+void pciserial_detach_ports(struct serial_private *priv)
{
struct pci_serial_quirk *quirk;
int i;
@@ -4088,7 +4090,11 @@ void pciserial_remove_ports(struct seria
quirk = find_quirk(priv->dev);
if (quirk->exit)
quirk->exit(priv->dev);
+}
+void pciserial_remove_ports(struct serial_private *priv)
+{
+ pciserial_detach_ports(priv);
kfree(priv);
}
EXPORT_SYMBOL_GPL(pciserial_remove_ports);
@@ -5819,7 +5825,7 @@ static pci_ers_result_t serial8250_io_er
return PCI_ERS_RESULT_DISCONNECT;
if (priv)
- pciserial_suspend_ports(priv);
+ pciserial_detach_ports(priv);
pci_disable_device(dev);
@@ -5844,9 +5850,18 @@ static pci_ers_result_t serial8250_io_sl
static void serial8250_io_resume(struct pci_dev *dev)
{
struct serial_private *priv = pci_get_drvdata(dev);
+ const struct pciserial_board *board;
- if (priv)
- pciserial_resume_ports(priv);
+ if (!priv)
+ return;
+
+ board = priv->board;
+ kfree(priv);
+ priv = pciserial_init_ports(dev, board);
+
+ if (!IS_ERR(priv)) {
+ pci_set_drvdata(dev, priv);
+ }
}
static const struct pci_error_handlers serial8250_err_handler = {
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit f259ca3eed6e4b79ac3d5c5c9fb259fb46e86217 upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer or accessing memory beyond the endpoint array should a
malicious device lack the expected endpoints.
Note that the endpoint access that causes the NULL-deref is currently
only used for debugging purposes during probe so the oops only happens
when dynamic debugging is enabled. This means the driver could be
rewritten to continue to accept device with only two endpoints, should
such devices exist.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/usb/misc/uss720.c | 5 +++++
1 file changed, 5 insertions(+)
--- a/drivers/usb/misc/uss720.c
+++ b/drivers/usb/misc/uss720.c
@@ -711,6 +711,11 @@ static int uss720_probe(struct usb_inter
interface = intf->cur_altsetting;
+ if (interface->desc.bNumEndpoints < 3) {
+ usb_put_dev(usbdev);
+ return -ENODEV;
+ }
+
/*
* Allocate parport interface
*/
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Bjorn Helgaas <[email protected]>
[ Upstream commit 6ffa2489c51da77564a0881a73765ea2169f955d ]
Previously pci_update_resource() used the same code path for updating
standard BARs and VF BARs in SR-IOV capabilities.
Split the VF BAR update into a new pci_iov_update_resource() internal
interface, which makes it simpler to compute the BAR address (we can get
rid of pci_resource_bar() and pci_iov_resource_bar()).
This patch:
- Renames pci_update_resource() to pci_std_update_resource(),
- Adds pci_iov_update_resource(),
- Makes pci_update_resource() a wrapper that calls the appropriate one,
No functional change intended.
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Gavin Shan <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/pci/iov.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci.h | 1
drivers/pci/setup-res.c | 13 ++++++++++--
3 files changed, 62 insertions(+), 2 deletions(-)
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -572,6 +572,56 @@ int pci_iov_resource_bar(struct pci_dev
4 * (resno - PCI_IOV_RESOURCES);
}
+/**
+ * pci_iov_update_resource - update a VF BAR
+ * @dev: the PCI device
+ * @resno: the resource number
+ *
+ * Update a VF BAR in the SR-IOV capability of a PF.
+ */
+void pci_iov_update_resource(struct pci_dev *dev, int resno)
+{
+ struct pci_sriov *iov = dev->is_physfn ? dev->sriov : NULL;
+ struct resource *res = dev->resource + resno;
+ int vf_bar = resno - PCI_IOV_RESOURCES;
+ struct pci_bus_region region;
+ u32 new;
+ int reg;
+
+ /*
+ * The generic pci_restore_bars() path calls this for all devices,
+ * including VFs and non-SR-IOV devices. If this is not a PF, we
+ * have nothing to do.
+ */
+ if (!iov)
+ return;
+
+ /*
+ * Ignore unimplemented BARs, unused resource slots for 64-bit
+ * BARs, and non-movable resources, e.g., those described via
+ * Enhanced Allocation.
+ */
+ if (!res->flags)
+ return;
+
+ if (res->flags & IORESOURCE_UNSET)
+ return;
+
+ if (res->flags & IORESOURCE_PCI_FIXED)
+ return;
+
+ pcibios_resource_to_bus(dev->bus, ®ion, res);
+ new = region.start;
+ new |= res->flags & ~PCI_BASE_ADDRESS_MEM_MASK;
+
+ reg = iov->pos + PCI_SRIOV_BAR + 4 * vf_bar;
+ pci_write_config_dword(dev, reg, new);
+ if (res->flags & IORESOURCE_MEM_64) {
+ new = region.start >> 16 >> 16;
+ pci_write_config_dword(dev, reg + 4, new);
+ }
+}
+
resource_size_t __weak pcibios_iov_resource_alignment(struct pci_dev *dev,
int resno)
{
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -277,6 +277,7 @@ static inline void pci_restore_ats_state
int pci_iov_init(struct pci_dev *dev);
void pci_iov_release(struct pci_dev *dev);
int pci_iov_resource_bar(struct pci_dev *dev, int resno);
+void pci_iov_update_resource(struct pci_dev *dev, int resno);
resource_size_t pci_sriov_resource_alignment(struct pci_dev *dev, int resno);
void pci_restore_iov_state(struct pci_dev *dev);
int pci_iov_bus_range(struct pci_bus *bus);
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -25,8 +25,7 @@
#include <linux/slab.h>
#include "pci.h"
-
-void pci_update_resource(struct pci_dev *dev, int resno)
+static void pci_std_update_resource(struct pci_dev *dev, int resno)
{
struct pci_bus_region region;
bool disable;
@@ -110,6 +109,16 @@ void pci_update_resource(struct pci_dev
pci_write_config_word(dev, PCI_COMMAND, cmd);
}
+void pci_update_resource(struct pci_dev *dev, int resno)
+{
+ if (resno <= PCI_ROM_RESOURCE)
+ pci_std_update_resource(dev, resno);
+#ifdef CONFIG_PCI_IOV
+ else if (resno >= PCI_IOV_RESOURCES && resno <= PCI_IOV_RESOURCE_END)
+ pci_iov_update_resource(dev, resno);
+#endif
+}
+
int pci_claim_resource(struct pci_dev *dev, int resource)
{
struct resource *res = &dev->resource[resource];
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Bjorn Helgaas <[email protected]>
[ Upstream commit 63880b230a4af502c56dde3d4588634c70c66006 ]
VF BARs are read-only zero, so updating VF BARs will not have any effect.
See the SR-IOV spec r1.1, sec 3.4.1.11.
We already ignore these updates because of 70675e0b6a1a ("PCI: Don't try to
restore VF BARs"); this merely restructures it slightly to make it easier
to split updates for standard and SR-IOV BARs.
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Gavin Shan <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/pci/pci.c | 4 ----
drivers/pci/setup-res.c | 5 ++---
2 files changed, 2 insertions(+), 7 deletions(-)
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -519,10 +519,6 @@ static void pci_restore_bars(struct pci_
{
int i;
- /* Per SR-IOV spec 3.4.1.11, VF BARs are RO zero */
- if (dev->is_virtfn)
- return;
-
for (i = 0; i < PCI_BRIDGE_RESOURCES; i++)
pci_update_resource(dev, i);
}
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -34,10 +34,9 @@ static void pci_std_update_resource(stru
int reg;
struct resource *res = dev->resource + resno;
- if (dev->is_virtfn) {
- dev_warn(&dev->dev, "can't update VF BAR%d\n", resno);
+ /* Per SR-IOV spec 3.4.1.11, VF BARs are RO zero */
+ if (dev->is_virtfn)
return;
- }
/*
* Ignore resources for unimplemented BARs and unused resource slots
On Tue, Mar 28, 2017 at 2:43 PM, Michal Hocko <[email protected]> wrote:
> On Tue 28-03-17 14:30:45, Greg KH wrote:
>> 4.4-stable review patch. If anyone has any objections, please let me know.
>
> I haven't seen the original patch but the changelog makes me worried.
> How exactly this is a problem? Where do we lockup? Does rbd/libceph take
> any xfs locks?
No, it doesn't. This is just another instance of "using GFP_KERNEL on
the writeback path may lead to a deadlock" with nothing extra to it.
XFS is writing out data, libceph messenger worker tries to open
a socket and recurses back into XFS because the sockfs inode is
allocated with GFP_KERNEL. The message with some of the data never
goes out and eventually we get a deadlock.
I've only included the offending stack trace. I guess I should have
stressed that ceph-msgr workqueue is used for reclaim.
Thanks,
Ilya
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit 687e0687f71ec00e0132a21fef802dee88c2f1ad upstream.
USBTMC devices are required to have a bulk-in and a bulk-out endpoint,
but the driver failed to verify this, something which could lead to the
endpoint addresses being taken from uninitialised memory.
Make sure to zero all private data as part of allocation, and add the
missing endpoint sanity check.
Note that this also addresses a more recently introduced issue, where
the interrupt-in-presence flag would also be uninitialised whenever the
optional interrupt-in endpoint is not present. This in turn could lead
to an interrupt urb being allocated, initialised and submitted based on
uninitialised values.
Fixes: dbf3e7f654c0 ("Implement an ioctl to support the USMTMC-USB488 READ_STATUS_BYTE operation.")
Fixes: 5b775f672cc9 ("USB: add USB test and measurement class driver")
Signed-off-by: Johan Hovold <[email protected]>
[ johan: backport to v4.4 ]
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/usb/class/usbtmc.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
--- a/drivers/usb/class/usbtmc.c
+++ b/drivers/usb/class/usbtmc.c
@@ -1105,7 +1105,7 @@ static int usbtmc_probe(struct usb_inter
dev_dbg(&intf->dev, "%s called\n", __func__);
- data = kmalloc(sizeof(*data), GFP_KERNEL);
+ data = kzalloc(sizeof(*data), GFP_KERNEL);
if (!data)
return -ENOMEM;
@@ -1163,6 +1163,12 @@ static int usbtmc_probe(struct usb_inter
}
}
+ if (!data->bulk_out || !data->bulk_in) {
+ dev_err(&intf->dev, "bulk endpoints not found\n");
+ retcode = -ENODEV;
+ goto err_put;
+ }
+
retcode = get_capabilities(data);
if (retcode)
dev_err(&intf->dev, "can't read capabilities\n");
@@ -1186,6 +1192,7 @@ static int usbtmc_probe(struct usb_inter
error_register:
sysfs_remove_group(&intf->dev.kobj, &capability_attr_grp);
sysfs_remove_group(&intf->dev.kobj, &data_attr_grp);
+err_put:
kref_put(&data->kref, usbtmc_delete);
return retcode;
}
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Viresh Kumar <[email protected]>
commit ff010472fb75670cb5c08671e820eeea3af59c87 upstream.
On CPU online the cpufreq core restores the previous governor (or
the previous "policy" setting for ->setpolicy drivers), but it does
not restore the min/max limits at the same time, which is confusing,
inconsistent and real pain for users who set the limits and then
suspend/resume the system (using full suspend), in which case the
limits are reset on all CPUs except for the boot one.
Fix this by making cpufreq_online() restore the limits when an inactive
policy is brought online.
The commit log and patch are inspired from Rafael's earlier work.
Reported-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Viresh Kumar <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/cpufreq/cpufreq.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1186,6 +1186,9 @@ static int cpufreq_online(unsigned int c
for_each_cpu(j, policy->related_cpus)
per_cpu(cpufreq_cpu_data, j) = policy;
write_unlock_irqrestore(&cpufreq_driver_lock, flags);
+ } else {
+ policy->min = policy->user_policy.min;
+ policy->max = policy->user_policy.max;
}
if (cpufreq_driver->get && !cpufreq_driver->setpolicy) {
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Harald Freudenberger <[email protected]>
[ Upstream commit b3e8652bcbfa04807e44708d4d0c8cdad39c9215 ]
Signed-off-by: Harald Freudenberger <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/s390/crypto/ap_bus.c | 3 +++
drivers/s390/crypto/ap_bus.h | 1 +
2 files changed, 4 insertions(+)
--- a/drivers/s390/crypto/ap_bus.c
+++ b/drivers/s390/crypto/ap_bus.c
@@ -1651,6 +1651,9 @@ static void ap_scan_bus(struct work_stru
ap_dev->queue_depth = queue_depth;
ap_dev->raw_hwtype = device_type;
ap_dev->device_type = device_type;
+ /* CEX6 toleration: map to CEX5 */
+ if (device_type == AP_DEVICE_TYPE_CEX6)
+ ap_dev->device_type = AP_DEVICE_TYPE_CEX5;
ap_dev->functions = device_functions;
spin_lock_init(&ap_dev->lock);
INIT_LIST_HEAD(&ap_dev->pendingq);
--- a/drivers/s390/crypto/ap_bus.h
+++ b/drivers/s390/crypto/ap_bus.h
@@ -105,6 +105,7 @@ static inline int ap_test_bit(unsigned i
#define AP_DEVICE_TYPE_CEX3C 9
#define AP_DEVICE_TYPE_CEX4 10
#define AP_DEVICE_TYPE_CEX5 11
+#define AP_DEVICE_TYPE_CEX6 12
/*
* Known function facilities
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit 1dc56c52d2484be09c7398a5207d6b11a4256be9 upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer should the probed device lack endpoints.
Note that this driver does not bind to any devices by default.
Fixes: ce21bfe603b3 ("USB: Add LVS Test device driver")
Cc: Pratyush Anand <[email protected]>
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/usb/misc/lvstest.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/drivers/usb/misc/lvstest.c
+++ b/drivers/usb/misc/lvstest.c
@@ -370,6 +370,10 @@ static int lvs_rh_probe(struct usb_inter
hdev = interface_to_usbdev(intf);
desc = intf->cur_altsetting;
+
+ if (desc->desc.bNumEndpoints < 1)
+ return -ENODEV;
+
endpoint = &desc->endpoint[0].desc;
/* valid only for SS root hub */
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Chris J Arges <[email protected]>
[ Upstream commit 4e684f59d760a2c7c716bb60190783546e2d08a1 ]
Sometimes firmware may not properly initialize I347AT4_PAGE_SELECT causing
the probe of an igb i210 NIC to fail. This patch adds an addition zeroing
of this register during igb_get_phy_id to workaround this issue.
Thanks for Jochen Henneberg for the idea and original patch.
Signed-off-by: Chris J Arges <[email protected]>
Tested-by: Aaron Brown <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/ethernet/intel/igb/e1000_phy.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/drivers/net/ethernet/intel/igb/e1000_phy.c
+++ b/drivers/net/ethernet/intel/igb/e1000_phy.c
@@ -77,6 +77,10 @@ s32 igb_get_phy_id(struct e1000_hw *hw)
s32 ret_val = 0;
u16 phy_id;
+ /* ensure PHY page selection to fix misconfigured i210 */
+ if (hw->mac.type == e1000_i210)
+ phy->ops.write_reg(hw, I347AT4_PAGE_SELECT, 0);
+
ret_val = phy->ops.read_reg(hw, PHY_ID1, &phy_id);
if (ret_val)
goto out;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Todd Fujinaka <[email protected]>
[ Upstream commit 5bc8c230e2a993b49244f9457499f17283da9ec7 ]
i210 and i211 share the same PHY but have different PCI IDs. Don't
forget i211 for any i210 workarounds.
Signed-off-by: Todd Fujinaka <[email protected]>
Tested-by: Aaron Brown <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/ethernet/intel/igb/e1000_phy.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/ethernet/intel/igb/e1000_phy.c
+++ b/drivers/net/ethernet/intel/igb/e1000_phy.c
@@ -78,7 +78,7 @@ s32 igb_get_phy_id(struct e1000_hw *hw)
u16 phy_id;
/* ensure PHY page selection to fix misconfigured i210 */
- if (hw->mac.type == e1000_i210)
+ if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211))
phy->ops.write_reg(hw, I347AT4_PAGE_SELECT, 0);
ret_val = phy->ops.read_reg(hw, PHY_ID1, &phy_id);
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Mauricio Faria de Oliveira <[email protected]>
[ Upstream commit 25cdb64510644f3e854d502d69c73f21c6df88a9 ]
The WRITE_SAME commands are not present in the blk_default_cmd_filter
write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
[ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]
The problem can be reproduced with the sg_write_same command
# sg_write_same --num 1 --xferlen 512 /dev/sda
#
# capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
Write same: pass through os error: Operation not permitted
#
For comparison, the WRITE_VERIFY command does not observe this problem,
since it is in that list:
# capsh --drop=cap_sys_rawio -- -c \
'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
#
So, this patch adds the WRITE_SAME commands to the list, in order
for the SG_IO ioctl to finish successfully:
# capsh --drop=cap_sys_rawio -- -c \
'sg_write_same --num 1 --xferlen 512 /dev/sda'
#
That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2]),
which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).
In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
which are translated to write-same calls in the guest kernel, and then into
SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:
[...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
[...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
[...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
[...] blk_update_request: I/O error, dev sda, sector 17096824
Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
Signed-off-by: Mauricio Faria de Oliveira <[email protected]>
Signed-off-by: Brahadambal Srinivasan <[email protected]>
Reported-by: Manjunatha H R <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
block/scsi_ioctl.c | 3 +++
1 file changed, 3 insertions(+)
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -182,6 +182,9 @@ static void blk_set_cmd_filter_defaults(
__set_bit(WRITE_16, filter->write_ok);
__set_bit(WRITE_LONG, filter->write_ok);
__set_bit(WRITE_LONG_2, filter->write_ok);
+ __set_bit(WRITE_SAME, filter->write_ok);
+ __set_bit(WRITE_SAME_16, filter->write_ok);
+ __set_bit(WRITE_SAME_32, filter->write_ok);
__set_bit(ERASE, filter->write_ok);
__set_bit(GPCMD_MODE_SELECT_10, filter->write_ok);
__set_bit(MODE_SELECT, filter->write_ok);
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Henrik Ingo <[email protected]>
[ Upstream commit e950267ab802c8558f1100eafd4087fd039ad634 ]
Some devices have invalid baSourceID references, causing uvc_scan_chain()
to fail, but if we just take the entities we can find and put them
together in the most sensible chain we can think of, turns out they do
work anyway. Note: This heuristic assumes there is a single chain.
At the time of writing, devices known to have such a broken chain are
- Acer Integrated Camera (5986:055a)
- Realtek rtl157a7 (0bda:57a7)
Signed-off-by: Henrik Ingo <[email protected]>
Signed-off-by: Laurent Pinchart <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/media/usb/uvc/uvc_driver.c | 118 +++++++++++++++++++++++++++++++++++--
1 file changed, 112 insertions(+), 6 deletions(-)
--- a/drivers/media/usb/uvc/uvc_driver.c
+++ b/drivers/media/usb/uvc/uvc_driver.c
@@ -1595,6 +1595,114 @@ static const char *uvc_print_chain(struc
return buffer;
}
+static struct uvc_video_chain *uvc_alloc_chain(struct uvc_device *dev)
+{
+ struct uvc_video_chain *chain;
+
+ chain = kzalloc(sizeof(*chain), GFP_KERNEL);
+ if (chain == NULL)
+ return NULL;
+
+ INIT_LIST_HEAD(&chain->entities);
+ mutex_init(&chain->ctrl_mutex);
+ chain->dev = dev;
+ v4l2_prio_init(&chain->prio);
+
+ return chain;
+}
+
+/*
+ * Fallback heuristic for devices that don't connect units and terminals in a
+ * valid chain.
+ *
+ * Some devices have invalid baSourceID references, causing uvc_scan_chain()
+ * to fail, but if we just take the entities we can find and put them together
+ * in the most sensible chain we can think of, turns out they do work anyway.
+ * Note: This heuristic assumes there is a single chain.
+ *
+ * At the time of writing, devices known to have such a broken chain are
+ * - Acer Integrated Camera (5986:055a)
+ * - Realtek rtl157a7 (0bda:57a7)
+ */
+static int uvc_scan_fallback(struct uvc_device *dev)
+{
+ struct uvc_video_chain *chain;
+ struct uvc_entity *iterm = NULL;
+ struct uvc_entity *oterm = NULL;
+ struct uvc_entity *entity;
+ struct uvc_entity *prev;
+
+ /*
+ * Start by locating the input and output terminals. We only support
+ * devices with exactly one of each for now.
+ */
+ list_for_each_entry(entity, &dev->entities, list) {
+ if (UVC_ENTITY_IS_ITERM(entity)) {
+ if (iterm)
+ return -EINVAL;
+ iterm = entity;
+ }
+
+ if (UVC_ENTITY_IS_OTERM(entity)) {
+ if (oterm)
+ return -EINVAL;
+ oterm = entity;
+ }
+ }
+
+ if (iterm == NULL || oterm == NULL)
+ return -EINVAL;
+
+ /* Allocate the chain and fill it. */
+ chain = uvc_alloc_chain(dev);
+ if (chain == NULL)
+ return -ENOMEM;
+
+ if (uvc_scan_chain_entity(chain, oterm) < 0)
+ goto error;
+
+ prev = oterm;
+
+ /*
+ * Add all Processing and Extension Units with two pads. The order
+ * doesn't matter much, use reverse list traversal to connect units in
+ * UVC descriptor order as we build the chain from output to input. This
+ * leads to units appearing in the order meant by the manufacturer for
+ * the cameras known to require this heuristic.
+ */
+ list_for_each_entry_reverse(entity, &dev->entities, list) {
+ if (entity->type != UVC_VC_PROCESSING_UNIT &&
+ entity->type != UVC_VC_EXTENSION_UNIT)
+ continue;
+
+ if (entity->num_pads != 2)
+ continue;
+
+ if (uvc_scan_chain_entity(chain, entity) < 0)
+ goto error;
+
+ prev->baSourceID[0] = entity->id;
+ prev = entity;
+ }
+
+ if (uvc_scan_chain_entity(chain, iterm) < 0)
+ goto error;
+
+ prev->baSourceID[0] = iterm->id;
+
+ list_add_tail(&chain->list, &dev->chains);
+
+ uvc_trace(UVC_TRACE_PROBE,
+ "Found a video chain by fallback heuristic (%s).\n",
+ uvc_print_chain(chain));
+
+ return 0;
+
+error:
+ kfree(chain);
+ return -EINVAL;
+}
+
/*
* Scan the device for video chains and register video devices.
*
@@ -1617,15 +1725,10 @@ static int uvc_scan_device(struct uvc_de
if (term->chain.next || term->chain.prev)
continue;
- chain = kzalloc(sizeof(*chain), GFP_KERNEL);
+ chain = uvc_alloc_chain(dev);
if (chain == NULL)
return -ENOMEM;
- INIT_LIST_HEAD(&chain->entities);
- mutex_init(&chain->ctrl_mutex);
- chain->dev = dev;
- v4l2_prio_init(&chain->prio);
-
term->flags |= UVC_ENTITY_FLAG_DEFAULT;
if (uvc_scan_chain(chain, term) < 0) {
@@ -1639,6 +1742,9 @@ static int uvc_scan_device(struct uvc_de
list_add_tail(&chain->list, &dev->chains);
}
+ if (list_empty(&dev->chains))
+ uvc_scan_fallback(dev);
+
if (list_empty(&dev->chains)) {
uvc_printk(KERN_INFO, "No valid video chain found.\n");
return -1;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Jiri Slaby <[email protected]>
commit 6207119444595d287b1e9e83a2066c17209698f3 upstream.
With this reproducer:
struct sockaddr_alg alg = {
.salg_family = 0x26,
.salg_type = "hash",
.salg_feat = 0xf,
.salg_mask = 0x5,
.salg_name = "digest_null",
};
int sock, sock2;
sock = socket(AF_ALG, SOCK_SEQPACKET, 0);
bind(sock, (struct sockaddr *)&alg, sizeof(alg));
sock2 = accept(sock, NULL, NULL);
setsockopt(sock, SOL_ALG, ALG_SET_KEY, "\x9b\xca", 2);
accept(sock2, NULL, NULL);
==== 8< ======== 8< ======== 8< ======== 8< ====
one can immediatelly see an UBSAN warning:
UBSAN: Undefined behaviour in crypto/algif_hash.c:187:7
variable length array bound value 0 <= 0
CPU: 0 PID: 15949 Comm: syz-executor Tainted: G E 4.4.30-0-default #1
...
Call Trace:
...
[<ffffffff81d598fd>] ? __ubsan_handle_vla_bound_not_positive+0x13d/0x188
[<ffffffff81d597c0>] ? __ubsan_handle_out_of_bounds+0x1bc/0x1bc
[<ffffffffa0e2204d>] ? hash_accept+0x5bd/0x7d0 [algif_hash]
[<ffffffffa0e2293f>] ? hash_accept_nokey+0x3f/0x51 [algif_hash]
[<ffffffffa0e206b0>] ? hash_accept_parent_nokey+0x4a0/0x4a0 [algif_hash]
[<ffffffff8235c42b>] ? SyS_accept+0x2b/0x40
It is a correct warning, as hash state is propagated to accept as zero,
but creating a zero-length variable array is not allowed in C.
Fix this as proposed by Herbert -- do "?: 1" on that site. No sizeof or
similar happens in the code there, so we just allocate one byte even
though we do not use the array.
Signed-off-by: Jiri Slaby <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: "David S. Miller" <[email protected]> (maintainer:CRYPTO API)
Reported-by: Sasha Levin <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
crypto/algif_hash.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -184,7 +184,7 @@ static int hash_accept(struct socket *so
struct alg_sock *ask = alg_sk(sk);
struct hash_ctx *ctx = ask->private;
struct ahash_request *req = &ctx->req;
- char state[crypto_ahash_statesize(crypto_ahash_reqtfm(req))];
+ char state[crypto_ahash_statesize(crypto_ahash_reqtfm(req)) ? : 1];
struct sock *sk2;
struct alg_sock *ask2;
struct hash_ctx *ctx2;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Eric Sandeen <[email protected]>
commit 4dfce57db6354603641132fac3c887614e3ebe81 upstream.
There have been several reports over the years of NULL pointer
dereferences in xfs_trans_log_inode during xfs_fsr processes,
when the process is doing an fput and tearing down extents
on the temporary inode, something like:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
PID: 29439 TASK: ffff880550584fa0 CPU: 6 COMMAND: "xfs_fsr"
[exception RIP: xfs_trans_log_inode+0x10]
#9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
#10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
#11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
#12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
#13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
#14 [ffff8800a57bbe00] evict at ffffffff811e1b67
#15 [ffff8800a57bbe28] iput at ffffffff811e23a5
#16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
#17 [ffff8800a57bbe88] dput at ffffffff811dd06c
#18 [ffff8800a57bbea8] __fput at ffffffff811c823b
#19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
#20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
#21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
#22 [ffff8800a57bbf50] int_signal at ffffffff8161405d
As it turns out, this is because the i_itemp pointer, along
with the d_ops pointer, has been overwritten with zeros
when we tear down the extents during truncate. When the in-core
inode fork on the temporary inode used by xfs_fsr was originally
set up during the extent swap, we mistakenly looked at di_nextents
to determine whether all extents fit inline, but this misses extents
generated by speculative preallocation; we should be using if_bytes
instead.
This mistake corrupts the in-memory inode, and code in
xfs_iext_remove_inline eventually gets bad inputs, causing
it to memmove and memset incorrect ranges; this became apparent
because the two values in ifp->if_u2.if_inline_ext[1] contained
what should have been in d_ops and i_itemp; they were memmoved due
to incorrect array indexing and then the original locations
were zeroed with memset, again due to an array overrun.
Fix this by properly using i_df.if_bytes to determine the number
of extents, not di_nextents.
Thanks to dchinner for looking at this with me and spotting the
root cause.
[nborisov: backported to 4.4]
Cc: [email protected]
Signed-off-by: Eric Sandeen <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
--
fs/xfs/xfs_bmap_util.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1713,6 +1713,7 @@ xfs_swap_extents(
xfs_trans_t *tp;
xfs_bstat_t *sbp = &sxp->sx_stat;
xfs_ifork_t *tempifp, *ifp, *tifp;
+ xfs_extnum_t nextents;
int src_log_flags, target_log_flags;
int error = 0;
int aforkblks = 0;
@@ -1899,7 +1900,8 @@ xfs_swap_extents(
* pointer. Otherwise it's already NULL or
* pointing to the extent.
*/
- if (ip->i_d.di_nextents <= XFS_INLINE_EXTS) {
+ nextents = ip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t);
+ if (nextents <= XFS_INLINE_EXTS) {
ifp->if_u1.if_extents =
ifp->if_u2.if_inline_ext;
}
@@ -1918,7 +1920,8 @@ xfs_swap_extents(
* pointer. Otherwise it's already NULL or
* pointing to the extent.
*/
- if (tip->i_d.di_nextents <= XFS_INLINE_EXTS) {
+ nextents = tip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t);
+ if (nextents <= XFS_INLINE_EXTS) {
tifp->if_u1.if_extents =
tifp->if_u2.if_inline_ext;
}
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Bjorn Helgaas <[email protected]>
[ Upstream commit 45d004f4afefdd8d79916ee6d97a9ecd94bb1ffe ]
The BAR property bits (0-3 for memory BARs, 0-1 for I/O BARs) are supposed
to be read-only, but we do save them in res->flags and include them when
updating the BAR.
Mask the I/O property bits with ~PCI_BASE_ADDRESS_IO_MASK (0x3) instead of
PCI_REGION_FLAG_MASK (0xf) to make it obvious that we can't corrupt bits
2-3 of I/O addresses.
Use PCI_ROM_ADDRESS_MASK for ROM BARs. This means we'll only check the top
21 bits (instead of the 28 bits we used to check) of a ROM BAR to see if
the update was successful.
Signed-off-by: Bjorn Helgaas <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/pci/setup-res.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -58,12 +58,17 @@ static void pci_std_update_resource(stru
return;
pcibios_resource_to_bus(dev->bus, ®ion, res);
+ new = region.start;
- new = region.start | (res->flags & PCI_REGION_FLAG_MASK);
- if (res->flags & IORESOURCE_IO)
+ if (res->flags & IORESOURCE_IO) {
mask = (u32)PCI_BASE_ADDRESS_IO_MASK;
- else
+ new |= res->flags & ~PCI_BASE_ADDRESS_IO_MASK;
+ } else if (resno == PCI_ROM_RESOURCE) {
+ mask = (u32)PCI_ROM_ADDRESS_MASK;
+ } else {
mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
+ new |= res->flags & ~PCI_BASE_ADDRESS_MEM_MASK;
+ }
if (resno < PCI_ROM_RESOURCE) {
reg = PCI_BASE_ADDRESS_0 + 4 * resno;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Samuel Thibault <[email protected]>
commit 3243367b209faed5c320a4e5f9a565ee2a2ba958 upstream.
Some USB 2.0 devices erroneously report millisecond values in
bInterval. The generic config code manages to catch most of them,
but in some cases it's not completely enough.
The case at stake here is a USB 2.0 braille device, which wants to
announce 10ms and thus sets bInterval to 10, but with the USB 2.0
computation that yields to 64ms. It happens that one can type fast
enough to reach this interval and get the device buffers overflown,
leading to problematic latencies. The generic config code does not
catch this case because the 64ms is considered a sane enough value.
This change thus adds a USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL quirk
to mark devices which actually report milliseconds in bInterval,
and marks Vario Ultra devices as needing it.
Signed-off-by: Samuel Thibault <[email protected]>
Acked-by: Alan Stern <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/usb/core/config.c | 10 ++++++++++
drivers/usb/core/quirks.c | 8 ++++++++
include/linux/usb/quirks.h | 6 ++++++
3 files changed, 24 insertions(+)
--- a/drivers/usb/core/config.c
+++ b/drivers/usb/core/config.c
@@ -246,6 +246,16 @@ static int usb_parse_endpoint(struct dev
/*
* Adjust bInterval for quirked devices.
+ */
+ /*
+ * This quirk fixes bIntervals reported in ms.
+ */
+ if (to_usb_device(ddev)->quirks &
+ USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL) {
+ n = clamp(fls(d->bInterval) + 3, i, j);
+ i = j = n;
+ }
+ /*
* This quirk fixes bIntervals reported in
* linear microframes.
*/
--- a/drivers/usb/core/quirks.c
+++ b/drivers/usb/core/quirks.c
@@ -170,6 +170,14 @@ static const struct usb_device_id usb_qu
/* M-Systems Flash Disk Pioneers */
{ USB_DEVICE(0x08ec, 0x1000), .driver_info = USB_QUIRK_RESET_RESUME },
+ /* Baum Vario Ultra */
+ { USB_DEVICE(0x0904, 0x6101), .driver_info =
+ USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL },
+ { USB_DEVICE(0x0904, 0x6102), .driver_info =
+ USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL },
+ { USB_DEVICE(0x0904, 0x6103), .driver_info =
+ USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL },
+
/* Keytouch QWERTY Panel keyboard */
{ USB_DEVICE(0x0926, 0x3333), .driver_info =
USB_QUIRK_CONFIG_INTF_STRINGS },
--- a/include/linux/usb/quirks.h
+++ b/include/linux/usb/quirks.h
@@ -50,4 +50,10 @@
/* device can't handle Link Power Management */
#define USB_QUIRK_NO_LPM BIT(10)
+/*
+ * Device reports its bInterval as linear frames instead of the
+ * USB 2.0 calculation.
+ */
+#define USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL BIT(11)
+
#endif /* __LINUX_USB_QUIRKS_H */
On Tue 28-03-17 15:23:58, Ilya Dryomov wrote:
> On Tue, Mar 28, 2017 at 2:43 PM, Michal Hocko <[email protected]> wrote:
> > On Tue 28-03-17 14:30:45, Greg KH wrote:
> >> 4.4-stable review patch. If anyone has any objections, please let me know.
> >
> > I haven't seen the original patch but the changelog makes me worried.
> > How exactly this is a problem? Where do we lockup? Does rbd/libceph take
> > any xfs locks?
>
> No, it doesn't. This is just another instance of "using GFP_KERNEL on
> the writeback path may lead to a deadlock" with nothing extra to it.
>
> XFS is writing out data, libceph messenger worker tries to open
> a socket and recurses back into XFS because the sockfs inode is
> allocated with GFP_KERNEL. The message with some of the data never
> goes out and eventually we get a deadlock.
>
> I've only included the offending stack trace. I guess I should have
> stressed that ceph-msgr workqueue is used for reclaim.
Could you be more specific about the lockup scenario. I still do not get
how this would lead to a deadlock.
--
Michal Hocko
SUSE Labs
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Tomasz Majchrzak <[email protected]>
commit 9b622e2bbcf049c82e2550d35fb54ac205965f50 upstream.
md pending write counter must be incremented after bio is split,
otherwise it gets decremented too many times in end bio callback and
becomes negative.
Signed-off-by: Tomasz Majchrzak <[email protected]>
Reviewed-by: Artur Paszkiewicz <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/md/raid10.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1072,6 +1072,8 @@ static void __make_request(struct mddev
int max_sectors;
int sectors;
+ md_write_start(mddev, bio);
+
/*
* Register the new request and wait if the reconstruction
* thread has put up a bar for new requests.
@@ -1455,8 +1457,6 @@ static void make_request(struct mddev *m
return;
}
- md_write_start(mddev, bio);
-
do {
/*
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit b0addd3fa6bcd119be9428996d5d4522479ab240 upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer should a malicious device lack endpoints.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/usb/misc/idmouse.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/usb/misc/idmouse.c
+++ b/drivers/usb/misc/idmouse.c
@@ -346,6 +346,9 @@ static int idmouse_probe(struct usb_inte
if (iface_desc->desc.bInterfaceClass != 0x0A)
return -ENODEV;
+ if (iface_desc->desc.bNumEndpoints < 1)
+ return -ENODEV;
+
/* allocate memory for our device state and initialize it */
dev = kzalloc(sizeof(*dev), GFP_KERNEL);
if (dev == NULL)
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sumit Semwal <[email protected]>
From: Gavin Shan <[email protected]>
[ Upstream commit f40ec3c748c6912f6266c56a7f7992de61b255ed ]
Previously we enabled VFs and enable their memory space before calling
pcibios_sriov_enable(). But pcibios_sriov_enable() may update the VF BARs:
for example, on PPC PowerNV we may change them to manage the association of
VFs to PEs.
Because 64-bit BARs cannot be updated atomically, it's unsafe to update
them while they're enabled. The half-updated state may conflict with other
devices in the system.
Call pcibios_sriov_enable() before enabling the VFs so any BAR updates
happen while the VF BARs are disabled.
[bhelgaas: changelog]
Tested-by: Carol Soto <[email protected]>
Signed-off-by: Gavin Shan <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Sumit Semwal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/pci/iov.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -303,13 +303,6 @@ static int sriov_enable(struct pci_dev *
return rc;
}
- pci_iov_set_numvfs(dev, nr_virtfn);
- iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
- pci_cfg_access_lock(dev);
- pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
- msleep(100);
- pci_cfg_access_unlock(dev);
-
iov->initial_VFs = initial;
if (nr_virtfn < initial)
initial = nr_virtfn;
@@ -320,6 +313,13 @@ static int sriov_enable(struct pci_dev *
goto err_pcibios;
}
+ pci_iov_set_numvfs(dev, nr_virtfn);
+ iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
+ pci_cfg_access_lock(dev);
+ pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+ msleep(100);
+ pci_cfg_access_unlock(dev);
+
for (i = 0; i < initial; i++) {
rc = virtfn_add(dev, i, 0);
if (rc)
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Darrick J. Wong <[email protected]>
commit ef388e2054feedaeb05399ed654bdb06f385d294 upstream.
The on-disk field di_size is used to set i_size, which is a signed
integer of loff_t. If the high bit of di_size is set, we'll end up with
a negative i_size, which will cause all sorts of problems. Since the
VFS won't let us create a file with such length, we should catch them
here in the verifier too.
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
fs/xfs/libxfs/xfs_inode_buf.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -299,6 +299,14 @@ xfs_dinode_verify(
if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
return false;
+ /* don't allow invalid i_size */
+ if (be64_to_cpu(dip->di_size) & (1ULL << 63))
+ return false;
+
+ /* No zero-length symlinks. */
+ if (S_ISLNK(be16_to_cpu(dip->di_mode)) && dip->di_size == 0)
+ return false;
+
/* only version 3 or greater inodes are extensively verified here */
if (dip->di_version < 3)
return true;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit ba340d7b83703768ce566f53f857543359aa1b98 upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer should a malicious device lack endpoints.
Fixes: bba5394ad3bd ("Input: add support for Hanwang tablets")
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/input/tablet/hanwang.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/input/tablet/hanwang.c
+++ b/drivers/input/tablet/hanwang.c
@@ -340,6 +340,9 @@ static int hanwang_probe(struct usb_inte
int error;
int i;
+ if (intf->cur_altsetting->desc.bNumEndpoints < 1)
+ return -ENODEV;
+
hanwang = kzalloc(sizeof(struct hanwang), GFP_KERNEL);
input_dev = input_allocate_device();
if (!hanwang || !input_dev) {
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit daf229b15907fbfdb6ee183aac8ca428cb57e361 upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer should a malicious device lack endpoints.
Note that the dereference happens in the start callback which is called
during probe.
Fixes: de520b8bd552 ("uwb: add HWA radio controller driver")
Cc: Inaky Perez-Gonzalez <[email protected]>
Cc: David Vrabel <[email protected]>
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/uwb/hwa-rc.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/uwb/hwa-rc.c
+++ b/drivers/uwb/hwa-rc.c
@@ -825,6 +825,9 @@ static int hwarc_probe(struct usb_interf
struct hwarc *hwarc;
struct device *dev = &iface->dev;
+ if (iface->cur_altsetting->desc.bNumEndpoints < 1)
+ return -ENODEV;
+
result = -ENOMEM;
uwb_rc = uwb_rc_alloc();
if (uwb_rc == NULL) {
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Bin Liu <[email protected]>
commit 0090114d336a9604aa2d90bc83f20f7cd121b76c upstream.
The CPPI 4.1 driver polls register to workaround the premature TX
interrupt issue, but it causes audio playback underrun when triggered in
Isoch transfers.
Isoch doesn't do back-to-back transfers, the TX should be done by the
time the next transfer is scheduled. So skip this polling workaround for
Isoch transfer.
Fixes: a655f481d83d6 ("usb: musb: musb_cppi41: handle pre-mature TX complete interrupt")
Reported-by: Alexandre Bailon <[email protected]>
Acked-by: Sebastian Andrzej Siewior <[email protected]>
Tested-by: Alexandre Bailon <[email protected]>
Signed-off-by: Bin Liu <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/usb/musb/musb_cppi41.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
--- a/drivers/usb/musb/musb_cppi41.c
+++ b/drivers/usb/musb/musb_cppi41.c
@@ -250,8 +250,27 @@ static void cppi41_dma_callback(void *pr
transferred < cppi41_channel->packet_sz)
cppi41_channel->prog_len = 0;
- if (cppi41_channel->is_tx)
- empty = musb_is_tx_fifo_empty(hw_ep);
+ if (cppi41_channel->is_tx) {
+ u8 type;
+
+ if (is_host_active(musb))
+ type = hw_ep->out_qh->type;
+ else
+ type = hw_ep->ep_in.type;
+
+ if (type == USB_ENDPOINT_XFER_ISOC)
+ /*
+ * Don't use the early-TX-interrupt workaround below
+ * for Isoch transfter. Since Isoch are periodic
+ * transfer, by the time the next transfer is
+ * scheduled, the current one should be done already.
+ *
+ * This avoids audio playback underrun issue.
+ */
+ empty = true;
+ else
+ empty = musb_is_tx_fifo_empty(hw_ep);
+ }
if (!cppi41_channel->is_tx || empty) {
cppi41_trans_done(cppi41_channel);
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Doug Berger <[email protected]>
[ Upstream commit 31739eae738ccbe8b9d627c3f2251017ca03f4d2 ]
Commit 6ac3ce8295e6 ("net: bcmgenet: Remove excessive PHY reset")
removed the bcmgenet_mii_reset() function from bcmgenet_power_up() and
bcmgenet_internal_phy_setup() functions. In so doing it broke the reset
of the internal PHY devices used by the GENETv1-GENETv3 which required
this reset before the UniMAC was enabled. It also broke the internal
GPHY devices used by the GENETv4 because the config_init that installed
the AFE workaround was no longer occurring after the reset of the GPHY
performed by bcmgenet_phy_power_set() in bcmgenet_internal_phy_setup().
In addition the code in bcmgenet_internal_phy_setup() related to the
"enable APD" comment goes with the bcmgenet_mii_reset() so it should
have also been removed.
Commit bd4060a6108b ("net: bcmgenet: Power on integrated GPHY in
bcmgenet_power_up()") moved the bcmgenet_phy_power_set() call to the
bcmgenet_power_up() function, but failed to remove it from the
bcmgenet_internal_phy_setup() function. Had it done so, the
bcmgenet_internal_phy_setup() function would have been empty and could
have been removed at that time.
Commit 5dbebbb44a6a ("net: bcmgenet: Software reset EPHY after power on")
was submitted to correct the functional problems introduced by
commit 6ac3ce8295e6 ("net: bcmgenet: Remove excessive PHY reset"). It
was included in v4.4 and made available on 4.3-stable. Unfortunately,
it didn't fully revert the commit because this bcmgenet_mii_reset()
doesn't apply the soft reset to the internal GPHY used by GENETv4 like
the previous one did. This prevents the restoration of the AFE work-
arounds for internal GPHY devices after the bcmgenet_phy_power_set() in
bcmgenet_internal_phy_setup().
This commit takes the alternate approach of removing the unnecessary
bcmgenet_internal_phy_setup() function which shouldn't have been in v4.3
so that when bcmgenet_mii_reset() was restored it should have only gone
into bcmgenet_power_up(). This will avoid the problems while also
removing the redundancy (and hopefully some of the confusion).
Fixes: 6ac3ce8295e6 ("net: bcmgenet: Remove excessive PHY reset")
Signed-off-by: Doug Berger <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/ethernet/broadcom/genet/bcmmii.c | 15 ---------------
1 file changed, 15 deletions(-)
--- a/drivers/net/ethernet/broadcom/genet/bcmmii.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmmii.c
@@ -220,20 +220,6 @@ void bcmgenet_phy_power_set(struct net_d
udelay(60);
}
-static void bcmgenet_internal_phy_setup(struct net_device *dev)
-{
- struct bcmgenet_priv *priv = netdev_priv(dev);
- u32 reg;
-
- /* Power up PHY */
- bcmgenet_phy_power_set(dev, true);
- /* enable APD */
- reg = bcmgenet_ext_readl(priv, EXT_EXT_PWR_MGMT);
- reg |= EXT_PWR_DN_EN_LD;
- bcmgenet_ext_writel(priv, reg, EXT_EXT_PWR_MGMT);
- bcmgenet_mii_reset(dev);
-}
-
static void bcmgenet_moca_phy_setup(struct bcmgenet_priv *priv)
{
u32 reg;
@@ -281,7 +267,6 @@ int bcmgenet_mii_config(struct net_devic
if (priv->internal_phy) {
phy_name = "internal PHY";
- bcmgenet_internal_phy_setup(dev);
} else if (priv->phy_interface == PHY_INTERFACE_MODE_MOCA) {
phy_name = "MoCA";
bcmgenet_moca_phy_setup(priv);
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Or Gerlitz <[email protected]>
[ Upstream commit 3d20f1f7bd575d147ffa75621fa560eea0aec690 ]
When dealing with ipv6 source tunnel key address attribute
(OVS_TUNNEL_KEY_ATTR_IPV6_SRC) we are wrongly setting the tunnel
dst ip, fix that.
Fixes: 6b26ba3a7d95 ('openvswitch: netlink attributes for IPv6 tunneling')
Signed-off-by: Or Gerlitz <[email protected]>
Reported-by: Paul Blakey <[email protected]>
Acked-by: Jiri Benc <[email protected]>
Acked-by: Joe Stringer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/openvswitch/flow_netlink.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -588,7 +588,7 @@ static int ip_tun_from_nlattr(const stru
ipv4 = true;
break;
case OVS_TUNNEL_KEY_ATTR_IPV6_SRC:
- SW_FLOW_KEY_PUT(match, tun_key.u.ipv6.dst,
+ SW_FLOW_KEY_PUT(match, tun_key.u.ipv6.src,
nla_get_in6_addr(a), is_mask);
ipv6 = true;
break;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Kai-Heng Feng <[email protected]>
commit 45838660e34d90db8d4f7cbc8fd66e8aff79f4fe upstream.
The aux port does not get detected without noloop quirk, so external PS/2
mouse cannot work as result.
The PS/2 mouse can work with this quirk.
BugLink: https://bugs.launchpad.net/bugs/1591053
Signed-off-by: Kai-Heng Feng <[email protected]>
Reviewed-by: Marcos Paulo de Souza <[email protected]>
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/input/serio/i8042-x86ia64io.h | 7 +++++++
1 file changed, 7 insertions(+)
--- a/drivers/input/serio/i8042-x86ia64io.h
+++ b/drivers/input/serio/i8042-x86ia64io.h
@@ -120,6 +120,13 @@ static const struct dmi_system_id __init
},
},
{
+ /* Dell Embedded Box PC 3000 */
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."),
+ DMI_MATCH(DMI_PRODUCT_NAME, "Embedded Box PC 3000"),
+ },
+ },
+ {
/* OQO Model 01 */
.matches = {
DMI_MATCH(DMI_SYS_VENDOR, "OQO"),
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Michael Engl <[email protected]>
commit e83bb3e6f3efa21f4a9d883a25d0ecd9dfb431e1 upstream.
The tiadc_irq_h(int irq, void *private) function is handling FIFO
overruns by clearing flags, disabling and enabling the ADC to
recover.
If the ADC is running in continuous mode a FIFO overrun happens
regularly. If the disabling of the ADC happens concurrently with
a new conversion. It might happen that the enabling of the ADC
is ignored by the hardware. This stops the ADC permanently. No
more interrupts are triggered.
According to the AM335x Reference Manual (SPRUH73H October 2011 -
Revised April 2013 - Chapter 12.4 and 12.5) it is necessary to
check the ADC FSM bits in REG_ADCFSM before enabling the ADC
again. Because the disabling of the ADC is done right after the
current conversion has been finished.
To trigger this bug it is necessary to run the ADC in continuous
mode. The ADC values of all channels need to be read in an endless
loop. The bug appears within the first 6 hours (~5.4 million
handled FIFO overruns). The user space application will hang on
reading new values from the character device.
Fixes: ca9a563805f7a ("iio: ti_am335x_adc: Add continuous sampling support")
Signed-off-by: Michael Engl <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/iio/adc/ti_am335x_adc.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
--- a/drivers/iio/adc/ti_am335x_adc.c
+++ b/drivers/iio/adc/ti_am335x_adc.c
@@ -151,7 +151,9 @@ static irqreturn_t tiadc_irq_h(int irq,
{
struct iio_dev *indio_dev = private;
struct tiadc_device *adc_dev = iio_priv(indio_dev);
- unsigned int status, config;
+ unsigned int status, config, adc_fsm;
+ unsigned short count = 0;
+
status = tiadc_readl(adc_dev, REG_IRQSTATUS);
/*
@@ -165,6 +167,15 @@ static irqreturn_t tiadc_irq_h(int irq,
tiadc_writel(adc_dev, REG_CTRL, config);
tiadc_writel(adc_dev, REG_IRQSTATUS, IRQENB_FIFO1OVRRUN
| IRQENB_FIFO1UNDRFLW | IRQENB_FIFO1THRES);
+
+ /* wait for idle state.
+ * ADC needs to finish the current conversion
+ * before disabling the module
+ */
+ do {
+ adc_fsm = tiadc_readl(adc_dev, REG_ADCFSM);
+ } while (adc_fsm != 0x10 && count++ < 100);
+
tiadc_writel(adc_dev, REG_CTRL, (config | CNTRLREG_TSCSSENB));
return IRQ_HANDLED;
} else if (status & IRQENB_FIFO1THRES) {
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Eric Biggers <[email protected]>
commit b9cf625d6ecde0d372e23ae022feead72b4228a6 upstream.
If ext4_convert_inline_data() was called on a directory with inline
data, the filesystem was left in an inconsistent state (as considered by
e2fsck) because the file size was not increased to cover the new block.
This happened because the inode was not marked dirty after i_disksize
was updated. Fix this by marking the inode dirty at the end of
ext4_finish_convert_inline_dir().
This bug was probably not noticed before because most users mark the
inode dirty afterwards for other reasons. But if userspace executed
FS_IOC_SET_ENCRYPTION_POLICY with invalid parameters, as exercised by
'kvm-xfstests -c adv generic/396', then the inode was never marked dirty
after updating i_disksize.
Fixes: 3c47d54170b6a678875566b1b8d6dcf57904e49b
Signed-off-by: Eric Biggers <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
fs/ext4/inline.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1158,10 +1158,9 @@ static int ext4_finish_convert_inline_di
set_buffer_uptodate(dir_block);
err = ext4_handle_dirty_dirent_node(handle, inode, dir_block);
if (err)
- goto out;
+ return err;
set_buffer_verified(dir_block);
-out:
- return err;
+ return ext4_mark_inode_dirty(handle, inode);
}
static int ext4_convert_inline_data_nolock(handle_t *handle,
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Gal Pressman <[email protected]>
[ Upstream commit 8ab7e2ae15d84ba758b2c8c6f4075722e9bd2a08 ]
RX packets statistics ('rx_packets' counter) used to count LRO packets
as one, even though it contains multiple segments.
This patch will increment the counter by the number of segments, and
align the driver with the behavior of other drivers in the stack.
Note that no information is lost in this patch due to 'rx_lro_packets'
counter existence.
Before, ethtool showed:
$ ethtool -S ens6 | egrep "rx_packets|rx_lro_packets"
rx_packets: 435277
rx_lro_packets: 35847
rx_packets_phy: 1935066
Now, we will see the more logical statistics:
$ ethtool -S ens6 | egrep "rx_packets|rx_lro_packets"
rx_packets: 1935066
rx_lro_packets: 35847
rx_packets_phy: 1935066
Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files")
Signed-off-by: Gal Pressman <[email protected]>
Cc: [email protected]
Signed-off-by: Saeed Mahameed <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -197,6 +197,10 @@ static inline void mlx5e_build_rx_skb(st
if (lro_num_seg > 1) {
mlx5e_lro_update_hdr(skb, cqe);
skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt, lro_num_seg);
+ /* Subtract one since we already counted this as one
+ * "regular" packet in mlx5e_complete_rx_cqe()
+ */
+ rq->stats.packets += lro_num_seg - 1;
rq->stats.lro_packets++;
rq->stats.lro_bytes += cqe_bcnt;
}
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Eric Dumazet <[email protected]>
[ Upstream commit 22a0e18eac7a9e986fec76c60fa4a2926d1291e2 ]
I mistakenly added the code to release sk->sk_frag in
sk_common_release() instead of sk_destruct()
TCP sockets using sk->sk_allocation == GFP_ATOMIC do no call
sk_common_release() at close time, thus leaking one (order-3) page.
iSCSI is using such sockets.
Fixes: 5640f7685831 ("net: use a per task frag allocator")
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/core/sock.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1459,6 +1459,11 @@ void sk_destruct(struct sock *sk)
pr_debug("%s: optmem leakage (%d bytes) detected\n",
__func__, atomic_read(&sk->sk_omem_alloc));
+ if (sk->sk_frag.page) {
+ put_page(sk->sk_frag.page);
+ sk->sk_frag.page = NULL;
+ }
+
if (sk->sk_peer_cred)
put_cred(sk->sk_peer_cred);
put_pid(sk->sk_peer_pid);
@@ -2691,11 +2696,6 @@ void sk_common_release(struct sock *sk)
sk_refcnt_debug_release(sk);
- if (sk->sk_frag.page) {
- put_page(sk->sk_frag.page);
- sk->sk_frag.page = NULL;
- }
-
sock_put(sk);
}
EXPORT_SYMBOL(sk_common_release);
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit 59cf8bed44a79ec42303151dd014fdb6434254bb upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer or accessing memory that lie beyond the end of the endpoint
array should a malicious device lack the expected endpoints.
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/input/joystick/iforce/iforce-usb.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/input/joystick/iforce/iforce-usb.c
+++ b/drivers/input/joystick/iforce/iforce-usb.c
@@ -141,6 +141,9 @@ static int iforce_usb_probe(struct usb_i
interface = intf->cur_altsetting;
+ if (interface->desc.bNumEndpoints < 2)
+ return -ENODEV;
+
epirq = &interface->endpoint[0].desc;
epout = &interface->endpoint[1].desc;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: "Lendacky, Thomas" <[email protected]>
[ Upstream commit 622c36f143fc9566ba49d7cec994c2da1182d9e2 ]
Newer hardware does not provide a cumulative payload length when multiple
descriptors are needed to handle the data. Once the MTU increases beyond
the size that can be handled by a single descriptor, the SKB does not get
built properly by the driver.
The driver will now calculate the size of the data buffers used by the
hardware. The first buffer of the first descriptor is for packet headers
or packet headers and data when the headers can't be split. Subsequent
descriptors in a multi-descriptor chain will not use the first buffer. The
second buffer is used by all the descriptors in the chain for payload data.
Based on whether the driver is processing the first, intermediate, or last
descriptor it can calculate the buffer usage and build the SKB properly.
Tested and verified on both old and new hardware.
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/ethernet/amd/xgbe/xgbe-common.h | 6 +
drivers/net/ethernet/amd/xgbe/xgbe-dev.c | 20 +++--
drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 102 +++++++++++++++++-----------
3 files changed, 78 insertions(+), 50 deletions(-)
--- a/drivers/net/ethernet/amd/xgbe/xgbe-common.h
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-common.h
@@ -913,8 +913,8 @@
#define RX_PACKET_ATTRIBUTES_CSUM_DONE_WIDTH 1
#define RX_PACKET_ATTRIBUTES_VLAN_CTAG_INDEX 1
#define RX_PACKET_ATTRIBUTES_VLAN_CTAG_WIDTH 1
-#define RX_PACKET_ATTRIBUTES_INCOMPLETE_INDEX 2
-#define RX_PACKET_ATTRIBUTES_INCOMPLETE_WIDTH 1
+#define RX_PACKET_ATTRIBUTES_LAST_INDEX 2
+#define RX_PACKET_ATTRIBUTES_LAST_WIDTH 1
#define RX_PACKET_ATTRIBUTES_CONTEXT_NEXT_INDEX 3
#define RX_PACKET_ATTRIBUTES_CONTEXT_NEXT_WIDTH 1
#define RX_PACKET_ATTRIBUTES_CONTEXT_INDEX 4
@@ -923,6 +923,8 @@
#define RX_PACKET_ATTRIBUTES_RX_TSTAMP_WIDTH 1
#define RX_PACKET_ATTRIBUTES_RSS_HASH_INDEX 6
#define RX_PACKET_ATTRIBUTES_RSS_HASH_WIDTH 1
+#define RX_PACKET_ATTRIBUTES_FIRST_INDEX 7
+#define RX_PACKET_ATTRIBUTES_FIRST_WIDTH 1
#define RX_NORMAL_DESC0_OVT_INDEX 0
#define RX_NORMAL_DESC0_OVT_WIDTH 16
--- a/drivers/net/ethernet/amd/xgbe/xgbe-dev.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-dev.c
@@ -1658,10 +1658,15 @@ static int xgbe_dev_read(struct xgbe_cha
/* Get the header length */
if (XGMAC_GET_BITS_LE(rdesc->desc3, RX_NORMAL_DESC3, FD)) {
+ XGMAC_SET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES,
+ FIRST, 1);
rdata->rx.hdr_len = XGMAC_GET_BITS_LE(rdesc->desc2,
RX_NORMAL_DESC2, HL);
if (rdata->rx.hdr_len)
pdata->ext_stats.rx_split_header_packets++;
+ } else {
+ XGMAC_SET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES,
+ FIRST, 0);
}
/* Get the RSS hash */
@@ -1684,19 +1689,16 @@ static int xgbe_dev_read(struct xgbe_cha
}
}
- /* Get the packet length */
- rdata->rx.len = XGMAC_GET_BITS_LE(rdesc->desc3, RX_NORMAL_DESC3, PL);
-
- if (!XGMAC_GET_BITS_LE(rdesc->desc3, RX_NORMAL_DESC3, LD)) {
- /* Not all the data has been transferred for this packet */
- XGMAC_SET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES,
- INCOMPLETE, 1);
+ /* Not all the data has been transferred for this packet */
+ if (!XGMAC_GET_BITS_LE(rdesc->desc3, RX_NORMAL_DESC3, LD))
return 0;
- }
/* This is the last of the data for this packet */
XGMAC_SET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES,
- INCOMPLETE, 0);
+ LAST, 1);
+
+ /* Get the packet length */
+ rdata->rx.len = XGMAC_GET_BITS_LE(rdesc->desc3, RX_NORMAL_DESC3, PL);
/* Set checksum done indicator as appropriate */
if (netdev->features & NETIF_F_RXCSUM)
--- a/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-drv.c
@@ -1760,13 +1760,12 @@ static struct sk_buff *xgbe_create_skb(s
{
struct sk_buff *skb;
u8 *packet;
- unsigned int copy_len;
skb = napi_alloc_skb(napi, rdata->rx.hdr.dma_len);
if (!skb)
return NULL;
- /* Start with the header buffer which may contain just the header
+ /* Pull in the header buffer which may contain just the header
* or the header plus data
*/
dma_sync_single_range_for_cpu(pdata->dev, rdata->rx.hdr.dma_base,
@@ -1775,30 +1774,49 @@ static struct sk_buff *xgbe_create_skb(s
packet = page_address(rdata->rx.hdr.pa.pages) +
rdata->rx.hdr.pa.pages_offset;
- copy_len = (rdata->rx.hdr_len) ? rdata->rx.hdr_len : len;
- copy_len = min(rdata->rx.hdr.dma_len, copy_len);
- skb_copy_to_linear_data(skb, packet, copy_len);
- skb_put(skb, copy_len);
-
- len -= copy_len;
- if (len) {
- /* Add the remaining data as a frag */
- dma_sync_single_range_for_cpu(pdata->dev,
- rdata->rx.buf.dma_base,
- rdata->rx.buf.dma_off,
- rdata->rx.buf.dma_len,
- DMA_FROM_DEVICE);
-
- skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
- rdata->rx.buf.pa.pages,
- rdata->rx.buf.pa.pages_offset,
- len, rdata->rx.buf.dma_len);
- rdata->rx.buf.pa.pages = NULL;
- }
+ skb_copy_to_linear_data(skb, packet, len);
+ skb_put(skb, len);
return skb;
}
+static unsigned int xgbe_rx_buf1_len(struct xgbe_ring_data *rdata,
+ struct xgbe_packet_data *packet)
+{
+ /* Always zero if not the first descriptor */
+ if (!XGMAC_GET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES, FIRST))
+ return 0;
+
+ /* First descriptor with split header, return header length */
+ if (rdata->rx.hdr_len)
+ return rdata->rx.hdr_len;
+
+ /* First descriptor but not the last descriptor and no split header,
+ * so the full buffer was used
+ */
+ if (!XGMAC_GET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES, LAST))
+ return rdata->rx.hdr.dma_len;
+
+ /* First descriptor and last descriptor and no split header, so
+ * calculate how much of the buffer was used
+ */
+ return min_t(unsigned int, rdata->rx.hdr.dma_len, rdata->rx.len);
+}
+
+static unsigned int xgbe_rx_buf2_len(struct xgbe_ring_data *rdata,
+ struct xgbe_packet_data *packet,
+ unsigned int len)
+{
+ /* Always the full buffer if not the last descriptor */
+ if (!XGMAC_GET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES, LAST))
+ return rdata->rx.buf.dma_len;
+
+ /* Last descriptor so calculate how much of the buffer was used
+ * for the last bit of data
+ */
+ return rdata->rx.len - len;
+}
+
static int xgbe_tx_poll(struct xgbe_channel *channel)
{
struct xgbe_prv_data *pdata = channel->pdata;
@@ -1881,8 +1899,8 @@ static int xgbe_rx_poll(struct xgbe_chan
struct napi_struct *napi;
struct sk_buff *skb;
struct skb_shared_hwtstamps *hwtstamps;
- unsigned int incomplete, error, context_next, context;
- unsigned int len, rdesc_len, max_len;
+ unsigned int last, error, context_next, context;
+ unsigned int len, buf1_len, buf2_len, max_len;
unsigned int received = 0;
int packet_count = 0;
@@ -1892,7 +1910,7 @@ static int xgbe_rx_poll(struct xgbe_chan
if (!ring)
return 0;
- incomplete = 0;
+ last = 0;
context_next = 0;
napi = (pdata->per_channel_irq) ? &channel->napi : &pdata->napi;
@@ -1926,9 +1944,8 @@ read_again:
received++;
ring->cur++;
- incomplete = XGMAC_GET_BITS(packet->attributes,
- RX_PACKET_ATTRIBUTES,
- INCOMPLETE);
+ last = XGMAC_GET_BITS(packet->attributes, RX_PACKET_ATTRIBUTES,
+ LAST);
context_next = XGMAC_GET_BITS(packet->attributes,
RX_PACKET_ATTRIBUTES,
CONTEXT_NEXT);
@@ -1937,7 +1954,7 @@ read_again:
CONTEXT);
/* Earlier error, just drain the remaining data */
- if ((incomplete || context_next) && error)
+ if ((!last || context_next) && error)
goto read_again;
if (error || packet->errors) {
@@ -1949,16 +1966,22 @@ read_again:
}
if (!context) {
- /* Length is cumulative, get this descriptor's length */
- rdesc_len = rdata->rx.len - len;
- len += rdesc_len;
+ /* Get the data length in the descriptor buffers */
+ buf1_len = xgbe_rx_buf1_len(rdata, packet);
+ len += buf1_len;
+ buf2_len = xgbe_rx_buf2_len(rdata, packet, len);
+ len += buf2_len;
- if (rdesc_len && !skb) {
+ if (!skb) {
skb = xgbe_create_skb(pdata, napi, rdata,
- rdesc_len);
- if (!skb)
+ buf1_len);
+ if (!skb) {
error = 1;
- } else if (rdesc_len) {
+ goto skip_data;
+ }
+ }
+
+ if (buf2_len) {
dma_sync_single_range_for_cpu(pdata->dev,
rdata->rx.buf.dma_base,
rdata->rx.buf.dma_off,
@@ -1968,13 +1991,14 @@ read_again:
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
rdata->rx.buf.pa.pages,
rdata->rx.buf.pa.pages_offset,
- rdesc_len,
+ buf2_len,
rdata->rx.buf.dma_len);
rdata->rx.buf.pa.pages = NULL;
}
}
- if (incomplete || context_next)
+skip_data:
+ if (!last || context_next)
goto read_again;
if (!skb)
@@ -2033,7 +2057,7 @@ next_packet:
}
/* Check if we need to save state before leaving */
- if (received && (incomplete || context_next)) {
+ if (received && (!last || context_next)) {
rdata = XGBE_GET_DESC_DATA(ring, ring->cur);
rdata->state_saved = 1;
rdata->state.skb = skb;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit 92461f5d723037530c1f36cce93640770037812c upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer or accessing memory that lie beyond the end of the endpoint
array should a malicious device lack the expected endpoints.
Fixes: bdb5c57f209c ("Input: add sur40 driver for Samsung SUR40... ")
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/input/touchscreen/sur40.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/input/touchscreen/sur40.c
+++ b/drivers/input/touchscreen/sur40.c
@@ -500,6 +500,9 @@ static int sur40_probe(struct usb_interf
if (iface_desc->desc.bInterfaceClass != 0xFF)
return -ENODEV;
+ if (iface_desc->desc.bNumEndpoints < 5)
+ return -ENODEV;
+
/* Use endpoint #4 (0x86). */
endpoint = &iface_desc->endpoint[4].desc;
if (endpoint->bEndpointAddress != TOUCH_ENDPOINT)
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit 5cc4a1a9f5c179795c8a1f2b0f4361829d6a070e upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer should a malicious device lack endpoints.
Fixes: aca951a22a1d ("[PATCH] input-driver-yealink-P1K-usb-phone")
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/input/misc/yealink.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/drivers/input/misc/yealink.c
+++ b/drivers/input/misc/yealink.c
@@ -875,6 +875,10 @@ static int usb_probe(struct usb_interfac
int ret, pipe, i;
interface = intf->cur_altsetting;
+
+ if (interface->desc.bNumEndpoints < 1)
+ return -ENODEV;
+
endpoint = &interface->endpoint[0].desc;
if (!usb_endpoint_is_int_in(endpoint))
return -ENODEV;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Maor Gottlieb <[email protected]>
[ Upstream commit 5f40b4ed975c26016cf41953b7510fe90718e21c ]
With ConnectX-4 sharing SRQs from the same space as QPs, we hit a
limit preventing some applications to allocate needed QPs amount.
Double the size to 256K.
Fixes: e126ba97dba9e ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Maor Gottlieb <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -85,7 +85,7 @@ static struct mlx5_profile profile[] = {
[2] = {
.mask = MLX5_PROF_MASK_QP_SIZE |
MLX5_PROF_MASK_MR_CACHE,
- .log_max_qp = 17,
+ .log_max_qp = 18,
.mr_cache[0] = {
.size = 500,
.limit = 250
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Matjaz Hegedic <[email protected]>
commit 92ef6f97a66e580189a41a132d0f8a9f78d6ddce upstream.
EeeBook X205TA is yet another ASUS device with a special touchpad
firmware that needs to be accounted for during initialization, or
else the touchpad will go into an invalid state upon suspend/resume.
Adding the appropriate ic_type and product_id check fixes the problem.
Signed-off-by: Matjaz Hegedic <[email protected]>
Acked-by: KT Liao <[email protected]>
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/input/mouse/elan_i2c_core.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)
--- a/drivers/input/mouse/elan_i2c_core.c
+++ b/drivers/input/mouse/elan_i2c_core.c
@@ -218,17 +218,19 @@ static int elan_query_product(struct ela
static int elan_check_ASUS_special_fw(struct elan_tp_data *data)
{
- if (data->ic_type != 0x0E)
- return false;
-
- switch (data->product_id) {
- case 0x05 ... 0x07:
- case 0x09:
- case 0x13:
+ if (data->ic_type == 0x0E) {
+ switch (data->product_id) {
+ case 0x05 ... 0x07:
+ case 0x09:
+ case 0x13:
+ return true;
+ }
+ } else if (data->ic_type == 0x08 && data->product_id == 0x26) {
+ /* ASUS EeeBook X205TA */
return true;
- default:
- return false;
}
+
+ return false;
}
static int __elan_initialize(struct elan_tp_data *data)
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Daniel Borkmann <[email protected]>
[ Upstream commit a97e50cc4cb67e1e7bff56f6b41cda62ca832336 ]
In sk_clone_lock(), we create a new socket and inherit most of the
parent's members via sock_copy() which memcpy()'s various sections.
Now, in case the parent socket had a BPF socket filter attached,
then newsk->sk_filter points to the same instance as the original
sk->sk_filter.
sk_filter_charge() is then called on the newsk->sk_filter to take a
reference and should that fail due to hitting max optmem, we bail
out and release the newsk instance.
The issue is that commit 278571baca2a ("net: filter: simplify socket
charging") wrongly combined the dismantle path with the failure path
of xfrm_sk_clone_policy(). This means, even when charging failed, we
call sk_free_unlock_clone() on the newsk, which then still points to
the same sk_filter as the original sk.
Thus, sk_free_unlock_clone() calls into __sk_destruct() eventually
where it tests for present sk_filter and calls sk_filter_uncharge()
on it, which potentially lets sk_omem_alloc wrap around and releases
the eBPF prog and sk_filter structure from the (still intact) parent.
Fix it by making sure that when sk_filter_charge() failed, we reset
newsk->sk_filter back to NULL before passing to sk_free_unlock_clone(),
so that we don't mess with the parents sk_filter.
Only if xfrm_sk_clone_policy() fails, we did reach the point where
either the parent's filter was NULL and as a result newsk's as well
or where we previously had a successful sk_filter_charge(), thus for
that case, we do need sk_filter_uncharge() to release the prior taken
reference on sk_filter.
Fixes: 278571baca2a ("net: filter: simplify socket charging")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/core/sock.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1557,6 +1557,12 @@ struct sock *sk_clone_lock(const struct
is_charged = sk_filter_charge(newsk, filter);
if (unlikely(!is_charged || xfrm_sk_clone_policy(newsk, sk))) {
+ /* We need to make sure that we don't uncharge the new
+ * socket if we couldn't charge it in the first place
+ * as otherwise we uncharge the parent's filter.
+ */
+ if (!is_charged)
+ RCU_INIT_POINTER(newsk->sk_filter, NULL);
/* It is still raw copy of parent, so invalidate
* destructor and make plain sk_free() */
newsk->sk_destruct = NULL;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Florian Fainelli <[email protected]>
[ Upstream commit 5371bbf4b295eea334ed453efa286afa2c3ccff3 ]
Suspending the PHY would be putting it in a low power state where it
may no longer allow us to do Wake-on-LAN.
Fixes: cc013fb48898 ("net: bcmgenet: correctly suspend and resume PHY device")
Signed-off-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/ethernet/broadcom/genet/bcmgenet.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -3495,7 +3495,8 @@ static int bcmgenet_suspend(struct devic
bcmgenet_netif_stop(dev);
- phy_suspend(priv->phydev);
+ if (!device_may_wakeup(d))
+ phy_suspend(priv->phydev);
netif_device_detach(dev);
@@ -3592,7 +3593,8 @@ static int bcmgenet_resume(struct device
netif_device_attach(dev);
- phy_resume(priv->phydev);
+ if (!device_may_wakeup(d))
+ phy_resume(priv->phydev);
if (priv->eee.eee_enabled)
bcmgenet_eee_enable_set(dev, true);
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit 1916d319271664241b7aa0cd2b05e32bdb310ce9 upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer should a malicious device lack control-interface endpoints.
Fixes: 628329d52474 ("Input: add IMS Passenger Control Unit driver")
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/input/misc/ims-pcu.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/drivers/input/misc/ims-pcu.c
+++ b/drivers/input/misc/ims-pcu.c
@@ -1667,6 +1667,10 @@ static int ims_pcu_parse_cdc_data(struct
return -EINVAL;
alt = pcu->ctrl_intf->cur_altsetting;
+
+ if (alt->desc.bNumEndpoints < 1)
+ return -ENODEV;
+
pcu->ep_ctrl = &alt->endpoint[0].desc;
pcu->max_ctrl_size = usb_endpoint_maxp(pcu->ep_ctrl);
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Eric Dumazet <[email protected]>
[ Upstream commit 15bb7745e94a665caf42bfaabf0ce062845b533b ]
icsk_ack.lrcvtime has a 0 value at socket creation time.
tcpi_last_data_recv can have bogus value if no payload is ever received.
This patch initializes icsk_ack.lrcvtime for active sessions
in tcp_finish_connect(), and for passive sessions in
tcp_create_openreq_child()
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ipv4/tcp_input.c | 2 +-
net/ipv4/tcp_minisocks.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5435,6 +5435,7 @@ void tcp_finish_connect(struct sock *sk,
struct inet_connection_sock *icsk = inet_csk(sk);
tcp_set_state(sk, TCP_ESTABLISHED);
+ icsk->icsk_ack.lrcvtime = tcp_time_stamp;
if (skb) {
icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
@@ -5647,7 +5648,6 @@ static int tcp_rcv_synsent_state_process
* to stand against the temptation 8) --ANK
*/
inet_csk_schedule_ack(sk);
- icsk->icsk_ack.lrcvtime = tcp_time_stamp;
tcp_enter_quickack_mode(sk);
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
TCP_DELACK_MAX, TCP_RTO_MAX);
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -472,6 +472,7 @@ struct sock *tcp_create_openreq_child(co
newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
newtp->rtt_min[0].rtt = ~0U;
newicsk->icsk_rto = TCP_TIMEOUT_INIT;
+ newicsk->icsk_ack.lrcvtime = tcp_time_stamp;
newtp->packets_out = 0;
newtp->retrans_out = 0;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Sudip Mukherjee <[email protected]>
commit 03270c6ac6207fc55bbf9d20d195029dca210c79 upstream.
Usually every parallel port will have a single pardev registered with
it. But ppdev driver is an exception. This userspace parallel port
driver allows to create multiple parrallel port devices for a single
parallel port. And as a result we were having a nice warning like:
"sysctl table check failed:
/dev/parport/parport0/devices/ppdev0/timeslice Sysctl already exists"
Use the same logic as used in parport_register_device() and register
the proc files only once for each parallel port.
Fixes: 6fa45a226897 ("parport: add device-model to parport subsystem")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1414656
Bugzilla: https://bugs.archlinux.org/task/52322
Tested-by: James Feeney <[email protected]>
Signed-off-by: Sudip Mukherjee <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/parport/share.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/drivers/parport/share.c
+++ b/drivers/parport/share.c
@@ -936,8 +936,10 @@ parport_register_dev_model(struct parpor
* pardevice fields. -arca
*/
port->ops->init_state(par_dev, par_dev->state);
- port->proc_device = par_dev;
- parport_device_proc_register(par_dev);
+ if (!test_and_set_bit(PARPORT_DEVPROC_REGISTERED, &port->devflags)) {
+ port->proc_device = par_dev;
+ parport_device_proc_register(par_dev);
+ }
return par_dev;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Eric Dumazet <[email protected]>
[ Upstream commit c64c0b3cac4c5b8cb093727d2c19743ea3965c0b ]
Alexander reported a KMSAN splat caused by reads of uninitialized
field (tb_id_in) from user provided struct fib_result_nl
It turns out nl_fib_input() sanity tests on user input is a bit
wrong :
User can pretend nlh->nlmsg_len is big enough, but provide
at sendmsg() time a too small buffer.
Reported-by: Alexander Potapenko <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ipv4/fib_frontend.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1080,7 +1080,8 @@ static void nl_fib_input(struct sk_buff
net = sock_net(skb->sk);
nlh = nlmsg_hdr(skb);
- if (skb->len < NLMSG_HDRLEN || skb->len < nlh->nlmsg_len ||
+ if (skb->len < nlmsg_total_size(sizeof(*frn)) ||
+ skb->len < nlh->nlmsg_len ||
nlmsg_len(nlh) < sizeof(*frn))
return;
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Takashi Iwai <[email protected]>
commit c520ff3d03f0b5db7146d9beed6373ad5d2a5e0e upstream.
When snd_seq_pool_done() is called, it marks the closing flag to
refuse the further cell insertions. But snd_seq_pool_done() itself
doesn't clear the cells but just waits until all cells are cleared by
the caller side. That is, it's racy, and this leads to the endless
stall as syzkaller spotted.
This patch addresses the racy by splitting the setup of pool->closing
flag out of snd_seq_pool_done(), and calling it properly before
snd_seq_pool_done().
BugLink: http://lkml.kernel.org/r/CACT4Y+aqqy8bZA1fFieifNxR2fAfFQQABcBHj801+u5ePV0URw@mail.gmail.com
Reported-and-tested-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
sound/core/seq/seq_clientmgr.c | 1 +
sound/core/seq/seq_fifo.c | 3 +++
sound/core/seq/seq_memory.c | 17 +++++++++++++----
sound/core/seq/seq_memory.h | 1 +
4 files changed, 18 insertions(+), 4 deletions(-)
--- a/sound/core/seq/seq_clientmgr.c
+++ b/sound/core/seq/seq_clientmgr.c
@@ -1921,6 +1921,7 @@ static int snd_seq_ioctl_set_client_pool
info.output_pool != client->pool->size)) {
if (snd_seq_write_pool_allocated(client)) {
/* remove all existing cells */
+ snd_seq_pool_mark_closing(client->pool);
snd_seq_queue_client_leave_cells(client->number);
snd_seq_pool_done(client->pool);
}
--- a/sound/core/seq/seq_fifo.c
+++ b/sound/core/seq/seq_fifo.c
@@ -70,6 +70,9 @@ void snd_seq_fifo_delete(struct snd_seq_
return;
*fifo = NULL;
+ if (f->pool)
+ snd_seq_pool_mark_closing(f->pool);
+
snd_seq_fifo_clear(f);
/* wake up clients if any */
--- a/sound/core/seq/seq_memory.c
+++ b/sound/core/seq/seq_memory.c
@@ -414,6 +414,18 @@ int snd_seq_pool_init(struct snd_seq_poo
return 0;
}
+/* refuse the further insertion to the pool */
+void snd_seq_pool_mark_closing(struct snd_seq_pool *pool)
+{
+ unsigned long flags;
+
+ if (snd_BUG_ON(!pool))
+ return;
+ spin_lock_irqsave(&pool->lock, flags);
+ pool->closing = 1;
+ spin_unlock_irqrestore(&pool->lock, flags);
+}
+
/* remove events */
int snd_seq_pool_done(struct snd_seq_pool *pool)
{
@@ -424,10 +436,6 @@ int snd_seq_pool_done(struct snd_seq_poo
return -EINVAL;
/* wait for closing all threads */
- spin_lock_irqsave(&pool->lock, flags);
- pool->closing = 1;
- spin_unlock_irqrestore(&pool->lock, flags);
-
if (waitqueue_active(&pool->output_sleep))
wake_up(&pool->output_sleep);
@@ -484,6 +492,7 @@ int snd_seq_pool_delete(struct snd_seq_p
*ppool = NULL;
if (pool == NULL)
return 0;
+ snd_seq_pool_mark_closing(pool);
snd_seq_pool_done(pool);
kfree(pool);
return 0;
--- a/sound/core/seq/seq_memory.h
+++ b/sound/core/seq/seq_memory.h
@@ -84,6 +84,7 @@ static inline int snd_seq_total_cells(st
int snd_seq_pool_init(struct snd_seq_pool *pool);
/* done pool - free events */
+void snd_seq_pool_mark_closing(struct snd_seq_pool *pool);
int snd_seq_pool_done(struct snd_seq_pool *pool);
/* create pool */
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Johan Hovold <[email protected]>
commit cb1b494663e037253337623bf1ef2df727883cb7 upstream.
Make sure to check the number of endpoints to avoid dereferencing a
NULL-pointer should a malicious device lack endpoints.
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/input/tablet/kbtab.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/input/tablet/kbtab.c
+++ b/drivers/input/tablet/kbtab.c
@@ -122,6 +122,9 @@ static int kbtab_probe(struct usb_interf
struct input_dev *input_dev;
int error = -ENOMEM;
+ if (intf->cur_altsetting->desc.bNumEndpoints < 1)
+ return -ENODEV;
+
kbtab = kzalloc(sizeof(struct kbtab), GFP_KERNEL);
input_dev = input_allocate_device();
if (!kbtab || !input_dev)
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Ankur Arora <[email protected]>
commit 1914f0cd203c941bba72f9452c8290324f1ef3dc upstream.
This was broken in commit cd979883b9ed ("xen/acpi-processor:
fix enabling interrupts on syscore_resume"). do_suspend (from
xen/manage.c) and thus xen_resume_notifier never get called on
the initial-domain at resume (it is if running as guest.)
The rationale for the breaking change was that upload_pm_data()
potentially does blocking work in syscore_resume(). This patch
addresses the original issue by scheduling upload_pm_data() to
execute in workqueue context.
Cc: Stanislaw Gruszka <[email protected]>
Based-on-patch-by: Konrad Wilk <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Reviewed-by: Stanislaw Gruszka <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
Signed-off-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/xen/xen-acpi-processor.c | 34 ++++++++++++++++++++++++++--------
1 file changed, 26 insertions(+), 8 deletions(-)
--- a/drivers/xen/xen-acpi-processor.c
+++ b/drivers/xen/xen-acpi-processor.c
@@ -27,10 +27,10 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/types.h>
+#include <linux/syscore_ops.h>
#include <linux/acpi.h>
#include <acpi/processor.h>
#include <xen/xen.h>
-#include <xen/xen-ops.h>
#include <xen/interface/platform.h>
#include <asm/xen/hypercall.h>
@@ -466,15 +466,33 @@ static int xen_upload_processor_pm_data(
return rc;
}
-static int xen_acpi_processor_resume(struct notifier_block *nb,
- unsigned long action, void *data)
+static void xen_acpi_processor_resume_worker(struct work_struct *dummy)
{
+ int rc;
+
bitmap_zero(acpi_ids_done, nr_acpi_bits);
- return xen_upload_processor_pm_data();
+
+ rc = xen_upload_processor_pm_data();
+ if (rc != 0)
+ pr_info("ACPI data upload failed, error = %d\n", rc);
+}
+
+static void xen_acpi_processor_resume(void)
+{
+ static DECLARE_WORK(wq, xen_acpi_processor_resume_worker);
+
+ /*
+ * xen_upload_processor_pm_data() calls non-atomic code.
+ * However, the context for xen_acpi_processor_resume is syscore
+ * with only the boot CPU online and in an atomic context.
+ *
+ * So defer the upload for some point safer.
+ */
+ schedule_work(&wq);
}
-struct notifier_block xen_acpi_processor_resume_nb = {
- .notifier_call = xen_acpi_processor_resume,
+static struct syscore_ops xap_syscore_ops = {
+ .resume = xen_acpi_processor_resume,
};
static int __init xen_acpi_processor_init(void)
@@ -527,7 +545,7 @@ static int __init xen_acpi_processor_ini
if (rc)
goto err_unregister;
- xen_resume_notifier_register(&xen_acpi_processor_resume_nb);
+ register_syscore_ops(&xap_syscore_ops);
return 0;
err_unregister:
@@ -544,7 +562,7 @@ static void __exit xen_acpi_processor_ex
{
int i;
- xen_resume_notifier_unregister(&xen_acpi_processor_resume_nb);
+ unregister_syscore_ops(&xap_syscore_ops);
kfree(acpi_ids_done);
kfree(acpi_id_present);
kfree(acpi_id_cst_present);
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Nicolas Ferre <[email protected]>
commit b1708b72a0959a032cd2eebb77fa9086ea3e0c84 upstream.
The dmas/dma-names properties are added to the UART nodes. Note that additional
properties are needed to enable them at the board level: check bindings for
details.
Signed-off-by: Nicolas Ferre <[email protected]>
Signed-off-by: Alexandre Belloni <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
arch/arm/boot/dts/sama5d2.dtsi | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
--- a/arch/arm/boot/dts/sama5d2.dtsi
+++ b/arch/arm/boot/dts/sama5d2.dtsi
@@ -856,6 +856,13 @@
compatible = "atmel,at91sam9260-usart";
reg = <0xf801c000 0x100>;
interrupts = <24 IRQ_TYPE_LEVEL_HIGH 7>;
+ dmas = <&dma0
+ (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) |
+ AT91_XDMAC_DT_PERID(35))>,
+ <&dma0
+ (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) |
+ AT91_XDMAC_DT_PERID(36))>;
+ dma-names = "tx", "rx";
clocks = <&uart0_clk>;
clock-names = "usart";
status = "disabled";
@@ -865,6 +872,13 @@
compatible = "atmel,at91sam9260-usart";
reg = <0xf8020000 0x100>;
interrupts = <25 IRQ_TYPE_LEVEL_HIGH 7>;
+ dmas = <&dma0
+ (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) |
+ AT91_XDMAC_DT_PERID(37))>,
+ <&dma0
+ (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) |
+ AT91_XDMAC_DT_PERID(38))>;
+ dma-names = "tx", "rx";
clocks = <&uart1_clk>;
clock-names = "usart";
status = "disabled";
@@ -874,6 +888,13 @@
compatible = "atmel,at91sam9260-usart";
reg = <0xf8024000 0x100>;
interrupts = <26 IRQ_TYPE_LEVEL_HIGH 7>;
+ dmas = <&dma0
+ (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) |
+ AT91_XDMAC_DT_PERID(39))>,
+ <&dma0
+ (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) |
+ AT91_XDMAC_DT_PERID(40))>;
+ dma-names = "tx", "rx";
clocks = <&uart2_clk>;
clock-names = "usart";
status = "disabled";
@@ -985,6 +1006,13 @@
compatible = "atmel,at91sam9260-usart";
reg = <0xfc008000 0x100>;
interrupts = <27 IRQ_TYPE_LEVEL_HIGH 7>;
+ dmas = <&dma0
+ (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) |
+ AT91_XDMAC_DT_PERID(41))>,
+ <&dma0
+ (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) |
+ AT91_XDMAC_DT_PERID(42))>;
+ dma-names = "tx", "rx";
clocks = <&uart3_clk>;
clock-names = "usart";
status = "disabled";
@@ -993,6 +1021,13 @@
uart4: serial@fc00c000 {
compatible = "atmel,at91sam9260-usart";
reg = <0xfc00c000 0x100>;
+ dmas = <&dma0
+ (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) |
+ AT91_XDMAC_DT_PERID(43))>,
+ <&dma0
+ (AT91_XDMAC_DT_MEM_IF(0) | AT91_XDMAC_DT_PER_IF(1) |
+ AT91_XDMAC_DT_PERID(44))>;
+ dma-names = "tx", "rx";
interrupts = <28 IRQ_TYPE_LEVEL_HIGH 7>;
clocks = <&uart4_clk>;
clock-names = "usart";
On 03/28/2017 06:29 AM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.4.58 release.
> There are 76 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu Mar 30 12:25:40 UTC 2017.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.58-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h
>
Compiled and booted on my test system. No dmesg regressions.
thanks,
-- Shuah
On 03/28/2017 05:29 AM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.4.58 release.
> There are 76 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu Mar 30 12:25:40 UTC 2017.
> Anything received after that time might be too late.
>
Build results:
total: 149 pass: 149 fail: 0
Qemu test results:
total: 115 pass: 115 fail: 0
Details are available at http://kerneltests.org/builders.
Guenter
On Tue, Mar 28, 2017 at 3:30 PM, Michal Hocko <[email protected]> wrote:
> On Tue 28-03-17 15:23:58, Ilya Dryomov wrote:
>> On Tue, Mar 28, 2017 at 2:43 PM, Michal Hocko <[email protected]> wrote:
>> > On Tue 28-03-17 14:30:45, Greg KH wrote:
>> >> 4.4-stable review patch. If anyone has any objections, please let me know.
>> >
>> > I haven't seen the original patch but the changelog makes me worried.
>> > How exactly this is a problem? Where do we lockup? Does rbd/libceph take
>> > any xfs locks?
>>
>> No, it doesn't. This is just another instance of "using GFP_KERNEL on
>> the writeback path may lead to a deadlock" with nothing extra to it.
>>
>> XFS is writing out data, libceph messenger worker tries to open
>> a socket and recurses back into XFS because the sockfs inode is
>> allocated with GFP_KERNEL. The message with some of the data never
>> goes out and eventually we get a deadlock.
>>
>> I've only included the offending stack trace. I guess I should have
>> stressed that ceph-msgr workqueue is used for reclaim.
>
> Could you be more specific about the lockup scenario. I still do not get
> how this would lead to a deadlock.
This is a set of stack traces from http://tracker.ceph.com/issues/19309
(linked in the changelog):
Workqueue: ceph-msgr con_work [libceph]
ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
Call Trace:
[<ffffffff816dd629>] schedule+0x29/0x70
[<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
[<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
[<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
[<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
[<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
[<ffffffff81086335>] flush_work+0x165/0x250
[<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
[<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
[<ffffffff816d6b42>] ? __slab_free+0xee/0x234
[<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
[<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
[<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
[<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
[<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
[<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
[<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
[<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
[<ffffffff811c0c18>] super_cache_scan+0x178/0x180
[<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
[<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
[<ffffffff8115af70>] shrink_slab+0x100/0x140
[<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
[<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
[<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
[<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
[<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
[<ffffffff811a0ac5>] new_slab+0x2c5/0x390
[<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
[<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
[<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
[<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
[<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
[<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
[<ffffffff811d8566>] alloc_inode+0x26/0xa0
[<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
[<ffffffff815b933e>] sock_alloc+0x1e/0x80
[<ffffffff815ba855>] __sock_create+0x95/0x220
[<ffffffff815baa04>] sock_create_kern+0x24/0x30
[<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
[<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
[<ffffffff81084c19>] process_one_work+0x159/0x4f0
[<ffffffff8108561b>] worker_thread+0x11b/0x530
[<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[<ffffffff8108b6f9>] kthread+0xc9/0xe0
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
We are writing out data on ceph_connection X:
ceph_con_workfn
mutex_lock(&con->mutex) # ceph_connection::mutex
try_write
ceph_tcp_connect
sock_create_kern
GFP_KERNEL allocation
allocator recurses into XFS, more I/O is issued
Workqueue: rbd rbd_request_workfn [rbd]
ffff880047a83b38 0000000000000046 ffff881025350c00 ffff8800383fa9e0
0000000000012b00 0000000000000000 ffff880047a83fd8 0000000000012b00
ffff88014b638860 ffff8800383fa9e0 ffff880047a83b38 ffff8810878dc1b8
Call Trace:
[<ffffffff816dd629>] schedule+0x29/0x70
[<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
[<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
[<ffffffffa048ad66>] ? ceph_str_hash+0x26/0x80 [libceph]
[<ffffffff816df7f6>] mutex_lock+0x36/0x4a
[<ffffffffa04784fd>] ceph_con_send+0x4d/0x130 [libceph]
[<ffffffffa047d3f0>] __send_queued+0x120/0x150 [libceph]
[<ffffffffa047fe7b>] __ceph_osdc_start_request+0x5b/0xd0 [libceph]
[<ffffffffa047ff41>] ceph_osdc_start_request+0x51/0x80 [libceph]
[<ffffffffa04a8050>] rbd_obj_request_submit.isra.27+0x10/0x20 [rbd]
[<ffffffffa04aa6de>] rbd_img_obj_request_submit+0x23e/0x500 [rbd]
[<ffffffffa04aa9ec>] rbd_img_request_submit+0x4c/0x60 [rbd]
[<ffffffffa04ab3d5>] rbd_request_workfn+0x305/0x410 [rbd]
[<ffffffff81084c19>] process_one_work+0x159/0x4f0
[<ffffffff8108561b>] worker_thread+0x11b/0x530
[<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[<ffffffff8108b6f9>] kthread+0xc9/0xe0
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
Here is that I/O. We grab ceph_osd_client::request_mutex, but
ceph_connection::mutex is being held by the worker that recursed into
XFS:
rbd_queue_workfn
ceph_osdc_start_request
mutex_lock(&osdc->request_mutex);
ceph_con_send
mutex_lock(&con->mutex) # deadlock
Workqueue: ceph-msgr con_work [libceph]
ffff88014a89fc08 0000000000000046 ffff88014a89fc18 ffff88013a2d90c0
0000000000012b00 0000000000000000 ffff88014a89ffd8 0000000000012b00
ffff880015a210c0 ffff88013a2d90c0 0000000000000000 ffff882028a84798
Call Trace:
[<ffffffff816dd629>] schedule+0x29/0x70
[<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
[<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
[<ffffffff816df7f6>] mutex_lock+0x36/0x4a
[<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
[<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
[<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
[<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
[<ffffffff81084c19>] process_one_work+0x159/0x4f0
[<ffffffff8108561b>] worker_thread+0x11b/0x530
[<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[<ffffffff8108b6f9>] kthread+0xc9/0xe0
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
Workqueue: ceph-msgr con_work [libceph]
ffff88014c10fc08 0000000000000046 ffff88013a2d9988 ffff88013a2d9920
0000000000012b00 0000000000000000 ffff88014c10ffd8 0000000000012b00
ffffffff81c1b4a0 ffff88013a2d9920 0000000000000000 ffff882028a84798
Call Trace:
[<ffffffff816dd629>] schedule+0x29/0x70
[<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
[<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
[<ffffffff816df7f6>] mutex_lock+0x36/0x4a
[<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
[<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
[<ffffffff810a076c>] ? put_prev_entity+0x3c/0x2e0
[<ffffffff8109b315>] ? sched_clock_cpu+0x95/0xd0
[<ffffffff81084c19>] process_one_work+0x159/0x4f0
[<ffffffff8108561b>] worker_thread+0x11b/0x530
[<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[<ffffffff8108b6f9>] kthread+0xc9/0xe0
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
These two are replies on ceph_connections Y and Z, which need
ceph_osd_client::request_mutex to figure out which requests can be
completed:
alloc_msg
get_reply
mutex_lock(&osdc->request_mutex);
Eventually everything else blocks on ceph_osd_client::request_mutex,
since it's used for both submitting requests and handling replies.
This really is a straightforward "using GFP_KERNEL on the writeback
path isn't allowed" case. I'm not sure what made you worried here.
Thanks,
Ilya
[CC xfs guys]
On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
[...]
> This is a set of stack traces from http://tracker.ceph.com/issues/19309
> (linked in the changelog):
>
> Workqueue: ceph-msgr con_work [libceph]
> ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
> 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
> ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
> Call Trace:
> [<ffffffff816dd629>] schedule+0x29/0x70
> [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
> [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
> [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> [<ffffffff81086335>] flush_work+0x165/0x250
I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
right? I kind of got lost where this waits on an IO.
> [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
> [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
> [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
> [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
> [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
> [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
> [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
> [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
> [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
> [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
> [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
[...]
> [<ffffffff815b933e>] sock_alloc+0x1e/0x80
> [<ffffffff815ba855>] __sock_create+0x95/0x220
> [<ffffffff815baa04>] sock_create_kern+0x24/0x30
> [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
> [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
> [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [<ffffffff8108561b>] worker_thread+0x11b/0x530
> [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>
> We are writing out data on ceph_connection X:
>
> ceph_con_workfn
> mutex_lock(&con->mutex) # ceph_connection::mutex
> try_write
> ceph_tcp_connect
> sock_create_kern
> GFP_KERNEL allocation
> allocator recurses into XFS, more I/O is issued
I am not sure this is true actually. XFS tends to do an IO from a
separate kworkers rather than the direct reclaim context.
> Workqueue: rbd rbd_request_workfn [rbd]
> ffff880047a83b38 0000000000000046 ffff881025350c00 ffff8800383fa9e0
> 0000000000012b00 0000000000000000 ffff880047a83fd8 0000000000012b00
> ffff88014b638860 ffff8800383fa9e0 ffff880047a83b38 ffff8810878dc1b8
> Call Trace:
> [<ffffffff816dd629>] schedule+0x29/0x70
> [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> [<ffffffffa048ad66>] ? ceph_str_hash+0x26/0x80 [libceph]
> [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> [<ffffffffa04784fd>] ceph_con_send+0x4d/0x130 [libceph]
> [<ffffffffa047d3f0>] __send_queued+0x120/0x150 [libceph]
> [<ffffffffa047fe7b>] __ceph_osdc_start_request+0x5b/0xd0 [libceph]
> [<ffffffffa047ff41>] ceph_osdc_start_request+0x51/0x80 [libceph]
> [<ffffffffa04a8050>] rbd_obj_request_submit.isra.27+0x10/0x20 [rbd]
> [<ffffffffa04aa6de>] rbd_img_obj_request_submit+0x23e/0x500 [rbd]
> [<ffffffffa04aa9ec>] rbd_img_request_submit+0x4c/0x60 [rbd]
> [<ffffffffa04ab3d5>] rbd_request_workfn+0x305/0x410 [rbd]
> [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [<ffffffff8108561b>] worker_thread+0x11b/0x530
> [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>
> Here is that I/O. We grab ceph_osd_client::request_mutex, but
> ceph_connection::mutex is being held by the worker that recursed into
> XFS:
>
> rbd_queue_workfn
> ceph_osdc_start_request
> mutex_lock(&osdc->request_mutex);
> ceph_con_send
> mutex_lock(&con->mutex) # deadlock
>
>
> Workqueue: ceph-msgr con_work [libceph]
> ffff88014a89fc08 0000000000000046 ffff88014a89fc18 ffff88013a2d90c0
> 0000000000012b00 0000000000000000 ffff88014a89ffd8 0000000000012b00
> ffff880015a210c0 ffff88013a2d90c0 0000000000000000 ffff882028a84798
> Call Trace:
> [<ffffffff816dd629>] schedule+0x29/0x70
> [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> [<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
> [<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
> [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [<ffffffff8108561b>] worker_thread+0x11b/0x530
> [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>
> Workqueue: ceph-msgr con_work [libceph]
> ffff88014c10fc08 0000000000000046 ffff88013a2d9988 ffff88013a2d9920
> 0000000000012b00 0000000000000000 ffff88014c10ffd8 0000000000012b00
> ffffffff81c1b4a0 ffff88013a2d9920 0000000000000000 ffff882028a84798
> Call Trace:
> [<ffffffff816dd629>] schedule+0x29/0x70
> [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> [<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
> [<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
> [<ffffffff810a076c>] ? put_prev_entity+0x3c/0x2e0
> [<ffffffff8109b315>] ? sched_clock_cpu+0x95/0xd0
> [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [<ffffffff8108561b>] worker_thread+0x11b/0x530
> [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>
> These two are replies on ceph_connections Y and Z, which need
> ceph_osd_client::request_mutex to figure out which requests can be
> completed:
>
> alloc_msg
> get_reply
> mutex_lock(&osdc->request_mutex);
>
> Eventually everything else blocks on ceph_osd_client::request_mutex,
> since it's used for both submitting requests and handling replies.
>
> This really is a straightforward "using GFP_KERNEL on the writeback
> path isn't allowed" case. I'm not sure what made you worried here.
I am still not sure there is the dependency there. But if anything and
the con->mutex is the lock which is dangerous to recurse back to the FS
then please wrap the whole scope which takes the lock with the
memalloc_noio_save (or memalloc_nofs_save currently sitting in the mmotm
tree, if you can wait until that API gets merged) with a big fat comment
explaining why that is needed. Sticking the scope protection down the
path is just hard to understand later on. And as already mentioned
NOFS/NOIO context are (ab)used way too much without a clear/good reason.
--
Michal Hocko
SUSE Labs
On Wed 29-03-17 12:41:26, Michal Hocko wrote:
[...]
> > ceph_con_workfn
> > mutex_lock(&con->mutex) # ceph_connection::mutex
> > try_write
> > ceph_tcp_connect
> > sock_create_kern
> > GFP_KERNEL allocation
> > allocator recurses into XFS, more I/O is issued
One more note. So what happens if this is a GFP_NOIO request which
cannot make any progress? Your IO thread is blocked on con->mutex
as you write below but the above thread cannot proceed as well. So I am
_really_ not sure this acutally helps.
[...]
> >
> > rbd_queue_workfn
> > ceph_osdc_start_request
> > mutex_lock(&osdc->request_mutex);
> > ceph_con_send
> > mutex_lock(&con->mutex) # deadlock
--
Michal Hocko
SUSE Labs
On Wed, Mar 29, 2017 at 12:41:26PM +0200, Michal Hocko wrote:
> [CC xfs guys]
>
> On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
> [...]
> > This is a set of stack traces from http://tracker.ceph.com/issues/19309
> > (linked in the changelog):
> >
> > Workqueue: ceph-msgr con_work [libceph]
> > ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
> > 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
> > ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
> > Call Trace:
> > [<ffffffff816dd629>] schedule+0x29/0x70
> > [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> > [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
> > [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> > [<ffffffff81086335>] flush_work+0x165/0x250
>
> I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
> right? I kind of got lost where this waits on an IO.
>
Yep. That means a CIL push is already in progress. We wait on that to
complete here. After that, the resulting task queues execution of
xlog_cil_push_work()->xlog_cil_push() on m_cil_workqueue. That task may
submit I/O to the log.
I don't see any reference to xlog_cil_push() anywhere in the traces here
or in the bug referenced above, however..?
Brian
> > [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
> > [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
> > [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
> > [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
> > [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
> > [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
> > [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
> > [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
> > [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
> > [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
> > [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
> [...]
> > [<ffffffff815b933e>] sock_alloc+0x1e/0x80
> > [<ffffffff815ba855>] __sock_create+0x95/0x220
> > [<ffffffff815baa04>] sock_create_kern+0x24/0x30
> > [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
> > [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
> > [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> > [<ffffffff8108561b>] worker_thread+0x11b/0x530
> > [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> > [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> > [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> >
> > We are writing out data on ceph_connection X:
> >
> > ceph_con_workfn
> > mutex_lock(&con->mutex) # ceph_connection::mutex
> > try_write
> > ceph_tcp_connect
> > sock_create_kern
> > GFP_KERNEL allocation
> > allocator recurses into XFS, more I/O is issued
>
> I am not sure this is true actually. XFS tends to do an IO from a
> separate kworkers rather than the direct reclaim context.
>
> > Workqueue: rbd rbd_request_workfn [rbd]
> > ffff880047a83b38 0000000000000046 ffff881025350c00 ffff8800383fa9e0
> > 0000000000012b00 0000000000000000 ffff880047a83fd8 0000000000012b00
> > ffff88014b638860 ffff8800383fa9e0 ffff880047a83b38 ffff8810878dc1b8
> > Call Trace:
> > [<ffffffff816dd629>] schedule+0x29/0x70
> > [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> > [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> > [<ffffffffa048ad66>] ? ceph_str_hash+0x26/0x80 [libceph]
> > [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> > [<ffffffffa04784fd>] ceph_con_send+0x4d/0x130 [libceph]
> > [<ffffffffa047d3f0>] __send_queued+0x120/0x150 [libceph]
> > [<ffffffffa047fe7b>] __ceph_osdc_start_request+0x5b/0xd0 [libceph]
> > [<ffffffffa047ff41>] ceph_osdc_start_request+0x51/0x80 [libceph]
> > [<ffffffffa04a8050>] rbd_obj_request_submit.isra.27+0x10/0x20 [rbd]
> > [<ffffffffa04aa6de>] rbd_img_obj_request_submit+0x23e/0x500 [rbd]
> > [<ffffffffa04aa9ec>] rbd_img_request_submit+0x4c/0x60 [rbd]
> > [<ffffffffa04ab3d5>] rbd_request_workfn+0x305/0x410 [rbd]
> > [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> > [<ffffffff8108561b>] worker_thread+0x11b/0x530
> > [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> > [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> > [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> >
> > Here is that I/O. We grab ceph_osd_client::request_mutex, but
> > ceph_connection::mutex is being held by the worker that recursed into
> > XFS:
> >
> > rbd_queue_workfn
> > ceph_osdc_start_request
> > mutex_lock(&osdc->request_mutex);
> > ceph_con_send
> > mutex_lock(&con->mutex) # deadlock
> >
> >
> > Workqueue: ceph-msgr con_work [libceph]
> > ffff88014a89fc08 0000000000000046 ffff88014a89fc18 ffff88013a2d90c0
> > 0000000000012b00 0000000000000000 ffff88014a89ffd8 0000000000012b00
> > ffff880015a210c0 ffff88013a2d90c0 0000000000000000 ffff882028a84798
> > Call Trace:
> > [<ffffffff816dd629>] schedule+0x29/0x70
> > [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> > [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> > [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> > [<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
> > [<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> > [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> > [<ffffffff8108561b>] worker_thread+0x11b/0x530
> > [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> > [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> > [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> >
> > Workqueue: ceph-msgr con_work [libceph]
> > ffff88014c10fc08 0000000000000046 ffff88013a2d9988 ffff88013a2d9920
> > 0000000000012b00 0000000000000000 ffff88014c10ffd8 0000000000012b00
> > ffffffff81c1b4a0 ffff88013a2d9920 0000000000000000 ffff882028a84798
> > Call Trace:
> > [<ffffffff816dd629>] schedule+0x29/0x70
> > [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> > [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> > [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> > [<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
> > [<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
> > [<ffffffff810a076c>] ? put_prev_entity+0x3c/0x2e0
> > [<ffffffff8109b315>] ? sched_clock_cpu+0x95/0xd0
> > [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> > [<ffffffff8108561b>] worker_thread+0x11b/0x530
> > [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> > [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> > [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> >
> > These two are replies on ceph_connections Y and Z, which need
> > ceph_osd_client::request_mutex to figure out which requests can be
> > completed:
> >
> > alloc_msg
> > get_reply
> > mutex_lock(&osdc->request_mutex);
> >
> > Eventually everything else blocks on ceph_osd_client::request_mutex,
> > since it's used for both submitting requests and handling replies.
> >
> > This really is a straightforward "using GFP_KERNEL on the writeback
> > path isn't allowed" case. I'm not sure what made you worried here.
>
> I am still not sure there is the dependency there. But if anything and
> the con->mutex is the lock which is dangerous to recurse back to the FS
> then please wrap the whole scope which takes the lock with the
> memalloc_noio_save (or memalloc_nofs_save currently sitting in the mmotm
> tree, if you can wait until that API gets merged) with a big fat comment
> explaining why that is needed. Sticking the scope protection down the
> path is just hard to understand later on. And as already mentioned
> NOFS/NOIO context are (ab)used way too much without a clear/good reason.
> --
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <[email protected]> wrote:
> On Wed 29-03-17 12:41:26, Michal Hocko wrote:
> [...]
>> > ceph_con_workfn
>> > mutex_lock(&con->mutex) # ceph_connection::mutex
>> > try_write
>> > ceph_tcp_connect
>> > sock_create_kern
>> > GFP_KERNEL allocation
>> > allocator recurses into XFS, more I/O is issued
>
> One more note. So what happens if this is a GFP_NOIO request which
> cannot make any progress? Your IO thread is blocked on con->mutex
> as you write below but the above thread cannot proceed as well. So I am
> _really_ not sure this acutally helps.
This is not the only I/O worker. A ceph cluster typically consists of
at least a few OSDs and can be as large as thousands of OSDs. This is
the reason we are calling sock_create_kern() on the writeback path in
the first place: pre-opening thousands of sockets isn't feasible.
Thanks,
Ilya
On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <[email protected]> wrote:
> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
> > [...]
> >> > ceph_con_workfn
> >> > mutex_lock(&con->mutex) # ceph_connection::mutex
> >> > try_write
> >> > ceph_tcp_connect
> >> > sock_create_kern
> >> > GFP_KERNEL allocation
> >> > allocator recurses into XFS, more I/O is issued
> >
> > One more note. So what happens if this is a GFP_NOIO request which
> > cannot make any progress? Your IO thread is blocked on con->mutex
> > as you write below but the above thread cannot proceed as well. So I am
> > _really_ not sure this acutally helps.
>
> This is not the only I/O worker. A ceph cluster typically consists of
> at least a few OSDs and can be as large as thousands of OSDs. This is
> the reason we are calling sock_create_kern() on the writeback path in
> the first place: pre-opening thousands of sockets isn't feasible.
Sorry for being dense here but what actually guarantees the forward
progress? My current understanding is that the deadlock is caused by
con->mutext being held while the allocation cannot make a forward
progress. I can imagine this would be possible if the other io flushers
depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
much difference. What am I missing?
--
Michal Hocko
SUSE Labs
On Wed, Mar 29, 2017 at 1:05 PM, Brian Foster <[email protected]> wrote:
> On Wed, Mar 29, 2017 at 12:41:26PM +0200, Michal Hocko wrote:
>> [CC xfs guys]
>>
>> On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
>> [...]
>> > This is a set of stack traces from http://tracker.ceph.com/issues/19309
>> > (linked in the changelog):
>> >
>> > Workqueue: ceph-msgr con_work [libceph]
>> > ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
>> > 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
>> > ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
>> > Call Trace:
>> > [<ffffffff816dd629>] schedule+0x29/0x70
>> > [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
>> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
>> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
>> > [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
>> > [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
>> > [<ffffffff81086335>] flush_work+0x165/0x250
>>
>> I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
>> right? I kind of got lost where this waits on an IO.
>>
>
> Yep. That means a CIL push is already in progress. We wait on that to
> complete here. After that, the resulting task queues execution of
> xlog_cil_push_work()->xlog_cil_push() on m_cil_workqueue. That task may
> submit I/O to the log.
>
> I don't see any reference to xlog_cil_push() anywhere in the traces here
> or in the bug referenced above, however..?
Well, it's prefaced with "Interesting is:"... Sergey (the original
reporter, CCed here) might still have the rest of them.
Thanks,
Ilya
On Wed 29-03-17 13:14:42, Ilya Dryomov wrote:
> On Wed, Mar 29, 2017 at 1:05 PM, Brian Foster <[email protected]> wrote:
> > On Wed, Mar 29, 2017 at 12:41:26PM +0200, Michal Hocko wrote:
> >> [CC xfs guys]
> >>
> >> On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
> >> [...]
> >> > This is a set of stack traces from http://tracker.ceph.com/issues/19309
> >> > (linked in the changelog):
> >> >
> >> > Workqueue: ceph-msgr con_work [libceph]
> >> > ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
> >> > 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
> >> > ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
> >> > Call Trace:
> >> > [<ffffffff816dd629>] schedule+0x29/0x70
> >> > [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
> >> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> >> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> >> > [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
> >> > [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> >> > [<ffffffff81086335>] flush_work+0x165/0x250
> >>
> >> I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
> >> right? I kind of got lost where this waits on an IO.
> >>
> >
> > Yep. That means a CIL push is already in progress. We wait on that to
> > complete here. After that, the resulting task queues execution of
> > xlog_cil_push_work()->xlog_cil_push() on m_cil_workqueue. That task may
> > submit I/O to the log.
> >
> > I don't see any reference to xlog_cil_push() anywhere in the traces here
> > or in the bug referenced above, however..?
>
> Well, it's prefaced with "Interesting is:"... Sergey (the original
> reporter, CCed here) might still have the rest of them.
JFTR
http://tracker.ceph.com/attachments/download/2769/full_kern_trace.txt
[288420.754637] Workqueue: xfs-cil/rbd1 xlog_cil_push_work [xfs]
[288420.754638] ffff880130c1fb38 0000000000000046 ffff880130c1fac8 ffff880130d72180
[288420.754640] 0000000000012b00 ffff880130c1fad8 ffff880130c1ffd8 0000000000012b00
[288420.754641] ffff8810297b6480 ffff880130d72180 ffffffffa03b1264 ffff8820263d6800
[288420.754643] Call Trace:
[288420.754652] [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
[288420.754653] [<ffffffff816dd629>] schedule+0x29/0x70
[288420.754661] [<ffffffffa03b3b9c>] xlog_state_get_iclog_space+0xdc/0x2e0 [xfs]
[288420.754669] [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
[288420.754670] [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
[288420.754678] [<ffffffffa03b4090>] xlog_write+0x190/0x730 [xfs]
[288420.754686] [<ffffffffa03b5d9e>] xlog_cil_push+0x24e/0x3e0 [xfs]
[288420.754693] [<ffffffffa03b5f45>] xlog_cil_push_work+0x15/0x20 [xfs]
[288420.754695] [<ffffffff81084c19>] process_one_work+0x159/0x4f0
[288420.754697] [<ffffffff81084fdc>] process_scheduled_works+0x2c/0x40
[288420.754698] [<ffffffff8108579b>] worker_thread+0x29b/0x530
[288420.754699] [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[288420.754701] [<ffffffff8108b6f9>] kthread+0xc9/0xe0
[288420.754703] [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[288420.754705] [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[288420.754707] [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
--
Michal Hocko
SUSE Labs
On Wed, Mar 29, 2017 at 01:18:34PM +0200, Michal Hocko wrote:
> On Wed 29-03-17 13:14:42, Ilya Dryomov wrote:
> > On Wed, Mar 29, 2017 at 1:05 PM, Brian Foster <[email protected]> wrote:
> > > On Wed, Mar 29, 2017 at 12:41:26PM +0200, Michal Hocko wrote:
> > >> [CC xfs guys]
> > >>
> > >> On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
> > >> [...]
> > >> > This is a set of stack traces from http://tracker.ceph.com/issues/19309
> > >> > (linked in the changelog):
> > >> >
> > >> > Workqueue: ceph-msgr con_work [libceph]
> > >> > ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
> > >> > 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
> > >> > ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
> > >> > Call Trace:
> > >> > [<ffffffff816dd629>] schedule+0x29/0x70
> > >> > [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
> > >> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> > >> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> > >> > [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
> > >> > [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> > >> > [<ffffffff81086335>] flush_work+0x165/0x250
> > >>
> > >> I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
> > >> right? I kind of got lost where this waits on an IO.
> > >>
> > >
> > > Yep. That means a CIL push is already in progress. We wait on that to
> > > complete here. After that, the resulting task queues execution of
> > > xlog_cil_push_work()->xlog_cil_push() on m_cil_workqueue. That task may
> > > submit I/O to the log.
> > >
> > > I don't see any reference to xlog_cil_push() anywhere in the traces here
> > > or in the bug referenced above, however..?
> >
> > Well, it's prefaced with "Interesting is:"... Sergey (the original
> > reporter, CCed here) might still have the rest of them.
>
> JFTR
> http://tracker.ceph.com/attachments/download/2769/full_kern_trace.txt
> [288420.754637] Workqueue: xfs-cil/rbd1 xlog_cil_push_work [xfs]
> [288420.754638] ffff880130c1fb38 0000000000000046 ffff880130c1fac8 ffff880130d72180
> [288420.754640] 0000000000012b00 ffff880130c1fad8 ffff880130c1ffd8 0000000000012b00
> [288420.754641] ffff8810297b6480 ffff880130d72180 ffffffffa03b1264 ffff8820263d6800
> [288420.754643] Call Trace:
> [288420.754652] [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
> [288420.754653] [<ffffffff816dd629>] schedule+0x29/0x70
> [288420.754661] [<ffffffffa03b3b9c>] xlog_state_get_iclog_space+0xdc/0x2e0 [xfs]
> [288420.754669] [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
> [288420.754670] [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> [288420.754678] [<ffffffffa03b4090>] xlog_write+0x190/0x730 [xfs]
> [288420.754686] [<ffffffffa03b5d9e>] xlog_cil_push+0x24e/0x3e0 [xfs]
> [288420.754693] [<ffffffffa03b5f45>] xlog_cil_push_work+0x15/0x20 [xfs]
> [288420.754695] [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [288420.754697] [<ffffffff81084fdc>] process_scheduled_works+0x2c/0x40
> [288420.754698] [<ffffffff8108579b>] worker_thread+0x29b/0x530
> [288420.754699] [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [288420.754701] [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [288420.754703] [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [288420.754705] [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [288420.754707] [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
Ah, thanks. According to above, xfs_cil is waiting on log space to free
up. This means xfs-cil is probably in:
xlog_state_get_iclog_space()
->xlog_wait(&log->l_flush_wait, &log->l_icloglock);
l_flush_wait is awoken during log I/O completion handling via the
xfs-log workqueue. That guy is here:
[288420.773968] INFO: task kworker/6:3:420227 blocked for more than 300 seconds.
[288420.773986] Not tainted 3.18.43-40 #1
[288420.773997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[288420.774017] kworker/6:3 D ffff880103893650 0 420227 2 0x00000000
[288420.774027] Workqueue: xfs-log/rbd1 xfs_log_worker [xfs]
[288420.774028] ffff88010357fac8 0000000000000046 0000000000000000 ffff880103893240
[288420.774030] 0000000000012b00 ffff880146361128 ffff88010357ffd8 0000000000012b00
[288420.774031] ffff8810297b7540 ffff880103893240 ffff88010357fae8 ffff88010357fbf8
[288420.774033] Call Trace:
[288420.774035] [<ffffffff816dd629>] schedule+0x29/0x70
[288420.774036] [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
[288420.774038] [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
[288420.774040] [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
[288420.774042] [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
[288420.774043] [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
[288420.774044] [<ffffffff81086335>] flush_work+0x165/0x250
[288420.774046] [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
[288420.774054] [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
[288420.774056] [<ffffffff8109f3cc>] ? dequeue_entity+0x17c/0x520
[288420.774063] [<ffffffffa03b478e>] _xfs_log_force+0x6e/0x280 [xfs]
[288420.774065] [<ffffffff810a076c>] ? put_prev_entity+0x3c/0x2e0
[288420.774067] [<ffffffff8109b315>] ? sched_clock_cpu+0x95/0xd0
[288420.774068] [<ffffffff810145a2>] ? __switch_to+0xf2/0x5f0
[288420.774076] [<ffffffffa03b49d9>] xfs_log_force+0x39/0xe0 [xfs]
[288420.774083] [<ffffffffa03b4aa8>] xfs_log_worker+0x28/0x50 [xfs]
[288420.774085] [<ffffffff81084c19>] process_one_work+0x159/0x4f0
[288420.774086] [<ffffffff8108561b>] worker_thread+0x11b/0x530
[288420.774088] [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[288420.774089] [<ffffffff8108b6f9>] kthread+0xc9/0xe0
[288420.774091] [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[288420.774093] [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[288420.774095] [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
... which is back waiting on xfs-cil.
Ilya,
Have you looked into this[1] patch by any chance? Note that 7a29ac474
("xfs: give all workqueues rescuer threads") may also be a potential
band aid for this. Or IOW, the lack thereof in v3.18.z may make this
problem more likely.
Brian
[1] http://www.spinics.net/lists/linux-xfs/msg04886.html
> --
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 29, 2017 at 1:16 PM, Michal Hocko <[email protected]> wrote:
> On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
>> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <[email protected]> wrote:
>> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
>> > [...]
>> >> > ceph_con_workfn
>> >> > mutex_lock(&con->mutex) # ceph_connection::mutex
>> >> > try_write
>> >> > ceph_tcp_connect
>> >> > sock_create_kern
>> >> > GFP_KERNEL allocation
>> >> > allocator recurses into XFS, more I/O is issued
>> >
>> > One more note. So what happens if this is a GFP_NOIO request which
>> > cannot make any progress? Your IO thread is blocked on con->mutex
>> > as you write below but the above thread cannot proceed as well. So I am
>> > _really_ not sure this acutally helps.
>>
>> This is not the only I/O worker. A ceph cluster typically consists of
>> at least a few OSDs and can be as large as thousands of OSDs. This is
>> the reason we are calling sock_create_kern() on the writeback path in
>> the first place: pre-opening thousands of sockets isn't feasible.
>
> Sorry for being dense here but what actually guarantees the forward
> progress? My current understanding is that the deadlock is caused by
> con->mutext being held while the allocation cannot make a forward
> progress. I can imagine this would be possible if the other io flushers
> depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
> much difference. What am I missing?
con->mutex is per-ceph_connection, osdc->request_mutex is global and is
the real problem here because we need both on the submit side, at least
in 3.18. You are correct that even with GFP_NOIO this code may lock up
in theory, however I think it's very unlikely in practice.
We got rid of osdc->request_mutex in 4.7, so these workers are almost
independent in newer kernels and should be able to free up memory for
those blocked on GFP_NOIO retries with their respective con->mutex
held. Using GFP_KERNEL and thus allowing the recursion is just asking
for an AA deadlock on con->mutex OTOH, so it does make a difference.
I'm a little confused by this discussion because for me this patch was
a no-brainer... Locking aside, you said it was the stack trace in the
changelog that got your attention -- are you saying it's OK for a block
device to recurse back into the filesystem when doing I/O, potentially
generating more I/O?
Thanks,
Ilya
On Wed, Mar 29, 2017 at 1:49 PM, Brian Foster <[email protected]> wrote:
> On Wed, Mar 29, 2017 at 01:18:34PM +0200, Michal Hocko wrote:
>> On Wed 29-03-17 13:14:42, Ilya Dryomov wrote:
>> > On Wed, Mar 29, 2017 at 1:05 PM, Brian Foster <[email protected]> wrote:
>> > > On Wed, Mar 29, 2017 at 12:41:26PM +0200, Michal Hocko wrote:
>> > >> [CC xfs guys]
>> > >>
>> > >> On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
>> > >> [...]
>> > >> > This is a set of stack traces from http://tracker.ceph.com/issues/19309
>> > >> > (linked in the changelog):
>> > >> >
>> > >> > Workqueue: ceph-msgr con_work [libceph]
>> > >> > ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
>> > >> > 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
>> > >> > ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
>> > >> > Call Trace:
>> > >> > [<ffffffff816dd629>] schedule+0x29/0x70
>> > >> > [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
>> > >> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
>> > >> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
>> > >> > [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
>> > >> > [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
>> > >> > [<ffffffff81086335>] flush_work+0x165/0x250
>> > >>
>> > >> I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
>> > >> right? I kind of got lost where this waits on an IO.
>> > >>
>> > >
>> > > Yep. That means a CIL push is already in progress. We wait on that to
>> > > complete here. After that, the resulting task queues execution of
>> > > xlog_cil_push_work()->xlog_cil_push() on m_cil_workqueue. That task may
>> > > submit I/O to the log.
>> > >
>> > > I don't see any reference to xlog_cil_push() anywhere in the traces here
>> > > or in the bug referenced above, however..?
>> >
>> > Well, it's prefaced with "Interesting is:"... Sergey (the original
>> > reporter, CCed here) might still have the rest of them.
>>
>> JFTR
>> http://tracker.ceph.com/attachments/download/2769/full_kern_trace.txt
>> [288420.754637] Workqueue: xfs-cil/rbd1 xlog_cil_push_work [xfs]
>> [288420.754638] ffff880130c1fb38 0000000000000046 ffff880130c1fac8 ffff880130d72180
>> [288420.754640] 0000000000012b00 ffff880130c1fad8 ffff880130c1ffd8 0000000000012b00
>> [288420.754641] ffff8810297b6480 ffff880130d72180 ffffffffa03b1264 ffff8820263d6800
>> [288420.754643] Call Trace:
>> [288420.754652] [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
>> [288420.754653] [<ffffffff816dd629>] schedule+0x29/0x70
>> [288420.754661] [<ffffffffa03b3b9c>] xlog_state_get_iclog_space+0xdc/0x2e0 [xfs]
>> [288420.754669] [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
>> [288420.754670] [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
>> [288420.754678] [<ffffffffa03b4090>] xlog_write+0x190/0x730 [xfs]
>> [288420.754686] [<ffffffffa03b5d9e>] xlog_cil_push+0x24e/0x3e0 [xfs]
>> [288420.754693] [<ffffffffa03b5f45>] xlog_cil_push_work+0x15/0x20 [xfs]
>> [288420.754695] [<ffffffff81084c19>] process_one_work+0x159/0x4f0
>> [288420.754697] [<ffffffff81084fdc>] process_scheduled_works+0x2c/0x40
>> [288420.754698] [<ffffffff8108579b>] worker_thread+0x29b/0x530
>> [288420.754699] [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
>> [288420.754701] [<ffffffff8108b6f9>] kthread+0xc9/0xe0
>> [288420.754703] [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>> [288420.754705] [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
>> [288420.754707] [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>
> Ah, thanks. According to above, xfs_cil is waiting on log space to free
> up. This means xfs-cil is probably in:
>
> xlog_state_get_iclog_space()
> ->xlog_wait(&log->l_flush_wait, &log->l_icloglock);
>
> l_flush_wait is awoken during log I/O completion handling via the
> xfs-log workqueue. That guy is here:
>
> [288420.773968] INFO: task kworker/6:3:420227 blocked for more than 300 seconds.
> [288420.773986] Not tainted 3.18.43-40 #1
> [288420.773997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [288420.774017] kworker/6:3 D ffff880103893650 0 420227 2 0x00000000
> [288420.774027] Workqueue: xfs-log/rbd1 xfs_log_worker [xfs]
> [288420.774028] ffff88010357fac8 0000000000000046 0000000000000000 ffff880103893240
> [288420.774030] 0000000000012b00 ffff880146361128 ffff88010357ffd8 0000000000012b00
> [288420.774031] ffff8810297b7540 ffff880103893240 ffff88010357fae8 ffff88010357fbf8
> [288420.774033] Call Trace:
> [288420.774035] [<ffffffff816dd629>] schedule+0x29/0x70
> [288420.774036] [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
> [288420.774038] [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> [288420.774040] [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> [288420.774042] [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
> [288420.774043] [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> [288420.774044] [<ffffffff81086335>] flush_work+0x165/0x250
> [288420.774046] [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
> [288420.774054] [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
> [288420.774056] [<ffffffff8109f3cc>] ? dequeue_entity+0x17c/0x520
> [288420.774063] [<ffffffffa03b478e>] _xfs_log_force+0x6e/0x280 [xfs]
> [288420.774065] [<ffffffff810a076c>] ? put_prev_entity+0x3c/0x2e0
> [288420.774067] [<ffffffff8109b315>] ? sched_clock_cpu+0x95/0xd0
> [288420.774068] [<ffffffff810145a2>] ? __switch_to+0xf2/0x5f0
> [288420.774076] [<ffffffffa03b49d9>] xfs_log_force+0x39/0xe0 [xfs]
> [288420.774083] [<ffffffffa03b4aa8>] xfs_log_worker+0x28/0x50 [xfs]
> [288420.774085] [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [288420.774086] [<ffffffff8108561b>] worker_thread+0x11b/0x530
> [288420.774088] [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [288420.774089] [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [288420.774091] [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [288420.774093] [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [288420.774095] [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>
> ... which is back waiting on xfs-cil.
>
> Ilya,
>
> Have you looked into this[1] patch by any chance? Note that 7a29ac474
> ("xfs: give all workqueues rescuer threads") may also be a potential
> band aid for this. Or IOW, the lack thereof in v3.18.z may make this
> problem more likely.
No, I haven't -- this was a clear rbd/libceph bug to me.
Thanks,
Ilya
On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
> On Wed, Mar 29, 2017 at 1:16 PM, Michal Hocko <[email protected]> wrote:
> > On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
> >> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <[email protected]> wrote:
> >> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
> >> > [...]
> >> >> > ceph_con_workfn
> >> >> > mutex_lock(&con->mutex) # ceph_connection::mutex
> >> >> > try_write
> >> >> > ceph_tcp_connect
> >> >> > sock_create_kern
> >> >> > GFP_KERNEL allocation
> >> >> > allocator recurses into XFS, more I/O is issued
> >> >
> >> > One more note. So what happens if this is a GFP_NOIO request which
> >> > cannot make any progress? Your IO thread is blocked on con->mutex
> >> > as you write below but the above thread cannot proceed as well. So I am
> >> > _really_ not sure this acutally helps.
> >>
> >> This is not the only I/O worker. A ceph cluster typically consists of
> >> at least a few OSDs and can be as large as thousands of OSDs. This is
> >> the reason we are calling sock_create_kern() on the writeback path in
> >> the first place: pre-opening thousands of sockets isn't feasible.
> >
> > Sorry for being dense here but what actually guarantees the forward
> > progress? My current understanding is that the deadlock is caused by
> > con->mutext being held while the allocation cannot make a forward
> > progress. I can imagine this would be possible if the other io flushers
> > depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
> > much difference. What am I missing?
>
> con->mutex is per-ceph_connection, osdc->request_mutex is global and is
> the real problem here because we need both on the submit side, at least
> in 3.18. You are correct that even with GFP_NOIO this code may lock up
> in theory, however I think it's very unlikely in practice.
No, it would just make such a bug more obscure. The real problem seems
to be that you rely on locks which cannot guarantee a forward progress
in the IO path. And that is a bug IMHO.
> We got rid of osdc->request_mutex in 4.7, so these workers are almost
> independent in newer kernels and should be able to free up memory for
> those blocked on GFP_NOIO retries with their respective con->mutex
> held. Using GFP_KERNEL and thus allowing the recursion is just asking
> for an AA deadlock on con->mutex OTOH, so it does make a difference.
You keep saying this but so far I haven't heard how the AA deadlock is
possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
of time and that would cause you problems AFAIU.
> I'm a little confused by this discussion because for me this patch was
> a no-brainer...
No, it is a brainer. Because recursion prevention should be carefully
thought through. The lack of this approach has caused that we have
thousands of GFP_NOFS uses all over the kernel without a clear or proper
justification. Adding more on top doesn't help long term
maintainability.
> Locking aside, you said it was the stack trace in the changelog that
> got your attention
No, it is the usage of the scope GFP_NOIO API usage without a proper
explanation which caught my attention.
> are you saying it's OK for a block
> device to recurse back into the filesystem when doing I/O, potentially
> generating more I/O?
No, block device has to make a forward progress guarantee when
allocating and so use mempools or other means to achieve the same.
--
Michal Hocko
SUSE Labs
On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <[email protected]> wrote:
> On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
>> On Wed, Mar 29, 2017 at 1:16 PM, Michal Hocko <[email protected]> wrote:
>> > On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
>> >> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <[email protected]> wrote:
>> >> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
>> >> > [...]
>> >> >> > ceph_con_workfn
>> >> >> > mutex_lock(&con->mutex) # ceph_connection::mutex
>> >> >> > try_write
>> >> >> > ceph_tcp_connect
>> >> >> > sock_create_kern
>> >> >> > GFP_KERNEL allocation
>> >> >> > allocator recurses into XFS, more I/O is issued
>> >> >
>> >> > One more note. So what happens if this is a GFP_NOIO request which
>> >> > cannot make any progress? Your IO thread is blocked on con->mutex
>> >> > as you write below but the above thread cannot proceed as well. So I am
>> >> > _really_ not sure this acutally helps.
>> >>
>> >> This is not the only I/O worker. A ceph cluster typically consists of
>> >> at least a few OSDs and can be as large as thousands of OSDs. This is
>> >> the reason we are calling sock_create_kern() on the writeback path in
>> >> the first place: pre-opening thousands of sockets isn't feasible.
>> >
>> > Sorry for being dense here but what actually guarantees the forward
>> > progress? My current understanding is that the deadlock is caused by
>> > con->mutext being held while the allocation cannot make a forward
>> > progress. I can imagine this would be possible if the other io flushers
>> > depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
>> > much difference. What am I missing?
>>
>> con->mutex is per-ceph_connection, osdc->request_mutex is global and is
>> the real problem here because we need both on the submit side, at least
>> in 3.18. You are correct that even with GFP_NOIO this code may lock up
>> in theory, however I think it's very unlikely in practice.
>
> No, it would just make such a bug more obscure. The real problem seems
> to be that you rely on locks which cannot guarantee a forward progress
> in the IO path. And that is a bug IMHO.
Just to be clear: the "may lock up" comment above goes for 3.18, which
is where these stack traces came from. osdc->request_mutex which stood
in the way of other ceph_connection workers is no more.
>
>> We got rid of osdc->request_mutex in 4.7, so these workers are almost
>> independent in newer kernels and should be able to free up memory for
>> those blocked on GFP_NOIO retries with their respective con->mutex
>> held. Using GFP_KERNEL and thus allowing the recursion is just asking
>> for an AA deadlock on con->mutex OTOH, so it does make a difference.
>
> You keep saying this but so far I haven't heard how the AA deadlock is
> possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
> of time and that would cause you problems AFAIU.
Suppose we have an I/O for OSD X, which means it's got to go through
ceph_connection X:
ceph_con_workfn
mutex_lock(&con->mutex)
try_write
ceph_tcp_connect
sock_create_kern
GFP_KERNEL allocation
Suppose that generates another I/O for OSD X and blocks on it. Well,
it's got to go through the same ceph_connection:
rbd_queue_workfn
ceph_osdc_start_request
ceph_con_send
mutex_lock(&con->mutex) # deadlock, OSD X worker is knocked out
Now if that was a GFP_NOIO allocation, we would simply block in the
allocator. The placement algorithm distributes objects across the OSDs
in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
that OSD, some other I/Os for other OSDs would complete in the meantime
and free up memory. If we are under the kind of memory pressure that
makes GFP_NOIO allocations block for an extended period of time, we are
bound to have a lot of pre-open sockets, as we would have done at least
some flushing by then.
Thanks,
Ilya
On Thu 30-03-17 12:02:03, Ilya Dryomov wrote:
> On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <[email protected]> wrote:
> > On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
[...]
> >> We got rid of osdc->request_mutex in 4.7, so these workers are almost
> >> independent in newer kernels and should be able to free up memory for
> >> those blocked on GFP_NOIO retries with their respective con->mutex
> >> held. Using GFP_KERNEL and thus allowing the recursion is just asking
> >> for an AA deadlock on con->mutex OTOH, so it does make a difference.
> >
> > You keep saying this but so far I haven't heard how the AA deadlock is
> > possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
> > of time and that would cause you problems AFAIU.
>
> Suppose we have an I/O for OSD X, which means it's got to go through
> ceph_connection X:
>
> ceph_con_workfn
> mutex_lock(&con->mutex)
> try_write
> ceph_tcp_connect
> sock_create_kern
> GFP_KERNEL allocation
>
> Suppose that generates another I/O for OSD X and blocks on it.
Yeah, I have understand that but I am asking _who_ is going to generate
that IO. We do not do writeback from the direct reclaim path. I am not
familiar with Ceph at all but does any of its (slab) shrinkers generate
IO to recurse back?
> Well,
> it's got to go through the same ceph_connection:
>
> rbd_queue_workfn
> ceph_osdc_start_request
> ceph_con_send
> mutex_lock(&con->mutex) # deadlock, OSD X worker is knocked out
>
> Now if that was a GFP_NOIO allocation, we would simply block in the
> allocator. The placement algorithm distributes objects across the OSDs
> in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
> that OSD, some other I/Os for other OSDs would complete in the meantime
> and free up memory. If we are under the kind of memory pressure that
> makes GFP_NOIO allocations block for an extended period of time, we are
> bound to have a lot of pre-open sockets, as we would have done at least
> some flushing by then.
How is this any different from xfs waiting for its IO to be done?
--
Michal Hocko
SUSE Labs
On Thu, Mar 30, 2017 at 1:21 PM, Michal Hocko <[email protected]> wrote:
> On Thu 30-03-17 12:02:03, Ilya Dryomov wrote:
>> On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <[email protected]> wrote:
>> > On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
> [...]
>> >> We got rid of osdc->request_mutex in 4.7, so these workers are almost
>> >> independent in newer kernels and should be able to free up memory for
>> >> those blocked on GFP_NOIO retries with their respective con->mutex
>> >> held. Using GFP_KERNEL and thus allowing the recursion is just asking
>> >> for an AA deadlock on con->mutex OTOH, so it does make a difference.
>> >
>> > You keep saying this but so far I haven't heard how the AA deadlock is
>> > possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
>> > of time and that would cause you problems AFAIU.
>>
>> Suppose we have an I/O for OSD X, which means it's got to go through
>> ceph_connection X:
>>
>> ceph_con_workfn
>> mutex_lock(&con->mutex)
>> try_write
>> ceph_tcp_connect
>> sock_create_kern
>> GFP_KERNEL allocation
>>
>> Suppose that generates another I/O for OSD X and blocks on it.
>
> Yeah, I have understand that but I am asking _who_ is going to generate
> that IO. We do not do writeback from the direct reclaim path. I am not
It doesn't have to be a newly issued I/O, it could also be a wait on
something that depends on another I/O to OSD X, but I can't back this
up with any actual stack traces because the ones we have are too old.
That's just one scenario though. With such recursion allowed, we can
just as easily deadlock in the filesystem. Here is a couple of traces
circa 4.8, where it's the mutex in xfs_reclaim_inodes_ag():
cc1 D ffff92243fad8180 0 6772 6770 0x00000080
ffff9224d107b200 ffff922438de2f40 ffff922e8304fed8 ffff9224d107b200
ffff922ea7554000 ffff923034fb0618 0000000000000000 ffff9224d107b200
ffff9230368e5400 ffff92303788b000 ffffffff951eb4e1 0000003e00095bc0
Nov 28 18:21:23 dude kernel: Call Trace:
[<ffffffff951eb4e1>] ? schedule+0x31/0x80
[<ffffffffc0ab0570>] ? _xfs_log_force_lsn+0x1b0/0x340 [xfs]
[<ffffffff94ca5790>] ? wake_up_q+0x60/0x60
[<ffffffffc0a9f7ff>] ? __xfs_iunpin_wait+0x9f/0x160 [xfs]
[<ffffffffc0ab0730>] ? xfs_log_force_lsn+0x30/0xb0 [xfs]
[<ffffffffc0a97041>] ? xfs_reclaim_inode+0x131/0x370 [xfs]
[<ffffffffc0a9f7ff>] ? __xfs_iunpin_wait+0x9f/0x160 [xfs]
[<ffffffff94cbcf80>] ? autoremove_wake_function+0x40/0x40
[<ffffffffc0a97041>] ? xfs_reclaim_inode+0x131/0x370 [xfs]
[<ffffffffc0a97442>] ? xfs_reclaim_inodes_ag+0x1c2/0x2d0 [xfs]
[<ffffffff94cb197c>] ? enqueue_task_fair+0x5c/0x920
[<ffffffff94c35895>] ? sched_clock+0x5/0x10
[<ffffffff94ca47e0>] ? check_preempt_curr+0x50/0x90
[<ffffffff94ca4834>] ? ttwu_do_wakeup+0x14/0xe0
[<ffffffff94ca53c3>] ? try_to_wake_up+0x53/0x3a0
[<ffffffffc0a98331>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
[<ffffffff94e05bfe>] ? super_cache_scan+0x17e/0x190
[<ffffffff94d919f3>] ? shrink_slab.part.38+0x1e3/0x3d0
[<ffffffff94d9616a>] ? shrink_node+0x10a/0x320
[<ffffffff94d96474>] ? do_try_to_free_pages+0xf4/0x350
[<ffffffff94d967ba>] ? try_to_free_pages+0xea/0x1b0
[<ffffffff94d863bd>] ? __alloc_pages_nodemask+0x61d/0xe60
[<ffffffff94dd918a>] ? alloc_pages_vma+0xba/0x280
[<ffffffff94db0f8b>] ? wp_page_copy+0x45b/0x6c0
[<ffffffff94db3e12>] ? alloc_set_pte+0x2e2/0x5f0
[<ffffffff94db2169>] ? do_wp_page+0x4a9/0x7e0
[<ffffffff94db4bd2>] ? handle_mm_fault+0x872/0x1250
[<ffffffff94c65a53>] ? __do_page_fault+0x1e3/0x500
[<ffffffff951f0cd8>] ? page_fault+0x28/0x30
kworker/9:3 D ffff92303f318180 0 20732 2 0x00000080
Workqueue: ceph-msgr ceph_con_workfn [libceph]
ffff923035dd4480 ffff923038f8a0c0 0000000000000001 000000009eb27318
ffff92269eb28000 ffff92269eb27338 ffff923036b145ac ffff923035dd4480
00000000ffffffff ffff923036b145b0 ffffffff951eb4e1 ffff923036b145a8
Call Trace:
[<ffffffff951eb4e1>] ? schedule+0x31/0x80
[<ffffffff951eb77a>] ? schedule_preempt_disabled+0xa/0x10
[<ffffffff951ed1f4>] ? __mutex_lock_slowpath+0xb4/0x130
[<ffffffff951ed28b>] ? mutex_lock+0x1b/0x30
[<ffffffffc0a974b3>] ? xfs_reclaim_inodes_ag+0x233/0x2d0 [xfs]
[<ffffffff94d92ba5>] ? move_active_pages_to_lru+0x125/0x270
[<ffffffff94f2b985>] ? radix_tree_gang_lookup_tag+0xc5/0x1c0
[<ffffffff94dad0f3>] ? __list_lru_walk_one.isra.3+0x33/0x120
[<ffffffffc0a98331>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
[<ffffffff94e05bfe>] ? super_cache_scan+0x17e/0x190
[<ffffffff94d919f3>] ? shrink_slab.part.38+0x1e3/0x3d0
[<ffffffff94d9616a>] ? shrink_node+0x10a/0x320
[<ffffffff94d96474>] ? do_try_to_free_pages+0xf4/0x350
[<ffffffff94d967ba>] ? try_to_free_pages+0xea/0x1b0
[<ffffffff94d863bd>] ? __alloc_pages_nodemask+0x61d/0xe60
[<ffffffff94ddf42d>] ? cache_grow_begin+0x9d/0x560
[<ffffffff94ddfb88>] ? fallback_alloc+0x148/0x1c0
[<ffffffff94de09db>] ? __kmalloc+0x1eb/0x580
# a buggy ceph_connection worker doing a GFP_KERNEL allocation
xz D ffff92303f358180 0 5932 5928 0x00000084
ffff921a56201180 ffff923038f8ae00 ffff92303788b2c8 0000000000000001
ffff921e90234000 ffff921e90233820 ffff923036b14eac ffff921a56201180
00000000ffffffff ffff923036b14eb0 ffffffff951eb4e1 ffff923036b14ea8
Call Trace:
[<ffffffff951eb4e1>] ? schedule+0x31/0x80
[<ffffffff951eb77a>] ? schedule_preempt_disabled+0xa/0x10
[<ffffffff951ed1f4>] ? __mutex_lock_slowpath+0xb4/0x130
[<ffffffff951ed28b>] ? mutex_lock+0x1b/0x30
[<ffffffffc0a974b3>] ? xfs_reclaim_inodes_ag+0x233/0x2d0 [xfs]
[<ffffffff94f2b985>] ? radix_tree_gang_lookup_tag+0xc5/0x1c0
[<ffffffff94dad0f3>] ? __list_lru_walk_one.isra.3+0x33/0x120
[<ffffffffc0a98331>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
[<ffffffff94e05bfe>] ? super_cache_scan+0x17e/0x190
[<ffffffff94d919f3>] ? shrink_slab.part.38+0x1e3/0x3d0
[<ffffffff94d9616a>] ? shrink_node+0x10a/0x320
[<ffffffff94d96474>] ? do_try_to_free_pages+0xf4/0x350
[<ffffffff94d967ba>] ? try_to_free_pages+0xea/0x1b0
[<ffffffff94d863bd>] ? __alloc_pages_nodemask+0x61d/0xe60
[<ffffffff94dd73b1>] ? alloc_pages_current+0x91/0x140
[<ffffffff94e0ab98>] ? pipe_write+0x208/0x3f0
[<ffffffff94e01b08>] ? new_sync_write+0xd8/0x130
[<ffffffff94e02293>] ? vfs_write+0xb3/0x1a0
[<ffffffff94e03672>] ? SyS_write+0x52/0xc0
[<ffffffff94c03b8a>] ? do_syscall_64+0x7a/0xd0
[<ffffffff951ef9a5>] ? entry_SYSCALL64_slow_path+0x25/0x25
We have since fixed that allocation site, but the point is it was
a combination of direct reclaim and GFP_KERNEL recursion.
> familiar with Ceph at all but does any of its (slab) shrinkers generate
> IO to recurse back?
We don't register any custom shrinkers. This is XFS on top of rbd,
a ceph-backed block device.
>
>> Well,
>> it's got to go through the same ceph_connection:
>>
>> rbd_queue_workfn
>> ceph_osdc_start_request
>> ceph_con_send
>> mutex_lock(&con->mutex) # deadlock, OSD X worker is knocked out
>>
>> Now if that was a GFP_NOIO allocation, we would simply block in the
>> allocator. The placement algorithm distributes objects across the OSDs
>> in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
>> that OSD, some other I/Os for other OSDs would complete in the meantime
>> and free up memory. If we are under the kind of memory pressure that
>> makes GFP_NOIO allocations block for an extended period of time, we are
>> bound to have a lot of pre-open sockets, as we would have done at least
>> some flushing by then.
>
> How is this any different from xfs waiting for its IO to be done?
I feel like we are talking past each other here. If the worker in
question isn't deadlocked, it will eventually get its socket and start
flushing I/O. If it has deadlocked, it won't...
Thanks,
Ilya
On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <[email protected]> wrote:
> On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
>> On Wed, Mar 29, 2017 at 1:16 PM, Michal Hocko <[email protected]> wrote:
>> > On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
>> >> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <[email protected]> wrote:
>> >> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
>> >> > [...]
>> >> >> > ceph_con_workfn
>> >> >> > mutex_lock(&con->mutex) # ceph_connection::mutex
>> >> >> > try_write
>> >> >> > ceph_tcp_connect
>> >> >> > sock_create_kern
>> >> >> > GFP_KERNEL allocation
>> >> >> > allocator recurses into XFS, more I/O is issued
>> >> >
>> >> > One more note. So what happens if this is a GFP_NOIO request which
>> >> > cannot make any progress? Your IO thread is blocked on con->mutex
>> >> > as you write below but the above thread cannot proceed as well. So I am
>> >> > _really_ not sure this acutally helps.
>> >>
>> >> This is not the only I/O worker. A ceph cluster typically consists of
>> >> at least a few OSDs and can be as large as thousands of OSDs. This is
>> >> the reason we are calling sock_create_kern() on the writeback path in
>> >> the first place: pre-opening thousands of sockets isn't feasible.
>> >
>> > Sorry for being dense here but what actually guarantees the forward
>> > progress? My current understanding is that the deadlock is caused by
>> > con->mutext being held while the allocation cannot make a forward
>> > progress. I can imagine this would be possible if the other io flushers
>> > depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
>> > much difference. What am I missing?
>>
>> con->mutex is per-ceph_connection, osdc->request_mutex is global and is
>> the real problem here because we need both on the submit side, at least
>> in 3.18. You are correct that even with GFP_NOIO this code may lock up
>> in theory, however I think it's very unlikely in practice.
>
> No, it would just make such a bug more obscure. The real problem seems
> to be that you rely on locks which cannot guarantee a forward progress
> in the IO path. And that is a bug IMHO.
>
>> We got rid of osdc->request_mutex in 4.7, so these workers are almost
>> independent in newer kernels and should be able to free up memory for
>> those blocked on GFP_NOIO retries with their respective con->mutex
>> held. Using GFP_KERNEL and thus allowing the recursion is just asking
>> for an AA deadlock on con->mutex OTOH, so it does make a difference.
>
> You keep saying this but so far I haven't heard how the AA deadlock is
> possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
> of time and that would cause you problems AFAIU.
>
>> I'm a little confused by this discussion because for me this patch was
>> a no-brainer...
>
> No, it is a brainer. Because recursion prevention should be carefully
> thought through. The lack of this approach has caused that we have
> thousands of GFP_NOFS uses all over the kernel without a clear or proper
> justification. Adding more on top doesn't help long term
> maintainability.
>
>> Locking aside, you said it was the stack trace in the changelog that
>> got your attention
>
> No, it is the usage of the scope GFP_NOIO API usage without a proper
> explanation which caught my attention.
>
>> are you saying it's OK for a block
>> device to recurse back into the filesystem when doing I/O, potentially
>> generating more I/O?
>
> No, block device has to make a forward progress guarantee when
> allocating and so use mempools or other means to achieve the same.
OK, let me put this differently. Do you agree that a block device
cannot make _any_ kind of progress guarantee if it does a GFP_KERNEL
allocation in the I/O path?
Thanks,
Ilya
On Thu 30-03-17 15:53:35, Ilya Dryomov wrote:
> On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <[email protected]> wrote:
> > On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
[...]
> >> are you saying it's OK for a block
> >> device to recurse back into the filesystem when doing I/O, potentially
> >> generating more I/O?
> >
> > No, block device has to make a forward progress guarantee when
> > allocating and so use mempools or other means to achieve the same.
>
> OK, let me put this differently. Do you agree that a block device
> cannot make _any_ kind of progress guarantee if it does a GFP_KERNEL
> allocation in the I/O path?
yes that is correct. And the same is correct for GFP_NOIO allocations as
well.
--
Michal Hocko
SUSE Labs
On Thu 30-03-17 15:48:42, Ilya Dryomov wrote:
> On Thu, Mar 30, 2017 at 1:21 PM, Michal Hocko <[email protected]> wrote:
[...]
> > familiar with Ceph at all but does any of its (slab) shrinkers generate
> > IO to recurse back?
>
> We don't register any custom shrinkers. This is XFS on top of rbd,
> a ceph-backed block device.
OK, that was the part I was missing. So you depend on the XFS to make a
forward progress here.
> >> Well,
> >> it's got to go through the same ceph_connection:
> >>
> >> rbd_queue_workfn
> >> ceph_osdc_start_request
> >> ceph_con_send
> >> mutex_lock(&con->mutex) # deadlock, OSD X worker is knocked out
> >>
> >> Now if that was a GFP_NOIO allocation, we would simply block in the
> >> allocator. The placement algorithm distributes objects across the OSDs
> >> in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
> >> that OSD, some other I/Os for other OSDs would complete in the meantime
> >> and free up memory. If we are under the kind of memory pressure that
> >> makes GFP_NOIO allocations block for an extended period of time, we are
> >> bound to have a lot of pre-open sockets, as we would have done at least
> >> some flushing by then.
> >
> > How is this any different from xfs waiting for its IO to be done?
>
> I feel like we are talking past each other here. If the worker in
> question isn't deadlocked, it will eventually get its socket and start
> flushing I/O. If it has deadlocked, it won't...
But if the allocation is stuck then the holder of the lock cannot make
a forward progress and it is effectivelly deadlocked because other IO
depends on the lock it holds. Maybe I just ask bad questions but what
makes GFP_NOIO different from GFP_KERNEL here. We know that the later
might need to wait for an IO to finish in the shrinker but it itself
doesn't get the lock in question directly. The former depends on the
allocator forward progress as well and that in turn wait for somebody
else to proceed with the IO. So to me any blocking allocation while
holding a lock which blocks further IO to complete is simply broken.
--
Michal Hocko
SUSE Labs
On Thu, Mar 30, 2017 at 4:36 PM, Michal Hocko <[email protected]> wrote:
> On Thu 30-03-17 15:48:42, Ilya Dryomov wrote:
>> On Thu, Mar 30, 2017 at 1:21 PM, Michal Hocko <[email protected]> wrote:
> [...]
>> > familiar with Ceph at all but does any of its (slab) shrinkers generate
>> > IO to recurse back?
>>
>> We don't register any custom shrinkers. This is XFS on top of rbd,
>> a ceph-backed block device.
>
> OK, that was the part I was missing. So you depend on the XFS to make a
> forward progress here.
>
>> >> Well,
>> >> it's got to go through the same ceph_connection:
>> >>
>> >> rbd_queue_workfn
>> >> ceph_osdc_start_request
>> >> ceph_con_send
>> >> mutex_lock(&con->mutex) # deadlock, OSD X worker is knocked out
>> >>
>> >> Now if that was a GFP_NOIO allocation, we would simply block in the
>> >> allocator. The placement algorithm distributes objects across the OSDs
>> >> in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
>> >> that OSD, some other I/Os for other OSDs would complete in the meantime
>> >> and free up memory. If we are under the kind of memory pressure that
>> >> makes GFP_NOIO allocations block for an extended period of time, we are
>> >> bound to have a lot of pre-open sockets, as we would have done at least
>> >> some flushing by then.
>> >
>> > How is this any different from xfs waiting for its IO to be done?
>>
>> I feel like we are talking past each other here. If the worker in
>> question isn't deadlocked, it will eventually get its socket and start
>> flushing I/O. If it has deadlocked, it won't...
>
> But if the allocation is stuck then the holder of the lock cannot make
> a forward progress and it is effectivelly deadlocked because other IO
> depends on the lock it holds. Maybe I just ask bad questions but what
Only I/O to the same OSD. A typical ceph cluster has dozens of OSDs,
so there is plenty of room for other in-flight I/Os to finish and move
the allocator forward. The lock in question is per-ceph_connection
(read: per-OSD).
> makes GFP_NOIO different from GFP_KERNEL here. We know that the later
> might need to wait for an IO to finish in the shrinker but it itself
> doesn't get the lock in question directly. The former depends on the
> allocator forward progress as well and that in turn wait for somebody
> else to proceed with the IO. So to me any blocking allocation while
> holding a lock which blocks further IO to complete is simply broken.
Right, with GFP_NOIO we simply wait -- there is nothing wrong with
a blocking allocation, at least in the general case. With GFP_KERNEL
we deadlock, either in rbd/libceph (less likely) or in the filesystem
above (more likely, shown in the xfs_reclaim_inodes_ag() traces you
omitted in your quote).
Thanks,
Ilya
On Thu 30-03-17 17:06:51, Ilya Dryomov wrote:
[...]
> > But if the allocation is stuck then the holder of the lock cannot make
> > a forward progress and it is effectivelly deadlocked because other IO
> > depends on the lock it holds. Maybe I just ask bad questions but what
>
> Only I/O to the same OSD. A typical ceph cluster has dozens of OSDs,
> so there is plenty of room for other in-flight I/Os to finish and move
> the allocator forward. The lock in question is per-ceph_connection
> (read: per-OSD).
>
> > makes GFP_NOIO different from GFP_KERNEL here. We know that the later
> > might need to wait for an IO to finish in the shrinker but it itself
> > doesn't get the lock in question directly. The former depends on the
> > allocator forward progress as well and that in turn wait for somebody
> > else to proceed with the IO. So to me any blocking allocation while
> > holding a lock which blocks further IO to complete is simply broken.
>
> Right, with GFP_NOIO we simply wait -- there is nothing wrong with
> a blocking allocation, at least in the general case. With GFP_KERNEL
> we deadlock, either in rbd/libceph (less likely) or in the filesystem
> above (more likely, shown in the xfs_reclaim_inodes_ag() traces you
> omitted in your quote).
I am not convinced. It seems you are relying on something that is not
guaranteed fundamentally. AFAIU all the IO paths should _guarantee_
and use mempools for that purpose if they need to allocate.
But, hey, I will not argue as my understanding of ceph is close to
zero. You are the maintainer so it is your call. I would just really
appreciate if you could document this as much as possible (ideally
at the place where you call memalloc_noio_save and describe the lock
dependency there).
Thanks!
--
Michal Hocko
SUSE Labs
On Thu, Mar 30, 2017 at 6:12 PM, Michal Hocko <[email protected]> wrote:
> On Thu 30-03-17 17:06:51, Ilya Dryomov wrote:
> [...]
>> > But if the allocation is stuck then the holder of the lock cannot make
>> > a forward progress and it is effectivelly deadlocked because other IO
>> > depends on the lock it holds. Maybe I just ask bad questions but what
>>
>> Only I/O to the same OSD. A typical ceph cluster has dozens of OSDs,
>> so there is plenty of room for other in-flight I/Os to finish and move
>> the allocator forward. The lock in question is per-ceph_connection
>> (read: per-OSD).
>>
>> > makes GFP_NOIO different from GFP_KERNEL here. We know that the later
>> > might need to wait for an IO to finish in the shrinker but it itself
>> > doesn't get the lock in question directly. The former depends on the
>> > allocator forward progress as well and that in turn wait for somebody
>> > else to proceed with the IO. So to me any blocking allocation while
>> > holding a lock which blocks further IO to complete is simply broken.
>>
>> Right, with GFP_NOIO we simply wait -- there is nothing wrong with
>> a blocking allocation, at least in the general case. With GFP_KERNEL
>> we deadlock, either in rbd/libceph (less likely) or in the filesystem
>> above (more likely, shown in the xfs_reclaim_inodes_ag() traces you
>> omitted in your quote).
>
> I am not convinced. It seems you are relying on something that is not
> guaranteed fundamentally. AFAIU all the IO paths should _guarantee_
> and use mempools for that purpose if they need to allocate.
>
> But, hey, I will not argue as my understanding of ceph is close to
> zero. You are the maintainer so it is your call. I would just really
> appreciate if you could document this as much as possible (ideally
> at the place where you call memalloc_noio_save and describe the lock
> dependency there).
It's certainly not perfect (especially this socket case -- putting
together a pool of sockets is not easy) and I'm sure one could poke
some holes in the entire thing, but I'm convinced we are much better
off with the memalloc_noio_{save,restore}() pair in there.
I'll try to come up with a better comment, but the problem is that it
can be an arbitrary lock in an arbitrary filesystem, not just libceph's
con->mutex, so it's hard to be specific.
Do I have your OK to poke Greg to get the backports going?
Thanks,
Ilya
On Thu 30-03-17 19:19:59, Ilya Dryomov wrote:
> On Thu, Mar 30, 2017 at 6:12 PM, Michal Hocko <[email protected]> wrote:
> > On Thu 30-03-17 17:06:51, Ilya Dryomov wrote:
> > [...]
> >> > But if the allocation is stuck then the holder of the lock cannot make
> >> > a forward progress and it is effectivelly deadlocked because other IO
> >> > depends on the lock it holds. Maybe I just ask bad questions but what
> >>
> >> Only I/O to the same OSD. A typical ceph cluster has dozens of OSDs,
> >> so there is plenty of room for other in-flight I/Os to finish and move
> >> the allocator forward. The lock in question is per-ceph_connection
> >> (read: per-OSD).
> >>
> >> > makes GFP_NOIO different from GFP_KERNEL here. We know that the later
> >> > might need to wait for an IO to finish in the shrinker but it itself
> >> > doesn't get the lock in question directly. The former depends on the
> >> > allocator forward progress as well and that in turn wait for somebody
> >> > else to proceed with the IO. So to me any blocking allocation while
> >> > holding a lock which blocks further IO to complete is simply broken.
> >>
> >> Right, with GFP_NOIO we simply wait -- there is nothing wrong with
> >> a blocking allocation, at least in the general case. With GFP_KERNEL
> >> we deadlock, either in rbd/libceph (less likely) or in the filesystem
> >> above (more likely, shown in the xfs_reclaim_inodes_ag() traces you
> >> omitted in your quote).
> >
> > I am not convinced. It seems you are relying on something that is not
> > guaranteed fundamentally. AFAIU all the IO paths should _guarantee_
> > and use mempools for that purpose if they need to allocate.
> >
> > But, hey, I will not argue as my understanding of ceph is close to
> > zero. You are the maintainer so it is your call. I would just really
> > appreciate if you could document this as much as possible (ideally
> > at the place where you call memalloc_noio_save and describe the lock
> > dependency there).
>
> It's certainly not perfect (especially this socket case -- putting
> together a pool of sockets is not easy) and I'm sure one could poke
> some holes in the entire thing,
I would recommend testing under a heavy memory pressure (involving OOM
killer invocations) with a lot of IO pressure to see what falls out.
> but I'm convinced we are much better
> off with the memalloc_noio_{save,restore}() pair in there.
>
> I'll try to come up with a better comment, but the problem is that it
> can be an arbitrary lock in an arbitrary filesystem, not just libceph's
> con->mutex, so it's hard to be specific.
But the particular path should describe what is the deadlock scenario
regardless of the FS (xfs is likely not the only one to wait for the
IO to finish).
> Do I have your OK to poke Greg to get the backports going?
As I've said, it's your call, if you feel comfortable with this then I
will certainly not stand in the way.
--
Michal Hocko
SUSE Labs
On Tue, 2017-03-28 at 14:30 +0200, Greg Kroah-Hartman wrote:
> 4.4-stable review patch. If anyone has any objections, please let me know.
>
> ------------------
>
> From: Adrian Hunter <[email protected]>
>
> commit e2ebfb2142acefecc2496e71360f50d25726040b upstream.
>
> Disabling interrupts for even a millisecond can cause problems for some
> devices. That can happen when sdhci changes clock frequency because it
> waits for the clock to become stable under a spin lock.
>
> The spin lock is not necessary here. Anything that is racing with changes
> to the I/O state is already broken. The mmc core already provides
> synchronization via "claiming" the host.
[...]
In mainline, drivers/mmc/host/sdhci-of-at91.c has a slightly different
version of this code that seems to have the same issue. In 4.4 there's
another (conditional) mdelay(1) further up this function that seems to
be related to that hardware, and probably ought to have an unlock/lock
around it.
Ben.
--
Ben Hutchings
Software Developer, Codethink Ltd.
On Tue, 2017-03-28 at 14:31 +0200, Greg Kroah-Hartman wrote:
[...]
> static void serial8250_io_resume(struct pci_dev *dev)
> {
> struct serial_private *priv = pci_get_drvdata(dev);
> + const struct pciserial_board *board;
>
> - if (priv)
> - pciserial_resume_ports(priv);
> + if (!priv)
> + return;
> +
> + board = priv->board;
> + kfree(priv);
> + priv = pciserial_init_ports(dev, board);
> +
> + if (!IS_ERR(priv)) {
> + pci_set_drvdata(dev, priv);
> + }
> }
On error, this leaves drvdata as a dangling pointer. Removing the
device or driver will then cause a use-after-free. (And setting drvdata
to NULL isn't enough to fix this as there is no null pointer check in
pciserial_remove_ports().)
Ben.
--
Ben Hutchings
Software Developer, Codethink Ltd.
On Tue, Apr 04, 2017 at 05:50:50PM +0100, Ben Hutchings wrote:
> On Tue, 2017-03-28 at 14:30 +0200, Greg Kroah-Hartman wrote:
> > 4.4-stable review patch. If anyone has any objections, please let me know.
> >
> > ------------------
> >
> > From: Adrian Hunter <[email protected]>
> >
> > commit e2ebfb2142acefecc2496e71360f50d25726040b upstream.
> >
> > Disabling interrupts for even a millisecond can cause problems for some
> > devices. That can happen when sdhci changes clock frequency because it
> > waits for the clock to become stable under a spin lock.
> >
> > The spin lock is not necessary here. Anything that is racing with changes
> > to the I/O state is already broken. The mmc core already provides
> > synchronization via "claiming" the host.
> [...]
>
> In mainline, drivers/mmc/host/sdhci-of-at91.c has a slightly different
> version of this code that seems to have the same issue. In 4.4 there's
> another (conditional) mdelay(1) further up this function that seems to
> be related to that hardware, and probably ought to have an unlock/lock
> around it.
Right, how do you want to proceed? Do you want me to send a patch on top
of it to manage this extra mdelay?
Regards
Ludovic
On Thu, 2017-04-06 at 14:12 +0200, Ludovic Desroches wrote:
> On Tue, Apr 04, 2017 at 05:50:50PM +0100, Ben Hutchings wrote:
> > On Tue, 2017-03-28 at 14:30 +0200, Greg Kroah-Hartman wrote:
> > > 4.4-stable review patch. If anyone has any objections, please let me know.
> > >
> > > ------------------
> > >
> > > From: Adrian Hunter <[email protected]>
> > >
> > > commit e2ebfb2142acefecc2496e71360f50d25726040b upstream.
> > >
> > > Disabling interrupts for even a millisecond can cause problems for some
> > > devices. That can happen when sdhci changes clock frequency because it
> > > waits for the clock to become stable under a spin lock.
> > >
> > > The spin lock is not necessary here. Anything that is racing with changes
> > > to the I/O state is already broken. The mmc core already provides
> > > synchronization via "claiming" the host.
> > [...]
> >
> > In mainline, drivers/mmc/host/sdhci-of-at91.c has a slightly different
> > version of this code that seems to have the same issue. In 4.4 there's
> > another (conditional) mdelay(1) further up this function that seems to
> > be related to that hardware, and probably ought to have an unlock/lock
> > around it.
>
> Right, how do you want to proceed? Do you want me to send a patch on top
> of it to manage this extra mdelay?
This change doesn't appear to break anything; I'm just saying that it's
an incomplete fix. The other case where there's a delay with IRQs
disabled should be fixed with an additional patch.
Ben.
--
Ben Hutchings
Software Developer, Codethink Ltd.