2017-06-05 16:28:40

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 000/115] 4.11.4-stable review

This is the start of the stable review cycle for the 4.11.4 release.
There are 115 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.

Responses should be made by Wed Jun 7 15:30:31 UTC 2017.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.11.4-rc1.gz
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.11.y
and the diffstat can be found below.

thanks,

greg k-h

-------------
Pseudo-Shortlog of commits:

Greg Kroah-Hartman <[email protected]>
Linux 4.11.4-rc1

Jan Kara <[email protected]>
xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()

Eric Sandeen <[email protected]>
xfs: fix unaligned access in xfs_btree_visit_blocks

Darrick J. Wong <[email protected]>
xfs: avoid mount-time deadlock in CoW extent recovery

Christoph Hellwig <[email protected]>
xfs: xfs_trans_alloc_empty

Zorro Lang <[email protected]>
xfs: bad assertion for delalloc an extent that start at i_size

Darrick J. Wong <[email protected]>
xfs: BMAPX shouldn't barf on inline-format directories

Brian Foster <[email protected]>
xfs: fix indlen accounting error on partial delalloc conversion

Eryu Guan <[email protected]>
xfs: fix use-after-free in xfs_finish_page_writeback

Darrick J. Wong <[email protected]>
xfs: reserve enough blocks to handle btree splits when remapping

Brian Foster <[email protected]>
xfs: wait on new inodes during quotaoff dquot release

Brian Foster <[email protected]>
xfs: update ag iterator to support wait on new inodes

Brian Foster <[email protected]>
xfs: support ability to wait on new inodes

Brian Foster <[email protected]>
xfs: fix up quotacheck buffer list error handling

Brian Foster <[email protected]>
xfs: prevent multi-fsb dir readahead from reading random blocks

Eric Sandeen <[email protected]>
xfs: handle array index overrun in xfs_dir2_leaf_readbuf()

Christoph Hellwig <[email protected]>
xfs: fix integer truncation in xfs_bmap_remap_alloc

Brian Foster <[email protected]>
xfs: drop iolock from reclaim context to appease lockdep

Darrick J. Wong <[email protected]>
xfs: actually report xattr extents via iomap

Darrick J. Wong <[email protected]>
xfs: fix over-copying of getbmap parameters from userspace

Brian Foster <[email protected]>
xfs: use dedicated log worker wq to avoid deadlock with cil wq

Eryu Guan <[email protected]>
xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()

Brian Foster <[email protected]>
xfs: use ->b_state to fix buffer I/O accounting release race

Jan Kara <[email protected]>
xfs: Fix missed holes in SEEK_HOLE implementation

Patrik Jakobsson <[email protected]>
drm/gma500/psb: Actually use VBT mode when it is found

Thomas Gleixner <[email protected]>
slub/memcg: cure the brainless abuse of sysfs attributes

Andrea Arcangeli <[email protected]>
ksm: prevent crash after write_protect_page fails

Rob Landley <[email protected]>
x86/boot: Use CROSS_COMPILE prefix for readelf

Imre Deak <[email protected]>
PCI/PM: Add needs_resume flag to avoid suspend complete optimization

Mike Marciniszyn <[email protected]>
RDMA/qib,hfi1: Fix MR reference count leak on write with immediate

Israel Rukshin <[email protected]>
RDMA/srp: Fix NULL deref at srp_destroy_qp()

Michal Hocko <[email protected]>
mm: consider memblock reservations for deferred memory initialization sizing

James Morse <[email protected]>
mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified

Yisheng Xie <[email protected]>
mlock: fix mlock count can not decrease in race condition

Punit Agrawal <[email protected]>
mm/migrate: fix refcount handling when !hugepage_migration_supported()

Ross Zwisler <[email protected]>
dax: fix race between colliding PMD & PTE entries

Ross Zwisler <[email protected]>
mm: avoid spurious 'bad pmd' warning messages

Tetsuo Handa <[email protected]>
mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once

Takashi Iwai <[email protected]>
ALSA: usb: Fix a typo in Tascam US-16x08 mixer element

Takashi Iwai <[email protected]>
Revert "ALSA: usb-audio: purge needless variable length array"

Alexander Tsoy <[email protected]>
ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430

Takashi Iwai <[email protected]>
ALSA: hda - No loopback on ALC299 codec

Nicolas Iooss <[email protected]>
pcmcia: remove left-over %Z format

Lyude <[email protected]>
drm/radeon: Unbreak HPD handling for r600+

Alex Deucher <[email protected]>
drm/radeon/ci: disable mclk switching for high refresh rates (v2)

Alex Deucher <[email protected]>
drm/amd/powerplay/smu7: disable mclk switching for high refresh rates

Alex Deucher <[email protected]>
drm/amd/powerplay/smu7: add vblank check for mclk switching (v2)

Ming Lei <[email protected]>
nvme: avoid to use blk_mq_abort_requeue_list()

Ming Lei <[email protected]>
nvme: use blk_mq_start_hw_queues() in nvme_kill_queues()

Marta Rybczynska <[email protected]>
nvme-rdma: support devices with queue size < 32

Jason Gerecke <[email protected]>
HID: wacom: Have wacom_tpc_irq guard against possible NULL dereference

Bryant G. Ly <[email protected]>
ibmvscsis: Fix the incorrect req_lim_delta

Bryant G. Ly <[email protected]>
ibmvscsis: Clear left-over abort_cmd pointers

Artem Savkov <[email protected]>
scsi: scsi_dh_rdac: Use ctlr directly in rdac_failover_get()

Nicholas Bellinger <[email protected]>
iscsi-target: Fix initial login PDU asynchronous socket close OOPs

Jiang Yi <[email protected]>
iscsi-target: Always wait for kthread_should_stop() before kthread exit

Long Li <[email protected]>
scsi: zero per-cmd private driver data for each MQ I/O

Srinath Mannam <[email protected]>
mmc: sdhci-iproc: suppress spurious interrupt with Multiblock read

Benjamin Tissoires <[email protected]>
Revert "ACPI / button: Change default behavior to lid_init_state=open"

Lv Zheng <[email protected]>
ACPICA: Tables: Fix regression introduced by a too early mechanism enabling

Dan Williams <[email protected]>
ACPI / sysfs: fix acpi_get_table() leak / acpi-sysfs denial of service

Vishal Verma <[email protected]>
acpi, nfit: Fix the memory error check in nfit_handle_mce()

Borislav Petkov <[email protected]>
x86/MCE: Export memory_error()

Lv Zheng <[email protected]>
Revert "ACPI / button: Remove lid_init_state=method mode"

Herbert Xu <[email protected]>
crypto: skcipher - Add missing API setkey checks

Sebastian Reichel <[email protected]>
i2c: i2c-tiny-usb: fix buffer not being DMA capable

Ard Biesheuvel <[email protected]>
drivers/tty: 8250: only call fintek_8250_probe when doing port I/O

Johan Hovold <[email protected]>
serdev: fix tty-port client deregistration

Johan Hovold <[email protected]>
Revert "tty_port: register tty ports with serdev bus"

Jeremy Kerr <[email protected]>
powerpc/spufs: Fix hash faults for kernel regions

Michael Neuling <[email protected]>
powerpc: Fix booting P9 hash with CONFIG_PPC_RADIX_MMU=N

Richard Narron <[email protected]>
fs/ufs: Set UFS default maximum bytes per file

Liam R. Howlett <[email protected]>
sparc/ftrace: Fix ftrace graph time measurement

Orlando Arias <[email protected]>
sparc: Fix -Wstringop-overflow warning

Nitin Gupta <[email protected]>
sparc64: Fix mapping of 64k pages with MAP_FIXED

Daniel Borkmann <[email protected]>
bpf: adjust verifier heuristics

Daniel Borkmann <[email protected]>
bpf: fix wrong exposure of map_flags into fdinfo for lpm

Daniel Borkmann <[email protected]>
bpf: add bpf_clone_redirect to bpf_helper_changes_pkt_data

Eric Dumazet <[email protected]>
ipv4: add reference counting to metrics

Peter Dawson <[email protected]>
ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets

Davide Caratti <[email protected]>
sctp: fix ICMP processing if skb is non-linear

Wei Wang <[email protected]>
tcp: avoid fastopen API to be used on AF_UNSPEC

Eric Garver <[email protected]>
geneve: fix fill_info when using collect_metadata

Vlad Yasevich <[email protected]>
virtio-net: enable TSO/checksum offloads for Q-in-Q vlans

Vlad Yasevich <[email protected]>
be2net: Fix offload features for Q-in-Q packets

Vlad Yasevich <[email protected]>
vlan: Fix tcp checksum offloads in Q-in-Q vlans

Andrew Lunn <[email protected]>
net: phy: marvell: Limit errata to 88m1101

Mohamad Haj Yahia <[email protected]>
net/mlx5: Avoid using pending command interface slots

Jarod Wilson <[email protected]>
bonding: fix accounting of active ports in 3ad

Eric Dumazet <[email protected]>
ipv6: fix out of bound writes in __ip6_append_data()

Xin Long <[email protected]>
bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

Bjørn Mork <[email protected]>
qmi_wwan: add another Lenovo EM74xx device ID

Tobias Jungel <[email protected]>
bridge: netlink: check vlan_default_pvid range

David S. Miller <[email protected]>
ipv6: Check ip6_find_1stfragopt() return value properly.

Craig Gallek <[email protected]>
ipv6: Prevent overrun when parsing v6 header options

David Ahern <[email protected]>
net: Improve handling of failures on link and route dumps

Christoph Hellwig <[email protected]>
net/smc: Add warning about remote memory exposure

Ursula Braun <[email protected]>
smc: switch to usage of IB_PD_UNSAFE_GLOBAL_RKEY

Soheil Hassas Yeganeh <[email protected]>
tcp: eliminate negative reordering in tcp_clean_rtx_queue

Gal Pressman <[email protected]>
net/mlx5e: Fix ethtool pause support and advertise reporting

Gal Pressman <[email protected]>
net/mlx5e: Use the correct pause values for ethtool advertising

Douglas Caetano dos Santos <[email protected]>
net/packet: fix missing net_device reference release

Eric Dumazet <[email protected]>
sctp: do not inherit ipv6_{mc|ac|fl}_list from parent

Xin Long <[email protected]>
sctp: fix src address selection if using secondary addresses for ipv6

Jon Paul Maloy <[email protected]>
tipc: make macro tipc_wait_for_cond() smp safe

Yuchung Cheng <[email protected]>
tcp: avoid fragmenting peculiar skbs in SACK

Eric Dumazet <[email protected]>
net: fix compile error in skb_orphan_partial()

Eric Dumazet <[email protected]>
netem: fix skb_orphan_partial()

Daniel Borkmann <[email protected]>
bpf, arm64: fix faulty emission of map access in tail calls

Ursula Braun <[email protected]>
s390/qeth: add missing hash table initializations

Julian Wiedmann <[email protected]>
s390/qeth: avoid null pointer dereference on OSN

Julian Wiedmann <[email protected]>
s390/qeth: unbreak OSM and OSN support

Ursula Braun <[email protected]>
s390/qeth: handle sysfs error during initialization

WANG Cong <[email protected]>
ipv6/dccp: do not inherit ipv6_mc_list from parent

Gao Feng <[email protected]>
driver: vrf: Fix one possible use-after-free issue

Eric Dumazet <[email protected]>
dccp/tcp: do not inherit mc_list from parent


-------------

Diffstat:

Documentation/acpi/acpi-lid.txt | 16 +-
Makefile | 4 +-
arch/arm64/net/bpf_jit_comp.c | 5 +-
arch/powerpc/kernel/prom.c | 2 +
arch/powerpc/platforms/cell/spu_base.c | 4 +-
arch/sparc/include/asm/hugetlb.h | 6 +-
arch/sparc/include/asm/pgtable_32.h | 4 +-
arch/sparc/include/asm/setup.h | 2 +-
arch/sparc/kernel/ftrace.c | 13 +-
arch/sparc/mm/init_32.c | 2 +-
arch/x86/boot/compressed/Makefile | 2 +-
arch/x86/include/asm/mce.h | 1 +
arch/x86/kernel/cpu/mcheck/mce.c | 11 +-
crypto/skcipher.c | 40 ++++-
drivers/acpi/acpica/tbutils.c | 4 -
drivers/acpi/button.c | 11 +-
drivers/acpi/nfit/mce.c | 2 +-
drivers/acpi/sysfs.c | 7 +-
drivers/char/pcmcia/cm4040_cs.c | 6 +-
drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c | 32 +++-
drivers/gpu/drm/gma500/psb_intel_lvds.c | 18 +-
drivers/gpu/drm/radeon/ci_dpm.c | 6 +
drivers/gpu/drm/radeon/cik.c | 4 +-
drivers/gpu/drm/radeon/evergreen.c | 4 +-
drivers/gpu/drm/radeon/r600.c | 2 +-
drivers/gpu/drm/radeon/si.c | 4 +-
drivers/hid/wacom_wac.c | 45 ++---
drivers/i2c/busses/i2c-tiny-usb.c | 25 ++-
drivers/infiniband/hw/hfi1/rc.c | 5 +-
drivers/infiniband/hw/qib/qib_rc.c | 4 +-
drivers/infiniband/ulp/srp/ib_srp.c | 2 +-
drivers/mmc/host/sdhci-iproc.c | 3 +-
drivers/net/bonding/bond_3ad.c | 2 +-
drivers/net/ethernet/emulex/benet/be_main.c | 4 +-
drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 41 ++++-
.../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 9 +-
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/health.c | 2 +-
drivers/net/geneve.c | 8 +-
drivers/net/phy/marvell.c | 66 ++++---
drivers/net/usb/qmi_wwan.c | 2 +
drivers/net/virtio_net.c | 1 +
drivers/net/vrf.c | 3 +-
drivers/nvme/host/core.c | 13 +-
drivers/nvme/host/rdma.c | 18 +-
drivers/pci/pci.c | 3 +-
drivers/s390/net/qeth_core.h | 4 +
drivers/s390/net/qeth_core_main.c | 21 ++-
drivers/s390/net/qeth_core_sys.c | 24 ++-
drivers/s390/net/qeth_l2.h | 2 +
drivers/s390/net/qeth_l2_main.c | 26 ++-
drivers/s390/net/qeth_l2_sys.c | 8 +
drivers/s390/net/qeth_l3_main.c | 8 +-
drivers/scsi/device_handler/scsi_dh_rdac.c | 10 +-
drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c | 27 ++-
drivers/scsi/scsi_lib.c | 2 +-
drivers/target/iscsi/iscsi_target.c | 30 +++-
drivers/target/iscsi/iscsi_target_erl0.c | 6 +-
drivers/target/iscsi/iscsi_target_erl0.h | 2 +-
drivers/target/iscsi/iscsi_target_login.c | 4 +
drivers/target/iscsi/iscsi_target_nego.c | 194 ++++++++++++++-------
drivers/tty/serdev/serdev-ttyport.c | 15 +-
drivers/tty/serial/8250/8250_port.c | 2 +-
drivers/tty/tty_port.c | 12 --
fs/dax.c | 23 +++
fs/ufs/super.c | 5 +-
fs/xfs/libxfs/xfs_bmap.c | 9 +-
fs/xfs/libxfs/xfs_btree.c | 2 +-
fs/xfs/libxfs/xfs_refcount.c | 43 +++--
fs/xfs/libxfs/xfs_trans_space.h | 23 ++-
fs/xfs/xfs_aops.c | 4 +-
fs/xfs/xfs_bmap_item.c | 5 +-
fs/xfs/xfs_bmap_util.c | 18 +-
fs/xfs/xfs_buf.c | 62 +++++--
fs/xfs/xfs_buf.h | 6 +-
fs/xfs/xfs_dir2_readdir.c | 15 +-
fs/xfs/xfs_file.c | 33 ++--
fs/xfs/xfs_icache.c | 58 +++++-
fs/xfs/xfs_icache.h | 8 +
fs/xfs/xfs_inode.c | 9 +-
fs/xfs/xfs_inode.h | 4 +-
fs/xfs/xfs_ioctl.c | 5 +-
fs/xfs/xfs_iomap.c | 4 +-
fs/xfs/xfs_log.c | 2 +-
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_qm.c | 7 +-
fs/xfs/xfs_qm_syscalls.c | 3 +-
fs/xfs/xfs_reflink.c | 18 +-
fs/xfs/xfs_super.c | 8 +
fs/xfs/xfs_trans.c | 22 +++
fs/xfs/xfs_trans.h | 2 +
include/linux/if_vlan.h | 18 +-
include/linux/memblock.h | 8 +
include/linux/mlx5/driver.h | 7 +-
include/linux/mm.h | 11 ++
include/linux/mmzone.h | 1 +
include/linux/pci.h | 5 +
include/net/dst.h | 8 +-
include/net/ip_fib.h | 10 +-
include/target/iscsi/iscsi_target_core.h | 1 +
kernel/bpf/arraymap.c | 1 +
kernel/bpf/lpm_trie.c | 1 +
kernel/bpf/stackmap.c | 1 +
kernel/bpf/verifier.c | 12 +-
mm/gup.c | 20 +--
mm/hugetlb.c | 5 +
mm/ksm.c | 3 +-
mm/memblock.c | 23 +++
mm/memory-failure.c | 8 +-
mm/memory.c | 40 +++--
mm/mlock.c | 5 +-
mm/page_alloc.c | 37 ++--
mm/slub.c | 6 +-
net/bridge/br_netlink.c | 7 +
net/bridge/br_stp_if.c | 1 +
net/bridge/br_stp_timer.c | 2 +-
net/core/dst.c | 23 ++-
net/core/filter.c | 1 +
net/core/rtnetlink.c | 36 ++--
net/core/sock.c | 23 +--
net/dccp/ipv6.c | 6 +
net/ipv4/fib_frontend.c | 15 +-
net/ipv4/fib_semantics.c | 17 +-
net/ipv4/fib_trie.c | 26 +--
net/ipv4/inet_connection_sock.c | 2 +
net/ipv4/route.c | 10 +-
net/ipv4/tcp.c | 7 +-
net/ipv4/tcp_input.c | 11 +-
net/ipv6/ip6_gre.c | 13 +-
net/ipv6/ip6_offload.c | 7 +-
net/ipv6/ip6_output.c | 20 ++-
net/ipv6/ip6_tunnel.c | 21 ++-
net/ipv6/output_core.c | 14 +-
net/ipv6/tcp_ipv6.c | 2 +
net/ipv6/udp_offload.c | 6 +-
net/packet/af_packet.c | 14 +-
net/sctp/input.c | 16 +-
net/sctp/ipv6.c | 49 ++++--
net/smc/Kconfig | 4 +
net/smc/smc_clc.c | 4 +-
net/smc/smc_core.c | 16 +-
net/smc/smc_core.h | 2 +-
net/smc/smc_ib.c | 21 +--
net/smc/smc_ib.h | 2 -
net/tipc/socket.c | 38 ++--
sound/pci/hda/patch_realtek.c | 3 +
sound/pci/hda/patch_sigmatel.c | 2 +
sound/usb/mixer_us16x08.c | 10 +-
148 files changed, 1323 insertions(+), 625 deletions(-)



2017-06-05 16:28:00

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 013/115] sctp: fix src address selection if using secondary addresses for ipv6

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Xin Long <[email protected]>


[ Upstream commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 ]

Commit 0ca50d12fe46 ("sctp: fix src address selection if using secondary
addresses") has fixed a src address selection issue when using secondary
addresses for ipv4.

Now sctp ipv6 also has the similar issue. When using a secondary address,
sctp_v6_get_dst tries to choose the saddr which has the most same bits
with the daddr by sctp_v6_addr_match_len. It may make some cases not work
as expected.

hostA:
[1] fd21:356b:459a:cf10::11 (eth1)
[2] fd21:356b:459a:cf20::11 (eth2)

hostB:
[a] fd21:356b:459a:cf30::2 (eth1)
[b] fd21:356b:459a:cf40::2 (eth2)

route from hostA to hostB:
fd21:356b:459a:cf30::/64 dev eth1 metric 1024 mtu 1500

The expected path should be:
fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2
But addr[2] matches addr[a] more bits than addr[1] does, according to
sctp_v6_addr_match_len. It causes the path to be:
fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2

This patch is to fix it with the same way as Marcelo's fix for sctp ipv4.
As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check
if the saddr is in a dev instead.

Note that for backwards compatibility, it will still do the addr_match_len
check here when no optimal is found.

Reported-by: Patrick Talbert <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/sctp/ipv6.c | 46 +++++++++++++++++++++++++++++-----------------
1 file changed, 29 insertions(+), 17 deletions(-)

--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -240,12 +240,10 @@ static void sctp_v6_get_dst(struct sctp_
struct sctp_bind_addr *bp;
struct ipv6_pinfo *np = inet6_sk(sk);
struct sctp_sockaddr_entry *laddr;
- union sctp_addr *baddr = NULL;
union sctp_addr *daddr = &t->ipaddr;
union sctp_addr dst_saddr;
struct in6_addr *final_p, final;
__u8 matchlen = 0;
- __u8 bmatchlen;
sctp_scope_t scope;

memset(fl6, 0, sizeof(struct flowi6));
@@ -312,23 +310,37 @@ static void sctp_v6_get_dst(struct sctp_
*/
rcu_read_lock();
list_for_each_entry_rcu(laddr, &bp->address_list, list) {
- if (!laddr->valid)
+ struct dst_entry *bdst;
+ __u8 bmatchlen;
+
+ if (!laddr->valid ||
+ laddr->state != SCTP_ADDR_SRC ||
+ laddr->a.sa.sa_family != AF_INET6 ||
+ scope > sctp_scope(&laddr->a))
continue;
- if ((laddr->state == SCTP_ADDR_SRC) &&
- (laddr->a.sa.sa_family == AF_INET6) &&
- (scope <= sctp_scope(&laddr->a))) {
- bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
- if (!baddr || (matchlen < bmatchlen)) {
- baddr = &laddr->a;
- matchlen = bmatchlen;
- }
- }
- }
- if (baddr) {
- fl6->saddr = baddr->v6.sin6_addr;
- fl6->fl6_sport = baddr->v6.sin6_port;
+
+ fl6->saddr = laddr->a.v6.sin6_addr;
+ fl6->fl6_sport = laddr->a.v6.sin6_port;
final_p = fl6_update_dst(fl6, rcu_dereference(np->opt), &final);
- dst = ip6_dst_lookup_flow(sk, fl6, final_p);
+ bdst = ip6_dst_lookup_flow(sk, fl6, final_p);
+
+ if (!IS_ERR(bdst) &&
+ ipv6_chk_addr(dev_net(bdst->dev),
+ &laddr->a.v6.sin6_addr, bdst->dev, 1)) {
+ if (!IS_ERR_OR_NULL(dst))
+ dst_release(dst);
+ dst = bdst;
+ break;
+ }
+
+ bmatchlen = sctp_v6_addr_match_len(daddr, &laddr->a);
+ if (matchlen > bmatchlen)
+ continue;
+
+ if (!IS_ERR_OR_NULL(dst))
+ dst_release(dst);
+ dst = bdst;
+ matchlen = bmatchlen;
}
rcu_read_unlock();



2017-06-05 16:28:06

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 014/115] sctp: do not inherit ipv6_{mc|ac|fl}_list from parent

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Eric Dumazet <[email protected]>


[ Upstream commit fdcee2cbb8438702ea1b328fb6e0ac5e9a40c7f8 ]

SCTP needs fixes similar to 83eaddab4378 ("ipv6/dccp: do not inherit
ipv6_mc_list from parent"), otherwise bad things can happen.

Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Andrey Konovalov <[email protected]>
Tested-by: Andrey Konovalov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/sctp/ipv6.c | 3 +++
1 file changed, 3 insertions(+)

--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -677,6 +677,9 @@ static struct sock *sctp_v6_create_accep
newnp = inet6_sk(newsk);

memcpy(newnp, np, sizeof(struct ipv6_pinfo));
+ newnp->ipv6_mc_list = NULL;
+ newnp->ipv6_ac_list = NULL;
+ newnp->ipv6_fl_list = NULL;

rcu_read_lock();
opt = rcu_dereference(np->opt);


2017-06-05 16:28:12

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 002/115] driver: vrf: Fix one possible use-after-free issue

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Gao Feng <[email protected]>


[ Upstream commit 1a4a5bf52a4adb477adb075e5afce925824ad132 ]

The current codes only deal with the case that the skb is dropped, it
may meet one use-after-free issue when NF_HOOK returns 0 that means
the skb is stolen by one netfilter rule or hook.

When one netfilter rule or hook stoles the skb and return NF_STOLEN,
it means the skb is taken by the rule, and other modules should not
touch this skb ever. Maybe the skb is queued or freed directly by the
rule.

Now uses the nf_hook instead of NF_HOOK to get the result of netfilter,
and check the return value of nf_hook. Only when its value equals 1, it
means the skb could go ahead. Or reset the skb as NULL.

BTW, because vrf_rcv_finish is empty function, so needn't invoke it
even though nf_hook returns 1. But we need to modify vrf_rcv_finish
to deal with the NF_STOLEN case.

There are two cases when skb is stolen.
1. The skb is stolen and freed directly.
There is nothing we need to do, and vrf_rcv_finish isn't invoked.
2. The skb is queued and reinjected again.
The vrf_rcv_finish would be invoked as okfn, so need to free the
skb in it.

Signed-off-by: Gao Feng <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/vrf.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -851,6 +851,7 @@ static u32 vrf_fib_table(const struct ne

static int vrf_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
+ kfree_skb(skb);
return 0;
}

@@ -860,7 +861,7 @@ static struct sk_buff *vrf_rcv_nfhook(u8
{
struct net *net = dev_net(dev);

- if (NF_HOOK(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) < 0)
+ if (nf_hook(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) != 1)
skb = NULL; /* kfree_skb(skb) handled by nf code */

return skb;


2017-06-05 16:28:26

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 006/115] s390/qeth: avoid null pointer dereference on OSN

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Julian Wiedmann <[email protected]>


[ Upstream commit 25e2c341e7818a394da9abc403716278ee646014 ]

Access card->dev only after checking whether's its valid.

Signed-off-by: Julian Wiedmann <[email protected]>
Reviewed-by: Ursula Braun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/s390/net/qeth_l2_main.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -1091,7 +1091,6 @@ static int qeth_l2_setup_netdev(struct q
case QETH_CARD_TYPE_OSN:
card->dev = alloc_netdev(0, "osn%d", NET_NAME_UNKNOWN,
ether_setup);
- card->dev->flags |= IFF_NOARP;
break;
default:
card->dev = alloc_etherdev(0);
@@ -1106,9 +1105,12 @@ static int qeth_l2_setup_netdev(struct q
card->dev->min_mtu = 64;
card->dev->max_mtu = ETH_MAX_MTU;
card->dev->netdev_ops = &qeth_l2_netdev_ops;
- card->dev->ethtool_ops =
- (card->info.type != QETH_CARD_TYPE_OSN) ?
- &qeth_l2_ethtool_ops : &qeth_l2_osn_ops;
+ if (card->info.type == QETH_CARD_TYPE_OSN) {
+ card->dev->ethtool_ops = &qeth_l2_osn_ops;
+ card->dev->flags |= IFF_NOARP;
+ } else {
+ card->dev->ethtool_ops = &qeth_l2_ethtool_ops;
+ }
card->dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
if (card->info.type == QETH_CARD_TYPE_OSD && !card->info.guestlan) {
card->dev->hw_features = NETIF_F_SG;


2017-06-05 16:28:37

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 016/115] net/mlx5e: Use the correct pause values for ethtool advertising

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Gal Pressman <[email protected]>


[ Upstream commit b383b544f2666d67446b951a9a97af239dafed5d ]

Query the operational pause from firmware (PFCC register) instead of
always passing zeros.

Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API")
Signed-off-by: Gal Pressman <[email protected]>
Cc: [email protected]
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -828,6 +828,8 @@ static int mlx5e_get_link_ksettings(stru
struct mlx5e_priv *priv = netdev_priv(netdev);
struct mlx5_core_dev *mdev = priv->mdev;
u32 out[MLX5_ST_SZ_DW(ptys_reg)] = {0};
+ u32 rx_pause = 0;
+ u32 tx_pause = 0;
u32 eth_proto_cap;
u32 eth_proto_admin;
u32 eth_proto_lp;
@@ -850,11 +852,13 @@ static int mlx5e_get_link_ksettings(stru
an_disable_admin = MLX5_GET(ptys_reg, out, an_disable_admin);
an_status = MLX5_GET(ptys_reg, out, an_status);

+ mlx5_query_port_pause(mdev, &rx_pause, &tx_pause);
+
ethtool_link_ksettings_zero_link_mode(link_ksettings, supported);
ethtool_link_ksettings_zero_link_mode(link_ksettings, advertising);

get_supported(eth_proto_cap, link_ksettings);
- get_advertising(eth_proto_admin, 0, 0, link_ksettings);
+ get_advertising(eth_proto_admin, tx_pause, rx_pause, link_ksettings);
get_speed_duplex(netdev, eth_proto_oper, link_ksettings);

eth_proto_oper = eth_proto_oper ? eth_proto_oper : eth_proto_cap;


2017-06-05 16:28:53

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 028/115] bonding: fix accounting of active ports in 3ad

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Jarod Wilson <[email protected]>


[ Upstream commit 751da2a69b7cc82d83dc310ed7606225f2d6e014 ]

As of 7bb11dc9f59d and 0622cab0341c, bond slaves in a 3ad bond are not
removed from the aggregator when they are down, and the active slave count
is NOT equal to number of ports in the aggregator, but rather the number
of ports in the aggregator that are still enabled. The sysfs spew for
bonding_show_ad_num_ports() has a comment that says "Show number of active
802.3ad ports.", but it's currently showing total number of ports, both
active and inactive. Remedy it by using the same logic introduced in
0622cab0341c in __bond_3ad_get_active_agg_info(), so sysfs, procfs and
netlink all report the number of active ports. Note that this means that
IFLA_BOND_AD_INFO_NUM_PORTS really means NUM_ACTIVE_PORTS instead of
NUM_PORTS, and thus perhaps should be renamed for clarity.

Lightly tested on a dual i40e lacp bond, simulating link downs with an ip
link set dev <slave2> down, was able to produce the state where I could
see both in the same aggregator, but a number of ports count of 1.

MII Status: up
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2 <---
Slave Interface: ens10
MII Status: up <---
Aggregator ID: 1
Slave Interface: ens11
MII Status: up
Aggregator ID: 1

MII Status: up
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 1 <---
Slave Interface: ens10
MII Status: down <---
Aggregator ID: 1
Slave Interface: ens11
MII Status: up
Aggregator ID: 1

CC: Jay Vosburgh <[email protected]>
CC: Veaceslav Falico <[email protected]>
CC: Andy Gospodarek <[email protected]>
CC: [email protected]
Signed-off-by: Jarod Wilson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/bonding/bond_3ad.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2573,7 +2573,7 @@ int __bond_3ad_get_active_agg_info(struc
return -1;

ad_info->aggregator_id = aggregator->aggregator_identifier;
- ad_info->ports = aggregator->num_of_ports;
+ ad_info->ports = __agg_active_ports(aggregator);
ad_info->actor_key = aggregator->actor_oper_aggregator_key;
ad_info->partner_key = aggregator->partner_oper_aggregator_key;
ether_addr_copy(ad_info->partner_system,


2017-06-05 16:28:43

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 026/115] bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Xin Long <[email protected]>


[ Upstream commit 6d18c732b95c0a9d35e9f978b4438bba15412284 ]

Since commit 76b91c32dd86 ("bridge: stp: when using userspace stp stop
kernel hello and hold timers"), bridge would not start hello_timer if
stp_enabled is not KERNEL_STP when br_dev_open.

The problem is even if users set stp_enabled with KERNEL_STP later,
the timer will still not be started. It causes that KERNEL_STP can
not really work. Users have to re-ifup the bridge to avoid this.

This patch is to fix it by starting br->hello_timer when enabling
KERNEL_STP in br_stp_start.

As an improvement, it's also to start hello_timer again only when
br->stp_enabled is KERNEL_STP in br_hello_timer_expired, there is
no reason to start the timer again when it's NO_STP.

Fixes: 76b91c32dd86 ("bridge: stp: when using userspace stp stop kernel hello and hold timers")
Reported-by: Haidong Li <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Ivan Vecera <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/bridge/br_stp_if.c | 1 +
net/bridge/br_stp_timer.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)

--- a/net/bridge/br_stp_if.c
+++ b/net/bridge/br_stp_if.c
@@ -179,6 +179,7 @@ static void br_stp_start(struct net_brid
br_debug(br, "using kernel STP\n");

/* To start timers on any ports left in blocking */
+ mod_timer(&br->hello_timer, jiffies + br->hello_time);
br_port_state_selection(br);
}

--- a/net/bridge/br_stp_timer.c
+++ b/net/bridge/br_stp_timer.c
@@ -40,7 +40,7 @@ static void br_hello_timer_expired(unsig
if (br->dev->flags & IFF_UP) {
br_config_bpdu_generation(br);

- if (br->stp_enabled != BR_USER_STP)
+ if (br->stp_enabled == BR_KERNEL_STP)
mod_timer(&br->hello_timer,
round_jiffies(jiffies + br->hello_time));
}


2017-06-05 16:29:00

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 031/115] vlan: Fix tcp checksum offloads in Q-in-Q vlans

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Vlad Yasevich <[email protected]>


[ Upstream commit 35d2f80b07bbe03fb358afb0bdeff7437a7d67ff ]

It appears that TCP checksum offloading has been broken for
Q-in-Q vlans. The behavior was execerbated by the
series
commit afb0bc972b52 ("Merge branch 'stacked_vlan_tso'")
that that enabled accleleration features on stacked vlans.

However, event without that series, it is possible to trigger
this issue. It just requires a lot more specialized configuration.

The root cause is the interaction between how
netdev_intersect_features() works, the features actually set on
the vlan devices and HW having the ability to run checksum with
longer headers.

The issue starts when netdev_interesect_features() replaces
NETIF_F_HW_CSUM with a combination of NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM,
if the HW advertises IP|IPV6 specific checksums. This happens
for tagged and multi-tagged packets. However, HW that enables
IP|IPV6 checksum offloading doesn't gurantee that packets with
arbitrarily long headers can be checksummed.

This patch disables IP|IPV6 checksums on the packet for multi-tagged
packets.

CC: Toshiaki Makita <[email protected]>
CC: Michal Kubecek <[email protected]>
Signed-off-by: Vladislav Yasevich <[email protected]>
Acked-by: Toshiaki Makita <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
include/linux/if_vlan.h | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -614,14 +614,16 @@ static inline bool skb_vlan_tagged_multi
static inline netdev_features_t vlan_features_check(const struct sk_buff *skb,
netdev_features_t features)
{
- if (skb_vlan_tagged_multi(skb))
- features = netdev_intersect_features(features,
- NETIF_F_SG |
- NETIF_F_HIGHDMA |
- NETIF_F_FRAGLIST |
- NETIF_F_HW_CSUM |
- NETIF_F_HW_VLAN_CTAG_TX |
- NETIF_F_HW_VLAN_STAG_TX);
+ if (skb_vlan_tagged_multi(skb)) {
+ /* In the case of multi-tagged packets, use a direct mask
+ * instead of using netdev_interesect_features(), to make
+ * sure that only devices supporting NETIF_F_HW_CSUM will
+ * have checksum offloading support.
+ */
+ features &= NETIF_F_SG | NETIF_F_HIGHDMA | NETIF_F_HW_CSUM |
+ NETIF_F_FRAGLIST | NETIF_F_HW_VLAN_CTAG_TX |
+ NETIF_F_HW_VLAN_STAG_TX;
+ }

return features;
}


2017-06-05 16:29:10

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 030/115] net: phy: marvell: Limit errata to 88m1101

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Andrew Lunn <[email protected]>


[ Upstream commit f2899788353c13891412b273fdff5f02d49aa40f ]

The 88m1101 has an errata when configuring autoneg. However, it was
being applied to many other Marvell PHYs as well. Limit its scope to
just the 88m1101.

Fixes: 76884679c644 ("phylib: Add support for Marvell 88e1111S and 88e1145")
Reported-by: Daniel Walker <[email protected]>
Signed-off-by: Andrew Lunn <[email protected]>
Acked-by: Harini Katakam <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/phy/marvell.c | 66 +++++++++++++++++++++++++---------------------
1 file changed, 37 insertions(+), 29 deletions(-)

--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -255,34 +255,6 @@ static int marvell_config_aneg(struct ph
{
int err;

- /* The Marvell PHY has an errata which requires
- * that certain registers get written in order
- * to restart autonegotiation */
- err = phy_write(phydev, MII_BMCR, BMCR_RESET);
-
- if (err < 0)
- return err;
-
- err = phy_write(phydev, 0x1d, 0x1f);
- if (err < 0)
- return err;
-
- err = phy_write(phydev, 0x1e, 0x200c);
- if (err < 0)
- return err;
-
- err = phy_write(phydev, 0x1d, 0x5);
- if (err < 0)
- return err;
-
- err = phy_write(phydev, 0x1e, 0);
- if (err < 0)
- return err;
-
- err = phy_write(phydev, 0x1e, 0x100);
- if (err < 0)
- return err;
-
err = marvell_set_polarity(phydev, phydev->mdix_ctrl);
if (err < 0)
return err;
@@ -316,6 +288,42 @@ static int marvell_config_aneg(struct ph
return 0;
}

+static int m88e1101_config_aneg(struct phy_device *phydev)
+{
+ int err;
+
+ /* This Marvell PHY has an errata which requires
+ * that certain registers get written in order
+ * to restart autonegotiation
+ */
+ err = phy_write(phydev, MII_BMCR, BMCR_RESET);
+
+ if (err < 0)
+ return err;
+
+ err = phy_write(phydev, 0x1d, 0x1f);
+ if (err < 0)
+ return err;
+
+ err = phy_write(phydev, 0x1e, 0x200c);
+ if (err < 0)
+ return err;
+
+ err = phy_write(phydev, 0x1d, 0x5);
+ if (err < 0)
+ return err;
+
+ err = phy_write(phydev, 0x1e, 0);
+ if (err < 0)
+ return err;
+
+ err = phy_write(phydev, 0x1e, 0x100);
+ if (err < 0)
+ return err;
+
+ return marvell_config_aneg(phydev);
+}
+
static int m88e1111_config_aneg(struct phy_device *phydev)
{
int err;
@@ -1892,7 +1900,7 @@ static struct phy_driver marvell_drivers
.flags = PHY_HAS_INTERRUPT,
.probe = marvell_probe,
.config_init = &marvell_config_init,
- .config_aneg = &marvell_config_aneg,
+ .config_aneg = &m88e1101_config_aneg,
.read_status = &genphy_read_status,
.ack_interrupt = &marvell_ack_interrupt,
.config_intr = &marvell_config_intr,


2017-06-05 16:29:21

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 019/115] smc: switch to usage of IB_PD_UNSAFE_GLOBAL_RKEY

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Ursula Braun <[email protected]>


[ Upstream commit 263eec9b2a82e8697d064709414914b5b10ac538 ]

Currently, SMC enables remote access to physical memory when a user
has successfully configured and established an SMC-connection until ten
minutes after the last SMC connection is closed. Because this is considered
a security risk, drivers are supposed to use IB_PD_UNSAFE_GLOBAL_RKEY in
such a case.

This patch changes the current SMC code to use IB_PD_UNSAFE_GLOBAL_RKEY.
This improves user awareness, but does not remove the security risk itself.

Signed-off-by: Ursula Braun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/smc/smc_clc.c | 4 ++--
net/smc/smc_core.c | 16 +++-------------
net/smc/smc_core.h | 2 +-
net/smc/smc_ib.c | 21 ++-------------------
net/smc/smc_ib.h | 2 --
5 files changed, 8 insertions(+), 37 deletions(-)

--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -204,7 +204,7 @@ int smc_clc_send_confirm(struct smc_sock
memcpy(&cclc.lcl.mac, &link->smcibdev->mac[link->ibport - 1], ETH_ALEN);
hton24(cclc.qpn, link->roce_qp->qp_num);
cclc.rmb_rkey =
- htonl(conn->rmb_desc->mr_rx[SMC_SINGLE_LINK]->rkey);
+ htonl(conn->rmb_desc->rkey[SMC_SINGLE_LINK]);
cclc.conn_idx = 1; /* for now: 1 RMB = 1 RMBE */
cclc.rmbe_alert_token = htonl(conn->alert_token_local);
cclc.qp_mtu = min(link->path_mtu, link->peer_mtu);
@@ -256,7 +256,7 @@ int smc_clc_send_accept(struct smc_sock
memcpy(&aclc.lcl.mac, link->smcibdev->mac[link->ibport - 1], ETH_ALEN);
hton24(aclc.qpn, link->roce_qp->qp_num);
aclc.rmb_rkey =
- htonl(conn->rmb_desc->mr_rx[SMC_SINGLE_LINK]->rkey);
+ htonl(conn->rmb_desc->rkey[SMC_SINGLE_LINK]);
aclc.conn_idx = 1; /* as long as 1 RMB = 1 RMBE */
aclc.rmbe_alert_token = htonl(conn->alert_token_local);
aclc.qp_mtu = link->path_mtu;
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -613,19 +613,8 @@ int smc_rmb_create(struct smc_sock *smc)
rmb_desc = NULL;
continue; /* if mapping failed, try smaller one */
}
- rc = smc_ib_get_memory_region(lgr->lnk[SMC_SINGLE_LINK].roce_pd,
- IB_ACCESS_REMOTE_WRITE |
- IB_ACCESS_LOCAL_WRITE,
- &rmb_desc->mr_rx[SMC_SINGLE_LINK]);
- if (rc) {
- smc_ib_buf_unmap(lgr->lnk[SMC_SINGLE_LINK].smcibdev,
- tmp_bufsize, rmb_desc,
- DMA_FROM_DEVICE);
- kfree(rmb_desc->cpu_addr);
- kfree(rmb_desc);
- rmb_desc = NULL;
- continue;
- }
+ rmb_desc->rkey[SMC_SINGLE_LINK] =
+ lgr->lnk[SMC_SINGLE_LINK].roce_pd->unsafe_global_rkey;
rmb_desc->used = 1;
write_lock_bh(&lgr->rmbs_lock);
list_add(&rmb_desc->list,
@@ -668,6 +657,7 @@ int smc_rmb_rtoken_handling(struct smc_c

for (i = 0; i < SMC_RMBS_PER_LGR_MAX; i++) {
if ((lgr->rtokens[i][SMC_SINGLE_LINK].rkey == rkey) &&
+ (lgr->rtokens[i][SMC_SINGLE_LINK].dma_addr == dma_addr) &&
test_bit(i, lgr->rtokens_used_mask)) {
conn->rtoken_idx = i;
return 0;
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -93,7 +93,7 @@ struct smc_buf_desc {
u64 dma_addr[SMC_LINKS_PER_LGR_MAX];
/* mapped address of buffer */
void *cpu_addr; /* virtual address of buffer */
- struct ib_mr *mr_rx[SMC_LINKS_PER_LGR_MAX];
+ u32 rkey[SMC_LINKS_PER_LGR_MAX];
/* for rmb only:
* rkey provided to peer
*/
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -37,24 +37,6 @@ u8 local_systemid[SMC_SYSTEMID_LEN] = SM
* identifier
*/

-int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
- struct ib_mr **mr)
-{
- int rc;
-
- if (*mr)
- return 0; /* already done */
-
- /* obtain unique key -
- * next invocation of get_dma_mr returns a different key!
- */
- *mr = pd->device->get_dma_mr(pd, access_flags);
- rc = PTR_ERR_OR_ZERO(*mr);
- if (IS_ERR(*mr))
- *mr = NULL;
- return rc;
-}
-
static int smc_ib_modify_qp_init(struct smc_link *lnk)
{
struct ib_qp_attr qp_attr;
@@ -213,7 +195,8 @@ int smc_ib_create_protection_domain(stru
{
int rc;

- lnk->roce_pd = ib_alloc_pd(lnk->smcibdev->ibdev, 0);
+ lnk->roce_pd = ib_alloc_pd(lnk->smcibdev->ibdev,
+ IB_PD_UNSAFE_GLOBAL_RKEY);
rc = PTR_ERR_OR_ZERO(lnk->roce_pd);
if (IS_ERR(lnk->roce_pd))
lnk->roce_pd = NULL;
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -60,8 +60,6 @@ void smc_ib_dealloc_protection_domain(st
int smc_ib_create_protection_domain(struct smc_link *lnk);
void smc_ib_destroy_queue_pair(struct smc_link *lnk);
int smc_ib_create_queue_pair(struct smc_link *lnk);
-int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
- struct ib_mr **mr);
int smc_ib_ready_link(struct smc_link *lnk);
int smc_ib_modify_qp_rts(struct smc_link *lnk);
int smc_ib_modify_qp_reset(struct smc_link *lnk);


2017-06-05 16:29:30

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 015/115] net/packet: fix missing net_device reference release

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Douglas Caetano dos Santos <[email protected]>


[ Upstream commit d19b183cdc1fa3d70d6abe2a4c369e748cd7ebb8 ]

When using a TX ring buffer, if an error occurs processing a control
message (e.g. invalid message), the net_device reference is not
released.

Fixes c14ac9451c348 ("sock: enable timestamping using control messages")
Signed-off-by: Douglas Caetano dos Santos <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/packet/af_packet.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2614,13 +2614,6 @@ static int tpacket_snd(struct packet_soc
dev = dev_get_by_index(sock_net(&po->sk), saddr->sll_ifindex);
}

- sockc.tsflags = po->sk.sk_tsflags;
- if (msg->msg_controllen) {
- err = sock_cmsg_send(&po->sk, msg, &sockc);
- if (unlikely(err))
- goto out;
- }
-
err = -ENXIO;
if (unlikely(dev == NULL))
goto out;
@@ -2628,6 +2621,13 @@ static int tpacket_snd(struct packet_soc
if (unlikely(!(dev->flags & IFF_UP)))
goto out_put;

+ sockc.tsflags = po->sk.sk_tsflags;
+ if (msg->msg_controllen) {
+ err = sock_cmsg_send(&po->sk, msg, &sockc);
+ if (unlikely(err))
+ goto out_put;
+ }
+
if (po->sk.sk_socket->type == SOCK_RAW)
reserve = dev->hard_header_len;
size_max = po->tx_ring.frame_size


2017-06-05 16:29:39

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 042/115] sparc64: Fix mapping of 64k pages with MAP_FIXED

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Nitin Gupta <[email protected]>


[ Upstream commit b6c41cb050d5debc7e4eaa0a81cbdbad72588891 ]

An incorrect huge page alignment check caused
mmap failure for 64K pages when MAP_FIXED is used
with address not aligned to HPAGE_SIZE.

Orabug: 25885991

Fixes: dcd1912d21a0 ("sparc64: Add 64K page size support")
Signed-off-by: Nitin Gupta <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
arch/sparc/include/asm/hugetlb.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -24,9 +24,11 @@ static inline int is_hugepage_only_range
static inline int prepare_hugepage_range(struct file *file,
unsigned long addr, unsigned long len)
{
- if (len & ~HPAGE_MASK)
+ struct hstate *h = hstate_file(file);
+
+ if (len & ~huge_page_mask(h))
return -EINVAL;
- if (addr & ~HPAGE_MASK)
+ if (addr & ~huge_page_mask(h))
return -EINVAL;
return 0;
}


2017-06-05 16:29:51

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 045/115] fs/ufs: Set UFS default maximum bytes per file

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Richard Narron <[email protected]>

commit 239e250e4acbc0104d514307029c0839e834a51a upstream.

This fixes a problem with reading files larger than 2GB from a UFS-2
file system:

https://bugzilla.kernel.org/show_bug.cgi?id=195721

The incorrect UFS s_maxsize limit became a problem as of commit
c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
which started using s_maxbytes to avoid a page index overflow in
do_generic_file_read().

That caused files to be truncated on UFS-2 file systems because the
default maximum file size is 2GB (MAX_NON_LFS) and UFS didn't update it.

Here I simply increase the default to a common value used by other file
systems.

Signed-off-by: Richard Narron <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Will B <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ufs/super.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -812,9 +812,8 @@ static int ufs_fill_super(struct super_b
uspi->s_dirblksize = UFS_SECTOR_SIZE;
super_block_offset=UFS_SBLOCK;

- /* Keep 2Gig file limit. Some UFS variants need to override
- this but as I don't know which I'll let those in the know loosen
- the rules */
+ sb->s_maxbytes = MAX_LFS_FILESIZE;
+
switch (sbi->s_mount_opt & UFS_MOUNT_UFSTYPE) {
case UFS_MOUNT_UFSTYPE_44BSD:
UFSD("ufstype=44bsd\n");


2017-06-05 16:29:49

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 046/115] powerpc: Fix booting P9 hash with CONFIG_PPC_RADIX_MMU=N

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Michael Neuling <[email protected]>

commit d957fb4d173647640a2b83e7c7e56a580e7fc7e7 upstream.

Currently if you disable CONFIG_PPC_RADIX_MMU you'll crash on boot on
a P9. This is because we still set MMU_FTR_TYPE_RADIX via
ibm,pa-features and MMU_FTR_TYPE_RADIX is what's used for code patching
in much of the asm code (ie. slb_miss_realmode)

This patch fixes the problem by stopping MMU_FTR_TYPE_RADIX from being
set from ibm.pa-features.

We may eventually end up removing the CONFIG_PPC_RADIX_MMU option
completely but until then this fixes the issue.

Fixes: 17a3dd2f5fc7 ("powerpc/mm/radix: Use firmware feature to enable Radix MMU")
Signed-off-by: Michael Neuling <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/powerpc/kernel/prom.c | 2 ++
1 file changed, 2 insertions(+)

--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -161,7 +161,9 @@ static struct ibm_pa_feature {
{ .pabyte = 0, .pabit = 3, .cpu_features = CPU_FTR_CTRL },
{ .pabyte = 0, .pabit = 6, .cpu_features = CPU_FTR_NOEXECUTE },
{ .pabyte = 1, .pabit = 2, .mmu_features = MMU_FTR_CI_LARGE_PAGE },
+#ifdef CONFIG_PPC_RADIX_MMU
{ .pabyte = 40, .pabit = 0, .mmu_features = MMU_FTR_TYPE_RADIX },
+#endif
{ .pabyte = 1, .pabit = 1, .invert = 1, .cpu_features = CPU_FTR_NODSISRALIGN },
{ .pabyte = 5, .pabit = 0, .cpu_features = CPU_FTR_REAL_LE,
.cpu_user_ftrs = PPC_FEATURE_TRUE_LE },


2017-06-05 16:29:56

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 044/115] sparc/ftrace: Fix ftrace graph time measurement

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: "Liam R. Howlett" <[email protected]>


[ Upstream commit 48078d2dac0a26f84f5f3ec704f24f7c832cce14 ]

The ftrace function_graph time measurements of a given function is not
accurate according to those recorded by ftrace using the function
filters. This change pulls the x86_64 fix from 'commit 722b3c746953
("ftrace/graph: Trace function entry before updating index")' into the
sparc specific prepare_ftrace_return which stops ftrace from
counting interrupted tasks in the time measurement.

Example measurements for select_task_rq_fair running "hackbench 100
process 1000":

| tracing/trace_stat/function0 | function_graph
Before patch | 2.802 us | 4.255 us
After patch | 2.749 us | 3.094 us

Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
arch/sparc/kernel/ftrace.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

--- a/arch/sparc/kernel/ftrace.c
+++ b/arch/sparc/kernel/ftrace.c
@@ -130,17 +130,16 @@ unsigned long prepare_ftrace_return(unsi
if (unlikely(atomic_read(&current->tracing_graph_pause)))
return parent + 8UL;

- if (ftrace_push_return_trace(parent, self_addr, &trace.depth,
- frame_pointer, NULL) == -EBUSY)
- return parent + 8UL;
-
trace.func = self_addr;
+ trace.depth = current->curr_ret_stack + 1;

/* Only trace if the calling function expects to */
- if (!ftrace_graph_entry(&trace)) {
- current->curr_ret_stack--;
+ if (!ftrace_graph_entry(&trace))
+ return parent + 8UL;
+
+ if (ftrace_push_return_trace(parent, self_addr, &trace.depth,
+ frame_pointer, NULL) == -EBUSY)
return parent + 8UL;
- }

return return_hooker;
}


2017-06-05 16:30:03

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 037/115] ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Peter Dawson <[email protected]>


[ Upstream commit 0e9a709560dbcfbace8bf4019dc5298619235891 ]

This fix addresses two problems in the way the DSCP field is formulated
on the encapsulating header of IPv6 tunnels.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195661

1) The IPv6 tunneling code was manipulating the DSCP field of the
encapsulating packet using the 32b flowlabel. Since the flowlabel is
only the lower 20b it was incorrect to assume that the upper 12b
containing the DSCP and ECN fields would remain intact when formulating
the encapsulating header. This fix handles the 'inherit' and
'fixed-value' DSCP cases explicitly using the extant dsfield u8 variable.

2) The use of INET_ECN_encapsulate(0, dsfield) in ip6_tnl_xmit was
incorrect and resulted in the DSCP value always being set to 0.

Commit 90427ef5d2a4 ("ipv6: fix flow labels when the traffic class
is non-0") caused the regression by masking out the flowlabel
which exposed the incorrect handling of the DSCP portion of the
flowlabel in ip6_tunnel and ip6_gre.

Fixes: 90427ef5d2a4 ("ipv6: fix flow labels when the traffic class is non-0")
Signed-off-by: Peter Dawson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ipv6/ip6_gre.c | 13 +++++++------
net/ipv6/ip6_tunnel.c | 21 +++++++++++++--------
2 files changed, 20 insertions(+), 14 deletions(-)

--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -537,11 +537,10 @@ static inline int ip6gre_xmit_ipv4(struc

memcpy(&fl6, &t->fl.u.ip6, sizeof(fl6));

- dsfield = ipv4_get_dsfield(iph);
-
if (t->parms.flags & IP6_TNL_F_USE_ORIG_TCLASS)
- fl6.flowlabel |= htonl((__u32)iph->tos << IPV6_TCLASS_SHIFT)
- & IPV6_TCLASS_MASK;
+ dsfield = ipv4_get_dsfield(iph);
+ else
+ dsfield = ip6_tclass(t->parms.flowinfo);
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK)
fl6.flowi6_mark = skb->mark;

@@ -596,9 +595,11 @@ static inline int ip6gre_xmit_ipv6(struc

memcpy(&fl6, &t->fl.u.ip6, sizeof(fl6));

- dsfield = ipv6_get_dsfield(ipv6h);
if (t->parms.flags & IP6_TNL_F_USE_ORIG_TCLASS)
- fl6.flowlabel |= (*(__be32 *) ipv6h & IPV6_TCLASS_MASK);
+ dsfield = ipv6_get_dsfield(ipv6h);
+ else
+ dsfield = ip6_tclass(t->parms.flowinfo);
+
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FLOWLABEL)
fl6.flowlabel |= ip6_flowlabel(ipv6h);
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK)
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1196,7 +1196,7 @@ route_lookup:
skb_push(skb, sizeof(struct ipv6hdr));
skb_reset_network_header(skb);
ipv6h = ipv6_hdr(skb);
- ip6_flow_hdr(ipv6h, INET_ECN_encapsulate(0, dsfield),
+ ip6_flow_hdr(ipv6h, dsfield,
ip6_make_flowlabel(net, skb, fl6->flowlabel, true, fl6));
ipv6h->hop_limit = hop_limit;
ipv6h->nexthdr = proto;
@@ -1231,8 +1231,6 @@ ip4ip6_tnl_xmit(struct sk_buff *skb, str
if (tproto != IPPROTO_IPIP && tproto != 0)
return -1;

- dsfield = ipv4_get_dsfield(iph);
-
if (t->parms.collect_md) {
struct ip_tunnel_info *tun_info;
const struct ip_tunnel_key *key;
@@ -1246,6 +1244,7 @@ ip4ip6_tnl_xmit(struct sk_buff *skb, str
fl6.flowi6_proto = IPPROTO_IPIP;
fl6.daddr = key->u.ipv6.dst;
fl6.flowlabel = key->label;
+ dsfield = ip6_tclass(key->label);
} else {
if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
encap_limit = t->parms.encap_limit;
@@ -1254,8 +1253,9 @@ ip4ip6_tnl_xmit(struct sk_buff *skb, str
fl6.flowi6_proto = IPPROTO_IPIP;

if (t->parms.flags & IP6_TNL_F_USE_ORIG_TCLASS)
- fl6.flowlabel |= htonl((__u32)iph->tos << IPV6_TCLASS_SHIFT)
- & IPV6_TCLASS_MASK;
+ dsfield = ipv4_get_dsfield(iph);
+ else
+ dsfield = ip6_tclass(t->parms.flowinfo);
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK)
fl6.flowi6_mark = skb->mark;
}
@@ -1265,6 +1265,8 @@ ip4ip6_tnl_xmit(struct sk_buff *skb, str
if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6))
return -1;

+ dsfield = INET_ECN_encapsulate(dsfield, ipv4_get_dsfield(iph));
+
skb_set_inner_ipproto(skb, IPPROTO_IPIP);

err = ip6_tnl_xmit(skb, dev, dsfield, &fl6, encap_limit, &mtu,
@@ -1298,8 +1300,6 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, str
ip6_tnl_addr_conflict(t, ipv6h))
return -1;

- dsfield = ipv6_get_dsfield(ipv6h);
-
if (t->parms.collect_md) {
struct ip_tunnel_info *tun_info;
const struct ip_tunnel_key *key;
@@ -1313,6 +1313,7 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, str
fl6.flowi6_proto = IPPROTO_IPV6;
fl6.daddr = key->u.ipv6.dst;
fl6.flowlabel = key->label;
+ dsfield = ip6_tclass(key->label);
} else {
offset = ip6_tnl_parse_tlv_enc_lim(skb, skb_network_header(skb));
/* ip6_tnl_parse_tlv_enc_lim() might have reallocated skb->head */
@@ -1335,7 +1336,9 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, str
fl6.flowi6_proto = IPPROTO_IPV6;

if (t->parms.flags & IP6_TNL_F_USE_ORIG_TCLASS)
- fl6.flowlabel |= (*(__be32 *)ipv6h & IPV6_TCLASS_MASK);
+ dsfield = ipv6_get_dsfield(ipv6h);
+ else
+ dsfield = ip6_tclass(t->parms.flowinfo);
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FLOWLABEL)
fl6.flowlabel |= ip6_flowlabel(ipv6h);
if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK)
@@ -1347,6 +1350,8 @@ ip6ip6_tnl_xmit(struct sk_buff *skb, str
if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6))
return -1;

+ dsfield = INET_ECN_encapsulate(dsfield, ipv6_get_dsfield(ipv6h));
+
skb_set_inner_ipproto(skb, IPPROTO_IPV6);

err = ip6_tnl_xmit(skb, dev, dsfield, &fl6, encap_limit, &mtu,


2017-06-05 16:30:10

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 039/115] bpf: add bpf_clone_redirect to bpf_helper_changes_pkt_data

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Daniel Borkmann <[email protected]>


[ Upstream commit 41703a731066fde79c3e5ccf3391cf77a98aeda5 ]

The bpf_clone_redirect() still needs to be listed in
bpf_helper_changes_pkt_data() since we call into
bpf_try_make_head_writable() from there, thus we need
to invalidate prior pkt regs as well.

Fixes: 36bbef52c7eb ("bpf: direct packet write and access for helpers for clsact progs")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/core/filter.c | 1 +
1 file changed, 1 insertion(+)

--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2266,6 +2266,7 @@ bool bpf_helper_changes_pkt_data(void *f
func == bpf_skb_change_head ||
func == bpf_skb_change_tail ||
func == bpf_skb_pull_data ||
+ func == bpf_clone_redirect ||
func == bpf_l3_csum_replace ||
func == bpf_l4_csum_replace ||
func == bpf_xdp_adjust_head)


2017-06-05 16:30:34

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 040/115] bpf: fix wrong exposure of map_flags into fdinfo for lpm

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Daniel Borkmann <[email protected]>


[ Upstream commit a316338cb71a3260201490e615f2f6d5c0d8fb2c ]

trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
attr->map_flags, since it does not support preallocation yet. We
check the flag, but we never copy the flag into trie->map.map_flags,
which is later on exposed into fdinfo and used by loaders such as
iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
whether a pinned map has the same spec as the one from the BPF obj
file and if not, bails out, which is currently the case for lpm
since it exposes always 0 as flags.

Also copy over flags in array_map_alloc() and stack_map_alloc().
They always have to be 0 right now, but we should make sure to not
miss to copy them over at a later point in time when we add actual
flags for them to use.

Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Reported-by: Jarno Rajahalme <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
kernel/bpf/arraymap.c | 1 +
kernel/bpf/lpm_trie.c | 1 +
kernel/bpf/stackmap.c | 1 +
3 files changed, 3 insertions(+)

--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -83,6 +83,7 @@ static struct bpf_map *array_map_alloc(u
array->map.key_size = attr->key_size;
array->map.value_size = attr->value_size;
array->map.max_entries = attr->max_entries;
+ array->map.map_flags = attr->map_flags;
array->elem_size = elem_size;

if (!percpu)
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -432,6 +432,7 @@ static struct bpf_map *trie_alloc(union
trie->map.key_size = attr->key_size;
trie->map.value_size = attr->value_size;
trie->map.max_entries = attr->max_entries;
+ trie->map.map_flags = attr->map_flags;
trie->data_size = attr->key_size -
offsetof(struct bpf_lpm_trie_key, data);
trie->max_prefixlen = trie->data_size * 8;
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -88,6 +88,7 @@ static struct bpf_map *stack_map_alloc(u
smap->map.key_size = attr->key_size;
smap->map.value_size = value_size;
smap->map.max_entries = attr->max_entries;
+ smap->map.map_flags = attr->map_flags;
smap->n_buckets = n_buckets;
smap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;



2017-06-05 16:30:23

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 032/115] be2net: Fix offload features for Q-in-Q packets

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Vlad Yasevich <[email protected]>


[ Upstream commit cc6e9de62a7f84c9293a2ea41bc412b55bb46e85 ]

At least some of the be2net cards do not seem to be capabled
of performing checksum offload computions on Q-in-Q packets.
In these case, the recevied checksum on the remote is invalid
and TCP syn packets are dropped.

This patch adds a call to check disbled acceleration features
on Q-in-Q tagged traffic.

CC: Sathya Perla <[email protected]>
CC: Ajit Khaparde <[email protected]>
CC: Sriharsha Basavapatna <[email protected]>
CC: Somnath Kotur <[email protected]>
Signed-off-by: Vladislav Yasevich <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/ethernet/emulex/benet/be_main.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -5027,9 +5027,11 @@ static netdev_features_t be_features_che
struct be_adapter *adapter = netdev_priv(dev);
u8 l4_hdr = 0;

- /* The code below restricts offload features for some tunneled packets.
+ /* The code below restricts offload features for some tunneled and
+ * Q-in-Q packets.
* Offload features for normal (non tunnel) packets are unchanged.
*/
+ features = vlan_features_check(skb, features);
if (!skb->encapsulation ||
!(adapter->flags & BE_FLAGS_VXLAN_OFFLOADS))
return features;


2017-06-05 16:30:47

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 050/115] drivers/tty: 8250: only call fintek_8250_probe when doing port I/O

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Ard Biesheuvel <[email protected]>

commit 4c4fc90964b1cf205a67df566cc82ea1731bcb00 upstream.

Commit fa01e2ca9f53 ("serial: 8250: Integrate Fintek into 8250_base")
modified the probing logic for PNP0501 devices, to remove a collision
between the generic 16550A driver and the Fintek driver, which reused
the same ACPI _HID.

The Fintek device probe is now incorporated into the common 8250 probe
path, and gets called for all discovered 16550A compatible devices,
including ones that are MMIO mapped rather than IO mapped. However,
the Fintek driver assumes the port base is a I/O address, and proceeds
to probe some arbitrary offsets above it.

This is generally a wrong thing to do, but on ARM systems (having no
native port I/O), this may result in faulting accesses of completely
unrelated MMIO regions in the PCI I/O space. Given that this is at
serial probe time, this results in hard to diagnose crashes at boot.

So let's restrict the Fintek probe to devices that we know are using
port I/O in the first place.

Fixes: fa01e2ca9f53 ("serial: 8250: Integrate Fintek into 8250_base")
Suggested-by: Arnd Bergmann <[email protected]>
Reviewed-by: Ricardo Ribalda <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/tty/serial/8250/8250_port.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -1337,7 +1337,7 @@ out_lock:
/*
* Check if the device is a Fintek F81216A
*/
- if (port->type == PORT_16550A)
+ if (port->type == PORT_16550A && port->iotype == UPIO_PORT)
fintek_8250_probe(up);

if (up->capabilities != old_capabilities) {


2017-06-05 16:30:41

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 059/115] mmc: sdhci-iproc: suppress spurious interrupt with Multiblock read

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Srinath Mannam <[email protected]>

commit f5f968f2371ccdebb8a365487649673c9af68d09 upstream.

The stingray SDHCI hardware supports ACMD12 and automatically
issues after multi block transfer completed.

If ACMD12 in SDHCI is disabled, spurious tx done interrupts are seen
on multi block read command with below error message:

Got data interrupt 0x00000002 even though no data
operation was in progress.

This patch uses SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 to enable
ACM12 support in SDHCI hardware and suppress spurious interrupt.

Signed-off-by: Srinath Mannam <[email protected]>
Reviewed-by: Ray Jui <[email protected]>
Reviewed-by: Scott Branden <[email protected]>
Acked-by: Adrian Hunter <[email protected]>
Fixes: b580c52d58d9 ("mmc: sdhci-iproc: add IPROC SDHCI driver")
Signed-off-by: Ulf Hansson <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/mmc/host/sdhci-iproc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- a/drivers/mmc/host/sdhci-iproc.c
+++ b/drivers/mmc/host/sdhci-iproc.c
@@ -187,7 +187,8 @@ static const struct sdhci_iproc_data ipr
};

static const struct sdhci_pltfm_data sdhci_iproc_pltfm_data = {
- .quirks = SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK,
+ .quirks = SDHCI_QUIRK_DATA_TIMEOUT_USES_SDCLK |
+ SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12,
.quirks2 = SDHCI_QUIRK2_ACMD23_BROKEN,
.ops = &sdhci_iproc_ops,
};


2017-06-05 16:30:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 052/115] crypto: skcipher - Add missing API setkey checks

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Herbert Xu <[email protected]>

commit 9933e113c2e87a9f46a40fde8dafbf801dca1ab9 upstream.

The API setkey checks for key sizes and alignment went AWOL during the
skcipher conversion. This patch restores them.

Fixes: 4e6c3df4d729 ("crypto: skcipher - Add low-level skcipher...")
Reported-by: Baozeng <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
crypto/skcipher.c | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)

--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -764,6 +764,44 @@ static int crypto_init_skcipher_ops_ablk
return 0;
}

+static int skcipher_setkey_unaligned(struct crypto_skcipher *tfm,
+ const u8 *key, unsigned int keylen)
+{
+ unsigned long alignmask = crypto_skcipher_alignmask(tfm);
+ struct skcipher_alg *cipher = crypto_skcipher_alg(tfm);
+ u8 *buffer, *alignbuffer;
+ unsigned long absize;
+ int ret;
+
+ absize = keylen + alignmask;
+ buffer = kmalloc(absize, GFP_ATOMIC);
+ if (!buffer)
+ return -ENOMEM;
+
+ alignbuffer = (u8 *)ALIGN((unsigned long)buffer, alignmask + 1);
+ memcpy(alignbuffer, key, keylen);
+ ret = cipher->setkey(tfm, alignbuffer, keylen);
+ kzfree(buffer);
+ return ret;
+}
+
+static int skcipher_setkey(struct crypto_skcipher *tfm, const u8 *key,
+ unsigned int keylen)
+{
+ struct skcipher_alg *cipher = crypto_skcipher_alg(tfm);
+ unsigned long alignmask = crypto_skcipher_alignmask(tfm);
+
+ if (keylen < cipher->min_keysize || keylen > cipher->max_keysize) {
+ crypto_skcipher_set_flags(tfm, CRYPTO_TFM_RES_BAD_KEY_LEN);
+ return -EINVAL;
+ }
+
+ if ((unsigned long)key & alignmask)
+ return skcipher_setkey_unaligned(tfm, key, keylen);
+
+ return cipher->setkey(tfm, key, keylen);
+}
+
static void crypto_skcipher_exit_tfm(struct crypto_tfm *tfm)
{
struct crypto_skcipher *skcipher = __crypto_skcipher_cast(tfm);
@@ -784,7 +822,7 @@ static int crypto_skcipher_init_tfm(stru
tfm->__crt_alg->cra_type == &crypto_givcipher_type)
return crypto_init_skcipher_ops_ablkcipher(tfm);

- skcipher->setkey = alg->setkey;
+ skcipher->setkey = skcipher_setkey;
skcipher->encrypt = alg->encrypt;
skcipher->decrypt = alg->decrypt;
skcipher->ivsize = alg->ivsize;


2017-06-05 16:31:02

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 055/115] acpi, nfit: Fix the memory error check in nfit_handle_mce()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Vishal Verma <[email protected]>

commit fc08a4703a418a398bbb575ac311d36d110ac786 upstream.

The check for an MCE being a memory error in the NFIT mce handler was
bogus. Use the new mce_is_memory_error() helper to detect the error
properly.

Reported-by: Tony Luck <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/acpi/nfit/mce.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifi
struct nfit_spa *nfit_spa;

/* We only care about memory errors */
- if (!(mce->status & MCACOD))
+ if (!mce_is_memory_error(mce))
return NOTIFY_DONE;

/*


2017-06-05 16:31:09

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 047/115] powerpc/spufs: Fix hash faults for kernel regions

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Jeremy Kerr <[email protected]>

commit d75e4919cc0b6fbcbc8d6654ef66d87a9dbf1526 upstream.

Commit ac29c64089b7 ("powerpc/mm: Replace _PAGE_USER with
_PAGE_PRIVILEGED") swapped _PAGE_USER for _PAGE_PRIVILEGED, and
introduced check_pte_access() which denied kernel access to
non-_PAGE_PRIVILEGED pages.

However, it didn't add _PAGE_PRIVILEGED to the hash fault handler
for spufs' kernel accesses, so the DMAs required to establish SPE
memory no longer work.

This change adds _PAGE_PRIVILEGED to the hash fault handler for
kernel accesses.

Fixes: ac29c64089b7 ("powerpc/mm: Replace _PAGE_USER with _PAGE_PRIVILEGED")
Signed-off-by: Jeremy Kerr <[email protected]>
Reported-by: Sombat Tragolgosol <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/powerpc/platforms/cell/spu_base.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

--- a/arch/powerpc/platforms/cell/spu_base.c
+++ b/arch/powerpc/platforms/cell/spu_base.c
@@ -197,7 +197,9 @@ static int __spu_trap_data_map(struct sp
(REGION_ID(ea) != USER_REGION_ID)) {

spin_unlock(&spu->register_lock);
- ret = hash_page(ea, _PAGE_PRESENT | _PAGE_READ, 0x300, dsisr);
+ ret = hash_page(ea,
+ _PAGE_PRESENT | _PAGE_READ | _PAGE_PRIVILEGED,
+ 0x300, dsisr);
spin_lock(&spu->register_lock);

if (!ret) {


2017-06-05 16:30:49

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 051/115] i2c: i2c-tiny-usb: fix buffer not being DMA capable

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Sebastian Reichel <[email protected]>

commit 5165da5923d6c7df6f2927b0113b2e4d9288661e upstream.

Since v4.9 i2c-tiny-usb generates the below call trace
and longer works, since it can't communicate with the
USB device. The reason is, that since v4.9 the USB
stack checks, that the buffer it should transfer is DMA
capable. This was a requirement since v2.2 days, but it
usually worked nevertheless.

[ 17.504959] ------------[ cut here ]------------
[ 17.505488] WARNING: CPU: 0 PID: 93 at drivers/usb/core/hcd.c:1587 usb_hcd_map_urb_for_dma+0x37c/0x570
[ 17.506545] transfer buffer not dma capable
[ 17.507022] Modules linked in:
[ 17.507370] CPU: 0 PID: 93 Comm: i2cdetect Not tainted 4.11.0-rc8+ #10
[ 17.508103] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 17.509039] Call Trace:
[ 17.509320] ? dump_stack+0x5c/0x78
[ 17.509714] ? __warn+0xbe/0xe0
[ 17.510073] ? warn_slowpath_fmt+0x5a/0x80
[ 17.510532] ? nommu_map_sg+0xb0/0xb0
[ 17.510949] ? usb_hcd_map_urb_for_dma+0x37c/0x570
[ 17.511482] ? usb_hcd_submit_urb+0x336/0xab0
[ 17.511976] ? wait_for_completion_timeout+0x12f/0x1a0
[ 17.512549] ? wait_for_completion_timeout+0x65/0x1a0
[ 17.513125] ? usb_start_wait_urb+0x65/0x160
[ 17.513604] ? usb_control_msg+0xdc/0x130
[ 17.514061] ? usb_xfer+0xa4/0x2a0
[ 17.514445] ? __i2c_transfer+0x108/0x3c0
[ 17.514899] ? i2c_transfer+0x57/0xb0
[ 17.515310] ? i2c_smbus_xfer_emulated+0x12f/0x590
[ 17.515851] ? _raw_spin_unlock_irqrestore+0x11/0x20
[ 17.516408] ? i2c_smbus_xfer+0x125/0x330
[ 17.516876] ? i2c_smbus_xfer+0x125/0x330
[ 17.517329] ? i2cdev_ioctl_smbus+0x1c1/0x2b0
[ 17.517824] ? i2cdev_ioctl+0x75/0x1c0
[ 17.518248] ? do_vfs_ioctl+0x9f/0x600
[ 17.518671] ? vfs_write+0x144/0x190
[ 17.519078] ? SyS_ioctl+0x74/0x80
[ 17.519463] ? entry_SYSCALL_64_fastpath+0x1e/0xad
[ 17.519959] ---[ end trace d047c04982f5ac50 ]---

Signed-off-by: Sebastian Reichel <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Acked-by: Till Harbaum <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/i2c/busses/i2c-tiny-usb.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)

--- a/drivers/i2c/busses/i2c-tiny-usb.c
+++ b/drivers/i2c/busses/i2c-tiny-usb.c
@@ -178,22 +178,39 @@ static int usb_read(struct i2c_adapter *
int value, int index, void *data, int len)
{
struct i2c_tiny_usb *dev = (struct i2c_tiny_usb *)adapter->algo_data;
+ void *dmadata = kmalloc(len, GFP_KERNEL);
+ int ret;
+
+ if (!dmadata)
+ return -ENOMEM;

/* do control transfer */
- return usb_control_msg(dev->usb_dev, usb_rcvctrlpipe(dev->usb_dev, 0),
+ ret = usb_control_msg(dev->usb_dev, usb_rcvctrlpipe(dev->usb_dev, 0),
cmd, USB_TYPE_VENDOR | USB_RECIP_INTERFACE |
- USB_DIR_IN, value, index, data, len, 2000);
+ USB_DIR_IN, value, index, dmadata, len, 2000);
+
+ memcpy(data, dmadata, len);
+ kfree(dmadata);
+ return ret;
}

static int usb_write(struct i2c_adapter *adapter, int cmd,
int value, int index, void *data, int len)
{
struct i2c_tiny_usb *dev = (struct i2c_tiny_usb *)adapter->algo_data;
+ void *dmadata = kmemdup(data, len, GFP_KERNEL);
+ int ret;
+
+ if (!dmadata)
+ return -ENOMEM;

/* do control transfer */
- return usb_control_msg(dev->usb_dev, usb_sndctrlpipe(dev->usb_dev, 0),
+ ret = usb_control_msg(dev->usb_dev, usb_sndctrlpipe(dev->usb_dev, 0),
cmd, USB_TYPE_VENDOR | USB_RECIP_INTERFACE,
- value, index, data, len, 2000);
+ value, index, dmadata, len, 2000);
+
+ kfree(dmadata);
+ return ret;
}

static void i2c_tiny_usb_free(struct i2c_tiny_usb *dev)


2017-06-05 16:31:16

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 056/115] ACPI / sysfs: fix acpi_get_table() leak / acpi-sysfs denial of service

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Dan Williams <[email protected]>

commit 0de0e198bc7191a0e46cf71f66fec4d07ca91396 upstream.

Reading an ACPI table through the /sys/firmware/acpi/tables interface
more than 65,536 times leads to the following log message:

ACPI Error: Table ffff88033595eaa8, Validation count is zero after increment
(20170119/tbutils-423)

...and the table being unavailable until the next reboot. Add the
missing acpi_put_table() so the table ->validation_count is decremented
after each read.

Reported-by: Anush Seetharaman <[email protected]>
Fixes: 174cc7187e6f "ACPICA: Tables: Back port acpi_get_table_with_size() ..."
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/acpi/sysfs.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

--- a/drivers/acpi/sysfs.c
+++ b/drivers/acpi/sysfs.c
@@ -333,14 +333,17 @@ static ssize_t acpi_table_show(struct fi
container_of(bin_attr, struct acpi_table_attr, attr);
struct acpi_table_header *table_header = NULL;
acpi_status status;
+ ssize_t rc;

status = acpi_get_table(table_attr->name, table_attr->instance,
&table_header);
if (ACPI_FAILURE(status))
return -ENODEV;

- return memory_read_from_buffer(buf, count, &offset,
- table_header, table_header->length);
+ rc = memory_read_from_buffer(buf, count, &offset, table_header,
+ table_header->length);
+ acpi_put_table(table_header);
+ return rc;
}

static int acpi_table_attr_init(struct kobject *tables_obj,


2017-06-05 16:31:23

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 063/115] scsi: scsi_dh_rdac: Use ctlr directly in rdac_failover_get()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Artem Savkov <[email protected]>

commit 0648a07c9b22acc33ead0645cf8f607b0c9c7e32 upstream.

rdac_failover_get references struct rdac_controller as
ctlr->ms_sdev->handler_data->ctlr for no apparent reason. Besides being
inefficient this also introduces a null-pointer dereference as
send_mode_select() sets ctlr->ms_sdev to NULL before calling
rdac_failover_get():

[ 18.432550] device-mapper: multipath service-time: version 0.3.0 loaded
[ 18.436124] BUG: unable to handle kernel NULL pointer dereference at 0000000000000790
[ 18.436129] IP: send_mode_select+0xca/0x560
[ 18.436129] PGD 0
[ 18.436130] P4D 0
[ 18.436130]
[ 18.436132] Oops: 0000 [#1] SMP
[ 18.436133] Modules linked in: dm_service_time sd_mod dm_multipath amdkfd amd_iommu_v2 radeon(+) i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm qla2xxx drm serio_raw scsi_transport_fc bnx2 i2c_core dm_mirror dm_region_hash dm_log dm_mod
[ 18.436143] CPU: 4 PID: 443 Comm: kworker/u16:2 Not tainted 4.12.0-rc1.1.el7.test.x86_64 #1
[ 18.436144] Hardware name: IBM BladeCenter LS22 -[79013SG]-/Server Blade, BIOS -[L8E164AUS-1.07]- 05/25/2011
[ 18.436145] Workqueue: kmpath_rdacd send_mode_select
[ 18.436146] task: ffff880225116a40 task.stack: ffffc90002bd8000
[ 18.436148] RIP: 0010:send_mode_select+0xca/0x560
[ 18.436148] RSP: 0018:ffffc90002bdbda8 EFLAGS: 00010246
[ 18.436149] RAX: 0000000000000000 RBX: ffffc90002bdbe08 RCX: ffff88017ef04a80
[ 18.436150] RDX: ffffc90002bdbe08 RSI: ffff88017ef04a80 RDI: ffff8802248e4388
[ 18.436151] RBP: ffffc90002bdbe48 R08: 0000000000000000 R09: ffffffff81c104c0
[ 18.436151] R10: 00000000000001ff R11: 000000000000035a R12: ffffc90002bdbdd8
[ 18.436152] R13: ffff8802248e4390 R14: ffff880225152800 R15: ffff8802248e4400
[ 18.436153] FS: 0000000000000000(0000) GS:ffff880227d00000(0000) knlGS:0000000000000000
[ 18.436154] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 18.436154] CR2: 0000000000000790 CR3: 000000042535b000 CR4: 00000000000006e0
[ 18.436155] Call Trace:
[ 18.436159] ? rdac_activate+0x14e/0x150
[ 18.436161] ? refcount_dec_and_test+0x11/0x20
[ 18.436162] ? kobject_put+0x1c/0x50
[ 18.436165] ? scsi_dh_activate+0x6f/0xd0
[ 18.436168] process_one_work+0x149/0x360
[ 18.436170] worker_thread+0x4d/0x3c0
[ 18.436172] kthread+0x109/0x140
[ 18.436173] ? rescuer_thread+0x380/0x380
[ 18.436174] ? kthread_park+0x60/0x60
[ 18.436176] ret_from_fork+0x2c/0x40
[ 18.436177] Code: 49 c7 46 20 00 00 00 00 4c 89 ef c6 07 00 0f 1f 40 00 45 31 ed c7 45 b0 05 00 00 00 44 89 6d b4 4d 89 f5 4c 8b 75 a8 49 8b 45 20 <48> 8b b0 90 07 00 00 48 8b 56 10 8b 42 10 48 8d 7a 28 85 c0 0f
[ 18.436192] RIP: send_mode_select+0xca/0x560 RSP: ffffc90002bdbda8
[ 18.436192] CR2: 0000000000000790
[ 18.436198] ---[ end trace 40f3e4dca1ffabdd ]---
[ 18.436199] Kernel panic - not syncing: Fatal exception
[ 18.436222] Kernel Offset: disabled
[-- MARK -- Thu May 18 11:45:00 2017]

Fixes: 327825574132 scsi_dh_rdac: switch to scsi_execute_req_flags()
Signed-off-by: Artem Savkov <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/scsi/device_handler/scsi_dh_rdac.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

--- a/drivers/scsi/device_handler/scsi_dh_rdac.c
+++ b/drivers/scsi/device_handler/scsi_dh_rdac.c
@@ -265,18 +265,16 @@ static unsigned int rdac_failover_get(st
struct list_head *list,
unsigned char *cdb)
{
- struct scsi_device *sdev = ctlr->ms_sdev;
- struct rdac_dh_data *h = sdev->handler_data;
struct rdac_mode_common *common;
unsigned data_size;
struct rdac_queue_data *qdata;
u8 *lun_table;

- if (h->ctlr->use_ms10) {
+ if (ctlr->use_ms10) {
struct rdac_pg_expanded *rdac_pg;

data_size = sizeof(struct rdac_pg_expanded);
- rdac_pg = &h->ctlr->mode_select.expanded;
+ rdac_pg = &ctlr->mode_select.expanded;
memset(rdac_pg, 0, data_size);
common = &rdac_pg->common;
rdac_pg->page_code = RDAC_PAGE_CODE_REDUNDANT_CONTROLLER + 0x40;
@@ -288,7 +286,7 @@ static unsigned int rdac_failover_get(st
struct rdac_pg_legacy *rdac_pg;

data_size = sizeof(struct rdac_pg_legacy);
- rdac_pg = &h->ctlr->mode_select.legacy;
+ rdac_pg = &ctlr->mode_select.legacy;
memset(rdac_pg, 0, data_size);
common = &rdac_pg->common;
rdac_pg->page_code = RDAC_PAGE_CODE_REDUNDANT_CONTROLLER;
@@ -304,7 +302,7 @@ static unsigned int rdac_failover_get(st
}

/* Prepare the command. */
- if (h->ctlr->use_ms10) {
+ if (ctlr->use_ms10) {
cdb[0] = MODE_SELECT_10;
cdb[7] = data_size >> 8;
cdb[8] = data_size & 0xff;


2017-06-05 16:31:34

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 077/115] Revert "ALSA: usb-audio: purge needless variable length array"

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Takashi Iwai <[email protected]>

commit 64188cfbe5245d412de2139a3864e4e00b4136f0 upstream.

This reverts commit 89b593c30e83 ("ALSA: usb-audio: purge needless
variable length array"). The patch turned out to cause a severe
regression, triggering an Oops at snd_usb_ctl_msg(). It was overseen
that snd_usb_ctl_msg() writes back the response to the given buffer,
while the patch changed it to a read-only const buffer. (One should
always double-check when an extra pointer cast is present...)

As a simple fix, just revert the affected commit. It was merely a
cleanup. Although it brings VLA again, it's clearer as a fix. We'll
address the VLA later in another patch.

Fixes: 89b593c30e83 ("ALSA: usb-audio: purge needless variable length array")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195875
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
sound/usb/mixer_us16x08.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

--- a/sound/usb/mixer_us16x08.c
+++ b/sound/usb/mixer_us16x08.c
@@ -698,12 +698,12 @@ static int snd_us16x08_meter_get(struct
struct snd_usb_audio *chip = elem->head.mixer->chip;
struct snd_us16x08_meter_store *store = elem->private_data;
u8 meter_urb[64];
- char tmp[sizeof(mix_init_msg2)] = {0};
+ char tmp[max(sizeof(mix_init_msg1), sizeof(mix_init_msg2))];

switch (kcontrol->private_value) {
case 0:
- snd_us16x08_send_urb(chip, (char *)mix_init_msg1,
- sizeof(mix_init_msg1));
+ memcpy(tmp, mix_init_msg1, sizeof(mix_init_msg1));
+ snd_us16x08_send_urb(chip, tmp, 4);
snd_us16x08_recv_urb(chip, meter_urb,
sizeof(meter_urb));
kcontrol->private_value++;
@@ -721,7 +721,7 @@ static int snd_us16x08_meter_get(struct
case 3:
memcpy(tmp, mix_init_msg2, sizeof(mix_init_msg2));
tmp[2] = snd_get_meter_comp_index(store);
- snd_us16x08_send_urb(chip, tmp, sizeof(mix_init_msg2));
+ snd_us16x08_send_urb(chip, tmp, 10);
snd_us16x08_recv_urb(chip, meter_urb,
sizeof(meter_urb));
kcontrol->private_value = 0;


2017-06-05 16:31:44

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 064/115] ibmvscsis: Clear left-over abort_cmd pointers

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Bryant G. Ly <[email protected]>

commit 98883f1b5415ea9dce60d5178877d15f4faa10b8 upstream.

With the addition of ibmvscsis->abort_cmd pointer within
commit 25e78531268e ("ibmvscsis: Do not send aborted task response"),
make sure to explicitly NULL these pointers when clearing
DELAY_SEND flag.

Do this for two cases, when getting the new new ibmvscsis
descriptor in ibmvscsis_get_free_cmd() and before posting
the response completion in ibmvscsis_send_messages().

Signed-off-by: Bryant G. Ly <[email protected]>
Reviewed-by: Michael Cyr <[email protected]>
Signed-off-by: Nicholas Bellinger <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c | 3 +++
1 file changed, 3 insertions(+)

--- a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c
+++ b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c
@@ -1170,6 +1170,8 @@ static struct ibmvscsis_cmd *ibmvscsis_g
cmd = list_first_entry_or_null(&vscsi->free_cmd,
struct ibmvscsis_cmd, list);
if (cmd) {
+ if (cmd->abort_cmd)
+ cmd->abort_cmd = NULL;
cmd->flags &= ~(DELAY_SEND);
list_del(&cmd->list);
cmd->iue = iue;
@@ -1774,6 +1776,7 @@ static void ibmvscsis_send_messages(stru
if (cmd->abort_cmd) {
retry = true;
cmd->abort_cmd->flags &= ~(DELAY_SEND);
+ cmd->abort_cmd = NULL;
}

/*


2017-06-05 16:31:55

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 065/115] ibmvscsis: Fix the incorrect req_lim_delta

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Bryant G. Ly <[email protected]>

commit 75dbf2d36f6b122ad3c1070fe4bf95f71bbff321 upstream.

The current code is not correctly calculating the req_lim_delta.

We want to make sure vscsi->credit is always incremented when
we do not send a response for the scsi op. Thus for the case where
there is a successfully aborted task we need to make sure the
vscsi->credit is incremented.

v2 - Moves the original location of the vscsi->credit increment
to a better spot. Since if we increment credit, the next command
we send back will have increased req_lim_delta. But we probably
shouldn't be doing that until the aborted cmd is actually released.
Otherwise the client will think that it can send a new command, and
we could find ourselves short of command elements. Not likely, but could
happen.

This patch depends on both:
commit 25e78531268e ("ibmvscsis: Do not send aborted task response")
commit 98883f1b5415 ("ibmvscsis: Clear left-over abort_cmd pointers")

Signed-off-by: Bryant G. Ly <[email protected]>
Reviewed-by: Michael Cyr <[email protected]>
Signed-off-by: Nicholas Bellinger <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)

--- a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c
+++ b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c
@@ -1791,6 +1791,25 @@ static void ibmvscsis_send_messages(stru
list_del(&cmd->list);
ibmvscsis_free_cmd_resources(vscsi,
cmd);
+ /*
+ * With a successfully aborted op
+ * through LIO we want to increment the
+ * the vscsi credit so that when we dont
+ * send a rsp to the original scsi abort
+ * op (h_send_crq), but the tm rsp to
+ * the abort is sent, the credit is
+ * correctly sent with the abort tm rsp.
+ * We would need 1 for the abort tm rsp
+ * and 1 credit for the aborted scsi op.
+ * Thus we need to increment here.
+ * Also we want to increment the credit
+ * here because we want to make sure
+ * cmd is actually released first
+ * otherwise the client will think it
+ * it can send a new cmd, and we could
+ * find ourselves short of cmd elements.
+ */
+ vscsi->credit += 1;
} else {
iue = cmd->iue;

@@ -2965,10 +2984,7 @@ static long srp_build_response(struct sc

rsp->opcode = SRP_RSP;

- if (vscsi->credit > 0 && vscsi->state == SRP_PROCESSING)
- rsp->req_lim_delta = cpu_to_be32(vscsi->credit);
- else
- rsp->req_lim_delta = cpu_to_be32(1 + vscsi->credit);
+ rsp->req_lim_delta = cpu_to_be32(1 + vscsi->credit);
rsp->tag = cmd->rsp.tag;
rsp->flags = 0;



2017-06-05 16:32:13

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 080/115] mm: avoid spurious bad pmd warning messages

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Ross Zwisler <[email protected]>

commit d0f0931de936a0a468d7e59284d39581c16d3a73 upstream.

When the pmd_devmap() checks were added by 5c7fb56e5e3f ("mm, dax:
dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge
pages, they were all added to the end of if() statements after existing
pmd_trans_huge() checks. So, things like:

- if (pmd_trans_huge(*pmd))
+ if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))

When further checks were added after pmd_trans_unstable() checks by
commit 7267ec008b5c ("mm: postpone page table allocation until we have
page to map") they were also added at the end of the conditional:

+ if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))

This ordering is fine for pmd_trans_huge(), but doesn't work for
pmd_trans_unstable(). This is because DAX huge pages trip the bad_pmd()
check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
pmd_trans_unstable()), which prints out a warning and returns 1. So, we
do end up doing the right thing, but only after spamming dmesg with
suspicious looking messages:

mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)

Reorder these checks in a helper so that pmd_devmap() is checked first,
avoiding the error messages, and add a comment explaining why the
ordering is important.

Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ross Zwisler <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Pawel Lebioda <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Xiong Zhou <[email protected]>
Cc: Eryu Guan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/memory.c | 40 ++++++++++++++++++++++++++++++----------
1 file changed, 30 insertions(+), 10 deletions(-)

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3029,6 +3029,17 @@ static int __do_fault(struct vm_fault *v
return ret;
}

+/*
+ * The ordering of these checks is important for pmds with _PAGE_DEVMAP set.
+ * If we check pmd_trans_unstable() first we will trip the bad_pmd() check
+ * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly
+ * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
+ */
+static int pmd_devmap_trans_unstable(pmd_t *pmd)
+{
+ return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
+}
+
static int pte_alloc_one_map(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
@@ -3052,18 +3063,27 @@ static int pte_alloc_one_map(struct vm_f
map_pte:
/*
* If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
- * didn't become pmd_trans_huge under us and then back to pmd_none, as
- * a result of MADV_DONTNEED running immediately after a huge pmd fault
- * in a different thread of this mm, in turn leading to a misleading
- * pmd_trans_huge() retval. All we have to ensure is that it is a
- * regular pmd that we can walk with pte_offset_map() and we can do that
- * through an atomic read in C, which is what pmd_trans_unstable()
- * provides.
+ * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of
+ * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge
+ * under us and then back to pmd_none, as a result of MADV_DONTNEED
+ * running immediately after a huge pmd fault in a different thread of
+ * this mm, in turn leading to a misleading pmd_trans_huge() retval.
+ * All we have to ensure is that it is a regular pmd that we can walk
+ * with pte_offset_map() and we can do that through an atomic read in
+ * C, which is what pmd_trans_unstable() provides.
*/
- if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
+ if (pmd_devmap_trans_unstable(vmf->pmd))
return VM_FAULT_NOPAGE;

+ /*
+ * At this point we know that our vmf->pmd points to a page of ptes
+ * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge()
+ * for the duration of the fault. If a racing MADV_DONTNEED runs and
+ * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still
+ * be valid and we will re-check to make sure the vmf->pte isn't
+ * pte_none() under vmf->ptl protection when we return to
+ * alloc_set_pte().
+ */
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
return 0;
@@ -3690,7 +3710,7 @@ static int handle_pte_fault(struct vm_fa
vmf->pte = NULL;
} else {
/* See comment in pte_alloc_one_map() */
- if (pmd_trans_unstable(vmf->pmd) || pmd_devmap(*vmf->pmd))
+ if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
/*
* A regular pmd is established and it can't morph into a huge


2017-06-05 16:32:06

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 069/115] nvme: avoid to use blk_mq_abort_requeue_list()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Ming Lei <[email protected]>

commit 986f75c876dbafed98eba7cb516c5118f155db23 upstream.

NVMe may add request into requeue list simply and not kick off the
requeue if hw queues are stopped. Then blk_mq_abort_requeue_list()
is called in both nvme_kill_queues() and nvme_ns_remove() for
dealing with this issue.

Unfortunately blk_mq_abort_requeue_list() is absolutely a
race maker, for example, one request may be requeued during
the aborting. So this patch just calls blk_mq_kick_requeue_list() in
nvme_kill_queues() to handle this issue like what nvme_start_queues()
does. Now all requests in requeue list when queues are stopped will be
handled by blk_mq_kick_requeue_list() when queues are restarted, either
in nvme_start_queues() or in nvme_kill_queues().

Reported-by: Zhang Yi <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/nvme/host/core.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2006,7 +2006,6 @@ static void nvme_ns_remove(struct nvme_n
if (ns->ndev)
nvme_nvm_unregister_sysfs(ns);
del_gendisk(ns->disk);
- blk_mq_abort_requeue_list(ns->queue);
blk_cleanup_queue(ns->queue);
}

@@ -2344,7 +2343,6 @@ void nvme_kill_queues(struct nvme_ctrl *
continue;
revalidate_disk(ns->disk);
blk_set_queue_dying(ns->queue);
- blk_mq_abort_requeue_list(ns->queue);

/*
* Forcibly start all queues to avoid having stuck requests.
@@ -2352,6 +2350,9 @@ void nvme_kill_queues(struct nvme_ctrl *
* when the final removal happens.
*/
blk_mq_start_hw_queues(ns->queue);
+
+ /* draining requests in requeue list */
+ blk_mq_kick_requeue_list(ns->queue);
}
mutex_unlock(&ctrl->namespaces_mutex);
}


2017-06-05 16:32:21

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 092/115] drm/gma500/psb: Actually use VBT mode when it is found

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Patrik Jakobsson <[email protected]>

commit 82bc9a42cf854fdf63155759c0aa790bd1f361b0 upstream.

With LVDS we were incorrectly picking the pre-programmed mode instead of
the prefered mode provided by VBT. Make sure we pick the VBT mode if
one is provided. It is likely that the mode read-out code is still wrong
but this patch fixes the immediate problem on most machines.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=78562
Signed-off-by: Patrik Jakobsson <[email protected]>
Link: http://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/gpu/drm/gma500/psb_intel_lvds.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

--- a/drivers/gpu/drm/gma500/psb_intel_lvds.c
+++ b/drivers/gpu/drm/gma500/psb_intel_lvds.c
@@ -760,20 +760,23 @@ void psb_intel_lvds_init(struct drm_devi
if (scan->type & DRM_MODE_TYPE_PREFERRED) {
mode_dev->panel_fixed_mode =
drm_mode_duplicate(dev, scan);
+ DRM_DEBUG_KMS("Using mode from DDC\n");
goto out; /* FIXME: check for quirks */
}
}

/* Failed to get EDID, what about VBT? do we need this? */
- if (mode_dev->vbt_mode)
+ if (dev_priv->lfp_lvds_vbt_mode) {
mode_dev->panel_fixed_mode =
- drm_mode_duplicate(dev, mode_dev->vbt_mode);
+ drm_mode_duplicate(dev, dev_priv->lfp_lvds_vbt_mode);

- if (!mode_dev->panel_fixed_mode)
- if (dev_priv->lfp_lvds_vbt_mode)
- mode_dev->panel_fixed_mode =
- drm_mode_duplicate(dev,
- dev_priv->lfp_lvds_vbt_mode);
+ if (mode_dev->panel_fixed_mode) {
+ mode_dev->panel_fixed_mode->type |=
+ DRM_MODE_TYPE_PREFERRED;
+ DRM_DEBUG_KMS("Using mode from VBT\n");
+ goto out;
+ }
+ }

/*
* If we didn't get EDID, try checking if the panel is already turned
@@ -790,6 +793,7 @@ void psb_intel_lvds_init(struct drm_devi
if (mode_dev->panel_fixed_mode) {
mode_dev->panel_fixed_mode->type |=
DRM_MODE_TYPE_PREFERRED;
+ DRM_DEBUG_KMS("Using pre-programmed mode\n");
goto out; /* FIXME: check for quirks */
}
}


2017-06-05 16:32:32

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 081/115] dax: fix race between colliding PMD & PTE entries

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Ross Zwisler <[email protected]>

commit e2093926a098a8ccf0f1d10f6df8dad452cb28d3 upstream.

We currently have two related PMD vs PTE races in the DAX code. These
can both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that
private mapping reads can be handled with PMDs but private mapping
writes are always handled with PTEs so that we can COW.

Here is the first race:

CPU 0 CPU 1

(private mapping write)
__handle_mm_fault()
create_huge_pmd() - FALLBACK
handle_pte_fault()
passes check for pmd_devmap()

(private mapping read)
__handle_mm_fault()
create_huge_pmd()
dax_iomap_pmd_fault() inserts PMD

dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
installed in our page tables at this spot.

Here's the second race:

CPU 0 CPU 1

(private mapping read)
__handle_mm_fault()
passes check for pmd_none()
create_huge_pmd()
dax_iomap_pmd_fault() inserts PMD

(private mapping write)
__handle_mm_fault()
create_huge_pmd() - FALLBACK
(private mapping read)
__handle_mm_fault()
passes check for pmd_none()
create_huge_pmd()

handle_pte_fault()
dax_iomap_pte_fault() inserts PTE
dax_iomap_pmd_fault() inserts PMD,
but we already have a PTE at
this spot.

The core of the issue is that while there is isolation between faults to
the same range in the DAX fault handlers via our DAX entry locking,
there is no isolation between faults in the code in mm/memory.c. This
means for instance that this code in __handle_mm_fault() can run:

if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
ret = create_huge_pmd(&vmf);

But by the time we actually get to run the fault handler called by
create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
fault has installed a normal PMD here as a parent. This is the cause of
the 2nd race. The first race is similar - there is the following check
in handle_pte_fault():

} else {
/* See comment in pte_alloc_one_map() */
if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
return 0;

So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
will bail and retry the fault. This is correct, but there is nothing
preventing the PMD from being installed after this check but before we
actually get to the DAX PTE fault handlers.

In my testing these races result in the following types of errors:

BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
BUG: non-zero nr_ptes on freeing mm: 15

Fix this issue by having the DAX fault handlers verify that it is safe
to continue their fault after they have taken an entry lock to block
other racing faults.

[[email protected]: improve fix for colliding PMD & PTE entries]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ross Zwisler <[email protected]>
Reported-by: Pawel Lebioda <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Pawel Lebioda <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Xiong Zhou <[email protected]>
Cc: Eryu Guan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/dax.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1129,6 +1129,17 @@ static int dax_iomap_pte_fault(struct vm
return dax_fault_return(PTR_ERR(entry));

/*
+ * It is possible, particularly with mixed reads & writes to private
+ * mappings, that we have raced with a PMD fault that overlaps with
+ * the PTE we need to set up. If so just return and the fault will be
+ * retried.
+ */
+ if (pmd_trans_huge(*vmf->pmd) || pmd_devmap(*vmf->pmd)) {
+ vmf_ret = VM_FAULT_NOPAGE;
+ goto unlock_entry;
+ }
+
+ /*
* Note that we don't bother to use iomap_apply here: DAX required
* the file system block size to be equal the page size, which means
* that we never have to deal with more than a single extent here.
@@ -1363,6 +1374,18 @@ static int dax_iomap_pmd_fault(struct vm
goto fallback;

/*
+ * It is possible, particularly with mixed reads & writes to private
+ * mappings, that we have raced with a PTE fault that overlaps with
+ * the PMD we need to set up. If so just return and the fault will be
+ * retried.
+ */
+ if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd) &&
+ !pmd_devmap(*vmf->pmd)) {
+ result = 0;
+ goto unlock_entry;
+ }
+
+ /*
* Note that we don't use iomap_apply here. We aren't doing I/O, only
* setting up a mapping, so really we're using iomap_begin() as a way
* to look up our filesystem block.


2017-06-05 16:32:26

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 089/115] x86/boot: Use CROSS_COMPILE prefix for readelf

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Rob Landley <[email protected]>

commit 3780578761921f094179c6289072a74b2228c602 upstream.

The boot code Makefile contains a straight 'readelf' invocation. This
causes build warnings in cross compile environments, when there is no
unprefixed readelf accessible via $PATH.

Add the missing $(CROSS_COMPILE) prefix.

[ tglx: Rewrote changelog ]

Fixes: 98f78525371b ("x86/boot: Refuse to build with data relocations")
Signed-off-by: Rob Landley <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Paul Bolle <[email protected]>
Cc: "H.J. Lu" <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/x86/boot/compressed/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -94,7 +94,7 @@ vmlinux-objs-$(CONFIG_EFI_MIXED) += $(ob
quiet_cmd_check_data_rel = DATAREL $@
define cmd_check_data_rel
for obj in $(filter %.o,$^); do \
- readelf -S $$obj | grep -qF .rel.local && { \
+ ${CROSS_COMPILE}readelf -S $$obj | grep -qF .rel.local && { \
echo "error: $$obj has data relocations!" >&2; \
exit 1; \
} || true; \


2017-06-05 16:32:43

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 084/115] mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: James Morse <[email protected]>

commit 9a291a7c9428155e8e623e4a3989f8be47134df5 upstream.

KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the
FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a
special case. (check_user_page_hwpoison())

When huge pages are involved, this doesn't work so well.
get_user_pages() calls follow_hugetlb_page(), which stops early if it
receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning
-EFAULT to the caller. The step to map this to -EHWPOISON based on the
FOLL_ flags is missing. The hwpoison special case is skipped, and
-EFAULT is returned to user-space, causing Qemu or kvmtool to exit.

Instead, move this VM_FAULT_ to errno mapping code into a header file
and use it from faultin_page() and follow_hugetlb_page().

With this, KVM works as expected.

This isn't a problem for arm64 today as we haven't enabled
MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86
too, so I think this should be a fix. This doesn't apply earlier than
stable's v4.11.1 due to all sorts of cleanup.

[[email protected]: add vm_fault_to_errno() call to faultin_page()]
suggested.
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: James Morse <[email protected]>
Acked-by: Punit Agrawal <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/linux/mm.h | 11 +++++++++++
mm/gup.c | 20 ++++++++------------
mm/hugetlb.c | 5 +++++
3 files changed, 24 insertions(+), 12 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2315,6 +2315,17 @@ static inline struct page *follow_page(s
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
#define FOLL_COW 0x4000 /* internal GUP flag */

+static inline int vm_fault_to_errno(int vm_fault, int foll_flags)
+{
+ if (vm_fault & VM_FAULT_OOM)
+ return -ENOMEM;
+ if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
+ return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
+ if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
+ return -EFAULT;
+ return 0;
+}
+
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -407,12 +407,10 @@ static int faultin_page(struct task_stru

ret = handle_mm_fault(vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
- if (ret & VM_FAULT_OOM)
- return -ENOMEM;
- if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
- return *flags & FOLL_HWPOISON ? -EHWPOISON : -EFAULT;
- if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
- return -EFAULT;
+ int err = vm_fault_to_errno(ret, *flags);
+
+ if (err)
+ return err;
BUG();
}

@@ -723,12 +721,10 @@ retry:
ret = handle_mm_fault(vma, address, fault_flags);
major |= ret & VM_FAULT_MAJOR;
if (ret & VM_FAULT_ERROR) {
- if (ret & VM_FAULT_OOM)
- return -ENOMEM;
- if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
- return -EHWPOISON;
- if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
- return -EFAULT;
+ int err = vm_fault_to_errno(ret, 0);
+
+ if (err)
+ return err;
BUG();
}

--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4170,6 +4170,11 @@ long follow_hugetlb_page(struct mm_struc
}
ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
if (ret & VM_FAULT_ERROR) {
+ int err = vm_fault_to_errno(ret, flags);
+
+ if (err)
+ return err;
+
remainder = 0;
break;
}


2017-06-05 16:32:55

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 088/115] PCI/PM: Add needs_resume flag to avoid suspend complete optimization

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Imre Deak <[email protected]>

commit 4d071c3238987325b9e50e33051a40d1cce311cc upstream.

Some drivers - like i915 - may not support the system suspend direct
complete optimization due to differences in their runtime and system
suspend sequence. Add a flag that when set resumes the device before
calling the driver's system suspend handlers which effectively disables
the optimization.

Needed by a future patch fixing suspend/resume on i915.

Suggested by Rafael.

Signed-off-by: Imre Deak <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/pci.c | 3 ++-
include/linux/pci.h | 5 +++++
2 files changed, 7 insertions(+), 1 deletion(-)

--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2142,7 +2142,8 @@ bool pci_dev_keep_suspended(struct pci_d

if (!pm_runtime_suspended(dev)
|| pci_target_state(pci_dev) != pci_dev->current_state
- || platform_pci_need_resume(pci_dev))
+ || platform_pci_need_resume(pci_dev)
+ || (pci_dev->dev_flags & PCI_DEV_FLAGS_NEEDS_RESUME))
return false;

/*
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -178,6 +178,11 @@ enum pci_dev_flags {
PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 << 7),
/* Get VPD from function 0 VPD */
PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 << 8),
+ /*
+ * Resume before calling the driver's system suspend hooks, disabling
+ * the direct_complete optimization.
+ */
+ PCI_DEV_FLAGS_NEEDS_RESUME = (__force pci_dev_flags_t) (1 << 11),
};

enum pci_irq_reroute_variant {


2017-06-05 16:32:48

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 082/115] mm/migrate: fix refcount handling when !hugepage_migration_supported()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Punit Agrawal <[email protected]>

commit 30809f559a0d348c2dfd7ab05e9a451e2384962e upstream.

On failing to migrate a page, soft_offline_huge_page() performs the
necessary update to the hugepage ref-count.

But when !hugepage_migration_supported() , unmap_and_move_hugepage()
also decrements the page ref-count for the hugepage. The combined
behaviour leaves the ref-count in an inconsistent state.

This leads to soft lockups when running the overcommitted hugepage test
from mce-tests suite.

Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head)
INFO: rcu_preempt detected stalls on CPUs/tasks:
Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
(detected by 7, t=5254 jiffies, g=963, c=962, q=321)
thugetlb_overco R running task 0 2715 2685 0x00000008
Call trace:
dump_backtrace+0x0/0x268
show_stack+0x24/0x30
sched_show_task+0x134/0x180
rcu_print_detail_task_stall_rnp+0x54/0x7c
rcu_check_callbacks+0xa74/0xb08
update_process_times+0x34/0x60
tick_sched_handle.isra.7+0x38/0x70
tick_sched_timer+0x4c/0x98
__hrtimer_run_queues+0xc0/0x300
hrtimer_interrupt+0xac/0x228
arch_timer_handler_phys+0x3c/0x50
handle_percpu_devid_irq+0x8c/0x290
generic_handle_irq+0x34/0x50
__handle_domain_irq+0x68/0xc0
gic_handle_irq+0x5c/0xb0

Address this by changing the putback_active_hugepage() in
soft_offline_huge_page() to putback_movable_pages().

This only triggers on systems that enable memory failure handling
(ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration
(!ARCH_ENABLE_HUGEPAGE_MIGRATION).

I imagine this wasn't triggered as there aren't many systems running
this configuration.

[[email protected]: remove dead comment, per Naoya]
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Manoj Iyer <[email protected]>
Tested-by: Manoj Iyer <[email protected]>
Suggested-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Punit Agrawal <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/memory-failure.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1587,12 +1587,8 @@ static int soft_offline_huge_page(struct
if (ret) {
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
pfn, ret, page->flags);
- /*
- * We know that soft_offline_huge_page() tries to migrate
- * only one hugepage pointed to by hpage, so we need not
- * run through the pagelist here.
- */
- putback_active_hugepage(hpage);
+ if (!list_empty(&pagelist))
+ putback_movable_pages(&pagelist);
if (ret > 0)
ret = -EIO;
} else {


2017-06-05 16:32:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 087/115] RDMA/qib,hfi1: Fix MR reference count leak on write with immediate

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Mike Marciniszyn <[email protected]>

commit 1feb40067cf04ae48d65f728d62ca255c9449178 upstream.

The handling of IB_RDMA_WRITE_ONLY_WITH_IMMEDIATE will leak a memory
reference when a buffer cannot be allocated for returning the immediate
data.

The issue is that the rkey validation has already occurred and the RNR
nak fails to release the reference that was fruitlessly gotten. The
the peer will send the identical single packet request when its RNR
timer pops.

The fix is to release the held reference prior to the rnr nak exit.
This is the only sequence the requires both rkey validation and the
buffer allocation on the same packet.

Tested-by: Tadeusz Struk <[email protected]>
Reviewed-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Mike Marciniszyn <[email protected]>
Signed-off-by: Dennis Dalessandro <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/infiniband/hw/hfi1/rc.c | 5 ++++-
drivers/infiniband/hw/qib/qib_rc.c | 4 +++-
2 files changed, 7 insertions(+), 2 deletions(-)

--- a/drivers/infiniband/hw/hfi1/rc.c
+++ b/drivers/infiniband/hw/hfi1/rc.c
@@ -2149,8 +2149,11 @@ send_last:
ret = hfi1_rvt_get_rwqe(qp, 1);
if (ret < 0)
goto nack_op_err;
- if (!ret)
+ if (!ret) {
+ /* peer will send again */
+ rvt_put_ss(&qp->r_sge);
goto rnr_nak;
+ }
wc.ex.imm_data = ohdr->u.rc.imm_data;
wc.wc_flags = IB_WC_WITH_IMM;
goto send_last;
--- a/drivers/infiniband/hw/qib/qib_rc.c
+++ b/drivers/infiniband/hw/qib/qib_rc.c
@@ -1947,8 +1947,10 @@ send_last:
ret = qib_get_rwqe(qp, 1);
if (ret < 0)
goto nack_op_err;
- if (!ret)
+ if (!ret) {
+ rvt_put_ss(&qp->r_sge);
goto rnr_nak;
+ }
wc.ex.imm_data = ohdr->u.rc.imm_data;
hdrsize += 4;
wc.wc_flags = IB_WC_WITH_IMM;


2017-06-05 16:33:07

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 106/115] xfs: wait on new inodes during quotaoff dquot release

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Brian Foster <[email protected]>

commit e20c8a517f259cb4d258e10b0cd5d4b30d4167a0 upstream.

The quotaoff operation has a race with inode allocation that results
in a livelock. An inode allocation that occurs before the quota
status flags are updated acquires the appropriate dquots for the
inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode
into the perag radix tree, sometime later attaches the dquots to the
inode and finally clears the XFS_INEW flag. Quotaoff expects to
release the dquots from all inodes in the filesystem via
xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator,
which skips inodes in the XFS_INEW state because they are not fully
constructed. If the scan occurs after dquots have been attached to
an inode, but before XFS_INEW is cleared, the newly allocated inode
will continue to hold a reference to the applicable dquots. When
quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those
dquot(s) remain elevated and the dqpurge scan spins indefinitely.

To address this problem, update the xfs_qm_dqrele_all_inodes() scan
to wait on inodes marked on the XFS_INEW state. We wait on the
inodes explicitly rather than skip and retry to avoid continuous
retry loops due to a parallel inode allocation workload. Since
quotaoff updates the quota state flags and uses a synchronous
transaction before the dqrele scan, and dquots are attached to
inodes after radix tree insertion iff quota is enabled, one INEW
waiting pass through the AG guarantees that the scan has processed
all inodes that could possibly hold dquot references.

Reported-by: Eryu Guan <[email protected]>
Signed-off-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_qm_syscalls.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -759,5 +759,6 @@ xfs_qm_dqrele_all_inodes(
uint flags)
{
ASSERT(mp->m_quotainfo);
- xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, NULL);
+ xfs_inode_ag_iterator_flags(mp, xfs_dqrele_inode, flags, NULL,
+ XFS_AGITER_INEW_WAIT);
}


2017-06-05 16:33:18

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 108/115] xfs: fix use-after-free in xfs_finish_page_writeback

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Eryu Guan <[email protected]>

commit 161f55efba5ddccc690139fae9373cafc3447a97 upstream.

Commit 28b783e47ad7 ("xfs: bufferhead chains are invalid after
end_page_writeback") fixed one use-after-free issue by
pre-calculating the loop conditionals before calling bh->b_end_io()
in the end_io processing loop, but it assigned 'next' pointer before
checking end offset boundary & breaking the loop, at which point the
bh might be freed already, and caused use-after-free.

This is caught by KASAN when running fstests generic/127 on sub-page
block size XFS.

[ 2517.244502] run fstests generic/127 at 2017-04-27 07:30:50
[ 2747.868840] ==================================================================
[ 2747.876949] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3d3/0x4e0 [xfs] at addr ffff8801395ae698
...
[ 2747.918245] Call Trace:
[ 2747.920975] dump_stack+0x63/0x84
[ 2747.924673] kasan_object_err+0x21/0x70
[ 2747.928950] kasan_report+0x271/0x530
[ 2747.933064] ? xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
[ 2747.938409] ? end_page_writeback+0xce/0x110
[ 2747.943171] __asan_report_load8_noabort+0x19/0x20
[ 2747.948545] xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
[ 2747.953724] xfs_end_io+0x1af/0x2b0 [xfs]
[ 2747.958197] process_one_work+0x5ff/0x1000
[ 2747.962766] worker_thread+0xe4/0x10e0
[ 2747.966946] kthread+0x2d3/0x3d0
[ 2747.970546] ? process_one_work+0x1000/0x1000
[ 2747.975405] ? kthread_create_on_node+0xc0/0xc0
[ 2747.980457] ? syscall_return_slowpath+0xe6/0x140
[ 2747.985706] ? do_page_fault+0x30/0x80
[ 2747.989887] ret_from_fork+0x2c/0x40
[ 2747.993874] Object at ffff8801395ae690, in cache buffer_head size: 104
[ 2748.001155] Allocated:
[ 2748.003782] PID = 8327
[ 2748.006411] save_stack_trace+0x1b/0x20
[ 2748.010688] save_stack+0x46/0xd0
[ 2748.014383] kasan_kmalloc+0xad/0xe0
[ 2748.018370] kasan_slab_alloc+0x12/0x20
[ 2748.022648] kmem_cache_alloc+0xb8/0x1b0
[ 2748.027024] alloc_buffer_head+0x22/0xc0
[ 2748.031399] alloc_page_buffers+0xd1/0x250
[ 2748.035968] create_empty_buffers+0x30/0x410
[ 2748.040730] create_page_buffers+0x120/0x1b0
[ 2748.045493] __block_write_begin_int+0x17a/0x1800
[ 2748.050740] iomap_write_begin+0x100/0x2f0
[ 2748.055308] iomap_zero_range_actor+0x253/0x5c0
[ 2748.060362] iomap_apply+0x157/0x270
[ 2748.064347] iomap_zero_range+0x5a/0x80
[ 2748.068624] iomap_truncate_page+0x6b/0xa0
[ 2748.073227] xfs_setattr_size+0x1f7/0xa10 [xfs]
[ 2748.078312] xfs_vn_setattr_size+0x68/0x140 [xfs]
[ 2748.083589] xfs_file_fallocate+0x4ac/0x820 [xfs]
[ 2748.088838] vfs_fallocate+0x2cf/0x780
[ 2748.093021] SyS_fallocate+0x48/0x80
[ 2748.097006] do_syscall_64+0x18a/0x430
[ 2748.101186] return_from_SYSCALL_64+0x0/0x6a
[ 2748.105948] Freed:
[ 2748.108189] PID = 8327
[ 2748.110816] save_stack_trace+0x1b/0x20
[ 2748.115093] save_stack+0x46/0xd0
[ 2748.118788] kasan_slab_free+0x73/0xc0
[ 2748.122969] kmem_cache_free+0x7a/0x200
[ 2748.127247] free_buffer_head+0x41/0x80
[ 2748.131524] try_to_free_buffers+0x178/0x250
[ 2748.136316] xfs_vm_releasepage+0x2e9/0x3d0 [xfs]
[ 2748.141563] try_to_release_page+0x100/0x180
[ 2748.146325] invalidate_inode_pages2_range+0x7da/0xcf0
[ 2748.152087] xfs_shift_file_space+0x37d/0x6e0 [xfs]
[ 2748.157557] xfs_collapse_file_space+0x49/0x120 [xfs]
[ 2748.163223] xfs_file_fallocate+0x2a7/0x820 [xfs]
[ 2748.168462] vfs_fallocate+0x2cf/0x780
[ 2748.172642] SyS_fallocate+0x48/0x80
[ 2748.176629] do_syscall_64+0x18a/0x430
[ 2748.180810] return_from_SYSCALL_64+0x0/0x6a

Fixed it by checking on offset against end & breaking out first,
dereference bh only if there're still bufferheads to process.

Signed-off-by: Eryu Guan <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_aops.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -111,11 +111,11 @@ xfs_finish_page_writeback(

bsize = bh->b_size;
do {
+ if (off > end)
+ break;
next = bh->b_this_page;
if (off < bvec->bv_offset)
goto next_bh;
- if (off > end)
- break;
bh->b_end_io(bh, !error);
next_bh:
off += bsize;


2017-06-05 16:33:30

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 099/115] xfs: drop iolock from reclaim context to appease lockdep

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Brian Foster <[email protected]>

commit 3b4683c294095b5f777c03307ef8c60f47320e12 upstream.

Lockdep complains about use of the iolock in inode reclaim context
because it doesn't understand that reclaim has the last reference to
the inode, and thus an iolock->reclaim->iolock deadlock is not
possible.

The iolock is technically not necessary in xfs_inactive() and was
only added to appease an assert in xfs_free_eofblocks(), which can
be called from other non-reclaim contexts. Therefore, just kill the
assert and drop the use of the iolock from reclaim context to quiet
lockdep.

Signed-off-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_bmap_util.c | 8 +++-----
fs/xfs/xfs_inode.c | 9 +++++----
2 files changed, 8 insertions(+), 9 deletions(-)

--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -904,9 +904,9 @@ xfs_can_free_eofblocks(struct xfs_inode
}

/*
- * This is called by xfs_inactive to free any blocks beyond eof
- * when the link count isn't zero and by xfs_dm_punch_hole() when
- * punching a hole to EOF.
+ * This is called to free any blocks beyond eof. The caller must hold
+ * IOLOCK_EXCL unless we are in the inode reclaim path and have the only
+ * reference to the inode.
*/
int
xfs_free_eofblocks(
@@ -921,8 +921,6 @@ xfs_free_eofblocks(
struct xfs_bmbt_irec imap;
struct xfs_mount *mp = ip->i_mount;

- ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
-
/*
* Figure out if there are any blocks beyond the end
* of the file. If not, then there is nothing to do.
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1906,12 +1906,13 @@ xfs_inactive(
* force is true because we are evicting an inode from the
* cache. Post-eof blocks must be freed, lest we end up with
* broken free space accounting.
+ *
+ * Note: don't bother with iolock here since lockdep complains
+ * about acquiring it in reclaim context. We have the only
+ * reference to the inode at this point anyways.
*/
- if (xfs_can_free_eofblocks(ip, true)) {
- xfs_ilock(ip, XFS_IOLOCK_EXCL);
+ if (xfs_can_free_eofblocks(ip, true))
xfs_free_eofblocks(ip);
- xfs_iunlock(ip, XFS_IOLOCK_EXCL);
- }

return;
}


2017-06-05 16:33:23

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 097/115] xfs: fix over-copying of getbmap parameters from userspace

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Darrick J. Wong <[email protected]>

commit be6324c00c4d1e0e665f03ed1fc18863a88da119 upstream.

In xfs_ioc_getbmap, we should only copy the fields of struct getbmap
from userspace, or else we end up copying random stack contents into the
kernel. struct getbmap is a strict subset of getbmapx, so a partial
structure copy should work fine.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_ioctl.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1543,10 +1543,11 @@ xfs_ioc_getbmap(
unsigned int cmd,
void __user *arg)
{
- struct getbmapx bmx;
+ struct getbmapx bmx = { 0 };
int error;

- if (copy_from_user(&bmx, arg, sizeof(struct getbmapx)))
+ /* struct getbmap is a strict subset of struct getbmapx. */
+ if (copy_from_user(&bmx, arg, offsetof(struct getbmapx, bmv_iflags)))
return -EFAULT;

if (bmx.bmv_count < 2)


2017-06-05 16:33:36

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 100/115] xfs: fix integer truncation in xfs_bmap_remap_alloc

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Christoph Hellwig <[email protected]>

commit 52813fb13ff90bd9c39a93446cbf1103c290b6e9 upstream.

bno should be a xfs_fsblock_t, which is 64-bit wides instead of a
xfs_aglock_t, which truncates the value to 32 bits.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/libxfs/xfs_bmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3863,7 +3863,7 @@ xfs_bmap_remap_alloc(
{
struct xfs_trans *tp = ap->tp;
struct xfs_mount *mp = tp->t_mountp;
- xfs_agblock_t bno;
+ xfs_fsblock_t bno;
struct xfs_alloc_arg args;
int error;



2017-06-05 16:33:40

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 102/115] xfs: prevent multi-fsb dir readahead from reading random blocks

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Brian Foster <[email protected]>

commit cb52ee334a45ae6c78a3999e4b473c43ddc528f4 upstream.

Directory block readahead uses a complex iteration mechanism to map
between high-level directory blocks and underlying physical extents.
This mechanism attempts to traverse the higher-level dir blocks in a
manner that handles multi-fsb directory blocks and simultaneously
maintains a reference to the corresponding physical blocks.

This logic doesn't handle certain (discontiguous) physical extent
layouts correctly with multi-fsb directory blocks. For example,
consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory
block size and a directory with the following extent layout:

EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..7]: 88..95 0 (88..95) 8
1: [8..15]: 80..87 0 (80..87) 8
2: [16..39]: 168..191 0 (168..191) 24
3: [40..63]: 5242952..5242975 1 (72..95) 24

Directory block 0 spans physical extents 0 and 1, dirblk 1 lies
entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because
extent 2 is larger than the directory block size, the readahead code
erroneously assumes the block is contiguous and issues a readahead
based on the physical mapping of the first fsb of the dirblk. This
results in read verifier failure and a spurious corruption or crc
failure, depending on the filesystem format.

Further, the subsequent readahead code responsible for walking
through the physical table doesn't correctly advance the physical
block reference for dirblk 2. Instead of advancing two physical
filesystem blocks, the first iteration of the loop advances 1 block
(correctly), but the subsequent iteration advances 2 more physical
blocks because the next physical extent (extent 3, above) happens to
cover more than dirblk 2. At this point, the higher-level directory
block walking is completely off the rails of the actual physical
layout of the directory for the respective mapping table.

Update the contiguous dirblock logic to consider the current offset
in the physical extent to avoid issuing directory readahead to
unrelated blocks. Also, update the mapping table advancing code to
consider the current offset within the current dirblock to avoid
advancing the mapping reference too far beyond the dirblock.

Signed-off-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_dir2_readdir.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -405,7 +405,8 @@ xfs_dir2_leaf_readbuf(
* Read-ahead a contiguous directory block.
*/
if (i > mip->ra_current &&
- map[mip->ra_index].br_blockcount >= geo->fsbcount) {
+ (map[mip->ra_index].br_blockcount - mip->ra_offset) >=
+ geo->fsbcount) {
xfs_dir3_data_readahead(dp,
map[mip->ra_index].br_startoff + mip->ra_offset,
XFS_FSB_TO_DADDR(dp->i_mount,
@@ -438,7 +439,7 @@ xfs_dir2_leaf_readbuf(
* The rest of this extent but not more than a dir
* block.
*/
- length = min_t(int, geo->fsbcount,
+ length = min_t(int, geo->fsbcount - j,
map[mip->ra_index].br_blockcount -
mip->ra_offset);
mip->ra_offset += length;


2017-06-05 16:33:56

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 113/115] xfs: avoid mount-time deadlock in CoW extent recovery

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Darrick J. Wong <[email protected]>

commit 3ecb3ac7b950ff8f6c6a61e8b7b0d6e3546429a0 upstream.

If a malicious user corrupts the refcount btree to cause a cycle between
different levels of the tree, the next mount attempt will deadlock in
the CoW recovery routine while grabbing buffer locks. We can use the
ability to re-grab a buffer that was previous locked to a transaction to
avoid deadlocks, so do that here.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/libxfs/xfs_refcount.c | 43 +++++++++++++++++++++++++++++++------------
1 file changed, 31 insertions(+), 12 deletions(-)

--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1629,13 +1629,28 @@ xfs_refcount_recover_cow_leftovers(
if (mp->m_sb.sb_agblocks >= XFS_REFC_COW_START)
return -EOPNOTSUPP;

- error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+ INIT_LIST_HEAD(&debris);
+
+ /*
+ * In this first part, we use an empty transaction to gather up
+ * all the leftover CoW extents so that we can subsequently
+ * delete them. The empty transaction is used to avoid
+ * a buffer lock deadlock if there happens to be a loop in the
+ * refcountbt because we're allowed to re-grab a buffer that is
+ * already attached to our transaction. When we're done
+ * recording the CoW debris we cancel the (empty) transaction
+ * and everything goes away cleanly.
+ */
+ error = xfs_trans_alloc_empty(mp, &tp);
if (error)
return error;
- cur = xfs_refcountbt_init_cursor(mp, NULL, agbp, agno, NULL);
+
+ error = xfs_alloc_read_agf(mp, tp, agno, 0, &agbp);
+ if (error)
+ goto out_trans;
+ cur = xfs_refcountbt_init_cursor(mp, tp, agbp, agno, NULL);

/* Find all the leftover CoW staging extents. */
- INIT_LIST_HEAD(&debris);
memset(&low, 0, sizeof(low));
memset(&high, 0, sizeof(high));
low.rc.rc_startblock = XFS_REFC_COW_START;
@@ -1645,10 +1660,11 @@ xfs_refcount_recover_cow_leftovers(
if (error)
goto out_cursor;
xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
- xfs_buf_relse(agbp);
+ xfs_trans_brelse(tp, agbp);
+ xfs_trans_cancel(tp);

/* Now iterate the list to free the leftovers */
- list_for_each_entry(rr, &debris, rr_list) {
+ list_for_each_entry_safe(rr, n, &debris, rr_list) {
/* Set up transaction. */
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, 0, &tp);
if (error)
@@ -1676,8 +1692,16 @@ xfs_refcount_recover_cow_leftovers(
error = xfs_trans_commit(tp);
if (error)
goto out_free;
+
+ list_del(&rr->rr_list);
+ kmem_free(rr);
}

+ return error;
+out_defer:
+ xfs_defer_cancel(&dfops);
+out_trans:
+ xfs_trans_cancel(tp);
out_free:
/* Free the leftover list */
list_for_each_entry_safe(rr, n, &debris, rr_list) {
@@ -1688,11 +1712,6 @@ out_free:

out_cursor:
xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
- xfs_buf_relse(agbp);
- goto out_free;
-
-out_defer:
- xfs_defer_cancel(&dfops);
- xfs_trans_cancel(tp);
- goto out_free;
+ xfs_trans_brelse(tp, agbp);
+ goto out_trans;
}


2017-06-05 16:34:04

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 114/115] xfs: fix unaligned access in xfs_btree_visit_blocks

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Eric Sandeen <[email protected]>

commit a4d768e702de224cc85e0c8eac9311763403b368 upstream.

This structure copy was throwing unaligned access warnings on sparc64:

Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs]

xfs_btree_copy_ptrs does a memcpy, which avoids it.

Signed-off-by: Eric Sandeen <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/libxfs/xfs_btree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4395,7 +4395,7 @@ xfs_btree_visit_blocks(
xfs_btree_readahead_ptr(cur, ptr, 1);

/* save for the next iteration of the loop */
- lptr = *ptr;
+ xfs_btree_copy_ptrs(cur, &lptr, ptr, 1);
}

/* for each buffer in the level */


2017-06-05 16:33:46

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 094/115] xfs: use ->b_state to fix buffer I/O accounting release race

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Brian Foster <[email protected]>

commit 63db7c815bc0997c29e484d2409684fdd9fcd93b upstream.

We've had user reports of unmount hangs in xfs_wait_buftarg() that
analysis shows is due to btp->bt_io_count == -1. bt_io_count
represents the count of in-flight asynchronous buffers and thus
should always be >= 0. xfs_wait_buftarg() waits for this value to
stabilize to zero in order to ensure that all untracked (with
respect to the lru) buffers have completed I/O processing before
unmount proceeds to tear down in-core data structures.

The value of -1 implies an I/O accounting decrement race. Indeed,
the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
(where the buffer lock is no longer held) means that bp->b_flags can
be updated from an unsafe context. While a user-level reproducer is
currently not available, some intrusive hacks to run racing buffer
lookups/ioacct/releases from multiple threads was used to
successfully manufacture this problem.

Existing callers do not expect to acquire the buffer lock from
xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
this context. It turns out that we already have separate buffer
state bits and associated serialization for dealing with buffer LRU
state in the form of ->b_state and ->b_lock. Therefore, replace the
_XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O
accounting wrappers appropriately and make sure they are used with
the correct locking. This ensures that buffer in-flight state can be
modified at buffer release time without racing with modifications
from a buffer lock holder.

Fixes: 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against unmount")
Signed-off-by: Brian Foster <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Tested-by: Libor Pechacek <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_buf.c | 38 ++++++++++++++++++++++++++------------
fs/xfs/xfs_buf.h | 5 ++---
2 files changed, 28 insertions(+), 15 deletions(-)

--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -97,12 +97,16 @@ static inline void
xfs_buf_ioacct_inc(
struct xfs_buf *bp)
{
- if (bp->b_flags & (XBF_NO_IOACCT|_XBF_IN_FLIGHT))
+ if (bp->b_flags & XBF_NO_IOACCT)
return;

ASSERT(bp->b_flags & XBF_ASYNC);
- bp->b_flags |= _XBF_IN_FLIGHT;
- percpu_counter_inc(&bp->b_target->bt_io_count);
+ spin_lock(&bp->b_lock);
+ if (!(bp->b_state & XFS_BSTATE_IN_FLIGHT)) {
+ bp->b_state |= XFS_BSTATE_IN_FLIGHT;
+ percpu_counter_inc(&bp->b_target->bt_io_count);
+ }
+ spin_unlock(&bp->b_lock);
}

/*
@@ -110,14 +114,24 @@ xfs_buf_ioacct_inc(
* freed and unaccount from the buftarg.
*/
static inline void
-xfs_buf_ioacct_dec(
+__xfs_buf_ioacct_dec(
struct xfs_buf *bp)
{
- if (!(bp->b_flags & _XBF_IN_FLIGHT))
- return;
+ ASSERT(spin_is_locked(&bp->b_lock));

- bp->b_flags &= ~_XBF_IN_FLIGHT;
- percpu_counter_dec(&bp->b_target->bt_io_count);
+ if (bp->b_state & XFS_BSTATE_IN_FLIGHT) {
+ bp->b_state &= ~XFS_BSTATE_IN_FLIGHT;
+ percpu_counter_dec(&bp->b_target->bt_io_count);
+ }
+}
+
+static inline void
+xfs_buf_ioacct_dec(
+ struct xfs_buf *bp)
+{
+ spin_lock(&bp->b_lock);
+ __xfs_buf_ioacct_dec(bp);
+ spin_unlock(&bp->b_lock);
}

/*
@@ -149,9 +163,9 @@ xfs_buf_stale(
* unaccounted (released to LRU) before that occurs. Drop in-flight
* status now to preserve accounting consistency.
*/
- xfs_buf_ioacct_dec(bp);
-
spin_lock(&bp->b_lock);
+ __xfs_buf_ioacct_dec(bp);
+
atomic_set(&bp->b_lru_ref, 0);
if (!(bp->b_state & XFS_BSTATE_DISPOSE) &&
(list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
@@ -979,12 +993,12 @@ xfs_buf_rele(
* ensures the decrement occurs only once per-buf.
*/
if ((atomic_read(&bp->b_hold) == 1) && !list_empty(&bp->b_lru))
- xfs_buf_ioacct_dec(bp);
+ __xfs_buf_ioacct_dec(bp);
goto out_unlock;
}

/* the last reference has been dropped ... */
- xfs_buf_ioacct_dec(bp);
+ __xfs_buf_ioacct_dec(bp);
if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
/*
* If the buffer is added to the LRU take a new reference to the
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -63,7 +63,6 @@ typedef enum {
#define _XBF_KMEM (1 << 21)/* backed by heap memory */
#define _XBF_DELWRI_Q (1 << 22)/* buffer on a delwri queue */
#define _XBF_COMPOUND (1 << 23)/* compound buffer */
-#define _XBF_IN_FLIGHT (1 << 25) /* I/O in flight, for accounting purposes */

typedef unsigned int xfs_buf_flags_t;

@@ -84,14 +83,14 @@ typedef unsigned int xfs_buf_flags_t;
{ _XBF_PAGES, "PAGES" }, \
{ _XBF_KMEM, "KMEM" }, \
{ _XBF_DELWRI_Q, "DELWRI_Q" }, \
- { _XBF_COMPOUND, "COMPOUND" }, \
- { _XBF_IN_FLIGHT, "IN_FLIGHT" }
+ { _XBF_COMPOUND, "COMPOUND" }


/*
* Internal state flags.
*/
#define XFS_BSTATE_DISPOSE (1 << 0) /* buffer being discarded */
+#define XFS_BSTATE_IN_FLIGHT (1 << 1) /* I/O in flight */

/*
* The xfs_buftarg contains 2 notions of "sector size" -


2017-06-05 16:34:08

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 115/115] xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Jan Kara <[email protected]>

commit d7fd24257aa60316bf81093f7f909dc9475ae974 upstream.

There is an off-by-one error in loop termination conditions in
xfs_find_get_desired_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.

Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1043,7 +1043,7 @@ xfs_find_get_desired_pgoff(

index = startoff >> PAGE_SHIFT;
endoff = XFS_FSB_TO_B(mp, map->br_startoff + map->br_blockcount);
- end = endoff >> PAGE_SHIFT;
+ end = (endoff - 1) >> PAGE_SHIFT;
do {
int want;
unsigned nr_pages;


2017-06-05 16:33:53

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 112/115] xfs: xfs_trans_alloc_empty

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Christoph Hellwig <[email protected]>

This is a partial cherry-pick of commit e89c041338
("xfs: implement the GETFSMAP ioctl"), which also adds this helper, and
a great example of why feature patches should be properly split into
their parts.

Signed-off-by: Darrick J. Wong <[email protected]>
[hch: split from the larger patch for -stable]
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/xfs/xfs_trans.c | 22 ++++++++++++++++++++++
fs/xfs/xfs_trans.h | 2 ++
2 files changed, 24 insertions(+)

--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -263,6 +263,28 @@ xfs_trans_alloc(
}

/*
+ * Create an empty transaction with no reservation. This is a defensive
+ * mechanism for routines that query metadata without actually modifying
+ * them -- if the metadata being queried is somehow cross-linked (think a
+ * btree block pointer that points higher in the tree), we risk deadlock.
+ * However, blocks grabbed as part of a transaction can be re-grabbed.
+ * The verifiers will notice the corrupt block and the operation will fail
+ * back to userspace without deadlocking.
+ *
+ * Note the zero-length reservation; this transaction MUST be cancelled
+ * without any dirty data.
+ */
+int
+xfs_trans_alloc_empty(
+ struct xfs_mount *mp,
+ struct xfs_trans **tpp)
+{
+ struct xfs_trans_res resv = {0};
+
+ return xfs_trans_alloc(mp, &resv, 0, 0, XFS_TRANS_NO_WRITECOUNT, tpp);
+}
+
+/*
* Record the indicated change to the given field for application
* to the file system's superblock when the transaction commits.
* For now, just store the change in the transaction structure.
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -158,6 +158,8 @@ typedef struct xfs_trans {
int xfs_trans_alloc(struct xfs_mount *mp, struct xfs_trans_res *resp,
uint blocks, uint rtextents, uint flags,
struct xfs_trans **tpp);
+int xfs_trans_alloc_empty(struct xfs_mount *mp,
+ struct xfs_trans **tpp);
void xfs_trans_mod_sb(xfs_trans_t *, uint, int64_t);

struct xfs_buf *xfs_trans_get_buf_map(struct xfs_trans *tp,


2017-06-05 16:34:28

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 110/115] xfs: BMAPX shouldnt barf on inline-format directories

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Darrick J. Wong <[email protected]>

commit 6eadbf4c8ba816c10d1c97bed9aa861d9fd17809 upstream.

When we're fulfilling a BMAPX request, jump out early if the data fork
is in local format. This prevents us from hitting a debugging check in
bmapi_read and barfing errors back to userspace. The on-disk extent
count check later isn't sufficient for IF_DELALLOC mode because da
extents are in memory and not on disk.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_bmap_util.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -583,9 +583,13 @@ xfs_getbmap(
}
break;
default:
+ /* Local format data forks report no extents. */
+ if (ip->i_d.di_format == XFS_DINODE_FMT_LOCAL) {
+ bmv->bmv_entries = 0;
+ return 0;
+ }
if (ip->i_d.di_format != XFS_DINODE_FMT_EXTENTS &&
- ip->i_d.di_format != XFS_DINODE_FMT_BTREE &&
- ip->i_d.di_format != XFS_DINODE_FMT_LOCAL)
+ ip->i_d.di_format != XFS_DINODE_FMT_BTREE)
return -EINVAL;

if (xfs_get_extsz_hint(ip) ||


2017-06-05 16:34:56

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 103/115] xfs: fix up quotacheck buffer list error handling

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Brian Foster <[email protected]>

commit 20e8a063786050083fe05b4f45be338c60b49126 upstream.

The quotacheck error handling of the delwri buffer list assumes the
resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag
on the buffers that are dequeued. This can lead to assert failures
on buffer release and possibly other locking problems.

Move this code to a delwri queue cancel helper function to
encapsulate the logic required to properly release buffers from a
delwri queue. Update the helper to clear the delwri queue flag and
call it from quotacheck.

Signed-off-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_buf.c | 24 ++++++++++++++++++++++++
fs/xfs/xfs_buf.h | 1 +
fs/xfs/xfs_qm.c | 7 +------
3 files changed, 26 insertions(+), 6 deletions(-)

--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1093,6 +1093,8 @@ void
xfs_buf_unlock(
struct xfs_buf *bp)
{
+ ASSERT(xfs_buf_islocked(bp));
+
XB_CLEAR_OWNER(bp);
up(&bp->b_sema);

@@ -1829,6 +1831,28 @@ error:
}

/*
+ * Cancel a delayed write list.
+ *
+ * Remove each buffer from the list, clear the delwri queue flag and drop the
+ * associated buffer reference.
+ */
+void
+xfs_buf_delwri_cancel(
+ struct list_head *list)
+{
+ struct xfs_buf *bp;
+
+ while (!list_empty(list)) {
+ bp = list_first_entry(list, struct xfs_buf, b_list);
+
+ xfs_buf_lock(bp);
+ bp->b_flags &= ~_XBF_DELWRI_Q;
+ list_del_init(&bp->b_list);
+ xfs_buf_relse(bp);
+ }
+}
+
+/*
* Add a buffer to the delayed write list.
*
* This queues a buffer for writeout if it hasn't already been. Note that
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -329,6 +329,7 @@ extern void *xfs_buf_offset(struct xfs_b
extern void xfs_buf_stale(struct xfs_buf *bp);

/* Delayed Write Buffer Routines */
+extern void xfs_buf_delwri_cancel(struct list_head *);
extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
extern int xfs_buf_delwri_submit(struct list_head *);
extern int xfs_buf_delwri_submit_nowait(struct list_head *);
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1384,12 +1384,7 @@ xfs_qm_quotacheck(
mp->m_qflags |= flags;

error_return:
- while (!list_empty(&buffer_list)) {
- struct xfs_buf *bp =
- list_first_entry(&buffer_list, struct xfs_buf, b_list);
- list_del_init(&bp->b_list);
- xfs_buf_relse(bp);
- }
+ xfs_buf_delwri_cancel(&buffer_list);

if (error) {
xfs_warn(mp,


2017-06-05 16:35:26

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 111/115] xfs: bad assertion for delalloc an extent that start at i_size

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Zorro Lang <[email protected]>

commit 892d2a5f705723b2cb488bfb38bcbdcf83273184 upstream.

By run fsstress long enough time enough in RHEL-7, I find an
assertion failure (harder to reproduce on linux-4.11, but problem
is still there):

XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c

The assertion is in xfs_getbmap() funciton:

if (map[i].br_startblock == DELAYSTARTBLOCK &&
--> map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip)))
ASSERT((iflags & BMV_IF_DELALLOC) != 0);

When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the
startoff is just at EOF. But we only need to make sure delalloc
extents that are within EOF, not include EOF.

Signed-off-by: Zorro Lang <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_bmap_util.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -717,7 +717,7 @@ xfs_getbmap(
* extents.
*/
if (map[i].br_startblock == DELAYSTARTBLOCK &&
- map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip)))
+ map[i].br_startoff < XFS_B_TO_FSB(mp, XFS_ISIZE(ip)))
ASSERT((iflags & BMV_IF_DELALLOC) != 0);

if (map[i].br_startblock == HOLESTARTBLOCK &&


2017-06-05 16:35:57

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 101/115] xfs: handle array index overrun in xfs_dir2_leaf_readbuf()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Eric Sandeen <[email protected]>

commit 023cc840b40fad95c6fe26fff1d380a8c9d45939 upstream.

Carlos had a case where "find" seemed to start spinning
forever and never return.

This was on a filesystem with non-default multi-fsb (8k)
directory blocks, and a fragmented directory with extents
like this:

0:[0,133646,2,0]
1:[2,195888,1,0]
2:[3,195890,1,0]
3:[4,195892,1,0]
4:[5,195894,1,0]
5:[6,195896,1,0]
6:[7,195898,1,0]
7:[8,195900,1,0]
8:[9,195902,1,0]
9:[10,195908,1,0]
10:[11,195910,1,0]
11:[12,195912,1,0]
12:[13,195914,1,0]
...

i.e. the first extent is a contiguous 2-fsb dir block, but
after that it is fragmented into 1 block extents.

At the top of the readdir path, we allocate a mapping array
which (for this filesystem geometry) can hold 10 extents; see
the assignment to map_info->map_size. During readdir, we are
therefore able to map extents 0 through 9 above into the array
for readahead purposes. If we count by 2, we see that the last
mapped index (9) is the first block of a 2-fsb directory block.

At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill
more readahead; the outer loop assumes one full dir block is
processed each loop iteration, and an inner loop that ensures
that this is so by advancing to the next extent until a full
directory block is mapped.

The problem is that this inner loop may step past the last
extent in the mapping array as it tries to reach the end of
the directory block. This will read garbage for the extent
length, and as a result the loop control variable 'j' may
become corrupted and never fail the loop conditional.

The number of valid mappings we have in our array is stored
in map->map_valid, so stop this inner loop based on that limit.

There is an ASSERT at the top of the outer loop for this
same condition, but we never made it out of the inner loop,
so the ASSERT never fired.

Huge appreciation for Carlos for debugging and isolating
the problem.

Debugged-and-analyzed-by: Carlos Maiolino <[email protected]>
Signed-off-by: Eric Sandeen <[email protected]>
Tested-by: Carlos Maiolino <[email protected]>
Reviewed-by: Carlos Maiolino <[email protected]>
Reviewed-by: Bill O'Donnell <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_dir2_readdir.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -394,6 +394,7 @@ xfs_dir2_leaf_readbuf(

/*
* Do we need more readahead?
+ * Each loop tries to process 1 full dir blk; last may be partial.
*/
blk_start_plug(&plug);
for (mip->ra_index = mip->ra_offset = i = 0;
@@ -425,9 +426,14 @@ xfs_dir2_leaf_readbuf(
}

/*
- * Advance offset through the mapping table.
+ * Advance offset through the mapping table, processing a full
+ * dir block even if it is fragmented into several extents.
+ * But stop if we have consumed all valid mappings, even if
+ * it's not yet a full directory block.
*/
- for (j = 0; j < geo->fsbcount; j += length ) {
+ for (j = 0;
+ j < geo->fsbcount && mip->ra_index < mip->map_valid;
+ j += length ) {
/*
* The rest of this extent but not more than a dir
* block.


2017-06-05 16:36:16

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 098/115] xfs: actually report xattr extents via iomap

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Darrick J. Wong <[email protected]>

commit 84358536dc355a9c8978ee425f87e116186bed16 upstream.

Apparently FIEMAP for xattrs has been broken since we switched to
the iomap backend because of an incorrect check for xattr presence.
Also fix the broken locking.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_iomap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1170,10 +1170,10 @@ xfs_xattr_iomap_begin(
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;

- lockmode = xfs_ilock_data_map_shared(ip);
+ lockmode = xfs_ilock_attr_map_shared(ip);

/* if there are no attribute fork or extents, return ENOENT */
- if (XFS_IFORK_Q(ip) || !ip->i_d.di_anextents) {
+ if (!XFS_IFORK_Q(ip) || !ip->i_d.di_anextents) {
error = -ENOENT;
goto out_unlock;
}


2017-06-05 16:36:37

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 096/115] xfs: use dedicated log worker wq to avoid deadlock with cil wq

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Brian Foster <[email protected]>

commit 696a562072e3c14bcd13ae5acc19cdf27679e865 upstream.

The log covering background task used to be part of the xfssyncd
workqueue. That workqueue was removed as of commit 5889608df ("xfs:
syncd workqueue is no more") and the associated work item scheduled
to the xfs-log wq. The latter is used for log buffer I/O completion.

Since xfs_log_worker() can invoke a log flush, a deadlock is
possible between the xfs-log and xfs-cil workqueues. Consider the
following codepath from xfs_log_worker():

xfs_log_worker()
xfs_log_force()
_xfs_log_force()
xlog_cil_force()
xlog_cil_force_lsn()
xlog_cil_push_now()
flush_work()

The above is in xfs-log wq context and blocked waiting on the
completion of an xfs-cil work item. Concurrently, the cil push in
progress can end up blocked here:

xlog_cil_push_work()
xlog_cil_push()
xlog_write()
xlog_state_get_iclog_space()
xlog_wait(&log->l_flush_wait, ...)

The above is in xfs-cil context waiting on log buffer I/O
completion, which executes in xfs-log wq context. In this scenario
both workqueues are deadlocked waiting on eachother.

Add a new workqueue specifically for the high level log covering and
ail pushing worker, as was the case prior to commit 5889608df.

Diagnosed-by: David Jeffery <[email protected]>
Signed-off-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_log.c | 2 +-
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_super.c | 8 ++++++++
3 files changed, 10 insertions(+), 1 deletion(-)

--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1293,7 +1293,7 @@ void
xfs_log_work_queue(
struct xfs_mount *mp)
{
- queue_delayed_work(mp->m_log_workqueue, &mp->m_log->l_work,
+ queue_delayed_work(mp->m_sync_workqueue, &mp->m_log->l_work,
msecs_to_jiffies(xfs_syncd_centisecs * 10));
}

--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -183,6 +183,7 @@ typedef struct xfs_mount {
struct workqueue_struct *m_reclaim_workqueue;
struct workqueue_struct *m_log_workqueue;
struct workqueue_struct *m_eofblocks_workqueue;
+ struct workqueue_struct *m_sync_workqueue;

/*
* Generation of the filesysyem layout. This is incremented by each
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -877,8 +877,15 @@ xfs_init_mount_workqueues(
if (!mp->m_eofblocks_workqueue)
goto out_destroy_log;

+ mp->m_sync_workqueue = alloc_workqueue("xfs-sync/%s", WQ_FREEZABLE, 0,
+ mp->m_fsname);
+ if (!mp->m_sync_workqueue)
+ goto out_destroy_eofb;
+
return 0;

+out_destroy_eofb:
+ destroy_workqueue(mp->m_eofblocks_workqueue);
out_destroy_log:
destroy_workqueue(mp->m_log_workqueue);
out_destroy_reclaim:
@@ -899,6 +906,7 @@ STATIC void
xfs_destroy_mount_workqueues(
struct xfs_mount *mp)
{
+ destroy_workqueue(mp->m_sync_workqueue);
destroy_workqueue(mp->m_eofblocks_workqueue);
destroy_workqueue(mp->m_log_workqueue);
destroy_workqueue(mp->m_reclaim_workqueue);


2017-06-05 16:33:16

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 079/115] mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Tetsuo Handa <[email protected]>

commit c288983dddf714216428774e022ad78f48dd8cb1 upstream.

Roman Gushchin has reported that the OOM killer can trivially selects
next OOM victim when a thread doing memory allocation from page fault
path was selected as first OOM victim.

allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
allocate cpuset=/ mems_allowed=0
CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
oom_kill_process+0x219/0x3e0
out_of_memory+0x11d/0x480
__alloc_pages_slowpath+0xc84/0xd40
__alloc_pages_nodemask+0x245/0x260
alloc_pages_vma+0xa2/0x270
__handle_mm_fault+0xca9/0x10c0
handle_mm_fault+0xf3/0x210
__do_page_fault+0x240/0x4e0
trace_do_page_fault+0x37/0xe0
do_async_page_fault+0x19/0x70
async_page_fault+0x28/0x30
...
Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
allocate cpuset=/ mems_allowed=0
CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
__alloc_pages_slowpath+0xd32/0xd40
__alloc_pages_nodemask+0x245/0x260
alloc_pages_vma+0xa2/0x270
__handle_mm_fault+0xca9/0x10c0
handle_mm_fault+0xf3/0x210
__do_page_fault+0x240/0x4e0
trace_do_page_fault+0x37/0xe0
do_async_page_fault+0x19/0x70
async_page_fault+0x28/0x30
...
oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
...
allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null), order=0, oom_score_adj=0
allocate cpuset=/ mems_allowed=0
CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
oom_kill_process+0x219/0x3e0
out_of_memory+0x11d/0x480
pagefault_out_of_memory+0x68/0x80
mm_fault_error+0x8f/0x190
? handle_mm_fault+0xf3/0x210
__do_page_fault+0x4b2/0x4e0
trace_do_page_fault+0x37/0xe0
do_async_page_fault+0x19/0x70
async_page_fault+0x28/0x30
...
Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB

There is a race window that the OOM reaper completes reclaiming the
first victim's memory while nothing but mutex_trylock() prevents the
first victim from calling out_of_memory() from pagefault_out_of_memory()
after memory allocation for page fault path failed due to being selected
as an OOM victim.

This is a side effect of commit 9a67f6488eca926f ("mm: consolidate
GFP_NOFAIL checks in the allocator slowpath") because that commit
silently changed the behavior from

/* Avoid allocations with no watermarks from looping endlessly */

to

/*
* Give up allocations without trying memory reserves if selected
* as an OOM victim
*/

in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
flag. I have noticed this change but I didn't post a patch because I
thought it is an acceptable change other than noise by warn_alloc()
because !__GFP_NOFAIL allocations are allowed to fail. But we
overlooked that failing memory allocation from page fault path makes
difference due to the race window explained above.

While it might be possible to add a check to pagefault_out_of_memory()
that prevents the first victim from calling out_of_memory() or remove
out_of_memory() from pagefault_out_of_memory(), changing
pagefault_out_of_memory() does not suppress noise by warn_alloc() when
allocating thread was selected as an OOM victim. There is little point
with printing similar backtraces and memory information from both
out_of_memory() and warn_alloc().

Instead, if we guarantee that current thread can try allocations with no
watermarks once when current thread looping inside
__alloc_pages_slowpath() was selected as an OOM victim, we can follow "who
can use memory reserves" rules and suppress noise by warn_alloc() and
prevent memory allocations from page fault path from calling
pagefault_out_of_memory().

If we take the comment literally, this patch would do

- if (test_thread_flag(TIF_MEMDIE))
- goto nopage;
+ if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
+ goto nopage;

because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
given. But if I recall correctly (I couldn't find the message), the
condition is meant to apply to only OOM victims despite the comment.
Therefore, this patch preserves TIF_MEMDIE check.

Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tetsuo Handa <[email protected]>
Reported-by: Roman Gushchin <[email protected]>
Tested-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/page_alloc.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3834,7 +3834,9 @@ retry:
goto got_pg;

/* Avoid allocations with no watermarks from looping endlessly */
- if (test_thread_flag(TIF_MEMDIE))
+ if (test_thread_flag(TIF_MEMDIE) &&
+ (alloc_flags == ALLOC_NO_WATERMARKS ||
+ (gfp_mask & __GFP_NOMEMALLOC)))
goto nopage;

/* Retry as long as the OOM killer is making progress */


2017-06-05 16:37:14

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 109/115] xfs: fix indlen accounting error on partial delalloc conversion

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Brian Foster <[email protected]>

commit 0daaecacb83bc6b656a56393ab77a31c28139bc7 upstream.

The delalloc -> real block conversion path uses an incorrect
calculation in the case where the middle part of a delalloc extent
is being converted. This is documented as a rare situation because
XFS generally attempts to maximize contiguity by converting as much
of a delalloc extent as possible.

If this situation does occur, the indlen reservation for the two new
delalloc extents left behind by the conversion of the middle range
is calculated and compared with the original reservation. If more
blocks are required, the delta is allocated from the global block
pool. This delta value can be characterized as the difference
between the new total requirement (temp + temp2) and the currently
available reservation minus those blocks that have already been
allocated (startblockval(PREV.br_startblock) - allocated).

The problem is that the current code does not account for previously
allocated blocks correctly. It subtracts the current allocation
count from the (new - old) delta rather than the old indlen
reservation. This means that more indlen blocks than have been
allocated end up stashed in the remaining extents and free space
accounting is broken as a result.

Fix up the calculation to subtract the allocated block count from
the original extent indlen and thus correctly allocate the
reservation delta based on the difference between the new total
requirement and the unused blocks from the original reservation.
Also remove a bogus assert that contradicts the fact that the new
indlen reservation can be larger than the original indlen
reservation.

Signed-off-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/libxfs/xfs_bmap.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2106,8 +2106,10 @@ xfs_bmap_add_extent_delay_real(
}
temp = xfs_bmap_worst_indlen(bma->ip, temp);
temp2 = xfs_bmap_worst_indlen(bma->ip, temp2);
- diff = (int)(temp + temp2 - startblockval(PREV.br_startblock) -
- (bma->cur ? bma->cur->bc_private.b.allocated : 0));
+ diff = (int)(temp + temp2 -
+ (startblockval(PREV.br_startblock) -
+ (bma->cur ?
+ bma->cur->bc_private.b.allocated : 0)));
if (diff > 0) {
error = xfs_mod_fdblocks(bma->ip->i_mount,
-((int64_t)diff), false);
@@ -2164,7 +2166,6 @@ xfs_bmap_add_extent_delay_real(
temp = da_new;
if (bma->cur)
temp += bma->cur->bc_private.b.allocated;
- ASSERT(temp <= da_old);
if (temp < da_old)
xfs_mod_fdblocks(bma->ip->i_mount,
(int64_t)(da_old - temp), false);


2017-06-05 16:37:39

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 107/115] xfs: reserve enough blocks to handle btree splits when remapping

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Darrick J. Wong <[email protected]>

commit fe0be23e68200573de027de9b8cc2b27e7fce35e upstream.

In xfs_reflink_end_cow, we erroneously reserve only enough blocks to
handle adding 1 extent. This is problematic if we fragment free space,
have to do CoW, and then have to perform multiple bmap btree expansions.
Furthermore, the BUI recovery routine doesn't reserve /any/ blocks to
handle btree splits, so log recovery fails after our first error causes
the filesystem to go down.

Therefore, refactor the transaction block reservation macros until we
have a macro that works for our deferred (re)mapping activities, and fix
both problems by using that macro.

With 1k blocks we can hit this fairly often in g/187 if the scratch fs
is big enough.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/libxfs/xfs_trans_space.h | 23 +++++++++++++++++------
fs/xfs/xfs_bmap_item.c | 5 ++++-
fs/xfs/xfs_reflink.c | 18 ++++++++++++++++--
3 files changed, 37 insertions(+), 9 deletions(-)

--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -21,8 +21,20 @@
/*
* Components of space reservations.
*/
+
+/* Worst case number of rmaps that can be held in a block. */
#define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) \
(((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0]))
+
+/* Adding one rmap could split every level up to the top of the tree. */
+#define XFS_RMAPADD_SPACE_RES(mp) ((mp)->m_rmap_maxlevels)
+
+/* Blocks we might need to add "b" rmaps to a tree. */
+#define XFS_NRMAPADD_SPACE_RES(mp, b)\
+ (((b + XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) - 1) / \
+ XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)) * \
+ XFS_RMAPADD_SPACE_RES(mp))
+
#define XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) \
(((mp)->m_alloc_mxr[0]) - ((mp)->m_alloc_mnr[0]))
#define XFS_EXTENTADD_SPACE_RES(mp,w) (XFS_BM_MAXLEVELS(mp,w) - 1)
@@ -30,13 +42,12 @@
(((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \
XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \
XFS_EXTENTADD_SPACE_RES(mp,w))
+
+/* Blocks we might need to add "b" mappings & rmappings to a file. */
#define XFS_SWAP_RMAP_SPACE_RES(mp,b,w)\
- (((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \
- XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \
- XFS_EXTENTADD_SPACE_RES(mp,w) + \
- ((b + XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) - 1) / \
- XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)) * \
- (mp)->m_rmap_maxlevels)
+ (XFS_NEXTENTADD_SPACE_RES((mp), (b), (w)) + \
+ XFS_NRMAPADD_SPACE_RES((mp), (b)))
+
#define XFS_DAENTER_1B(mp,w) \
((w) == XFS_DATA_FORK ? (mp)->m_dir_geo->fsbcount : 1)
#define XFS_DAENTER_DBS(mp,w) \
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -34,6 +34,8 @@
#include "xfs_bmap.h"
#include "xfs_icache.h"
#include "xfs_trace.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"


kmem_zone_t *xfs_bui_zone;
@@ -446,7 +448,8 @@ xfs_bui_recover(
return -EIO;
}

- error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+ error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
+ XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
if (error)
return error;
budp = xfs_trans_get_bud(tp, buip);
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -709,8 +709,22 @@ xfs_reflink_end_cow(
offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);

- /* Start a rolling transaction to switch the mappings */
- resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
+ /*
+ * Start a rolling transaction to switch the mappings. We're
+ * unlikely ever to have to remap 16T worth of single-block
+ * extents, so just cap the worst case extent count to 2^32-1.
+ * Stick a warning in just in case, and avoid 64-bit division.
+ */
+ BUILD_BUG_ON(MAX_RW_COUNT > UINT_MAX);
+ if (end_fsb - offset_fsb > UINT_MAX) {
+ error = -EFSCORRUPTED;
+ xfs_force_shutdown(ip->i_mount, SHUTDOWN_CORRUPT_INCORE);
+ ASSERT(0);
+ goto out;
+ }
+ resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount,
+ (unsigned int)(end_fsb - offset_fsb),
+ XFS_DATA_FORK);
error = xfs_trans_alloc(ip->i_mount, &M_RES(ip->i_mount)->tr_write,
resblks, 0, 0, &tp);
if (error)


2017-06-05 16:37:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 105/115] xfs: update ag iterator to support wait on new inodes

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Brian Foster <[email protected]>

commit ae2c4ac2dd39b23a87ddb14ceddc3f2872c6aef5 upstream.

The AG inode iterator currently skips new inodes as such inodes are
inserted into the inode radix tree before they are fully
constructed. Certain contexts require the ability to wait on the
construction of new inodes, however. The fs-wide dquot release from
the quotaoff sequence is an example of this.

Update the AG inode iterator to support the ability to wait on
inodes flagged with XFS_INEW upon request. Create a new
xfs_inode_ag_iterator_flags() interface and support a set of
iteration flags to modify the iteration behavior. When the
XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
radix tree inode lookup and wait on them before the callback is
executed.

Signed-off-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_icache.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--------
fs/xfs/xfs_icache.h | 8 +++++++
2 files changed, 53 insertions(+), 8 deletions(-)

--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -262,6 +262,22 @@ xfs_inode_clear_reclaim_tag(
xfs_perag_clear_reclaim_tag(pag);
}

+static void
+xfs_inew_wait(
+ struct xfs_inode *ip)
+{
+ wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_INEW_BIT);
+ DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_INEW_BIT);
+
+ do {
+ prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+ if (!xfs_iflags_test(ip, XFS_INEW))
+ break;
+ schedule();
+ } while (true);
+ finish_wait(wq, &wait.wait);
+}
+
/*
* When we recycle a reclaimable inode, we need to re-initialise the VFS inode
* part of the structure. This is made more complex by the fact we store
@@ -626,9 +642,11 @@ out_error_or_again:

STATIC int
xfs_inode_ag_walk_grab(
- struct xfs_inode *ip)
+ struct xfs_inode *ip,
+ int flags)
{
struct inode *inode = VFS_I(ip);
+ bool newinos = !!(flags & XFS_AGITER_INEW_WAIT);

ASSERT(rcu_read_lock_held());

@@ -646,7 +664,8 @@ xfs_inode_ag_walk_grab(
goto out_unlock_noent;

/* avoid new or reclaimable inodes. Leave for reclaim code to flush */
- if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM))
+ if ((!newinos && __xfs_iflags_test(ip, XFS_INEW)) ||
+ __xfs_iflags_test(ip, XFS_IRECLAIMABLE | XFS_IRECLAIM))
goto out_unlock_noent;
spin_unlock(&ip->i_flags_lock);

@@ -674,7 +693,8 @@ xfs_inode_ag_walk(
void *args),
int flags,
void *args,
- int tag)
+ int tag,
+ int iter_flags)
{
uint32_t first_index;
int last_error = 0;
@@ -716,7 +736,7 @@ restart:
for (i = 0; i < nr_found; i++) {
struct xfs_inode *ip = batch[i];

- if (done || xfs_inode_ag_walk_grab(ip))
+ if (done || xfs_inode_ag_walk_grab(ip, iter_flags))
batch[i] = NULL;

/*
@@ -744,6 +764,9 @@ restart:
for (i = 0; i < nr_found; i++) {
if (!batch[i])
continue;
+ if ((iter_flags & XFS_AGITER_INEW_WAIT) &&
+ xfs_iflags_test(batch[i], XFS_INEW))
+ xfs_inew_wait(batch[i]);
error = execute(batch[i], flags, args);
IRELE(batch[i]);
if (error == -EAGAIN) {
@@ -823,12 +846,13 @@ xfs_cowblocks_worker(
}

int
-xfs_inode_ag_iterator(
+xfs_inode_ag_iterator_flags(
struct xfs_mount *mp,
int (*execute)(struct xfs_inode *ip, int flags,
void *args),
int flags,
- void *args)
+ void *args,
+ int iter_flags)
{
struct xfs_perag *pag;
int error = 0;
@@ -838,7 +862,8 @@ xfs_inode_ag_iterator(
ag = 0;
while ((pag = xfs_perag_get(mp, ag))) {
ag = pag->pag_agno + 1;
- error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1);
+ error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1,
+ iter_flags);
xfs_perag_put(pag);
if (error) {
last_error = error;
@@ -850,6 +875,17 @@ xfs_inode_ag_iterator(
}

int
+xfs_inode_ag_iterator(
+ struct xfs_mount *mp,
+ int (*execute)(struct xfs_inode *ip, int flags,
+ void *args),
+ int flags,
+ void *args)
+{
+ return xfs_inode_ag_iterator_flags(mp, execute, flags, args, 0);
+}
+
+int
xfs_inode_ag_iterator_tag(
struct xfs_mount *mp,
int (*execute)(struct xfs_inode *ip, int flags,
@@ -866,7 +902,8 @@ xfs_inode_ag_iterator_tag(
ag = 0;
while ((pag = xfs_perag_get_tag(mp, ag, tag))) {
ag = pag->pag_agno + 1;
- error = xfs_inode_ag_walk(mp, pag, execute, flags, args, tag);
+ error = xfs_inode_ag_walk(mp, pag, execute, flags, args, tag,
+ 0);
xfs_perag_put(pag);
if (error) {
last_error = error;
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -48,6 +48,11 @@ struct xfs_eofblocks {
#define XFS_IGET_UNTRUSTED 0x2
#define XFS_IGET_DONTCACHE 0x4

+/*
+ * flags for AG inode iterator
+ */
+#define XFS_AGITER_INEW_WAIT 0x1 /* wait on new inodes */
+
int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
uint flags, uint lock_flags, xfs_inode_t **ipp);

@@ -79,6 +84,9 @@ void xfs_cowblocks_worker(struct work_st
int xfs_inode_ag_iterator(struct xfs_mount *mp,
int (*execute)(struct xfs_inode *ip, int flags, void *args),
int flags, void *args);
+int xfs_inode_ag_iterator_flags(struct xfs_mount *mp,
+ int (*execute)(struct xfs_inode *ip, int flags, void *args),
+ int flags, void *args, int iter_flags);
int xfs_inode_ag_iterator_tag(struct xfs_mount *mp,
int (*execute)(struct xfs_inode *ip, int flags, void *args),
int flags, void *args, int tag);


2017-06-05 16:33:05

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 104/115] xfs: support ability to wait on new inodes

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Brian Foster <[email protected]>

commit 756baca27fff3ecaeab9dbc7a5ee35a1d7bc0c7f upstream.

Inodes that are inserted into the perag tree but still under
construction are flagged with the XFS_INEW bit. Most contexts either
skip such inodes when they are encountered or have the ability to
handle them.

The runtime quotaoff sequence introduces a context that must wait
for construction of such inodes to correctly ensure that all dquots
in the fs are released. In anticipation of this, support the ability
to wait on new inodes. Wake the appropriate bit when XFS_INEW is
cleared.

Signed-off-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_icache.c | 5 ++++-
fs/xfs/xfs_inode.h | 4 +++-
2 files changed, 7 insertions(+), 2 deletions(-)

--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -366,14 +366,17 @@ xfs_iget_cache_hit(

error = xfs_reinit_inode(mp, inode);
if (error) {
+ bool wake;
/*
* Re-initializing the inode failed, and we are in deep
* trouble. Try to re-add it to the reclaim list.
*/
rcu_read_lock();
spin_lock(&ip->i_flags_lock);
-
+ wake = !!__xfs_iflags_test(ip, XFS_INEW);
ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
+ if (wake)
+ wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
trace_xfs_iget_reclaim_fail(ip);
goto out_error;
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -216,7 +216,8 @@ static inline bool xfs_is_reflink_inode(
#define XFS_IRECLAIM (1 << 0) /* started reclaiming this inode */
#define XFS_ISTALE (1 << 1) /* inode has been staled */
#define XFS_IRECLAIMABLE (1 << 2) /* inode can be reclaimed */
-#define XFS_INEW (1 << 3) /* inode has just been allocated */
+#define __XFS_INEW_BIT 3 /* inode has just been allocated */
+#define XFS_INEW (1 << __XFS_INEW_BIT)
#define XFS_ITRUNCATED (1 << 5) /* truncated down so flush-on-close */
#define XFS_IDIRTY_RELEASE (1 << 6) /* dirty release already seen */
#define __XFS_IFLOCK_BIT 7 /* inode is being flushed right now */
@@ -464,6 +465,7 @@ static inline void xfs_finish_inode_setu
xfs_iflags_clear(ip, XFS_INEW);
barrier();
unlock_new_inode(VFS_I(ip));
+ wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
}

static inline void xfs_setup_existing_inode(struct xfs_inode *ip)


2017-06-05 16:38:41

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 095/115] xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Eryu Guan <[email protected]>

commit 8affebe16d79ebefb1d9d6d56a46dc89716f9453 upstream.

xfs_find_get_desired_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size XFS on x86_64 host.

# xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
-c "seek -d 0" /mnt/xfs/testfile
wrote 1024/1024 bytes at offset 2048
1 KiB, 1 ops; 0.0000 sec (33.675 MiB/sec and 34482.7586 ops/sec)
Whence Result
DATA EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is uncovered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.

Signed-off-by: Eryu Guan <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1049,7 +1049,7 @@ xfs_find_get_desired_pgoff(
unsigned nr_pages;
unsigned int i;

- want = min_t(pgoff_t, end - index, PAGEVEC_SIZE);
+ want = min_t(pgoff_t, end - index, PAGEVEC_SIZE - 1) + 1;
nr_pages = pagevec_lookup(&pvec, inode->i_mapping, index,
want);
/*


2017-06-05 16:38:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 086/115] RDMA/srp: Fix NULL deref at srp_destroy_qp()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Israel Rukshin <[email protected]>

commit 95c2ef50c726a51d580c35ae8dccd383abaa8701 upstream.

If srp_init_qp() fails at srp_create_ch_ib() then ch->send_cq
may be NULL.
Calling directly to ib_destroy_qp() is sufficient because
no work requests were posted on the created qp.

Fixes: 9294000d6d89 ("IB/srp: Drain the send queue before destroying a QP")
Signed-off-by: Israel Rukshin <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Bart van Assche <[email protected]>--
Signed-off-by: Doug Ledford <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/infiniband/ulp/srp/ib_srp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -570,7 +570,7 @@ static int srp_create_ch_ib(struct srp_r
return 0;

err_qp:
- srp_destroy_qp(ch, qp);
+ ib_destroy_qp(qp);

err_send_cq:
ib_free_cq(send_cq);


2017-06-05 16:39:17

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 083/115] mlock: fix mlock count can not decrease in race condition

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Yisheng Xie <[email protected]>

commit 70feee0e1ef331b22cc51f383d532a0d043fbdcc upstream.

Kefeng reported that when running the follow test, the mlock count in
meminfo will increase permanently:

[1] testcase
linux:~ # cat test_mlockal
grep Mlocked /proc/meminfo
for j in `seq 0 10`
do
for i in `seq 4 15`
do
./p_mlockall >> log &
done
sleep 0.2
done
# wait some time to let mlock counter decrease and 5s may not enough
sleep 5
grep Mlocked /proc/meminfo

linux:~ # cat p_mlockall.c
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>

#define SPACE_LEN 4096

int main(int argc, char ** argv)
{
int ret;
void *adr = malloc(SPACE_LEN);
if (!adr)
return -1;

ret = mlockall(MCL_CURRENT | MCL_FUTURE);
printf("mlcokall ret = %d\n", ret);

ret = munlockall();
printf("munlcokall ret = %d\n", ret);

free(adr);
return 0;
}

In __munlock_pagevec() we should decrement NR_MLOCK for each page where
we clear the PageMlocked flag. Commit 1ebb7cc6a583 ("mm: munlock: batch
NR_MLOCK zone state updates") has introduced a bug where we don't
decrement NR_MLOCK for pages where we clear the flag, but fail to
isolate them from the lru list (e.g. when the pages are on some other
cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK
accounting gets permanently disrupted by this.

Fix it by counting the number of page whose PageMlock flag is cleared.

Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Yisheng Xie <[email protected]>
Reported-by: Kefeng Wang <[email protected]>
Tested-by: Kefeng Wang <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Joern Engel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Xishi Qiu <[email protected]>
Cc: zhongjiang <[email protected]>
Cc: Hanjun Guo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/mlock.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -286,7 +286,7 @@ static void __munlock_pagevec(struct pag
{
int i;
int nr = pagevec_count(pvec);
- int delta_munlocked;
+ int delta_munlocked = -nr;
struct pagevec pvec_putback;
int pgrescued = 0;

@@ -306,6 +306,8 @@ static void __munlock_pagevec(struct pag
continue;
else
__munlock_isolation_failed(page);
+ } else {
+ delta_munlocked++;
}

/*
@@ -317,7 +319,6 @@ static void __munlock_pagevec(struct pag
pagevec_add(&pvec_putback, pvec->pages[i]);
pvec->pages[i] = NULL;
}
- delta_munlocked = -nr + pagevec_count(&pvec_putback);
__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
spin_unlock_irq(zone_lru_lock(zone));



2017-06-05 16:39:44

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 085/115] mm: consider memblock reservations for deferred memory initialization sizing

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Michal Hocko <[email protected]>

commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374 upstream.

We have seen an early OOM killer invocation on ppc64 systems with
crashkernel=4096M:

kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
kthreadd cpuset=/ mems_allowed=7
CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
Call Trace:
dump_stack+0xb0/0xf0 (unreliable)
dump_header+0xb0/0x258
out_of_memory+0x5f0/0x640
__alloc_pages_nodemask+0xa8c/0xc80
kmem_getpages+0x84/0x1a0
fallback_alloc+0x2a4/0x320
kmem_cache_alloc_node+0xc0/0x2e0
copy_process.isra.25+0x260/0x1b30
_do_fork+0x94/0x470
kernel_thread+0x48/0x60
kthreadd+0x264/0x330
ret_from_kernel_thread+0x5c/0xa4

Mem-Info:
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:0 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:5 slab_unreclaimable:73
mapped:0 shmem:0 pagetables:0 bounce:0
free:0 free_pcp:0 free_cma:0
Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
0 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
819200 pages RAM
0 pages HighMem/MovableOnly
817481 pages reserved
0 pages cma reserved
0 pages hwpoisoned

the reason is that the managed memory is too low (only 110MB) while the
rest of the the 50GB is still waiting for the deferred intialization to
be done. update_defer_init estimates the initial memoty to initialize
to 2GB at least but it doesn't consider any memory allocated in that
range. In this particular case we've had

Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)

so the low 2GB is mostly depleted.

Fix this by considering memblock allocations in the initial static
initialization estimation. Move the max_initialise to
reset_deferred_meminit and implement a simple memblock_reserved_memory
helper which iterates all reserved blocks and sums the size of all that
start below the given address. The cumulative size is than added on top
of the initial estimation. This is still not ideal because
reset_deferred_meminit doesn't consider holes and so reservation might
be above the initial estimation whihch we ignore but let's make the
logic simpler until we really need to handle more complicated cases.

Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Tested-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/linux/memblock.h | 8 ++++++++
include/linux/mmzone.h | 1 +
mm/memblock.c | 23 +++++++++++++++++++++++
mm/page_alloc.c | 33 ++++++++++++++++++++++-----------
4 files changed, 54 insertions(+), 11 deletions(-)

--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -423,11 +423,19 @@ static inline void early_memtest(phys_ad
}
#endif

+extern unsigned long memblock_reserved_memory_within(phys_addr_t start_addr,
+ phys_addr_t end_addr);
#else
static inline phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align)
{
return 0;
}
+
+static inline unsigned long memblock_reserved_memory_within(phys_addr_t start_addr,
+ phys_addr_t end_addr)
+{
+ return 0;
+}

#endif /* CONFIG_HAVE_MEMBLOCK */

--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -672,6 +672,7 @@ typedef struct pglist_data {
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
+ unsigned long static_init_size;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1713,6 +1713,29 @@ static void __init_memblock memblock_dum
}
}

+extern unsigned long __init_memblock
+memblock_reserved_memory_within(phys_addr_t start_addr, phys_addr_t end_addr)
+{
+ struct memblock_region *rgn;
+ unsigned long size = 0;
+ int idx;
+
+ for_each_memblock_type((&memblock.reserved), rgn) {
+ phys_addr_t start, end;
+
+ if (rgn->base + rgn->size < start_addr)
+ continue;
+ if (rgn->base > end_addr)
+ continue;
+
+ start = rgn->base;
+ end = start + rgn->size;
+ size += end - start;
+ }
+
+ return size;
+}
+
void __init_memblock __memblock_dump_all(void)
{
pr_info("MEMBLOCK configuration:\n");
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -291,6 +291,26 @@ int page_group_by_mobility_disabled __re
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
+ unsigned long max_initialise;
+ unsigned long reserved_lowmem;
+
+ /*
+ * Initialise at least 2G of a node but also take into account that
+ * two large system hashes that can take up 1GB for 0.25TB/node.
+ */
+ max_initialise = max(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 8));
+
+ /*
+ * Compensate the all the memblock reservations (e.g. crash kernel)
+ * from the initial estimation to make sure we will initialize enough
+ * memory to boot.
+ */
+ reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn,
+ pgdat->node_start_pfn + max_initialise);
+ max_initialise += reserved_lowmem;
+
+ pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages);
pgdat->first_deferred_pfn = ULONG_MAX;
}

@@ -313,20 +333,11 @@ static inline bool update_defer_init(pg_
unsigned long pfn, unsigned long zone_end,
unsigned long *nr_initialised)
{
- unsigned long max_initialise;
-
/* Always populate low zones for address-contrained allocations */
if (zone_end < pgdat_end_pfn(pgdat))
return true;
- /*
- * Initialise at least 2G of a node but also take into account that
- * two large system hashes that can take up 1GB for 0.25TB/node.
- */
- max_initialise = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
-
(*nr_initialised)++;
- if ((*nr_initialised > max_initialise) &&
+ if ((*nr_initialised > pgdat->static_init_size) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
@@ -6100,7 +6111,6 @@ void __paginginit free_area_init_node(in
/* pg_data_t should be reset to zero when it's allocated */
WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);

- reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
pgdat->per_cpu_nodestats = NULL;
@@ -6122,6 +6132,7 @@ void __paginginit free_area_init_node(in
(unsigned long)pgdat->node_mem_map);
#endif

+ reset_deferred_meminit(pgdat);
free_area_init_core(pgdat);
}



2017-06-05 16:32:30

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 093/115] xfs: Fix missed holes in SEEK_HOLE implementation

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Jan Kara <[email protected]>

commit 5375023ae1266553a7baa0845e82917d8803f48c upstream.

XFS SEEK_HOLE implementation could miss a hole in an unwritten extent as
can be seen by the following command:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
-c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (49.312 MiB/sec and 12623.9856 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (70.383 MiB/sec and 18018.0180 ops/sec)
Whence Result
HOLE 139264

Where we can see that hole at offset 56k was just ignored by SEEK_HOLE
implementation. The bug is in xfs_find_get_desired_pgoff() which does
not properly detect the case when pages are not contiguous.

Fix the problem by properly detecting when found page has larger offset
than expected.

Fixes: d126d43f631f996daeee5006714fed914be32368
Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/xfs/xfs_file.c | 29 +++++++++--------------------
1 file changed, 9 insertions(+), 20 deletions(-)

--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1076,17 +1076,6 @@ xfs_find_get_desired_pgoff(
break;
}

- /*
- * At lease we found one page. If this is the first time we
- * step into the loop, and if the first page index offset is
- * greater than the given search offset, a hole was found.
- */
- if (type == HOLE_OFF && lastoff == startoff &&
- lastoff < page_offset(pvec.pages[0])) {
- found = true;
- break;
- }
-
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
loff_t b_offset;
@@ -1098,18 +1087,18 @@ xfs_find_get_desired_pgoff(
* file mapping. However, page->index will not change
* because we have a reference on the page.
*
- * Searching done if the page index is out of range.
- * If the current offset is not reaches the end of
- * the specified search range, there should be a hole
- * between them.
+ * If current page offset is beyond where we've ended,
+ * we've found a hole.
*/
- if (page->index > end) {
- if (type == HOLE_OFF && lastoff < endoff) {
- *offset = lastoff;
- found = true;
- }
+ if (type == HOLE_OFF && lastoff < endoff &&
+ lastoff < page_offset(pvec.pages[i])) {
+ found = true;
+ *offset = lastoff;
goto out;
}
+ /* Searching done if the page index is out of range. */
+ if (page->index > end)
+ goto out;

lock_page(page);
/*


2017-06-05 16:40:40

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 091/115] slub/memcg: cure the brainless abuse of sysfs attributes

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Thomas Gleixner <[email protected]>

commit 478fe3037b2278d276d4cd9cd0ab06c4cb2e9b32 upstream.

memcg_propagate_slab_attrs() abuses the sysfs attribute file functions
to propagate settings from the root kmem_cache to a newly created
kmem_cache. It does that with:

attr->show(root, buf);
attr->store(new, buf, strlen(bug);

Aside of being a lazy and absurd hackery this is broken because it does
not check the return value of the show() function.

Some of the show() functions return 0 w/o touching the buffer. That
means in such a case the store function is called with the stale content
of the previous show(). That causes nonsense like invoking
kmem_cache_shrink() on a newly created kmem_cache. In the worst case it
would cause handing in an uninitialized buffer.

This should be rewritten proper by adding a propagate() callback to
those slub_attributes which must be propagated and avoid that insane
conversion to and from ASCII, but that's too large for a hot fix.

Check at least the return value of the show() function, so calling
store() with stale content is prevented.

Steven said:
"It can cause a deadlock with get_online_cpus() that has been uncovered
by recent cpu hotplug and lockdep changes that Thomas and Peter have
been doing.

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(cpu_hotplug.lock);
lock(slab_mutex);
lock(cpu_hotplug.lock);
lock(slab_mutex);

*** DEADLOCK ***"

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanos
Signed-off-by: Thomas Gleixner <[email protected]>
Reported-by: Steven Rostedt <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/slub.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5512,6 +5512,7 @@ static void memcg_propagate_slab_attrs(s
char mbuf[64];
char *buf;
struct slab_attribute *attr = to_slab_attr(slab_attrs[i]);
+ ssize_t len;

if (!attr || !attr->store || !attr->show)
continue;
@@ -5536,8 +5537,9 @@ static void memcg_propagate_slab_attrs(s
buf = buffer;
}

- attr->show(root_cache, buf);
- attr->store(s, buf, strlen(buf));
+ len = attr->show(root_cache, buf);
+ if (len > 0)
+ attr->store(s, buf, len);
}

if (buffer)


2017-06-05 16:32:20

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 062/115] iscsi-target: Fix initial login PDU asynchronous socket close OOPs

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Nicholas Bellinger <[email protected]>

commit 25cdda95fda78d22d44157da15aa7ea34be3c804 upstream.

This patch fixes a OOPs originally introduced by:

commit bb048357dad6d604520c91586334c9c230366a14
Author: Nicholas Bellinger <[email protected]>
Date: Thu Sep 5 14:54:04 2013 -0700

iscsi-target: Add sk->sk_state_change to cleanup after TCP failure

which would trigger a NULL pointer dereference when a TCP connection
was closed asynchronously via iscsi_target_sk_state_change(), but only
when the initial PDU processing in iscsi_target_do_login() from iscsi_np
process context was blocked waiting for backend I/O to complete.

To address this issue, this patch makes the following changes.

First, it introduces some common helper functions used for checking
socket closing state, checking login_flags, and atomically checking
socket closing state + setting login_flags.

Second, it introduces a LOGIN_FLAGS_INITIAL_PDU bit to know when a TCP
connection has dropped via iscsi_target_sk_state_change(), but the
initial PDU processing within iscsi_target_do_login() in iscsi_np
context is still running. For this case, it sets LOGIN_FLAGS_CLOSED,
but doesn't invoke schedule_delayed_work().

The original NULL pointer dereference case reported by MNC is now handled
by iscsi_target_do_login() doing a iscsi_target_sk_check_close() before
transitioning to FFP to determine when the socket has already closed,
or iscsi_target_start_negotiation() if the login needs to exchange
more PDUs (eg: iscsi_target_do_login returned 0) but the socket has
closed. For both of these cases, the cleanup up of remaining connection
resources will occur in iscsi_target_start_negotiation() from iscsi_np
process context once the failure is detected.

Finally, to handle to case where iscsi_target_sk_state_change() is
called after the initial PDU procesing is complete, it now invokes
conn->login_work -> iscsi_target_do_login_rx() to perform cleanup once
existing iscsi_target_sk_check_close() checks detect connection failure.
For this case, the cleanup of remaining connection resources will occur
in iscsi_target_do_login_rx() from delayed workqueue process context
once the failure is detected.

Reported-by: Mike Christie <[email protected]>
Reviewed-by: Mike Christie <[email protected]>
Tested-by: Mike Christie <[email protected]>
Cc: Mike Christie <[email protected]>
Reported-by: Hannes Reinecke <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Varun Prakash <[email protected]>
Signed-off-by: Nicholas Bellinger <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/target/iscsi/iscsi_target_nego.c | 194 +++++++++++++++++++++----------
include/target/iscsi/iscsi_target_core.h | 1
2 files changed, 133 insertions(+), 62 deletions(-)

--- a/drivers/target/iscsi/iscsi_target_nego.c
+++ b/drivers/target/iscsi/iscsi_target_nego.c
@@ -493,14 +493,60 @@ static void iscsi_target_restore_sock_ca

static int iscsi_target_do_login(struct iscsi_conn *, struct iscsi_login *);

-static bool iscsi_target_sk_state_check(struct sock *sk)
+static bool __iscsi_target_sk_check_close(struct sock *sk)
{
if (sk->sk_state == TCP_CLOSE_WAIT || sk->sk_state == TCP_CLOSE) {
- pr_debug("iscsi_target_sk_state_check: TCP_CLOSE_WAIT|TCP_CLOSE,"
+ pr_debug("__iscsi_target_sk_check_close: TCP_CLOSE_WAIT|TCP_CLOSE,"
"returning FALSE\n");
- return false;
+ return true;
}
- return true;
+ return false;
+}
+
+static bool iscsi_target_sk_check_close(struct iscsi_conn *conn)
+{
+ bool state = false;
+
+ if (conn->sock) {
+ struct sock *sk = conn->sock->sk;
+
+ read_lock_bh(&sk->sk_callback_lock);
+ state = (__iscsi_target_sk_check_close(sk) ||
+ test_bit(LOGIN_FLAGS_CLOSED, &conn->login_flags));
+ read_unlock_bh(&sk->sk_callback_lock);
+ }
+ return state;
+}
+
+static bool iscsi_target_sk_check_flag(struct iscsi_conn *conn, unsigned int flag)
+{
+ bool state = false;
+
+ if (conn->sock) {
+ struct sock *sk = conn->sock->sk;
+
+ read_lock_bh(&sk->sk_callback_lock);
+ state = test_bit(flag, &conn->login_flags);
+ read_unlock_bh(&sk->sk_callback_lock);
+ }
+ return state;
+}
+
+static bool iscsi_target_sk_check_and_clear(struct iscsi_conn *conn, unsigned int flag)
+{
+ bool state = false;
+
+ if (conn->sock) {
+ struct sock *sk = conn->sock->sk;
+
+ write_lock_bh(&sk->sk_callback_lock);
+ state = (__iscsi_target_sk_check_close(sk) ||
+ test_bit(LOGIN_FLAGS_CLOSED, &conn->login_flags));
+ if (!state)
+ clear_bit(flag, &conn->login_flags);
+ write_unlock_bh(&sk->sk_callback_lock);
+ }
+ return state;
}

static void iscsi_target_login_drop(struct iscsi_conn *conn, struct iscsi_login *login)
@@ -540,6 +586,20 @@ static void iscsi_target_do_login_rx(str

pr_debug("entering iscsi_target_do_login_rx, conn: %p, %s:%d\n",
conn, current->comm, current->pid);
+ /*
+ * If iscsi_target_do_login_rx() has been invoked by ->sk_data_ready()
+ * before initial PDU processing in iscsi_target_start_negotiation()
+ * has completed, go ahead and retry until it's cleared.
+ *
+ * Otherwise if the TCP connection drops while this is occuring,
+ * iscsi_target_start_negotiation() will detect the failure, call
+ * cancel_delayed_work_sync(&conn->login_work), and cleanup the
+ * remaining iscsi connection resources from iscsi_np process context.
+ */
+ if (iscsi_target_sk_check_flag(conn, LOGIN_FLAGS_INITIAL_PDU)) {
+ schedule_delayed_work(&conn->login_work, msecs_to_jiffies(10));
+ return;
+ }

spin_lock(&tpg->tpg_state_lock);
state = (tpg->tpg_state == TPG_STATE_ACTIVE);
@@ -547,26 +607,12 @@ static void iscsi_target_do_login_rx(str

if (!state) {
pr_debug("iscsi_target_do_login_rx: tpg_state != TPG_STATE_ACTIVE\n");
- iscsi_target_restore_sock_callbacks(conn);
- iscsi_target_login_drop(conn, login);
- iscsit_deaccess_np(np, tpg, tpg_np);
- return;
+ goto err;
}

- if (conn->sock) {
- struct sock *sk = conn->sock->sk;
-
- read_lock_bh(&sk->sk_callback_lock);
- state = iscsi_target_sk_state_check(sk);
- read_unlock_bh(&sk->sk_callback_lock);
-
- if (!state) {
- pr_debug("iscsi_target_do_login_rx, TCP state CLOSE\n");
- iscsi_target_restore_sock_callbacks(conn);
- iscsi_target_login_drop(conn, login);
- iscsit_deaccess_np(np, tpg, tpg_np);
- return;
- }
+ if (iscsi_target_sk_check_close(conn)) {
+ pr_debug("iscsi_target_do_login_rx, TCP state CLOSE\n");
+ goto err;
}

conn->login_kworker = current;
@@ -584,34 +630,29 @@ static void iscsi_target_do_login_rx(str
flush_signals(current);
conn->login_kworker = NULL;

- if (rc < 0) {
- iscsi_target_restore_sock_callbacks(conn);
- iscsi_target_login_drop(conn, login);
- iscsit_deaccess_np(np, tpg, tpg_np);
- return;
- }
+ if (rc < 0)
+ goto err;

pr_debug("iscsi_target_do_login_rx after rx_login_io, %p, %s:%d\n",
conn, current->comm, current->pid);

rc = iscsi_target_do_login(conn, login);
if (rc < 0) {
- iscsi_target_restore_sock_callbacks(conn);
- iscsi_target_login_drop(conn, login);
- iscsit_deaccess_np(np, tpg, tpg_np);
+ goto err;
} else if (!rc) {
- if (conn->sock) {
- struct sock *sk = conn->sock->sk;
-
- write_lock_bh(&sk->sk_callback_lock);
- clear_bit(LOGIN_FLAGS_READ_ACTIVE, &conn->login_flags);
- write_unlock_bh(&sk->sk_callback_lock);
- }
+ if (iscsi_target_sk_check_and_clear(conn, LOGIN_FLAGS_READ_ACTIVE))
+ goto err;
} else if (rc == 1) {
iscsi_target_nego_release(conn);
iscsi_post_login_handler(np, conn, zero_tsih);
iscsit_deaccess_np(np, tpg, tpg_np);
}
+ return;
+
+err:
+ iscsi_target_restore_sock_callbacks(conn);
+ iscsi_target_login_drop(conn, login);
+ iscsit_deaccess_np(np, tpg, tpg_np);
}

static void iscsi_target_do_cleanup(struct work_struct *work)
@@ -659,31 +700,54 @@ static void iscsi_target_sk_state_change
orig_state_change(sk);
return;
}
+ state = __iscsi_target_sk_check_close(sk);
+ pr_debug("__iscsi_target_sk_close_change: state: %d\n", state);
+
if (test_bit(LOGIN_FLAGS_READ_ACTIVE, &conn->login_flags)) {
pr_debug("Got LOGIN_FLAGS_READ_ACTIVE=1 sk_state_change"
" conn: %p\n", conn);
+ if (state)
+ set_bit(LOGIN_FLAGS_CLOSED, &conn->login_flags);
write_unlock_bh(&sk->sk_callback_lock);
orig_state_change(sk);
return;
}
- if (test_and_set_bit(LOGIN_FLAGS_CLOSED, &conn->login_flags)) {
+ if (test_bit(LOGIN_FLAGS_CLOSED, &conn->login_flags)) {
pr_debug("Got LOGIN_FLAGS_CLOSED=1 sk_state_change conn: %p\n",
conn);
write_unlock_bh(&sk->sk_callback_lock);
orig_state_change(sk);
return;
}
+ /*
+ * If the TCP connection has dropped, go ahead and set LOGIN_FLAGS_CLOSED,
+ * but only queue conn->login_work -> iscsi_target_do_login_rx()
+ * processing if LOGIN_FLAGS_INITIAL_PDU has already been cleared.
+ *
+ * When iscsi_target_do_login_rx() runs, iscsi_target_sk_check_close()
+ * will detect the dropped TCP connection from delayed workqueue context.
+ *
+ * If LOGIN_FLAGS_INITIAL_PDU is still set, which means the initial
+ * iscsi_target_start_negotiation() is running, iscsi_target_do_login()
+ * via iscsi_target_sk_check_close() or iscsi_target_start_negotiation()
+ * via iscsi_target_sk_check_and_clear() is responsible for detecting the
+ * dropped TCP connection in iscsi_np process context, and cleaning up
+ * the remaining iscsi connection resources.
+ */
+ if (state) {
+ pr_debug("iscsi_target_sk_state_change got failed state\n");
+ set_bit(LOGIN_FLAGS_CLOSED, &conn->login_flags);
+ state = test_bit(LOGIN_FLAGS_INITIAL_PDU, &conn->login_flags);
+ write_unlock_bh(&sk->sk_callback_lock);

- state = iscsi_target_sk_state_check(sk);
- write_unlock_bh(&sk->sk_callback_lock);
-
- pr_debug("iscsi_target_sk_state_change: state: %d\n", state);
+ orig_state_change(sk);

- if (!state) {
- pr_debug("iscsi_target_sk_state_change got failed state\n");
- schedule_delayed_work(&conn->login_cleanup_work, 0);
+ if (!state)
+ schedule_delayed_work(&conn->login_work, 0);
return;
}
+ write_unlock_bh(&sk->sk_callback_lock);
+
orig_state_change(sk);
}

@@ -946,6 +1010,15 @@ static int iscsi_target_do_login(struct
if (iscsi_target_handle_csg_one(conn, login) < 0)
return -1;
if (login_rsp->flags & ISCSI_FLAG_LOGIN_TRANSIT) {
+ /*
+ * Check to make sure the TCP connection has not
+ * dropped asynchronously while session reinstatement
+ * was occuring in this kthread context, before
+ * transitioning to full feature phase operation.
+ */
+ if (iscsi_target_sk_check_close(conn))
+ return -1;
+
login->tsih = conn->sess->tsih;
login->login_complete = 1;
iscsi_target_restore_sock_callbacks(conn);
@@ -972,21 +1045,6 @@ static int iscsi_target_do_login(struct
break;
}

- if (conn->sock) {
- struct sock *sk = conn->sock->sk;
- bool state;
-
- read_lock_bh(&sk->sk_callback_lock);
- state = iscsi_target_sk_state_check(sk);
- read_unlock_bh(&sk->sk_callback_lock);
-
- if (!state) {
- pr_debug("iscsi_target_do_login() failed state for"
- " conn: %p\n", conn);
- return -1;
- }
- }
-
return 0;
}

@@ -1255,10 +1313,22 @@ int iscsi_target_start_negotiation(

write_lock_bh(&sk->sk_callback_lock);
set_bit(LOGIN_FLAGS_READY, &conn->login_flags);
+ set_bit(LOGIN_FLAGS_INITIAL_PDU, &conn->login_flags);
write_unlock_bh(&sk->sk_callback_lock);
}
-
+ /*
+ * If iscsi_target_do_login returns zero to signal more PDU
+ * exchanges are required to complete the login, go ahead and
+ * clear LOGIN_FLAGS_INITIAL_PDU but only if the TCP connection
+ * is still active.
+ *
+ * Otherwise if TCP connection dropped asynchronously, go ahead
+ * and perform connection cleanup now.
+ */
ret = iscsi_target_do_login(conn, login);
+ if (!ret && iscsi_target_sk_check_and_clear(conn, LOGIN_FLAGS_INITIAL_PDU))
+ ret = -1;
+
if (ret < 0) {
cancel_delayed_work_sync(&conn->login_work);
cancel_delayed_work_sync(&conn->login_cleanup_work);
--- a/include/target/iscsi/iscsi_target_core.h
+++ b/include/target/iscsi/iscsi_target_core.h
@@ -557,6 +557,7 @@ struct iscsi_conn {
#define LOGIN_FLAGS_READ_ACTIVE 1
#define LOGIN_FLAGS_CLOSED 2
#define LOGIN_FLAGS_READY 4
+#define LOGIN_FLAGS_INITIAL_PDU 8
unsigned long login_flags;
struct delayed_work login_work;
struct delayed_work login_cleanup_work;


2017-06-05 16:42:38

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 090/115] ksm: prevent crash after write_protect_page fails

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Andrea Arcangeli <[email protected]>

commit a7306c3436e9c8e584a4b9fad5f3dc91be2a6076 upstream.

"err" needs to be left set to -EFAULT if split_huge_page succeeds.
Otherwise if "err" gets clobbered with zero and write_protect_page
fails, try_to_merge_one_page() will succeed instead of returning -EFAULT
and then try_to_merge_with_ksm_page() will continue thinking kpage is a
PageKsm when in fact it's still an anonymous page. Eventually it'll
crash in page_add_anon_rmap.

This has been reproduced on Fedora25 kernel but I can reproduce with
upstream too.

The bug was introduced in commit f765f540598a ("ksm: prepare to new THP
semantics") introduced in v4.5.

page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673
flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked)
page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
page->mem_cgroup:ffffa09674bf0000
------------[ cut here ]------------
kernel BUG at mm/rmap.c:1222!
CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1
RIP: do_page_add_anon_rmap+0x1c4/0x240
Call Trace:
page_add_anon_rmap+0x18/0x20
try_to_merge_with_ksm_page+0x50b/0x780
ksm_scan_thread+0x1211/0x1410
? prepare_to_wait_event+0x100/0x100
? try_to_merge_with_ksm_page+0x780/0x780
kthread+0xd9/0xf0
? kthread_park+0x60/0x60
ret_from_fork+0x25/0x30

Fixes: f765f54059 ("ksm: prepare to new THP semantics")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Reported-by: Federico Simoncelli <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
mm/ksm.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1028,8 +1028,7 @@ static int try_to_merge_one_page(struct
goto out;

if (PageTransCompound(page)) {
- err = split_huge_page(page);
- if (err)
+ if (split_huge_page(page))
goto out_unlock;
}



2017-06-05 16:32:04

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 066/115] HID: wacom: Have wacom_tpc_irq guard against possible NULL dereference

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Jason Gerecke <[email protected]>

commit 2ac97f0f6654da14312d125005c77a6010e0ea38 upstream.

The following Smatch complaint was generated in response to commit
2a6cdbd ("HID: wacom: Introduce new 'touch_input' device"):

drivers/hid/wacom_wac.c:1586 wacom_tpc_irq()
error: we previously assumed 'wacom->touch_input' could be null (see line 1577)

The 'touch_input' and 'pen_input' variables point to the 'struct input_dev'
used for relaying touch and pen events to userspace, respectively. If a
device does not have a touch interface or pen interface, the associated
input variable is NULL. The 'wacom_tpc_irq()' function is responsible for
forwarding input reports to a more-specific IRQ handler function. An
unknown report could theoretically be mistaken as e.g. a touch report
on a device which does not have a touch interface. This can be prevented
by only calling the pen/touch functions are called when the pen/touch
pointers are valid.

Fixes: 2a6cdbd ("HID: wacom: Introduce new 'touch_input' device")
Signed-off-by: Jason Gerecke <[email protected]>
Reviewed-by: Ping Cheng <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/hid/wacom_wac.c | 47 ++++++++++++++++++++++++-----------------------
1 file changed, 24 insertions(+), 23 deletions(-)

--- a/drivers/hid/wacom_wac.c
+++ b/drivers/hid/wacom_wac.c
@@ -1571,37 +1571,38 @@ static int wacom_tpc_irq(struct wacom_wa
{
unsigned char *data = wacom->data;

- if (wacom->pen_input)
+ if (wacom->pen_input) {
dev_dbg(wacom->pen_input->dev.parent,
"%s: received report #%d\n", __func__, data[0]);
- else if (wacom->touch_input)
+
+ if (len == WACOM_PKGLEN_PENABLED ||
+ data[0] == WACOM_REPORT_PENABLED)
+ return wacom_tpc_pen(wacom);
+ }
+ else if (wacom->touch_input) {
dev_dbg(wacom->touch_input->dev.parent,
"%s: received report #%d\n", __func__, data[0]);

- switch (len) {
- case WACOM_PKGLEN_TPC1FG:
- return wacom_tpc_single_touch(wacom, len);
-
- case WACOM_PKGLEN_TPC2FG:
- return wacom_tpc_mt_touch(wacom);
-
- case WACOM_PKGLEN_PENABLED:
- return wacom_tpc_pen(wacom);
-
- default:
- switch (data[0]) {
- case WACOM_REPORT_TPC1FG:
- case WACOM_REPORT_TPCHID:
- case WACOM_REPORT_TPCST:
- case WACOM_REPORT_TPC1FGE:
+ switch (len) {
+ case WACOM_PKGLEN_TPC1FG:
return wacom_tpc_single_touch(wacom, len);

- case WACOM_REPORT_TPCMT:
- case WACOM_REPORT_TPCMT2:
- return wacom_mt_touch(wacom);
+ case WACOM_PKGLEN_TPC2FG:
+ return wacom_tpc_mt_touch(wacom);

- case WACOM_REPORT_PENABLED:
- return wacom_tpc_pen(wacom);
+ default:
+ switch (data[0]) {
+ case WACOM_REPORT_TPC1FG:
+ case WACOM_REPORT_TPCHID:
+ case WACOM_REPORT_TPCST:
+ case WACOM_REPORT_TPC1FGE:
+ return wacom_tpc_single_touch(wacom, len);
+
+ case WACOM_REPORT_TPCMT:
+ case WACOM_REPORT_TPCMT2:
+ return wacom_mt_touch(wacom);
+
+ }
}
}



2017-06-05 16:31:53

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 068/115] nvme: use blk_mq_start_hw_queues() in nvme_kill_queues()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Ming Lei <[email protected]>

commit 806f026f9b901eaf1a6baeb48b5da18d6a4f818e upstream.

Inside nvme_kill_queues(), we have to start hw queues for
draining requests in sw queues, .dispatch list and requeue list,
so use blk_mq_start_hw_queues() instead of blk_mq_start_stopped_hw_queues()
which only run queues if queues are stopped, but the queues may have
been started already, for example nvme_start_queues() is called in reset work
function.

blk_mq_start_hw_queues() run hw queues in current context, instead
of running asynchronously like before. Given nvme_kill_queues() is
run from either remove context or reset worker context, both are fine
to run hw queue directly. And the mutex of namespaces_mutex isn't a
problem too becasue nvme_start_freeze() runs hw queue in this way
already.

Reported-by: Zhang Yi <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/nvme/host/core.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2345,7 +2345,13 @@ void nvme_kill_queues(struct nvme_ctrl *
revalidate_disk(ns->disk);
blk_set_queue_dying(ns->queue);
blk_mq_abort_requeue_list(ns->queue);
- blk_mq_start_stopped_hw_queues(ns->queue, true);
+
+ /*
+ * Forcibly start all queues to avoid having stuck requests.
+ * Note that we must ensure the queues are not stopped
+ * when the final removal happens.
+ */
+ blk_mq_start_hw_queues(ns->queue);
}
mutex_unlock(&ctrl->namespaces_mutex);
}


2017-06-05 16:31:48

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 067/115] nvme-rdma: support devices with queue size < 32

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Marta Rybczynska <[email protected]>

commit 0544f5494a03b8846db74e02be5685d1f32b06c9 upstream.

In the case of small NVMe-oF queue size (<32) we may enter a deadlock
caused by the fact that the IB completions aren't sent waiting for 32
and the send queue will fill up.

The error is seen as (using mlx5):
[ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
[ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12

This patch changes the way the signaling is done so that it depends on
the queue depth now. The magic define has been removed completely.

Signed-off-by: Marta Rybczynska <[email protected]>
Signed-off-by: Samuel Jones <[email protected]>
Acked-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/nvme/host/rdma.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1029,6 +1029,19 @@ static void nvme_rdma_send_done(struct i
nvme_rdma_wr_error(cq, wc, "SEND");
}

+static inline int nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
+{
+ int sig_limit;
+
+ /*
+ * We signal completion every queue depth/2 and also handle the
+ * degenerated case of a device with queue_depth=1, where we
+ * would need to signal every message.
+ */
+ sig_limit = max(queue->queue_size / 2, 1);
+ return (++queue->sig_count % sig_limit) == 0;
+}
+
static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
struct ib_send_wr *first, bool flush)
@@ -1056,9 +1069,6 @@ static int nvme_rdma_post_send(struct nv
* Would have been way to obvious to handle this in hardware or
* at least the RDMA stack..
*
- * This messy and racy code sniplet is copy and pasted from the iSER
- * initiator, and the magic '32' comes from there as well.
- *
* Always signal the flushes. The magic request used for the flush
* sequencer is not allocated in our driver's tagset and it's
* triggered to be freed by blk_cleanup_queue(). So we need to
@@ -1066,7 +1076,7 @@ static int nvme_rdma_post_send(struct nv
* embedded in request's payload, is not freed when __ib_process_cq()
* calls wr_cqe->done().
*/
- if ((++queue->sig_count % 32) == 0 || flush)
+ if (nvme_rdma_queue_sig_limit(queue) || flush)
wr.send_flags |= IB_SEND_SIGNALED;

if (first)


2017-06-05 16:43:49

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 075/115] ALSA: hda - No loopback on ALC299 codec

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Takashi Iwai <[email protected]>

commit fa16b69f1299004b60b625f181143500a246e5cb upstream.

ALC299 has no loopback mixer, but the driver still tries to add a beep
control over the mixer NID which leads to the error at accessing it.
This patch fixes it by properly declaring mixer_nid=0 for this codec.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195775
Fixes: 28f1f9b26cee ("ALSA: hda/realtek - Add new codec ID ALC299")
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
sound/pci/hda/patch_realtek.c | 3 +++
1 file changed, 3 insertions(+)

--- a/sound/pci/hda/patch_realtek.c
+++ b/sound/pci/hda/patch_realtek.c
@@ -6276,8 +6276,11 @@ static int patch_alc269(struct hda_codec
break;
case 0x10ec0225:
case 0x10ec0295:
+ spec->codec_variant = ALC269_TYPE_ALC225;
+ break;
case 0x10ec0299:
spec->codec_variant = ALC269_TYPE_ALC225;
+ spec->gen.mixer_nid = 0; /* no loopback on ALC299 */
break;
case 0x10ec0234:
case 0x10ec0274:


2017-06-05 16:31:33

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 074/115] pcmcia: remove left-over %Z format

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Nicolas Iooss <[email protected]>

commit ff5a20169b98d84ad8d7f99f27c5ebbb008204d6 upstream.

Commit 5b5e0928f742 ("lib/vsprintf.c: remove %Z support") removed some
usages of format %Z but forgot "%.2Zx". This makes clang 4.0 reports a
-Wformat-extra-args warning because it does not know about %Z.

Replace %Z with %z.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Nicolas Iooss <[email protected]>
Cc: Harald Welte <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/char/pcmcia/cm4040_cs.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

--- a/drivers/char/pcmcia/cm4040_cs.c
+++ b/drivers/char/pcmcia/cm4040_cs.c
@@ -374,7 +374,7 @@ static ssize_t cm4040_write(struct file

rc = write_sync_reg(SCR_HOST_TO_READER_START, dev);
if (rc <= 0) {
- DEBUGP(5, dev, "write_sync_reg c=%.2Zx\n", rc);
+ DEBUGP(5, dev, "write_sync_reg c=%.2zx\n", rc);
DEBUGP(2, dev, "<- cm4040_write (failed)\n");
if (rc == -ERESTARTSYS)
return rc;
@@ -387,7 +387,7 @@ static ssize_t cm4040_write(struct file
for (i = 0; i < bytes_to_write; i++) {
rc = wait_for_bulk_out_ready(dev);
if (rc <= 0) {
- DEBUGP(5, dev, "wait_for_bulk_out_ready rc=%.2Zx\n",
+ DEBUGP(5, dev, "wait_for_bulk_out_ready rc=%.2zx\n",
rc);
DEBUGP(2, dev, "<- cm4040_write (failed)\n");
if (rc == -ERESTARTSYS)
@@ -403,7 +403,7 @@ static ssize_t cm4040_write(struct file
rc = write_sync_reg(SCR_HOST_TO_READER_DONE, dev);

if (rc <= 0) {
- DEBUGP(5, dev, "write_sync_reg c=%.2Zx\n", rc);
+ DEBUGP(5, dev, "write_sync_reg c=%.2zx\n", rc);
DEBUGP(2, dev, "<- cm4040_write (failed)\n");
if (rc == -ERESTARTSYS)
return rc;


2017-06-05 16:44:05

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 078/115] ALSA: usb: Fix a typo in Tascam US-16x08 mixer element

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Takashi Iwai <[email protected]>

commit 617163fc2580da3d489b6c1bacb6312e0e2aac02 upstream.

A mixer element created in a quirk for Tascam US-16x08 contains a
typo: it should be "EQ MidLow Q" instead of "EQ MidQLow Q".

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195875
Fixes: d2bb390a2081 ("ALSA: usb-audio: Tascam US-16x08 DSP mixer quirk")
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
sound/usb/mixer_us16x08.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/sound/usb/mixer_us16x08.c
+++ b/sound/usb/mixer_us16x08.c
@@ -1135,7 +1135,7 @@ static const struct snd_us16x08_control_
.control_id = SND_US16X08_ID_EQLOWMIDWIDTH,
.type = USB_MIXER_U8,
.num_channels = 16,
- .name = "EQ MidQLow Q",
+ .name = "EQ MidLow Q",
},
{ /* EQ mid high gain */
.kcontrol_new = &snd_us16x08_eq_gain_ctl,


2017-06-05 16:44:27

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 076/115] ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Alexander Tsoy <[email protected]>

commit 1fc2e41f7af4572b07190f9dec28396b418e9a36 upstream.

This model is actually called 92XXM2-8 in Windows driver. But since pin
configs for M22 and M28 are identical, just reuse M22 quirk.

Fixes external microphone (tested) and probably docking station ports
(not tested).

Signed-off-by: Alexander Tsoy <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
sound/pci/hda/patch_sigmatel.c | 2 ++
1 file changed, 2 insertions(+)

--- a/sound/pci/hda/patch_sigmatel.c
+++ b/sound/pci/hda/patch_sigmatel.c
@@ -1559,6 +1559,8 @@ static const struct snd_pci_quirk stac92
"Dell Inspiron 1501", STAC_9200_DELL_M26),
SND_PCI_QUIRK(PCI_VENDOR_ID_DELL, 0x01f6,
"unknown Dell", STAC_9200_DELL_M26),
+ SND_PCI_QUIRK(PCI_VENDOR_ID_DELL, 0x0201,
+ "Dell Latitude D430", STAC_9200_DELL_M22),
/* Panasonic */
SND_PCI_QUIRK(0x10f7, 0x8338, "Panasonic CF-74", STAC_9200_PANASONIC),
/* Gateway machines needs EAPD to be set on resume */


2017-06-05 16:45:05

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 054/115] x86/MCE: Export memory_error()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Borislav Petkov <[email protected]>

commit 2d1f406139ec20320bf38bcd2461aa8e358084b5 upstream.

Export the function which checks whether an MCE is a memory error to
other users so that we can reuse the logic. Drop the boot_cpu_data use,
while at it, as mce.cpuvendor already has the CPU vendor in there.

Integrate a piece from a patch from Vishal Verma
<[email protected]> to export it for modules (nfit).

The main reason we're exporting it is that the nfit handler
nfit_handle_mce() needs to detect a memory error properly before doing
its recovery actions.

Signed-off-by: Borislav Petkov <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vishal Verma <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/x86/include/asm/mce.h | 1 +
arch/x86/kernel/cpu/mcheck/mce.c | 11 +++++------
2 files changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -264,6 +264,7 @@ static inline int umc_normaddr_to_sysadd
#endif

int mce_available(struct cpuinfo_x86 *c);
+bool mce_is_memory_error(struct mce *m);

DECLARE_PER_CPU(unsigned, mce_exception_count);
DECLARE_PER_CPU(unsigned, mce_poll_count);
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -643,16 +643,14 @@ static void mce_read_aux(struct mce *m,
}
}

-static bool memory_error(struct mce *m)
+bool mce_is_memory_error(struct mce *m)
{
- struct cpuinfo_x86 *c = &boot_cpu_data;
-
- if (c->x86_vendor == X86_VENDOR_AMD) {
+ if (m->cpuvendor == X86_VENDOR_AMD) {
/* ErrCodeExt[20:16] */
u8 xec = (m->status >> 16) & 0x1f;

return (xec == 0x0 || xec == 0x8);
- } else if (c->x86_vendor == X86_VENDOR_INTEL) {
+ } else if (m->cpuvendor == X86_VENDOR_INTEL) {
/*
* Intel SDM Volume 3B - 15.9.2 Compound Error Codes
*
@@ -673,6 +671,7 @@ static bool memory_error(struct mce *m)

return false;
}
+EXPORT_SYMBOL_GPL(mce_is_memory_error);

DEFINE_PER_CPU(unsigned, mce_poll_count);

@@ -734,7 +733,7 @@ bool machine_check_poll(enum mcp_flags f

severity = mce_severity(&m, mca_cfg.tolerant, NULL, false);

- if (severity == MCE_DEFERRED_SEVERITY && memory_error(&m))
+ if (severity == MCE_DEFERRED_SEVERITY && mce_is_memory_error(&m))
if (m.status & MCI_STATUS_ADDRV)
m.severity = severity;



2017-06-05 16:45:29

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 053/115] Revert "ACPI / button: Remove lid_init_state=method mode"

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Lv Zheng <[email protected]>

commit f369fdf4f661322b73f3307e9f3cd55fb3a20123 upstream.

This reverts commit ecb10b694b72ca5ea51b3c90a71ff2a11963425a.

The only expected ACPI control method lid device's usage model is

1. Listen to the lid notification,
2. Evaluate _LID after being notified by BIOS,
3. Suspend the system (if users configure to do so) after seeing "close".

It's not ensured that BIOS will notify OS after boot/resume, and
it's not ensured that BIOS will always generate "open" event upon
opening the lid.

But there are 2 wrong usage models:

1. When the lid device is responsible for suspend/resume the system,
userspace requires to see "open" event to be paired with "close" after
the system is resumed, or it will suspend the system again.

2. When an external monitor connects to the laptop attached docks,
userspace requires to see "close" event after the system is resumed so
that it can determine whether the internal display should remain dark
and the external display should be lit on.

After we made default kernel behavior to be suitable for usage model 1,
users of usage model 2 start to report regressions for such behavior
change.

Reversion of button.lid_init_state=method doesn't actually reverts to old
default behavior as doing so can enter a regression loop, but facilitates
users to work the reported regressions around with
button.lid_init_state=method.

Fixes: ecb10b694b72 (ACPI / button: Remove lid_init_state=method mode)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=195455
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1430259
Tested-by: Steffen Weber <[email protected]>
Tested-by: Julian Wiedmann <[email protected]>
Reported-by: Joachim Frieben <[email protected]>
Signed-off-by: Lv Zheng <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
Documentation/acpi/acpi-lid.txt | 16 ++++++++++++----
drivers/acpi/button.c | 9 +++++++++
2 files changed, 21 insertions(+), 4 deletions(-)

--- a/Documentation/acpi/acpi-lid.txt
+++ b/Documentation/acpi/acpi-lid.txt
@@ -59,20 +59,28 @@ button driver uses the following 3 modes
If the userspace hasn't been prepared to ignore the unreliable "opened"
events and the unreliable initial state notification, Linux users can use
the following kernel parameters to handle the possible issues:
-A. button.lid_init_state=open:
+A. button.lid_init_state=method:
+ When this option is specified, the ACPI button driver reports the
+ initial lid state using the returning value of the _LID control method
+ and whether the "opened"/"closed" events are paired fully relies on the
+ firmware implementation.
+ This option can be used to fix some platforms where the returning value
+ of the _LID control method is reliable but the initial lid state
+ notification is missing.
+ This option is the default behavior during the period the userspace
+ isn't ready to handle the buggy AML tables.
+B. button.lid_init_state=open:
When this option is specified, the ACPI button driver always reports the
initial lid state as "opened" and whether the "opened"/"closed" events
are paired fully relies on the firmware implementation.
This may fix some platforms where the returning value of the _LID
control method is not reliable and the initial lid state notification is
missing.
- This option is the default behavior during the period the userspace
- isn't ready to handle the buggy AML tables.

If the userspace has been prepared to ignore the unreliable "opened" events
and the unreliable initial state notification, Linux users should always
use the following kernel parameter:
-B. button.lid_init_state=ignore:
+C. button.lid_init_state=ignore:
When this option is specified, the ACPI button driver never reports the
initial lid state and there is a compensation mechanism implemented to
ensure that the reliable "closed" notifications can always be delievered
--- a/drivers/acpi/button.c
+++ b/drivers/acpi/button.c
@@ -57,6 +57,7 @@

#define ACPI_BUTTON_LID_INIT_IGNORE 0x00
#define ACPI_BUTTON_LID_INIT_OPEN 0x01
+#define ACPI_BUTTON_LID_INIT_METHOD 0x02

#define _COMPONENT ACPI_BUTTON_COMPONENT
ACPI_MODULE_NAME("button");
@@ -376,6 +377,9 @@ static void acpi_lid_initialize_state(st
case ACPI_BUTTON_LID_INIT_OPEN:
(void)acpi_lid_notify_state(device, 1);
break;
+ case ACPI_BUTTON_LID_INIT_METHOD:
+ (void)acpi_lid_update_state(device);
+ break;
case ACPI_BUTTON_LID_INIT_IGNORE:
default:
break;
@@ -559,6 +563,9 @@ static int param_set_lid_init_state(cons
if (!strncmp(val, "open", sizeof("open") - 1)) {
lid_init_state = ACPI_BUTTON_LID_INIT_OPEN;
pr_info("Notify initial lid state as open\n");
+ } else if (!strncmp(val, "method", sizeof("method") - 1)) {
+ lid_init_state = ACPI_BUTTON_LID_INIT_METHOD;
+ pr_info("Notify initial lid state with _LID return value\n");
} else if (!strncmp(val, "ignore", sizeof("ignore") - 1)) {
lid_init_state = ACPI_BUTTON_LID_INIT_IGNORE;
pr_info("Do not notify initial lid state\n");
@@ -572,6 +579,8 @@ static int param_get_lid_init_state(char
switch (lid_init_state) {
case ACPI_BUTTON_LID_INIT_OPEN:
return sprintf(buffer, "open");
+ case ACPI_BUTTON_LID_INIT_METHOD:
+ return sprintf(buffer, "method");
case ACPI_BUTTON_LID_INIT_IGNORE:
return sprintf(buffer, "ignore");
default:


2017-06-05 16:45:56

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 049/115] serdev: fix tty-port client deregistration

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Johan Hovold <[email protected]>

commit aee5da7838787f8ed47f825dbe09e2812acdf97b upstream.

The port client data must be set when registering the serdev controller
or client deregistration will fail (and the serdev devices are left
registered and allocated) if the port was never opened in between.

Make sure to clear the port client data on any probe errors to avoid a
use-after-free when the client is later deregistered unconditionally
(e.g. in a tty-port deregistration helper).

Also move port client operation initialisation to registration. Note
that the client ops must be restored on failed probe.

Fixes: bed35c6dfa6a ("serdev: add a tty port controller driver")
Signed-off-by: Johan Hovold <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/tty/serdev/serdev-ttyport.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

--- a/drivers/tty/serdev/serdev-ttyport.c
+++ b/drivers/tty/serdev/serdev-ttyport.c
@@ -101,9 +101,6 @@ static int ttyport_open(struct serdev_co
return PTR_ERR(tty);
serport->tty = tty;

- serport->port->client_ops = &client_ops;
- serport->port->client_data = ctrl;
-
if (tty->ops->open)
tty->ops->open(serport->tty, NULL);
else
@@ -181,6 +178,7 @@ struct device *serdev_tty_port_register(
struct device *parent,
struct tty_driver *drv, int idx)
{
+ const struct tty_port_client_operations *old_ops;
struct serdev_controller *ctrl;
struct serport *serport;
int ret;
@@ -199,15 +197,22 @@ struct device *serdev_tty_port_register(

ctrl->ops = &ctrl_ops;

+ old_ops = port->client_ops;
+ port->client_ops = &client_ops;
+ port->client_data = ctrl;
+
ret = serdev_controller_add(ctrl);
if (ret)
- goto err_controller_put;
+ goto err_reset_data;

dev_info(&ctrl->dev, "tty port %s%d registered\n", drv->name, idx);
return &ctrl->dev;

-err_controller_put:
+err_reset_data:
+ port->client_data = NULL;
+ port->client_ops = old_ops;
serdev_controller_put(ctrl);
+
return ERR_PTR(ret);
}



2017-06-05 16:46:16

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 057/115] ACPICA: Tables: Fix regression introduced by a too early mechanism enabling

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Lv Zheng <[email protected]>

commit 2ea65321b83539afc1d45c1bea39c55ab42af62b upstream.

In the Linux kernel, acpi_get_table() "clones" haven't been fully
balanced by acpi_put_table() invocations. In upstream ACPICA, due to
the design change, there are also unbalanced acpi_get_table_by_index()
invocations requiring special care.

acpi_get_table() reference counting mismatches may occor due to that
and printing error messages related to them is not useful at this
point. The strict balanced validation count check should only be
enabled after confirming that all invocations are safe and aligned
with their designed purposes.

Thus this patch removes the error value returned by acpi_tb_get_table()
in that case along with the accompanying error message to fix the
issue.

Fixes: 174cc7187e6f (ACPICA: Tables: Back port acpi_get_table_with_size() and early_acpi_os_unmap_memory() from Linux kernel)
Reported-by: Anush Seetharaman <[email protected]>
Reported-by: Dan Williams <[email protected]>
Signed-off-by: Lv Zheng <[email protected]>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/acpi/acpica/tbutils.c | 4 ----
1 file changed, 4 deletions(-)

--- a/drivers/acpi/acpica/tbutils.c
+++ b/drivers/acpi/acpica/tbutils.c
@@ -418,11 +418,7 @@ acpi_tb_get_table(struct acpi_table_desc

table_desc->validation_count++;
if (table_desc->validation_count == 0) {
- ACPI_ERROR((AE_INFO,
- "Table %p, Validation count is zero after increment\n",
- table_desc));
table_desc->validation_count--;
- return_ACPI_STATUS(AE_LIMIT);
}

*out_table = table_desc->pointer;


2017-06-05 16:30:39

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 048/115] Revert "tty_port: register tty ports with serdev bus"

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Johan Hovold <[email protected]>

commit d3ba126a226a6b6da021ebfea444a2a807cde945 upstream.

This reverts commit 8ee3fde047589dc9c201251f07d0ca1dc776feca.

The new serdev bus hooked into the tty layer in
tty_port_register_device() by registering a serdev controller instead of
a tty device whenever a serdev client is present, and by deregistering
the controller in the tty-port destructor. This is broken in several
ways:

Firstly, it leads to a NULL-pointer dereference whenever a tty driver
later deregisters its devices as no corresponding character device will
exist.

Secondly, far from every tty driver uses tty-port refcounting (e.g.
serial core) so the serdev devices might never be deregistered or
deallocated.

Thirdly, deregistering at tty-port destruction is too late as the
underlying device and structures may be long gone by then. A port is not
released before an open tty device is closed, something which a
registered serdev client can prevent from ever happening. A driver
callback while the device is gone typically also leads to crashes.

Many tty drivers even keep their ports around until the driver is
unloaded (e.g. serial core), something which even if a late callback
never happens, leads to leaks if a device is unbound from its driver and
is later rebound.

The right solution here is to add a new tty_port_unregister_device()
helper and to never call tty_device_unregister() whenever the port has
been claimed by serdev, but since this requires modifying just about
every tty driver (and multiple subsystems) it will need to be done
incrementally.

Reverting the offending patch is the first step in fixing the broken
lifetime assumptions. A follow-up patch will add a new pair of
tty-device registration helpers, which a vetted tty driver can use to
support serdev (initially serial core). When every tty driver uses the
serdev helpers (at least for deregistration), we can add serdev
registration to tty_port_register_device() again.

Note that this also fixes another issue with serdev, which currently
allocates and registers a serdev controller for every tty device
registered using tty_port_device_register() only to immediately
deregister and deallocate it when the corresponding OF node or serdev
child node is missing. This should be addressed before enabling serdev
for hot-pluggable buses.

Signed-off-by: Johan Hovold <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/tty/tty_port.c | 12 ------------
1 file changed, 12 deletions(-)

--- a/drivers/tty/tty_port.c
+++ b/drivers/tty/tty_port.c
@@ -16,7 +16,6 @@
#include <linux/bitops.h>
#include <linux/delay.h>
#include <linux/module.h>
-#include <linux/serdev.h>

static int tty_port_default_receive_buf(struct tty_port *port,
const unsigned char *p,
@@ -129,15 +128,7 @@ struct device *tty_port_register_device_
struct device *device, void *drvdata,
const struct attribute_group **attr_grp)
{
- struct device *dev;
-
tty_port_link_device(port, driver, index);
-
- dev = serdev_tty_port_register(port, device, driver, index);
- if (PTR_ERR(dev) != -ENODEV)
- /* Skip creating cdev if we registered a serdev device */
- return dev;
-
return tty_register_device_attr(driver, index, device, drvdata,
attr_grp);
}
@@ -189,9 +180,6 @@ static void tty_port_destructor(struct k
/* check if last port ref was dropped before tty release */
if (WARN_ON(port->itty))
return;
-
- serdev_tty_port_unregister(port);
-
if (port->xmit_buf)
free_page((unsigned long)port->xmit_buf);
tty_port_destroy(port);


2017-06-05 16:46:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 061/115] iscsi-target: Always wait for kthread_should_stop() before kthread exit

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Jiang Yi <[email protected]>

commit 5e0cf5e6c43b9e19fc0284f69e5cd2b4a47523b0 upstream.

There are three timing problems in the kthread usages of iscsi_target_mod:

- np_thread of struct iscsi_np
- rx_thread and tx_thread of struct iscsi_conn

In iscsit_close_connection(), it calls

send_sig(SIGINT, conn->tx_thread, 1);
kthread_stop(conn->tx_thread);

In conn->tx_thread, which is iscsi_target_tx_thread(), when it receive
SIGINT the kthread will exit without checking the return value of
kthread_should_stop().

So if iscsi_target_tx_thread() exit right between send_sig(SIGINT...)
and kthread_stop(...), the kthread_stop() will try to stop an already
stopped kthread.

This is invalid according to the documentation of kthread_stop().

(Fix -ECONNRESET logout handling in iscsi_target_tx_thread and
early iscsi_target_rx_thread failure case - nab)

Signed-off-by: Jiang Yi <[email protected]>
Signed-off-by: Nicholas Bellinger <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/target/iscsi/iscsi_target.c | 30 ++++++++++++++++++++++++------
drivers/target/iscsi/iscsi_target_erl0.c | 6 +++++-
drivers/target/iscsi/iscsi_target_erl0.h | 2 +-
drivers/target/iscsi/iscsi_target_login.c | 4 ++++
4 files changed, 34 insertions(+), 8 deletions(-)

--- a/drivers/target/iscsi/iscsi_target.c
+++ b/drivers/target/iscsi/iscsi_target.c
@@ -3810,6 +3810,8 @@ int iscsi_target_tx_thread(void *arg)
{
int ret = 0;
struct iscsi_conn *conn = arg;
+ bool conn_freed = false;
+
/*
* Allow ourselves to be interrupted by SIGINT so that a
* connection recovery / failure event can be triggered externally.
@@ -3835,12 +3837,14 @@ get_immediate:
goto transport_err;

ret = iscsit_handle_response_queue(conn);
- if (ret == 1)
+ if (ret == 1) {
goto get_immediate;
- else if (ret == -ECONNRESET)
+ } else if (ret == -ECONNRESET) {
+ conn_freed = true;
goto out;
- else if (ret < 0)
+ } else if (ret < 0) {
goto transport_err;
+ }
}

transport_err:
@@ -3850,8 +3854,13 @@ transport_err:
* responsible for cleaning up the early connection failure.
*/
if (conn->conn_state != TARG_CONN_STATE_IN_LOGIN)
- iscsit_take_action_for_connection_exit(conn);
+ iscsit_take_action_for_connection_exit(conn, &conn_freed);
out:
+ if (!conn_freed) {
+ while (!kthread_should_stop()) {
+ msleep(100);
+ }
+ }
return 0;
}

@@ -4024,6 +4033,7 @@ int iscsi_target_rx_thread(void *arg)
{
int rc;
struct iscsi_conn *conn = arg;
+ bool conn_freed = false;

/*
* Allow ourselves to be interrupted by SIGINT so that a
@@ -4036,7 +4046,7 @@ int iscsi_target_rx_thread(void *arg)
*/
rc = wait_for_completion_interruptible(&conn->rx_login_comp);
if (rc < 0 || iscsi_target_check_conn_state(conn))
- return 0;
+ goto out;

if (!conn->conn_transport->iscsit_get_rx_pdu)
return 0;
@@ -4045,7 +4055,15 @@ int iscsi_target_rx_thread(void *arg)

if (!signal_pending(current))
atomic_set(&conn->transport_failed, 1);
- iscsit_take_action_for_connection_exit(conn);
+ iscsit_take_action_for_connection_exit(conn, &conn_freed);
+
+out:
+ if (!conn_freed) {
+ while (!kthread_should_stop()) {
+ msleep(100);
+ }
+ }
+
return 0;
}

--- a/drivers/target/iscsi/iscsi_target_erl0.c
+++ b/drivers/target/iscsi/iscsi_target_erl0.c
@@ -930,8 +930,10 @@ static void iscsit_handle_connection_cle
}
}

-void iscsit_take_action_for_connection_exit(struct iscsi_conn *conn)
+void iscsit_take_action_for_connection_exit(struct iscsi_conn *conn, bool *conn_freed)
{
+ *conn_freed = false;
+
spin_lock_bh(&conn->state_lock);
if (atomic_read(&conn->connection_exit)) {
spin_unlock_bh(&conn->state_lock);
@@ -942,6 +944,7 @@ void iscsit_take_action_for_connection_e
if (conn->conn_state == TARG_CONN_STATE_IN_LOGOUT) {
spin_unlock_bh(&conn->state_lock);
iscsit_close_connection(conn);
+ *conn_freed = true;
return;
}

@@ -955,4 +958,5 @@ void iscsit_take_action_for_connection_e
spin_unlock_bh(&conn->state_lock);

iscsit_handle_connection_cleanup(conn);
+ *conn_freed = true;
}
--- a/drivers/target/iscsi/iscsi_target_erl0.h
+++ b/drivers/target/iscsi/iscsi_target_erl0.h
@@ -15,6 +15,6 @@ extern int iscsit_stop_time2retain_timer
extern void iscsit_connection_reinstatement_rcfr(struct iscsi_conn *);
extern void iscsit_cause_connection_reinstatement(struct iscsi_conn *, int);
extern void iscsit_fall_back_to_erl0(struct iscsi_session *);
-extern void iscsit_take_action_for_connection_exit(struct iscsi_conn *);
+extern void iscsit_take_action_for_connection_exit(struct iscsi_conn *, bool *);

#endif /*** ISCSI_TARGET_ERL0_H ***/
--- a/drivers/target/iscsi/iscsi_target_login.c
+++ b/drivers/target/iscsi/iscsi_target_login.c
@@ -1464,5 +1464,9 @@ int iscsi_target_login_thread(void *arg)
break;
}

+ while (!kthread_should_stop()) {
+ msleep(100);
+ }
+
return 0;
}


2017-06-05 16:30:36

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 060/115] scsi: zero per-cmd private driver data for each MQ I/O

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Long Li <[email protected]>

commit 1bad6c4a57efda0d5f5bf8a2403b21b1ed24875c upstream.

In lower layer driver's (LLD) scsi_host_template, the driver may
optionally ask SCSI to allocate its private driver memory for each
command, by specifying cmd_size. This memory is allocated at the end of
scsi_cmnd by SCSI. Later when SCSI queues a command, the LLD can use
scsi_cmd_priv to get to its private data.

Some LLD, e.g. hv_storvsc, doesn't clear its private data before use. In
this case, the LLD may get to stale or uninitialized data in its private
driver memory. This may result in unexpected driver and hardware
behavior.

Fix this problem by also zeroing the private driver memory before
passing them to LLD.

Signed-off-by: Long Li <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: KY Srinivasan <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/scsi/scsi_lib.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1850,7 +1850,7 @@ static int scsi_mq_prep_fn(struct reques

/* zero out the cmd, except for the embedded scsi_request */
memset((char *)cmd + sizeof(cmd->req), 0,
- sizeof(*cmd) - sizeof(cmd->req));
+ sizeof(*cmd) - sizeof(cmd->req) + shost->hostt->cmd_size);

req->special = cmd;



2017-06-05 16:47:39

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 038/115] ipv4: add reference counting to metrics

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Eric Dumazet <[email protected]>


[ Upstream commit 3fb07daff8e99243366a081e5129560734de4ada ]

Andrey Konovalov reported crashes in ipv4_mtu()

I could reproduce the issue with KASAN kernels, between
10.246.7.151 and 10.246.7.152 :

1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &

2) At the same time run following loop :
while :
do
ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
done

Cong Wang attempted to add back rt->fi in commit
82486aa6f1b9 ("ipv4: restore rt->fi for reference counting")
but this proved to add some issues that were complex to solve.

Instead, I suggested to add a refcount to the metrics themselves,
being a standalone object (in particular, no reference to other objects)

I tried to make this patch as small as possible to ease its backport,
instead of being super clean. Note that we believe that only ipv4 dst
need to take care of the metric refcount. But if this is wrong,
this patch adds the basic infrastructure to extend this to other
families.

Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
for his efforts on this problem.

Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Andrey Konovalov <[email protected]>
Reviewed-by: Julian Anastasov <[email protected]>
Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
include/net/dst.h | 8 +++++++-
include/net/ip_fib.h | 10 +++++-----
net/core/dst.c | 23 ++++++++++++++---------
net/ipv4/fib_semantics.c | 17 ++++++++++-------
net/ipv4/route.c | 10 +++++++++-
5 files changed, 45 insertions(+), 23 deletions(-)

--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -107,10 +107,16 @@ struct dst_entry {
};
};

+struct dst_metrics {
+ u32 metrics[RTAX_MAX];
+ atomic_t refcnt;
+};
+extern const struct dst_metrics dst_default_metrics;
+
u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old);
-extern const u32 dst_default_metrics[];

#define DST_METRICS_READ_ONLY 0x1UL
+#define DST_METRICS_REFCOUNTED 0x2UL
#define DST_METRICS_FLAGS 0x3UL
#define __DST_METRICS_PTR(Y) \
((u32 *)((Y) & ~DST_METRICS_FLAGS))
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -114,11 +114,11 @@ struct fib_info {
__be32 fib_prefsrc;
u32 fib_tb_id;
u32 fib_priority;
- u32 *fib_metrics;
-#define fib_mtu fib_metrics[RTAX_MTU-1]
-#define fib_window fib_metrics[RTAX_WINDOW-1]
-#define fib_rtt fib_metrics[RTAX_RTT-1]
-#define fib_advmss fib_metrics[RTAX_ADVMSS-1]
+ struct dst_metrics *fib_metrics;
+#define fib_mtu fib_metrics->metrics[RTAX_MTU-1]
+#define fib_window fib_metrics->metrics[RTAX_WINDOW-1]
+#define fib_rtt fib_metrics->metrics[RTAX_RTT-1]
+#define fib_advmss fib_metrics->metrics[RTAX_ADVMSS-1]
int fib_nhs;
#ifdef CONFIG_IP_ROUTE_MULTIPATH
int fib_weight;
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -151,13 +151,13 @@ int dst_discard_out(struct net *net, str
}
EXPORT_SYMBOL(dst_discard_out);

-const u32 dst_default_metrics[RTAX_MAX + 1] = {
+const struct dst_metrics dst_default_metrics = {
/* This initializer is needed to force linker to place this variable
* into const section. Otherwise it might end into bss section.
* We really want to avoid false sharing on this variable, and catch
* any writes on it.
*/
- [RTAX_MAX] = 0xdeadbeef,
+ .refcnt = ATOMIC_INIT(1),
};

void dst_init(struct dst_entry *dst, struct dst_ops *ops,
@@ -169,7 +169,7 @@ void dst_init(struct dst_entry *dst, str
if (dev)
dev_hold(dev);
dst->ops = ops;
- dst_init_metrics(dst, dst_default_metrics, true);
+ dst_init_metrics(dst, dst_default_metrics.metrics, true);
dst->expires = 0UL;
dst->path = dst;
dst->from = NULL;
@@ -314,25 +314,30 @@ EXPORT_SYMBOL(dst_release);

u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old)
{
- u32 *p = kmalloc(sizeof(u32) * RTAX_MAX, GFP_ATOMIC);
+ struct dst_metrics *p = kmalloc(sizeof(*p), GFP_ATOMIC);

if (p) {
- u32 *old_p = __DST_METRICS_PTR(old);
+ struct dst_metrics *old_p = (struct dst_metrics *)__DST_METRICS_PTR(old);
unsigned long prev, new;

- memcpy(p, old_p, sizeof(u32) * RTAX_MAX);
+ atomic_set(&p->refcnt, 1);
+ memcpy(p->metrics, old_p->metrics, sizeof(p->metrics));

new = (unsigned long) p;
prev = cmpxchg(&dst->_metrics, old, new);

if (prev != old) {
kfree(p);
- p = __DST_METRICS_PTR(prev);
+ p = (struct dst_metrics *)__DST_METRICS_PTR(prev);
if (prev & DST_METRICS_READ_ONLY)
p = NULL;
+ } else if (prev & DST_METRICS_REFCOUNTED) {
+ if (atomic_dec_and_test(&old_p->refcnt))
+ kfree(old_p);
}
}
- return p;
+ BUILD_BUG_ON(offsetof(struct dst_metrics, metrics) != 0);
+ return (u32 *)p;
}
EXPORT_SYMBOL(dst_cow_metrics_generic);

@@ -341,7 +346,7 @@ void __dst_destroy_metrics_generic(struc
{
unsigned long prev, new;

- new = ((unsigned long) dst_default_metrics) | DST_METRICS_READ_ONLY;
+ new = ((unsigned long) &dst_default_metrics) | DST_METRICS_READ_ONLY;
prev = cmpxchg(&dst->_metrics, old, new);
if (prev == old)
kfree(__DST_METRICS_PTR(old));
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -204,6 +204,7 @@ static void rt_fibinfo_free_cpus(struct
static void free_fib_info_rcu(struct rcu_head *head)
{
struct fib_info *fi = container_of(head, struct fib_info, rcu);
+ struct dst_metrics *m;

change_nexthops(fi) {
if (nexthop_nh->nh_dev)
@@ -214,8 +215,9 @@ static void free_fib_info_rcu(struct rcu
rt_fibinfo_free(&nexthop_nh->nh_rth_input);
} endfor_nexthops(fi);

- if (fi->fib_metrics != (u32 *) dst_default_metrics)
- kfree(fi->fib_metrics);
+ m = fi->fib_metrics;
+ if (m != &dst_default_metrics && atomic_dec_and_test(&m->refcnt))
+ kfree(m);
kfree(fi);
}

@@ -975,11 +977,11 @@ fib_convert_metrics(struct fib_info *fi,
val = 255;
if (type == RTAX_FEATURES && (val & ~RTAX_FEATURE_MASK))
return -EINVAL;
- fi->fib_metrics[type - 1] = val;
+ fi->fib_metrics->metrics[type - 1] = val;
}

if (ecn_ca)
- fi->fib_metrics[RTAX_FEATURES - 1] |= DST_FEATURE_ECN_CA;
+ fi->fib_metrics->metrics[RTAX_FEATURES - 1] |= DST_FEATURE_ECN_CA;

return 0;
}
@@ -1037,11 +1039,12 @@ struct fib_info *fib_create_info(struct
goto failure;
fib_info_cnt++;
if (cfg->fc_mx) {
- fi->fib_metrics = kzalloc(sizeof(u32) * RTAX_MAX, GFP_KERNEL);
+ fi->fib_metrics = kzalloc(sizeof(*fi->fib_metrics), GFP_KERNEL);
if (!fi->fib_metrics)
goto failure;
+ atomic_set(&fi->fib_metrics->refcnt, 1);
} else
- fi->fib_metrics = (u32 *) dst_default_metrics;
+ fi->fib_metrics = (struct dst_metrics *)&dst_default_metrics;

fi->fib_net = net;
fi->fib_protocol = cfg->fc_protocol;
@@ -1242,7 +1245,7 @@ int fib_dump_info(struct sk_buff *skb, u
if (fi->fib_priority &&
nla_put_u32(skb, RTA_PRIORITY, fi->fib_priority))
goto nla_put_failure;
- if (rtnetlink_put_metrics(skb, fi->fib_metrics) < 0)
+ if (rtnetlink_put_metrics(skb, fi->fib_metrics->metrics) < 0)
goto nla_put_failure;

if (fi->fib_prefsrc &&
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1389,8 +1389,12 @@ static void rt_add_uncached_list(struct

static void ipv4_dst_destroy(struct dst_entry *dst)
{
+ struct dst_metrics *p = (struct dst_metrics *)DST_METRICS_PTR(dst);
struct rtable *rt = (struct rtable *) dst;

+ if (p != &dst_default_metrics && atomic_dec_and_test(&p->refcnt))
+ kfree(p);
+
if (!list_empty(&rt->rt_uncached)) {
struct uncached_list *ul = rt->rt_uncached_list;

@@ -1442,7 +1446,11 @@ static void rt_set_nexthop(struct rtable
rt->rt_gateway = nh->nh_gw;
rt->rt_uses_gateway = 1;
}
- dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+ dst_init_metrics(&rt->dst, fi->fib_metrics->metrics, true);
+ if (fi->fib_metrics != &dst_default_metrics) {
+ rt->dst._metrics |= DST_METRICS_REFCOUNTED;
+ atomic_inc(&fi->fib_metrics->refcnt);
+ }
#ifdef CONFIG_IP_ROUTE_CLASSID
rt->dst.tclassid = nh->nh_tclassid;
#endif


2017-06-05 16:30:18

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 041/115] bpf: adjust verifier heuristics

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Daniel Borkmann <[email protected]>


[ Upstream commit 3c2ce60bdd3d57051bf85615deec04a694473840 ]

Current limits with regards to processing program paths do not
really reflect today's needs anymore due to programs becoming
more complex and verifier smarter, keeping track of more data
such as const ALU operations, alignment tracking, spilling of
PTR_TO_MAP_VALUE_ADJ registers, and other features allowing for
smarter matching of what LLVM generates.

This also comes with the side-effect that we result in fewer
opportunities to prune search states and thus often need to do
more work to prove safety than in the past due to different
register states and stack layout where we mismatch. Generally,
it's quite hard to determine what caused a sudden increase in
complexity, it could be caused by something as trivial as a
single branch somewhere at the beginning of the program where
LLVM assigned a stack slot that is marked differently throughout
other branches and thus causing a mismatch, where verifier
then needs to prove safety for the whole rest of the program.
Subsequently, programs with even less than half the insn size
limit can get rejected. We noticed that while some programs
load fine under pre 4.11, they get rejected due to hitting
limits on more recent kernels. We saw that in the vast majority
of cases (90+%) pruning failed due to register mismatches. In
case of stack mismatches, majority of cases failed due to
different stack slot types (invalid, spill, misc) rather than
differences in spilled registers.

This patch makes pruning more aggressive by also adding markers
that sit at conditional jumps as well. Currently, we only mark
jump targets for pruning. For example in direct packet access,
these are usually error paths where we bail out. We found that
adding these markers, it can reduce number of processed insns
by up to 30%. Another option is to ignore reg->id in probing
PTR_TO_MAP_VALUE_OR_NULL registers, which can help pruning
slightly as well by up to 7% observed complexity reduction as
stand-alone. Meaning, if a previous path with register type
PTR_TO_MAP_VALUE_OR_NULL for map X was found to be safe, then
in the current state a PTR_TO_MAP_VALUE_OR_NULL register for
the same map X must be safe as well. Last but not least the
patch also adds a scheduling point and bumps the current limit
for instructions to be processed to a more adequate value.

Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
kernel/bpf/verifier.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -140,7 +140,7 @@ struct bpf_verifier_stack_elem {
struct bpf_verifier_stack_elem *next;
};

-#define BPF_COMPLEXITY_LIMIT_INSNS 65536
+#define BPF_COMPLEXITY_LIMIT_INSNS 98304
#define BPF_COMPLEXITY_LIMIT_STACK 1024

struct bpf_call_arg_meta {
@@ -2546,6 +2546,7 @@ peek_stack:
env->explored_states[t + 1] = STATE_LIST_MARK;
} else {
/* conditional jump with two edges */
+ env->explored_states[t] = STATE_LIST_MARK;
ret = push_insn(t, t + 1, FALLTHROUGH, env);
if (ret == 1)
goto peek_stack;
@@ -2704,6 +2705,12 @@ static bool states_equal(struct bpf_veri
rcur->type != NOT_INIT))
continue;

+ /* Don't care about the reg->id in this case. */
+ if (rold->type == PTR_TO_MAP_VALUE_OR_NULL &&
+ rcur->type == PTR_TO_MAP_VALUE_OR_NULL &&
+ rold->map_ptr == rcur->map_ptr)
+ continue;
+
if (rold->type == PTR_TO_PACKET && rcur->type == PTR_TO_PACKET &&
compare_ptrs_to_packet(rold, rcur))
continue;
@@ -2838,6 +2845,9 @@ static int do_check(struct bpf_verifier_
goto process_bpf_exit;
}

+ if (need_resched())
+ cond_resched();
+
if (log_level && do_print_state) {
verbose("\nfrom %d to %d:", prev_insn_idx, insn_idx);
print_verifier_state(&env->cur_state);


2017-06-05 16:47:37

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 058/115] Revert "ACPI / button: Change default behavior to lid_init_state=open"

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Benjamin Tissoires <[email protected]>

commit 878d8db039daac0938238e9a40a5bd6e50ee3c9b upstream.

Revert commit 77e9a4aa9de1 (ACPI / button: Change default behavior to
lid_init_state=open) which changed the kernel's behavior on laptops
that boot with closed lids and expect the lid switch state to be
reported accurately by the kernel.

If you boot or resume your laptop with the lid closed on a docking
station while using an external monitor connected to it, both internal
and external displays will light on, while only the external should.

There is a design choice in gdm to only provide the greeter on the
internal display when lit on, so users only see a gray area on the
external monitor. Also, the cursor will not show up as it's by
default on the internal display too.

To "fix" that, users have to open the laptop once and close it once
again to sync the state of the switch with the hardware state.

Even if the "method" operation mode implementation can be buggy on
some platforms, the "open" choice is worse. It breaks docking
stations basically and there is no way to have a user-space hwdb to
fix that.

On the contrary, it's rather easy in user-space to have a hwdb
with the problematic platforms. Then, libinput (1.7.0+) can fix
the state of the lid switch for us: you need to set the udev
property LIBINPUT_ATTR_LID_SWITCH_RELIABILITY to 'write_open'.

When libinput detects internal keyboard events, it will overwrite the
state of the switch to open, making it reliable again. Given that
logind only checks the lid switch value after a timeout, we can
assume the user will use the internal keyboard before this timeout
expires.

For example, such a hwdb entry is:

libinput:name:*Lid Switch*:dmi:*svnMicrosoftCorporation:pnSurface3:*
LIBINPUT_ATTR_LID_SWITCH_RELIABILITY=write_open

Link: https://bugzilla.gnome.org/show_bug.cgi?id=782380
Signed-off-by: Benjamin Tissoires <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/acpi/button.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/acpi/button.c
+++ b/drivers/acpi/button.c
@@ -113,7 +113,7 @@ struct acpi_button {

static BLOCKING_NOTIFIER_HEAD(acpi_lid_notifier);
static struct acpi_device *lid_device;
-static u8 lid_init_state = ACPI_BUTTON_LID_INIT_OPEN;
+static u8 lid_init_state = ACPI_BUTTON_LID_INIT_METHOD;

static unsigned long lid_report_interval __read_mostly = 500;
module_param(lid_report_interval, ulong, 0644);


2017-06-05 16:48:57

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 035/115] tcp: avoid fastopen API to be used on AF_UNSPEC

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Wei Wang <[email protected]>


[ Upstream commit ba615f675281d76fd19aa03558777f81fb6b6084 ]

Fastopen API should be used to perform fastopen operations on the TCP
socket. It does not make sense to use fastopen API to perform disconnect
by calling it with AF_UNSPEC. The fastopen data path is also prone to
race conditions and bugs when using with AF_UNSPEC.

One issue reported and analyzed by Vegard Nossum is as follows:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Thread A: Thread B:
------------------------------------------------------------------------
sendto()
- tcp_sendmsg()
- sk_stream_memory_free() = 0
- goto wait_for_sndbuf
- sk_stream_wait_memory()
- sk_wait_event() // sleep
| sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC)
| - tcp_sendmsg()
| - tcp_sendmsg_fastopen()
| - __inet_stream_connect()
| - tcp_disconnect() //because of AF_UNSPEC
| - tcp_transmit_skb()// send RST
| - return 0; // no reconnect!
| - sk_stream_wait_connect()
| - sock_error()
| - xchg(&sk->sk_err, 0)
| - return -ECONNRESET
- ... // wake up, see sk->sk_err == 0
- skb_entail() on TCP_CLOSE socket

If the connection is reopened then we will send a brand new SYN packet
after thread A has already queued a buffer. At this point I think the
socket internal state (sequence numbers etc.) becomes messed up.

When the new connection is closed, the FIN-ACK is rejected because the
sequence number is outside the window. The other side tries to
retransmit,
but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which
corrupts the skb data length and hits a BUG() in copy_and_csum_bits().
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Hence, this patch adds a check for AF_UNSPEC in the fastopen data path
and return EOPNOTSUPP to user if such case happens.

Fixes: cf60af03ca4e7 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
Reported-by: Vegard Nossum <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ipv4/tcp.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1084,9 +1084,12 @@ static int tcp_sendmsg_fastopen(struct s
{
struct tcp_sock *tp = tcp_sk(sk);
struct inet_sock *inet = inet_sk(sk);
+ struct sockaddr *uaddr = msg->msg_name;
int err, flags;

- if (!(sysctl_tcp_fastopen & TFO_CLIENT_ENABLE))
+ if (!(sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) ||
+ (uaddr && msg->msg_namelen >= sizeof(uaddr->sa_family) &&
+ uaddr->sa_family == AF_UNSPEC))
return -EOPNOTSUPP;
if (tp->fastopen_req)
return -EALREADY; /* Another Fast Open is in progress */
@@ -1108,7 +1111,7 @@ static int tcp_sendmsg_fastopen(struct s
}
}
flags = (msg->msg_flags & MSG_DONTWAIT) ? O_NONBLOCK : 0;
- err = __inet_stream_connect(sk->sk_socket, msg->msg_name,
+ err = __inet_stream_connect(sk->sk_socket, uaddr,
msg->msg_namelen, flags, 1);
/* fastopen_req could already be freed in __inet_stream_connect
* if the connection times out or gets rst


2017-06-05 16:30:00

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 036/115] sctp: fix ICMP processing if skb is non-linear

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Davide Caratti <[email protected]>


[ Upstream commit 804ec7ebe8ea003999ca8d1bfc499edc6a9e07df ]

sometimes ICMP replies to INIT chunks are ignored by the client, even if
the encapsulated SCTP headers match an open socket. This happens when the
ICMP packet is carried by a paged skb: use skb_header_pointer() to read
packet contents beyond the SCTP header, so that chunk header and initiate
tag are validated correctly.

v2:
- don't use skb_header_pointer() to read the transport header, since
icmp_socket_deliver() already puts these 8 bytes in the linear area.
- change commit message to make specific reference to INIT chunks.

Signed-off-by: Davide Caratti <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Acked-by: Vlad Yasevich <[email protected]>
Reviewed-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/sctp/input.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -473,15 +473,14 @@ struct sock *sctp_err_lookup(struct net
struct sctp_association **app,
struct sctp_transport **tpp)
{
+ struct sctp_init_chunk *chunkhdr, _chunkhdr;
union sctp_addr saddr;
union sctp_addr daddr;
struct sctp_af *af;
struct sock *sk = NULL;
struct sctp_association *asoc;
struct sctp_transport *transport = NULL;
- struct sctp_init_chunk *chunkhdr;
__u32 vtag = ntohl(sctphdr->vtag);
- int len = skb->len - ((void *)sctphdr - (void *)skb->data);

*app = NULL; *tpp = NULL;

@@ -516,13 +515,16 @@ struct sock *sctp_err_lookup(struct net
* discard the packet.
*/
if (vtag == 0) {
- chunkhdr = (void *)sctphdr + sizeof(struct sctphdr);
- if (len < sizeof(struct sctphdr) + sizeof(sctp_chunkhdr_t)
- + sizeof(__be32) ||
+ /* chunk header + first 4 octects of init header */
+ chunkhdr = skb_header_pointer(skb, skb_transport_offset(skb) +
+ sizeof(struct sctphdr),
+ sizeof(struct sctp_chunkhdr) +
+ sizeof(__be32), &_chunkhdr);
+ if (!chunkhdr ||
chunkhdr->chunk_hdr.type != SCTP_CID_INIT ||
- ntohl(chunkhdr->init_hdr.init_tag) != asoc->c.my_vtag) {
+ ntohl(chunkhdr->init_hdr.init_tag) != asoc->c.my_vtag)
goto out;
- }
+
} else if (vtag != asoc->c.peer_vtag) {
goto out;
}


2017-06-05 16:49:26

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 034/115] geneve: fix fill_info when using collect_metadata

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Eric Garver <[email protected]>


[ Upstream commit 11387fe4a98f75d1f4cdb3efe3b42b19205c9df5 ]

Since 9b4437a5b870 ("geneve: Unify LWT and netdev handling.") fill_info
does not return UDP_ZERO_CSUM6_RX when using COLLECT_METADATA. This is
because it uses ip_tunnel_info_af() with the device level info, which is
not valid for COLLECT_METADATA.

Fix by checking for the presence of the actual sockets.

Fixes: 9b4437a5b870 ("geneve: Unify LWT and netdev handling.")
Signed-off-by: Eric Garver <[email protected]>
Acked-by: Pravin B Shelar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/geneve.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -1293,7 +1293,7 @@ static int geneve_fill_info(struct sk_bu
if (nla_put_u32(skb, IFLA_GENEVE_ID, vni))
goto nla_put_failure;

- if (ip_tunnel_info_af(info) == AF_INET) {
+ if (rtnl_dereference(geneve->sock4)) {
if (nla_put_in_addr(skb, IFLA_GENEVE_REMOTE,
info->key.u.ipv4.dst))
goto nla_put_failure;
@@ -1302,8 +1302,10 @@ static int geneve_fill_info(struct sk_bu
!!(info->key.tun_flags & TUNNEL_CSUM)))
goto nla_put_failure;

+ }
+
#if IS_ENABLED(CONFIG_IPV6)
- } else {
+ if (rtnl_dereference(geneve->sock6)) {
if (nla_put_in6_addr(skb, IFLA_GENEVE_REMOTE6,
&info->key.u.ipv6.dst))
goto nla_put_failure;
@@ -1315,8 +1317,8 @@ static int geneve_fill_info(struct sk_bu
if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
!geneve->use_udp6_rx_checksums))
goto nla_put_failure;
-#endif
}
+#endif

if (nla_put_u8(skb, IFLA_GENEVE_TTL, info->key.ttl) ||
nla_put_u8(skb, IFLA_GENEVE_TOS, info->key.tos) ||


2017-06-05 16:50:11

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 023/115] ipv6: Check ip6_find_1stfragopt() return value properly.

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: "David S. Miller" <[email protected]>


[ Upstream commit 7dd7eb9513bd02184d45f000ab69d78cb1fa1531 ]

Do not use unsigned variables to see if it returns a negative
error or not.

Fixes: 2423496af35d ("ipv6: Prevent overrun when parsing v6 header options")
Reported-by: Julia Lawall <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ipv6/ip6_offload.c | 9 ++++-----
net/ipv6/ip6_output.c | 7 +++----
net/ipv6/udp_offload.c | 8 +++++---
3 files changed, 12 insertions(+), 12 deletions(-)

--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -63,7 +63,6 @@ static struct sk_buff *ipv6_gso_segment(
const struct net_offload *ops;
int proto;
struct frag_hdr *fptr;
- unsigned int unfrag_ip6hlen;
unsigned int payload_len;
u8 *prevhdr;
int offset = 0;
@@ -116,10 +115,10 @@ static struct sk_buff *ipv6_gso_segment(
skb->network_header = (u8 *)ipv6h - skb->head;

if (udpfrag) {
- unfrag_ip6hlen = ip6_find_1stfragopt(skb, &prevhdr);
- if (unfrag_ip6hlen < 0)
- return ERR_PTR(unfrag_ip6hlen);
- fptr = (struct frag_hdr *)((u8 *)ipv6h + unfrag_ip6hlen);
+ int err = ip6_find_1stfragopt(skb, &prevhdr);
+ if (err < 0)
+ return ERR_PTR(err);
+ fptr = (struct frag_hdr *)((u8 *)ipv6h + err);
fptr->frag_off = htons(offset);
if (skb->next)
fptr->frag_off |= htons(IP6_MF);
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -597,11 +597,10 @@ int ip6_fragment(struct net *net, struct
int ptr, offset = 0, err = 0;
u8 *prevhdr, nexthdr = 0;

- hlen = ip6_find_1stfragopt(skb, &prevhdr);
- if (hlen < 0) {
- err = hlen;
+ err = ip6_find_1stfragopt(skb, &prevhdr);
+ if (err < 0)
goto fail;
- }
+ hlen = err;
nexthdr = *prevhdr;

mtu = ip6_skb_dst_mtu(skb);
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -29,6 +29,7 @@ static struct sk_buff *udp6_ufo_fragment
u8 frag_hdr_sz = sizeof(struct frag_hdr);
__wsum csum;
int tnl_hlen;
+ int err;

mss = skb_shinfo(skb)->gso_size;
if (unlikely(skb->len <= mss))
@@ -90,9 +91,10 @@ static struct sk_buff *udp6_ufo_fragment
/* Find the unfragmentable header and shift it left by frag_hdr_sz
* bytes to insert fragment header.
*/
- unfrag_ip6hlen = ip6_find_1stfragopt(skb, &prevhdr);
- if (unfrag_ip6hlen < 0)
- return ERR_PTR(unfrag_ip6hlen);
+ err = ip6_find_1stfragopt(skb, &prevhdr);
+ if (err < 0)
+ return ERR_PTR(err);
+ unfrag_ip6hlen = err;
nexthdr = *prevhdr;
*prevhdr = NEXTHDR_FRAGMENT;
unfrag_len = (skb_network_header(skb) - skb_mac_header(skb)) +


2017-06-05 16:29:38

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 024/115] bridge: netlink: check vlan_default_pvid range

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Tobias Jungel <[email protected]>


[ Upstream commit a285860211bf257b0e6d522dac6006794be348af ]

Currently it is allowed to set the default pvid of a bridge to a value
above VLAN_VID_MASK (0xfff). This patch adds a check to br_validate and
returns -EINVAL in case the pvid is out of bounds.

Reproduce by calling:

[root@test ~]# ip l a type bridge
[root@test ~]# ip l a type dummy
[root@test ~]# ip l s bridge0 type bridge vlan_filtering 1
[root@test ~]# ip l s bridge0 type bridge vlan_default_pvid 9999
[root@test ~]# ip l s dummy0 master bridge0
[root@test ~]# bridge vlan
port vlan ids
bridge0 9999 PVID Egress Untagged

dummy0 9999 PVID Egress Untagged

Fixes: 0f963b7592ef ("bridge: netlink: add support for default_pvid")
Acked-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Tobias Jungel <[email protected]>
Acked-by: Sabrina Dubroca <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/bridge/br_netlink.c | 7 +++++++
1 file changed, 7 insertions(+)

--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -828,6 +828,13 @@ static int br_validate(struct nlattr *tb
return -EPROTONOSUPPORT;
}
}
+
+ if (data[IFLA_BR_VLAN_DEFAULT_PVID]) {
+ __u16 defpvid = nla_get_u16(data[IFLA_BR_VLAN_DEFAULT_PVID]);
+
+ if (defpvid >= VLAN_VID_MASK)
+ return -EINVAL;
+ }
#endif

return 0;


2017-06-05 16:50:43

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 043/115] sparc: Fix -Wstringop-overflow warning

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Orlando Arias <[email protected]>


[ Upstream commit deba804c90642c8ed0f15ac1083663976d578f54 ]

Greetings,

GCC 7 introduced the -Wstringop-overflow flag to detect buffer overflows
in calls to string handling functions [1][2]. Due to the way
``empty_zero_page'' is declared in arch/sparc/include/setup.h, this
causes a warning to trigger at compile time in the function mem_init(),
which is subsequently converted to an error. The ensuing patch fixes
this issue and aligns the declaration of empty_zero_page to that of
other architectures. Thank you.

Cheers,
Orlando.

[1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg02308.html
[2] https://gcc.gnu.org/gcc-7/changes.html

Signed-off-by: Orlando Arias <[email protected]>

--------------------------------------------------------------------------------
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
arch/sparc/include/asm/pgtable_32.h | 4 ++--
arch/sparc/include/asm/setup.h | 2 +-
arch/sparc/mm/init_32.c | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)

--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -91,9 +91,9 @@ extern unsigned long pfn_base;
* ZERO_PAGE is a global shared page that is always zero: used
* for zero-mapped memory areas etc..
*/
-extern unsigned long empty_zero_page;
+extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];

-#define ZERO_PAGE(vaddr) (virt_to_page(&empty_zero_page))
+#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))

/*
* In general all page table modifications should use the V8 atomic
--- a/arch/sparc/include/asm/setup.h
+++ b/arch/sparc/include/asm/setup.h
@@ -16,7 +16,7 @@ extern char reboot_command[];
*/
extern unsigned char boot_cpu_id;

-extern unsigned long empty_zero_page;
+extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];

extern int serial_console;
static inline int con_is_present(void)
--- a/arch/sparc/mm/init_32.c
+++ b/arch/sparc/mm/init_32.c
@@ -290,7 +290,7 @@ void __init mem_init(void)


/* Saves us work later. */
- memset((void *)&empty_zero_page, 0, PAGE_SIZE);
+ memset((void *)empty_zero_page, 0, PAGE_SIZE);

i = last_valid_pfn >> ((20 - PAGE_SHIFT) + 5);
i += 1;


2017-06-05 16:29:33

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 033/115] virtio-net: enable TSO/checksum offloads for Q-in-Q vlans

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Vlad Yasevich <[email protected]>


[ Upstream commit 2836b4f224d4fd7d1a2b23c3eecaf0f0ae199a74 ]

Since virtio does not provide it's own ndo_features_check handler,
TSO, and now checksum offload, are disabled for stacked vlans.
Re-enable the support and let the host take care of it. This
restores/improves Guest-to-Guest performance over Q-in-Q vlans.

Acked-by: Jason Wang <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Vladislav Yasevich <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/virtio_net.c | 1 +
1 file changed, 1 insertion(+)

--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1894,6 +1894,7 @@ static const struct net_device_ops virtn
.ndo_poll_controller = virtnet_netpoll,
#endif
.ndo_xdp = virtnet_xdp,
+ .ndo_features_check = passthru_features_check,
};

static void virtnet_config_changed_work(struct work_struct *work)


2017-06-05 16:29:20

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 022/115] ipv6: Prevent overrun when parsing v6 header options

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Craig Gallek <[email protected]>


[ Upstream commit 2423496af35d94a87156b063ea5cedffc10a70a1 ]

The KASAN warning repoted below was discovered with a syzkaller
program. The reproducer is basically:
int s = socket(AF_INET6, SOCK_RAW, NEXTHDR_HOP);
send(s, &one_byte_of_data, 1, MSG_MORE);
send(s, &more_than_mtu_bytes_data, 2000, 0);

The socket() call sets the nexthdr field of the v6 header to
NEXTHDR_HOP, the first send call primes the payload with a non zero
byte of data, and the second send call triggers the fragmentation path.

The fragmentation code tries to parse the header options in order
to figure out where to insert the fragment option. Since nexthdr points
to an invalid option, the calculation of the size of the network header
can made to be much larger than the linear section of the skb and data
is read outside of it.

This fix makes ip6_find_1stfrag return an error if it detects
running out-of-bounds.

[ 42.361487] ==================================================================
[ 42.364412] BUG: KASAN: slab-out-of-bounds in ip6_fragment+0x11c8/0x3730
[ 42.365471] Read of size 840 at addr ffff88000969e798 by task ip6_fragment-oo/3789
[ 42.366469]
[ 42.366696] CPU: 1 PID: 3789 Comm: ip6_fragment-oo Not tainted 4.11.0+ #41
[ 42.367628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[ 42.368824] Call Trace:
[ 42.369183] dump_stack+0xb3/0x10b
[ 42.369664] print_address_description+0x73/0x290
[ 42.370325] kasan_report+0x252/0x370
[ 42.370839] ? ip6_fragment+0x11c8/0x3730
[ 42.371396] check_memory_region+0x13c/0x1a0
[ 42.371978] memcpy+0x23/0x50
[ 42.372395] ip6_fragment+0x11c8/0x3730
[ 42.372920] ? nf_ct_expect_unregister_notifier+0x110/0x110
[ 42.373681] ? ip6_copy_metadata+0x7f0/0x7f0
[ 42.374263] ? ip6_forward+0x2e30/0x2e30
[ 42.374803] ip6_finish_output+0x584/0x990
[ 42.375350] ip6_output+0x1b7/0x690
[ 42.375836] ? ip6_finish_output+0x990/0x990
[ 42.376411] ? ip6_fragment+0x3730/0x3730
[ 42.376968] ip6_local_out+0x95/0x160
[ 42.377471] ip6_send_skb+0xa1/0x330
[ 42.377969] ip6_push_pending_frames+0xb3/0xe0
[ 42.378589] rawv6_sendmsg+0x2051/0x2db0
[ 42.379129] ? rawv6_bind+0x8b0/0x8b0
[ 42.379633] ? _copy_from_user+0x84/0xe0
[ 42.380193] ? debug_check_no_locks_freed+0x290/0x290
[ 42.380878] ? ___sys_sendmsg+0x162/0x930
[ 42.381427] ? rcu_read_lock_sched_held+0xa3/0x120
[ 42.382074] ? sock_has_perm+0x1f6/0x290
[ 42.382614] ? ___sys_sendmsg+0x167/0x930
[ 42.383173] ? lock_downgrade+0x660/0x660
[ 42.383727] inet_sendmsg+0x123/0x500
[ 42.384226] ? inet_sendmsg+0x123/0x500
[ 42.384748] ? inet_recvmsg+0x540/0x540
[ 42.385263] sock_sendmsg+0xca/0x110
[ 42.385758] SYSC_sendto+0x217/0x380
[ 42.386249] ? SYSC_connect+0x310/0x310
[ 42.386783] ? __might_fault+0x110/0x1d0
[ 42.387324] ? lock_downgrade+0x660/0x660
[ 42.387880] ? __fget_light+0xa1/0x1f0
[ 42.388403] ? __fdget+0x18/0x20
[ 42.388851] ? sock_common_setsockopt+0x95/0xd0
[ 42.389472] ? SyS_setsockopt+0x17f/0x260
[ 42.390021] ? entry_SYSCALL_64_fastpath+0x5/0xbe
[ 42.390650] SyS_sendto+0x40/0x50
[ 42.391103] entry_SYSCALL_64_fastpath+0x1f/0xbe
[ 42.391731] RIP: 0033:0x7fbbb711e383
[ 42.392217] RSP: 002b:00007ffff4d34f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 42.393235] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbbb711e383
[ 42.394195] RDX: 0000000000001000 RSI: 00007ffff4d34f60 RDI: 0000000000000003
[ 42.395145] RBP: 0000000000000046 R08: 00007ffff4d34f40 R09: 0000000000000018
[ 42.396056] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400aad
[ 42.396598] R13: 0000000000000066 R14: 00007ffff4d34ee0 R15: 00007fbbb717af00
[ 42.397257]
[ 42.397411] Allocated by task 3789:
[ 42.397702] save_stack_trace+0x16/0x20
[ 42.398005] save_stack+0x46/0xd0
[ 42.398267] kasan_kmalloc+0xad/0xe0
[ 42.398548] kasan_slab_alloc+0x12/0x20
[ 42.398848] __kmalloc_node_track_caller+0xcb/0x380
[ 42.399224] __kmalloc_reserve.isra.32+0x41/0xe0
[ 42.399654] __alloc_skb+0xf8/0x580
[ 42.400003] sock_wmalloc+0xab/0xf0
[ 42.400346] __ip6_append_data.isra.41+0x2472/0x33d0
[ 42.400813] ip6_append_data+0x1a8/0x2f0
[ 42.401122] rawv6_sendmsg+0x11ee/0x2db0
[ 42.401505] inet_sendmsg+0x123/0x500
[ 42.401860] sock_sendmsg+0xca/0x110
[ 42.402209] ___sys_sendmsg+0x7cb/0x930
[ 42.402582] __sys_sendmsg+0xd9/0x190
[ 42.402941] SyS_sendmsg+0x2d/0x50
[ 42.403273] entry_SYSCALL_64_fastpath+0x1f/0xbe
[ 42.403718]
[ 42.403871] Freed by task 1794:
[ 42.404146] save_stack_trace+0x16/0x20
[ 42.404515] save_stack+0x46/0xd0
[ 42.404827] kasan_slab_free+0x72/0xc0
[ 42.405167] kfree+0xe8/0x2b0
[ 42.405462] skb_free_head+0x74/0xb0
[ 42.405806] skb_release_data+0x30e/0x3a0
[ 42.406198] skb_release_all+0x4a/0x60
[ 42.406563] consume_skb+0x113/0x2e0
[ 42.406910] skb_free_datagram+0x1a/0xe0
[ 42.407288] netlink_recvmsg+0x60d/0xe40
[ 42.407667] sock_recvmsg+0xd7/0x110
[ 42.408022] ___sys_recvmsg+0x25c/0x580
[ 42.408395] __sys_recvmsg+0xd6/0x190
[ 42.408753] SyS_recvmsg+0x2d/0x50
[ 42.409086] entry_SYSCALL_64_fastpath+0x1f/0xbe
[ 42.409513]
[ 42.409665] The buggy address belongs to the object at ffff88000969e780
[ 42.409665] which belongs to the cache kmalloc-512 of size 512
[ 42.410846] The buggy address is located 24 bytes inside of
[ 42.410846] 512-byte region [ffff88000969e780, ffff88000969e980)
[ 42.411941] The buggy address belongs to the page:
[ 42.412405] page:ffffea000025a780 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0
[ 42.413298] flags: 0x100000000008100(slab|head)
[ 42.413729] raw: 0100000000008100 0000000000000000 0000000000000000 00000001800c000c
[ 42.414387] raw: ffffea00002a9500 0000000900000007 ffff88000c401280 0000000000000000
[ 42.415074] page dumped because: kasan: bad access detected
[ 42.415604]
[ 42.415757] Memory state around the buggy address:
[ 42.416222] ffff88000969e880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 42.416904] ffff88000969e900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 42.417591] >ffff88000969e980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 42.418273] ^
[ 42.418588] ffff88000969ea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 42.419273] ffff88000969ea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 42.419882] ==================================================================

Reported-by: Andrey Konovalov <[email protected]>
Signed-off-by: Craig Gallek <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ipv6/ip6_offload.c | 2 ++
net/ipv6/ip6_output.c | 4 ++++
net/ipv6/output_core.c | 14 ++++++++------
net/ipv6/udp_offload.c | 2 ++
4 files changed, 16 insertions(+), 6 deletions(-)

--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -117,6 +117,8 @@ static struct sk_buff *ipv6_gso_segment(

if (udpfrag) {
unfrag_ip6hlen = ip6_find_1stfragopt(skb, &prevhdr);
+ if (unfrag_ip6hlen < 0)
+ return ERR_PTR(unfrag_ip6hlen);
fptr = (struct frag_hdr *)((u8 *)ipv6h + unfrag_ip6hlen);
fptr->frag_off = htons(offset);
if (skb->next)
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -598,6 +598,10 @@ int ip6_fragment(struct net *net, struct
u8 *prevhdr, nexthdr = 0;

hlen = ip6_find_1stfragopt(skb, &prevhdr);
+ if (hlen < 0) {
+ err = hlen;
+ goto fail;
+ }
nexthdr = *prevhdr;

mtu = ip6_skb_dst_mtu(skb);
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -79,14 +79,13 @@ EXPORT_SYMBOL(ipv6_select_ident);
int ip6_find_1stfragopt(struct sk_buff *skb, u8 **nexthdr)
{
u16 offset = sizeof(struct ipv6hdr);
- struct ipv6_opt_hdr *exthdr =
- (struct ipv6_opt_hdr *)(ipv6_hdr(skb) + 1);
unsigned int packet_len = skb_tail_pointer(skb) -
skb_network_header(skb);
int found_rhdr = 0;
*nexthdr = &ipv6_hdr(skb)->nexthdr;

- while (offset + 1 <= packet_len) {
+ while (offset <= packet_len) {
+ struct ipv6_opt_hdr *exthdr;

switch (**nexthdr) {

@@ -107,13 +106,16 @@ int ip6_find_1stfragopt(struct sk_buff *
return offset;
}

- offset += ipv6_optlen(exthdr);
- *nexthdr = &exthdr->nexthdr;
+ if (offset + sizeof(struct ipv6_opt_hdr) > packet_len)
+ return -EINVAL;
+
exthdr = (struct ipv6_opt_hdr *)(skb_network_header(skb) +
offset);
+ offset += ipv6_optlen(exthdr);
+ *nexthdr = &exthdr->nexthdr;
}

- return offset;
+ return -EINVAL;
}
EXPORT_SYMBOL(ip6_find_1stfragopt);

--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -91,6 +91,8 @@ static struct sk_buff *udp6_ufo_fragment
* bytes to insert fragment header.
*/
unfrag_ip6hlen = ip6_find_1stfragopt(skb, &prevhdr);
+ if (unfrag_ip6hlen < 0)
+ return ERR_PTR(unfrag_ip6hlen);
nexthdr = *prevhdr;
*prevhdr = NEXTHDR_FRAGMENT;
unfrag_len = (skb_network_header(skb) - skb_mac_header(skb)) +


2017-06-05 16:51:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 021/115] net: Improve handling of failures on link and route dumps

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: David Ahern <[email protected]>


[ Upstream commit f6c5775ff0bfa62b072face6bf1d40f659f194b2 ]

In general, rtnetlink dumps do not anticipate failure to dump a single
object (e.g., link or route) on a single pass. As both route and link
objects have grown via more attributes, that is no longer a given.

netlink dumps can handle a failure if the dump function returns an
error; specifically, netlink_dump adds the return code to the response
if it is <= 0 so userspace is notified of the failure. The missing
piece is the rtnetlink dump functions returning the error.

Fix route and link dump functions to return the errors if no object is
added to an skb (detected by skb->len != 0). IPv6 route dumps
(rt6_dump_route) already return the error; this patch updates IPv4 and
link dumps. Other dump functions may need to be ajusted as well.

Reported-by: Jan Moskyto Matejka <[email protected]>
Signed-off-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/core/rtnetlink.c | 36 ++++++++++++++++++++++++------------
net/ipv4/fib_frontend.c | 15 +++++++++++----
net/ipv4/fib_trie.c | 26 ++++++++++++++------------
3 files changed, 49 insertions(+), 28 deletions(-)

--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1620,13 +1620,13 @@ static int rtnl_dump_ifinfo(struct sk_bu
cb->nlh->nlmsg_seq, 0,
flags,
ext_filter_mask);
- /* If we ran out of room on the first message,
- * we're in trouble
- */
- WARN_ON((err == -EMSGSIZE) && (skb->len == 0));

- if (err < 0)
- goto out;
+ if (err < 0) {
+ if (likely(skb->len))
+ goto out;
+
+ goto out_err;
+ }

nl_dump_check_consistent(cb, nlmsg_hdr(skb));
cont:
@@ -1634,10 +1634,12 @@ cont:
}
}
out:
+ err = skb->len;
+out_err:
cb->args[1] = idx;
cb->args[0] = h;

- return skb->len;
+ return err;
}

int rtnl_nla_parse_ifla(struct nlattr **tb, const struct nlattr *head, int len)
@@ -3427,8 +3429,12 @@ static int rtnl_bridge_getlink(struct sk
err = br_dev->netdev_ops->ndo_bridge_getlink(
skb, portid, seq, dev,
filter_mask, NLM_F_MULTI);
- if (err < 0 && err != -EOPNOTSUPP)
- break;
+ if (err < 0 && err != -EOPNOTSUPP) {
+ if (likely(skb->len))
+ break;
+
+ goto out_err;
+ }
}
idx++;
}
@@ -3439,16 +3445,22 @@ static int rtnl_bridge_getlink(struct sk
seq, dev,
filter_mask,
NLM_F_MULTI);
- if (err < 0 && err != -EOPNOTSUPP)
- break;
+ if (err < 0 && err != -EOPNOTSUPP) {
+ if (likely(skb->len))
+ break;
+
+ goto out_err;
+ }
}
idx++;
}
}
+ err = skb->len;
+out_err:
rcu_read_unlock();
cb->args[0] = idx;

- return skb->len;
+ return err;
}

static inline size_t bridge_nlmsg_size(void)
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -760,7 +760,7 @@ static int inet_dump_fib(struct sk_buff
unsigned int e = 0, s_e;
struct fib_table *tb;
struct hlist_head *head;
- int dumped = 0;
+ int dumped = 0, err;

if (nlmsg_len(cb->nlh) >= sizeof(struct rtmsg) &&
((struct rtmsg *) nlmsg_data(cb->nlh))->rtm_flags & RTM_F_CLONED)
@@ -780,20 +780,27 @@ static int inet_dump_fib(struct sk_buff
if (dumped)
memset(&cb->args[2], 0, sizeof(cb->args) -
2 * sizeof(cb->args[0]));
- if (fib_table_dump(tb, skb, cb) < 0)
- goto out;
+ err = fib_table_dump(tb, skb, cb);
+ if (err < 0) {
+ if (likely(skb->len))
+ goto out;
+
+ goto out_err;
+ }
dumped = 1;
next:
e++;
}
}
out:
+ err = skb->len;
+out_err:
rcu_read_unlock();

cb->args[1] = e;
cb->args[0] = h;

- return skb->len;
+ return err;
}

/* Prepare and feed intra-kernel routing request.
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2079,6 +2079,8 @@ static int fn_trie_dump_leaf(struct key_

/* rcu_read_lock is hold by caller */
hlist_for_each_entry_rcu(fa, &l->leaf, fa_list) {
+ int err;
+
if (i < s_i) {
i++;
continue;
@@ -2089,17 +2091,14 @@ static int fn_trie_dump_leaf(struct key_
continue;
}

- if (fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
- cb->nlh->nlmsg_seq,
- RTM_NEWROUTE,
- tb->tb_id,
- fa->fa_type,
- xkey,
- KEYLENGTH - fa->fa_slen,
- fa->fa_tos,
- fa->fa_info, NLM_F_MULTI) < 0) {
+ err = fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
+ cb->nlh->nlmsg_seq, RTM_NEWROUTE,
+ tb->tb_id, fa->fa_type,
+ xkey, KEYLENGTH - fa->fa_slen,
+ fa->fa_tos, fa->fa_info, NLM_F_MULTI);
+ if (err < 0) {
cb->args[4] = i;
- return -1;
+ return err;
}
i++;
}
@@ -2121,10 +2120,13 @@ int fib_table_dump(struct fib_table *tb,
t_key key = cb->args[3];

while ((l = leaf_walk_rcu(&tp, key)) != NULL) {
- if (fn_trie_dump_leaf(l, tb, skb, cb) < 0) {
+ int err;
+
+ err = fn_trie_dump_leaf(l, tb, skb, cb);
+ if (err < 0) {
cb->args[3] = key;
cb->args[2] = count;
- return -1;
+ return err;
}

++count;


2017-06-05 16:29:17

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 020/115] net/smc: Add warning about remote memory exposure

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Christoph Hellwig <[email protected]>


[ Upstream commit 19a0f7e37c0761a0a1cbf550705a6063c9675223 ]

The driver explicitly bypasses APIs to register all memory once a
connection is made, and thus allows remote access to memory.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Acked-by: Ursula Braun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/smc/Kconfig | 4 ++++
1 file changed, 4 insertions(+)

--- a/net/smc/Kconfig
+++ b/net/smc/Kconfig
@@ -8,6 +8,10 @@ config SMC
The Linux implementation of the SMC-R solution is designed as
a separate socket family SMC.

+ Warning: SMC will expose all memory for remote reads and writes
+ once a connection is established. Don't enable this option except
+ for tightly controlled lab environment.
+
Select this option if you want to run SMC socket applications

config SMC_DIAG


2017-06-05 16:29:15

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 017/115] net/mlx5e: Fix ethtool pause support and advertise reporting

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Gal Pressman <[email protected]>


[ Upstream commit e3c19503712d6360239b19c14cded56dd63c40d7 ]

Pause bit should set when RX pause is on, not TX pause.
Also, setting Asym_Pause is incorrect, and should be turned off.

Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API")
Signed-off-by: Gal Pressman <[email protected]>
Cc: [email protected]
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -773,7 +773,6 @@ static void get_supported(u32 eth_proto_
ptys2ethtool_supported_port(link_ksettings, eth_proto_cap);
ptys2ethtool_supported_link(supported, eth_proto_cap);
ethtool_link_ksettings_add_link_mode(link_ksettings, supported, Pause);
- ethtool_link_ksettings_add_link_mode(link_ksettings, supported, Asym_Pause);
}

static void get_advertising(u32 eth_proto_cap, u8 tx_pause,
@@ -783,7 +782,7 @@ static void get_advertising(u32 eth_prot
unsigned long *advertising = link_ksettings->link_modes.advertising;

ptys2ethtool_adver_link(advertising, eth_proto_cap);
- if (tx_pause)
+ if (rx_pause)
ethtool_link_ksettings_add_link_mode(link_ksettings, advertising, Pause);
if (tx_pause ^ rx_pause)
ethtool_link_ksettings_add_link_mode(link_ksettings, advertising, Asym_Pause);


2017-06-05 16:52:48

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 018/115] tcp: eliminate negative reordering in tcp_clean_rtx_queue

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Soheil Hassas Yeganeh <[email protected]>


[ Upstream commit bafbb9c73241760023d8981191ddd30bb1c6dbac ]

tcp_ack() can call tcp_fragment() which may dededuct the
value tp->fackets_out when MSS changes. When prior_fackets
is larger than tp->fackets_out, tcp_clean_rtx_queue() can
invoke tcp_update_reordering() with negative values. This
results in absurd tp->reodering values higher than
sysctl_tcp_max_reordering.

Note that tcp_update_reordering indeeds sets tp->reordering
to min(sysctl_tcp_max_reordering, metric), but because
the comparison is signed, a negative metric always wins.

Fixes: c7caf8d3ed7a ("[TCP]: Fix reord detection due to snd_una covered holes")
Reported-by: Rebecca Isaacs <[email protected]>
Signed-off-by: Soheil Hassas Yeganeh <[email protected]>
Signed-off-by: Neal Cardwell <[email protected]>
Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ipv4/tcp_input.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3189,7 +3189,7 @@ static int tcp_clean_rtx_queue(struct so
int delta;

/* Non-retransmitted hole got filled? That's reordering */
- if (reord < prior_fackets)
+ if (reord < prior_fackets && reord <= tp->fackets_out)
tcp_update_reordering(sk, tp->fackets_out - reord, 0);

delta = tcp_is_fack(tp) ? pkts_acked :


2017-06-05 16:28:52

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 027/115] ipv6: fix out of bound writes in __ip6_append_data()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Eric Dumazet <[email protected]>


[ Upstream commit 232cd35d0804cc241eb887bb8d4d9b3b9881c64a ]

Andrey Konovalov and [email protected] reported crashes caused by
one skb shared_info being overwritten from __ip6_append_data()

Andrey program lead to following state :

copy -4200 datalen 2000 fraglen 2040
maxfraglen 2040 alloclen 2048 transhdrlen 0 offset 0 fraggap 6200

The skb_copy_and_csum_bits(skb_prev, maxfraglen, data + transhdrlen,
fraggap, 0); is overwriting skb->head and skb_shared_info

Since we apparently detect this rare condition too late, move the
code earlier to even avoid allocating skb and risking crashes.

Once again, many thanks to Andrey and syzkaller team.

Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Andrey Konovalov <[email protected]>
Tested-by: Andrey Konovalov <[email protected]>
Reported-by: <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ipv6/ip6_output.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1466,6 +1466,11 @@ alloc_new_skb:
*/
alloclen += sizeof(struct frag_hdr);

+ copy = datalen - transhdrlen - fraggap;
+ if (copy < 0) {
+ err = -EINVAL;
+ goto error;
+ }
if (transhdrlen) {
skb = sock_alloc_send_skb(sk,
alloclen + hh_len,
@@ -1515,13 +1520,9 @@ alloc_new_skb:
data += fraggap;
pskb_trim_unique(skb_prev, maxfraglen);
}
- copy = datalen - transhdrlen - fraggap;
-
- if (copy < 0) {
- err = -EINVAL;
- kfree_skb(skb);
- goto error;
- } else if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
+ if (copy > 0 &&
+ getfrag(from, data + transhdrlen, offset,
+ copy, fraggap, skb) < 0) {
err = -EFAULT;
kfree_skb(skb);
goto error;


2017-06-05 16:53:33

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 029/115] net/mlx5: Avoid using pending command interface slots

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Mohamad Haj Yahia <[email protected]>


[ Upstream commit 73dd3a4839c1d27c36d4dcc92e1ff44225ecbeb7 ]

Currently when firmware command gets stuck or it takes long time to
complete, the driver command will get timeout and the command slot is
freed and can be used for new commands, and if the firmware receive new
command on the old busy slot its behavior is unexpected and this could
be harmful.
To fix this when the driver command gets timeout we return failure,
but we don't free the command slot and we wait for the firmware to
explicitly respond to that command.
Once all the entries are busy we will stop processing new firmware
commands.

Fixes: 9cba4ebcf374 ('net/mlx5: Fix potential deadlock in command mode change')
Signed-off-by: Mohamad Haj Yahia <[email protected]>
Cc: [email protected]
Signed-off-by: Saeed Mahameed <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 41 ++++++++++++++++++++---
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 2 -
drivers/net/ethernet/mellanox/mlx5/core/health.c | 2 -
include/linux/mlx5/driver.h | 7 +++
4 files changed, 44 insertions(+), 8 deletions(-)

--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -770,7 +770,7 @@ static void cb_timeout_handler(struct wo
mlx5_core_warn(dev, "%s(0x%x) timeout. Will cause a leak of a command resource\n",
mlx5_command_str(msg_to_opcode(ent->in)),
msg_to_opcode(ent->in));
- mlx5_cmd_comp_handler(dev, 1UL << ent->idx);
+ mlx5_cmd_comp_handler(dev, 1UL << ent->idx, true);
}

static void cmd_work_handler(struct work_struct *work)
@@ -800,6 +800,7 @@ static void cmd_work_handler(struct work
}

cmd->ent_arr[ent->idx] = ent;
+ set_bit(MLX5_CMD_ENT_STATE_PENDING_COMP, &ent->state);
lay = get_inst(cmd, ent->idx);
ent->lay = lay;
memset(lay, 0, sizeof(*lay));
@@ -821,6 +822,20 @@ static void cmd_work_handler(struct work
if (ent->callback)
schedule_delayed_work(&ent->cb_timeout_work, cb_timeout);

+ /* Skip sending command to fw if internal error */
+ if (pci_channel_offline(dev->pdev) ||
+ dev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR) {
+ u8 status = 0;
+ u32 drv_synd;
+
+ ent->ret = mlx5_internal_err_ret_value(dev, msg_to_opcode(ent->in), &drv_synd, &status);
+ MLX5_SET(mbox_out, ent->out, status, status);
+ MLX5_SET(mbox_out, ent->out, syndrome, drv_synd);
+
+ mlx5_cmd_comp_handler(dev, 1UL << ent->idx, true);
+ return;
+ }
+
/* ring doorbell after the descriptor is valid */
mlx5_core_dbg(dev, "writing 0x%x to command doorbell\n", 1 << ent->idx);
wmb();
@@ -831,7 +846,7 @@ static void cmd_work_handler(struct work
poll_timeout(ent);
/* make sure we read the descriptor after ownership is SW */
rmb();
- mlx5_cmd_comp_handler(dev, 1UL << ent->idx);
+ mlx5_cmd_comp_handler(dev, 1UL << ent->idx, (ent->ret == -ETIMEDOUT));
}
}

@@ -875,7 +890,7 @@ static int wait_func(struct mlx5_core_de
wait_for_completion(&ent->done);
} else if (!wait_for_completion_timeout(&ent->done, timeout)) {
ent->ret = -ETIMEDOUT;
- mlx5_cmd_comp_handler(dev, 1UL << ent->idx);
+ mlx5_cmd_comp_handler(dev, 1UL << ent->idx, true);
}

err = ent->ret;
@@ -1371,7 +1386,7 @@ static void free_msg(struct mlx5_core_de
}
}

-void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, u64 vec)
+void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, u64 vec, bool forced)
{
struct mlx5_cmd *cmd = &dev->cmd;
struct mlx5_cmd_work_ent *ent;
@@ -1391,6 +1406,19 @@ void mlx5_cmd_comp_handler(struct mlx5_c
struct semaphore *sem;

ent = cmd->ent_arr[i];
+
+ /* if we already completed the command, ignore it */
+ if (!test_and_clear_bit(MLX5_CMD_ENT_STATE_PENDING_COMP,
+ &ent->state)) {
+ /* only real completion can free the cmd slot */
+ if (!forced) {
+ mlx5_core_err(dev, "Command completion arrived after timeout (entry idx = %d).\n",
+ ent->idx);
+ free_ent(cmd, ent->idx);
+ }
+ continue;
+ }
+
if (ent->callback)
cancel_delayed_work(&ent->cb_timeout_work);
if (ent->page_queue)
@@ -1413,7 +1441,10 @@ void mlx5_cmd_comp_handler(struct mlx5_c
mlx5_core_dbg(dev, "command completed. ret 0x%x, delivery status %s(0x%x)\n",
ent->ret, deliv_status_to_str(ent->status), ent->status);
}
- free_ent(cmd, ent->idx);
+
+ /* only real completion will free the entry slot */
+ if (!forced)
+ free_ent(cmd, ent->idx);

if (ent->callback) {
ds = ent->ts2 - ent->ts1;
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -422,7 +422,7 @@ static irqreturn_t mlx5_eq_int(int irq,
break;

case MLX5_EVENT_TYPE_CMD:
- mlx5_cmd_comp_handler(dev, be32_to_cpu(eqe->data.cmd.vector));
+ mlx5_cmd_comp_handler(dev, be32_to_cpu(eqe->data.cmd.vector), false);
break;

case MLX5_EVENT_TYPE_PORT_CHANGE:
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -90,7 +90,7 @@ static void trigger_cmd_completions(stru
spin_unlock_irqrestore(&dev->cmd.alloc_lock, flags);

mlx5_core_dbg(dev, "vector 0x%llx\n", vector);
- mlx5_cmd_comp_handler(dev, vector);
+ mlx5_cmd_comp_handler(dev, vector, true);
return;

no_trig:
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -785,7 +785,12 @@ enum {

typedef void (*mlx5_cmd_cbk_t)(int status, void *context);

+enum {
+ MLX5_CMD_ENT_STATE_PENDING_COMP,
+};
+
struct mlx5_cmd_work_ent {
+ unsigned long state;
struct mlx5_cmd_msg *in;
struct mlx5_cmd_msg *out;
void *uout;
@@ -979,7 +984,7 @@ void mlx5_cq_completion(struct mlx5_core
void mlx5_rsc_event(struct mlx5_core_dev *dev, u32 rsn, int event_type);
void mlx5_srq_event(struct mlx5_core_dev *dev, u32 srqn, int event_type);
struct mlx5_core_srq *mlx5_core_get_srq(struct mlx5_core_dev *dev, u32 srqn);
-void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, u64 vec);
+void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, u64 vec, bool forced);
void mlx5_cq_event(struct mlx5_core_dev *dev, u32 cqn, int event_type);
int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 vecidx,
int nent, u64 mask, const char *name,


2017-06-05 16:28:35

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 009/115] netem: fix skb_orphan_partial()

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Eric Dumazet <[email protected]>


[ Upstream commit f6ba8d33cfbb46df569972e64dbb5bb7e929bfd9 ]

I should have known that lowering skb->truesize was dangerous :/

In case packets are not leaving the host via a standard Ethernet device,
but looped back to local sockets, bad things can happen, as reported
by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 )

So instead of tweaking skb->truesize, lets change skb->destructor
and keep a reference on the owner socket via its sk_refcnt.

Fixes: f2f872f9272a ("netem: Introduce skb_orphan_partial() helper")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Michael Madsen <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/core/sock.c | 20 ++++++++------------
1 file changed, 8 insertions(+), 12 deletions(-)

--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1699,28 +1699,24 @@ EXPORT_SYMBOL(skb_set_owner_w);
* delay queue. We want to allow the owner socket to send more
* packets, as if they were already TX completed by a typical driver.
* But we also want to keep skb->sk set because some packet schedulers
- * rely on it (sch_fq for example). So we set skb->truesize to a small
- * amount (1) and decrease sk_wmem_alloc accordingly.
+ * rely on it (sch_fq for example).
*/
void skb_orphan_partial(struct sk_buff *skb)
{
- /* If this skb is a TCP pure ACK or already went here,
- * we have nothing to do. 2 is already a very small truesize.
- */
- if (skb->truesize <= 2)
+ if (skb_is_tcp_pure_ack(skb))
return;

- /* TCP stack sets skb->ooo_okay based on sk_wmem_alloc,
- * so we do not completely orphan skb, but transfert all
- * accounted bytes but one, to avoid unexpected reorders.
- */
if (skb->destructor == sock_wfree
#ifdef CONFIG_INET
|| skb->destructor == tcp_wfree
#endif
) {
- atomic_sub(skb->truesize - 1, &skb->sk->sk_wmem_alloc);
- skb->truesize = 1;
+ struct sock *sk = skb->sk;
+
+ if (atomic_inc_not_zero(&sk->sk_refcnt)) {
+ atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
+ skb->destructor = sock_efree;
+ }
} else {
skb_orphan(skb);
}


2017-06-05 16:28:33

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 008/115] bpf, arm64: fix faulty emission of map access in tail calls

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Daniel Borkmann <[email protected]>


[ Upstream commit d8b54110ee944de522ccd3531191f39986ec20f9 ]

Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.

The buggy access is at:

prog = array->ptrs[index];
if (prog == NULL)
goto out;

[...]
00000060: d2800e0a mov x10, #0x70 // #112
00000064: f86a682a ldr x10, [x1,x10]
00000068: f862694b ldr x11, [x10,x2]
0000006c: b40000ab cbz x11, 0x00000080
[...]

The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.

Fix this by emitting the following instead:

[...]
00000060: d2800e0a mov x10, #0x70 // #112
00000064: 8b0a002a add x10, x1, x10
00000068: d37df04b lsl x11, x2, #3
0000006c: f86b694b ldr x11, [x10,x11]
00000070: b40000ab cbz x11, 0x00000084
[...]

This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.

Fixes: ddb55992b04d ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
arch/arm64/net/bpf_jit_comp.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -252,8 +252,9 @@ static int emit_bpf_tail_call(struct jit
*/
off = offsetof(struct bpf_array, ptrs);
emit_a64_mov_i64(tmp, off, ctx);
- emit(A64_LDR64(tmp, r2, tmp), ctx);
- emit(A64_LDR64(prg, tmp, r3), ctx);
+ emit(A64_ADD(1, tmp, r2, tmp), ctx);
+ emit(A64_LSL(1, prg, r3, 3), ctx);
+ emit(A64_LDR64(prg, tmp, prg), ctx);
emit(A64_CBZ(1, prg, jmp_offset), ctx);

/* goto *(prog->bpf_func + prologue_size); */


2017-06-05 16:28:25

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 007/115] s390/qeth: add missing hash table initializations

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Ursula Braun <[email protected]>


[ Upstream commit ebccc7397e4a49ff64c8f44a54895de9d32fe742 ]

commit 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
added new hash tables, but missed to initialize them.

Fixes: 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
Signed-off-by: Ursula Braun <[email protected]>
Reviewed-by: Julian Wiedmann <[email protected]>
Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/s390/net/qeth_l3_main.c | 2 ++
1 file changed, 2 insertions(+)

--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -3158,6 +3158,8 @@ static int qeth_l3_probe_device(struct c
rc = qeth_l3_create_device_attributes(&gdev->dev);
if (rc)
return rc;
+ hash_init(card->ip_htable);
+ hash_init(card->ip_mc_htable);
card->options.layer2 = 0;
card->info.hwtrap = 0;
return 0;


2017-06-05 16:28:21

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 003/115] ipv6/dccp: do not inherit ipv6_mc_list from parent

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: WANG Cong <[email protected]>


[ Upstream commit 83eaddab4378db256d00d295bda6ca997cd13a52 ]

Like commit 657831ffc38e ("dccp/tcp: do not inherit mc_list from parent")
we should clear ipv6_mc_list etc. for IPv6 sockets too.

Cc: Eric Dumazet <[email protected]>
Signed-off-by: Cong Wang <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/dccp/ipv6.c | 6 ++++++
net/ipv6/tcp_ipv6.c | 2 ++
2 files changed, 8 insertions(+)

--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -426,6 +426,9 @@ static struct sock *dccp_v6_request_recv
newsk->sk_backlog_rcv = dccp_v4_do_rcv;
newnp->pktoptions = NULL;
newnp->opt = NULL;
+ newnp->ipv6_mc_list = NULL;
+ newnp->ipv6_ac_list = NULL;
+ newnp->ipv6_fl_list = NULL;
newnp->mcast_oif = inet6_iif(skb);
newnp->mcast_hops = ipv6_hdr(skb)->hop_limit;

@@ -490,6 +493,9 @@ static struct sock *dccp_v6_request_recv
/* Clone RX bits */
newnp->rxopt.all = np->rxopt.all;

+ newnp->ipv6_mc_list = NULL;
+ newnp->ipv6_ac_list = NULL;
+ newnp->ipv6_fl_list = NULL;
newnp->pktoptions = NULL;
newnp->opt = NULL;
newnp->mcast_oif = inet6_iif(skb);
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1070,6 +1070,7 @@ static struct sock *tcp_v6_syn_recv_sock
newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
#endif

+ newnp->ipv6_mc_list = NULL;
newnp->ipv6_ac_list = NULL;
newnp->ipv6_fl_list = NULL;
newnp->pktoptions = NULL;
@@ -1139,6 +1140,7 @@ static struct sock *tcp_v6_syn_recv_sock
First: no IPv4 options.
*/
newinet->inet_opt = NULL;
+ newnp->ipv6_mc_list = NULL;
newnp->ipv6_ac_list = NULL;
newnp->ipv6_fl_list = NULL;



2017-06-05 16:55:44

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 004/115] s390/qeth: handle sysfs error during initialization

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Ursula Braun <[email protected]>


[ Upstream commit 9111e7880ccf419548c7b0887df020b08eadb075 ]

When setting up the device from within the layer discipline's
probe routine, creating the layer-specific sysfs attributes can fail.
Report this error back to the caller, and handle it by
releasing the layer discipline.

Signed-off-by: Ursula Braun <[email protected]>
[jwi: updated commit msg, moved an OSN change to a subsequent patch]
Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/s390/net/qeth_core_main.c | 4 +++-
drivers/s390/net/qeth_core_sys.c | 2 ++
drivers/s390/net/qeth_l2_main.c | 5 ++++-
drivers/s390/net/qeth_l3_main.c | 5 ++++-
4 files changed, 13 insertions(+), 3 deletions(-)

--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -5661,8 +5661,10 @@ static int qeth_core_set_online(struct c
if (rc)
goto err;
rc = card->discipline->setup(card->gdev);
- if (rc)
+ if (rc) {
+ qeth_core_free_discipline(card);
goto err;
+ }
}
rc = card->discipline->set_online(gdev);
err:
--- a/drivers/s390/net/qeth_core_sys.c
+++ b/drivers/s390/net/qeth_core_sys.c
@@ -426,6 +426,8 @@ static ssize_t qeth_dev_layer2_store(str
goto out;

rc = card->discipline->setup(card->gdev);
+ if (rc)
+ qeth_core_free_discipline(card);
out:
mutex_unlock(&card->discipline_mutex);
return rc ? rc : count;
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -1009,8 +1009,11 @@ static int qeth_l2_stop(struct net_devic
static int qeth_l2_probe_device(struct ccwgroup_device *gdev)
{
struct qeth_card *card = dev_get_drvdata(&gdev->dev);
+ int rc;

- qeth_l2_create_device_attributes(&gdev->dev);
+ rc = qeth_l2_create_device_attributes(&gdev->dev);
+ if (rc)
+ return rc;
INIT_LIST_HEAD(&card->vid_list);
hash_init(card->mac_htable);
card->options.layer2 = 1;
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -3153,8 +3153,11 @@ static int qeth_l3_setup_netdev(struct q
static int qeth_l3_probe_device(struct ccwgroup_device *gdev)
{
struct qeth_card *card = dev_get_drvdata(&gdev->dev);
+ int rc;

- qeth_l3_create_device_attributes(&gdev->dev);
+ rc = qeth_l3_create_device_attributes(&gdev->dev);
+ if (rc)
+ return rc;
card->options.layer2 = 0;
card->info.hwtrap = 0;
return 0;


2017-06-05 16:55:42

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 005/115] s390/qeth: unbreak OSM and OSN support

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Julian Wiedmann <[email protected]>


[ Upstream commit 2d2ebb3ed0c6acfb014f98e427298673a5d07b82 ]

commit b4d72c08b358 ("qeth: bridgeport support - basic control")
broke the support for OSM and OSN devices as follows:

As OSM and OSN are L2 only, qeth_core_probe_device() does an early
setup by loading the l2 discipline and calling qeth_l2_probe_device().
In this context, adding the l2-specific bridgeport sysfs attributes
via qeth_l2_create_device_attributes() hits a BUG_ON in fs/sysfs/group.c,
since the basic sysfs infrastructure for the device hasn't been
established yet.

Note that OSN actually has its own unique sysfs attributes
(qeth_osn_devtype), so the additional attributes shouldn't be created
at all.
For OSM, add a new qeth_l2_devtype that contains all the common
and l2-specific sysfs attributes.
When qeth_core_probe_device() does early setup for OSM or OSN, assign
the corresponding devtype so that the ccwgroup probe code creates the
full set of sysfs attributes.
This allows us to skip qeth_l2_create_device_attributes() in case
of an early setup.

Any device that can't do early setup will initially have only the
generic sysfs attributes, and when it's probed later
qeth_l2_probe_device() adds the l2-specific attributes.

If an early-setup device is removed (by calling ccwgroup_ungroup()),
device_unregister() will - using the devtype - delete the
l2-specific attributes before qeth_l2_remove_device() is called.
So make sure to not remove them twice.

What complicates the issue is that qeth_l2_probe_device() and
qeth_l2_remove_device() is also called on a device when its
layer2 attribute changes (ie. its layer mode is switched).
For early-setup devices this wouldn't work properly - we wouldn't
remove the l2-specific attributes when switching to L3.
But switching the layer mode doesn't actually make any sense;
we already decided that the device can only operate in L2!
So just refuse to switch the layer mode on such devices. Note that
OSN doesn't have a layer2 attribute, so we only need to special-case
OSM.

Based on an initial patch by Ursula Braun.

Fixes: b4d72c08b358 ("qeth: bridgeport support - basic control")
Signed-off-by: Julian Wiedmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
drivers/s390/net/qeth_core.h | 4 ++++
drivers/s390/net/qeth_core_main.c | 17 +++++++++--------
drivers/s390/net/qeth_core_sys.c | 22 ++++++++++++++--------
drivers/s390/net/qeth_l2.h | 2 ++
drivers/s390/net/qeth_l2_main.c | 17 +++++++++++++----
drivers/s390/net/qeth_l2_sys.c | 8 ++++++++
drivers/s390/net/qeth_l3_main.c | 1 +
7 files changed, 51 insertions(+), 20 deletions(-)

--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -714,6 +714,7 @@ enum qeth_discipline_id {
};

struct qeth_discipline {
+ const struct device_type *devtype;
void (*start_poll)(struct ccw_device *, int, unsigned long);
qdio_handler_t *input_handler;
qdio_handler_t *output_handler;
@@ -889,6 +890,9 @@ extern struct qeth_discipline qeth_l2_di
extern struct qeth_discipline qeth_l3_discipline;
extern const struct attribute_group *qeth_generic_attr_groups[];
extern const struct attribute_group *qeth_osn_attr_groups[];
+extern const struct attribute_group qeth_device_attr_group;
+extern const struct attribute_group qeth_device_blkt_group;
+extern const struct device_type qeth_generic_devtype;
extern struct workqueue_struct *qeth_wq;

int qeth_card_hw_is_reachable(struct qeth_card *);
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -5460,10 +5460,12 @@ void qeth_core_free_discipline(struct qe
card->discipline = NULL;
}

-static const struct device_type qeth_generic_devtype = {
+const struct device_type qeth_generic_devtype = {
.name = "qeth_generic",
.groups = qeth_generic_attr_groups,
};
+EXPORT_SYMBOL_GPL(qeth_generic_devtype);
+
static const struct device_type qeth_osn_devtype = {
.name = "qeth_osn",
.groups = qeth_osn_attr_groups,
@@ -5589,23 +5591,22 @@ static int qeth_core_probe_device(struct
goto err_card;
}

- if (card->info.type == QETH_CARD_TYPE_OSN)
- gdev->dev.type = &qeth_osn_devtype;
- else
- gdev->dev.type = &qeth_generic_devtype;
-
switch (card->info.type) {
case QETH_CARD_TYPE_OSN:
case QETH_CARD_TYPE_OSM:
rc = qeth_core_load_discipline(card, QETH_DISCIPLINE_LAYER2);
if (rc)
goto err_card;
+
+ gdev->dev.type = (card->info.type != QETH_CARD_TYPE_OSN)
+ ? card->discipline->devtype
+ : &qeth_osn_devtype;
rc = card->discipline->setup(card->gdev);
if (rc)
goto err_disc;
- case QETH_CARD_TYPE_OSD:
- case QETH_CARD_TYPE_OSX:
+ break;
default:
+ gdev->dev.type = &qeth_generic_devtype;
break;
}

--- a/drivers/s390/net/qeth_core_sys.c
+++ b/drivers/s390/net/qeth_core_sys.c
@@ -413,12 +413,16 @@ static ssize_t qeth_dev_layer2_store(str

if (card->options.layer2 == newdis)
goto out;
- else {
- card->info.mac_bits = 0;
- if (card->discipline) {
- card->discipline->remove(card->gdev);
- qeth_core_free_discipline(card);
- }
+ if (card->info.type == QETH_CARD_TYPE_OSM) {
+ /* fixed layer, can't switch */
+ rc = -EOPNOTSUPP;
+ goto out;
+ }
+
+ card->info.mac_bits = 0;
+ if (card->discipline) {
+ card->discipline->remove(card->gdev);
+ qeth_core_free_discipline(card);
}

rc = qeth_core_load_discipline(card, newdis);
@@ -705,10 +709,11 @@ static struct attribute *qeth_blkt_devic
&dev_attr_inter_jumbo.attr,
NULL,
};
-static struct attribute_group qeth_device_blkt_group = {
+const struct attribute_group qeth_device_blkt_group = {
.name = "blkt",
.attrs = qeth_blkt_device_attrs,
};
+EXPORT_SYMBOL_GPL(qeth_device_blkt_group);

static struct attribute *qeth_device_attrs[] = {
&dev_attr_state.attr,
@@ -728,9 +733,10 @@ static struct attribute *qeth_device_att
&dev_attr_switch_attrs.attr,
NULL,
};
-static struct attribute_group qeth_device_attr_group = {
+const struct attribute_group qeth_device_attr_group = {
.attrs = qeth_device_attrs,
};
+EXPORT_SYMBOL_GPL(qeth_device_attr_group);

const struct attribute_group *qeth_generic_attr_groups[] = {
&qeth_device_attr_group,
--- a/drivers/s390/net/qeth_l2.h
+++ b/drivers/s390/net/qeth_l2.h
@@ -8,6 +8,8 @@

#include "qeth_core.h"

+extern const struct attribute_group *qeth_l2_attr_groups[];
+
int qeth_l2_create_device_attributes(struct device *);
void qeth_l2_remove_device_attributes(struct device *);
void qeth_l2_setup_bridgeport_attrs(struct qeth_card *card);
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -1006,14 +1006,21 @@ static int qeth_l2_stop(struct net_devic
return 0;
}

+static const struct device_type qeth_l2_devtype = {
+ .name = "qeth_layer2",
+ .groups = qeth_l2_attr_groups,
+};
+
static int qeth_l2_probe_device(struct ccwgroup_device *gdev)
{
struct qeth_card *card = dev_get_drvdata(&gdev->dev);
int rc;

- rc = qeth_l2_create_device_attributes(&gdev->dev);
- if (rc)
- return rc;
+ if (gdev->dev.type == &qeth_generic_devtype) {
+ rc = qeth_l2_create_device_attributes(&gdev->dev);
+ if (rc)
+ return rc;
+ }
INIT_LIST_HEAD(&card->vid_list);
hash_init(card->mac_htable);
card->options.layer2 = 1;
@@ -1025,7 +1032,8 @@ static void qeth_l2_remove_device(struct
{
struct qeth_card *card = dev_get_drvdata(&cgdev->dev);

- qeth_l2_remove_device_attributes(&cgdev->dev);
+ if (cgdev->dev.type == &qeth_generic_devtype)
+ qeth_l2_remove_device_attributes(&cgdev->dev);
qeth_set_allowed_threads(card, 0, 1);
wait_event(card->wait_q, qeth_threads_running(card, 0xffffffff) == 0);

@@ -1409,6 +1417,7 @@ static int qeth_l2_control_event(struct
}

struct qeth_discipline qeth_l2_discipline = {
+ .devtype = &qeth_l2_devtype,
.start_poll = qeth_qdio_start_poll,
.input_handler = (qdio_handler_t *) qeth_qdio_input_handler,
.output_handler = (qdio_handler_t *) qeth_qdio_output_handler,
--- a/drivers/s390/net/qeth_l2_sys.c
+++ b/drivers/s390/net/qeth_l2_sys.c
@@ -272,3 +272,11 @@ void qeth_l2_setup_bridgeport_attrs(stru
} else
qeth_bridgeport_an_set(card, 0);
}
+
+const struct attribute_group *qeth_l2_attr_groups[] = {
+ &qeth_device_attr_group,
+ &qeth_device_blkt_group,
+ /* l2 specific, see l2_{create,remove}_device_attributes(): */
+ &qeth_l2_bridgeport_attr_group,
+ NULL,
+};
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -3434,6 +3434,7 @@ static int qeth_l3_control_event(struct
}

struct qeth_discipline qeth_l3_discipline = {
+ .devtype = &qeth_generic_devtype,
.start_poll = qeth_qdio_start_poll,
.input_handler = (qdio_handler_t *) qeth_qdio_input_handler,
.output_handler = (qdio_handler_t *) qeth_qdio_output_handler,


2017-06-05 16:27:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 001/115] dccp/tcp: do not inherit mc_list from parent

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Eric Dumazet <[email protected]>


[ Upstream commit 657831ffc38e30092a2d5f03d385d710eb88b09a ]

syzkaller found a way to trigger double frees from ip_mc_drop_socket()

It turns out that leave a copy of parent mc_list at accept() time,
which is very bad.

Very similar to commit 8b485ce69876 ("tcp: do not inherit
fastopen_req from parent")

Initial report from Pray3r, completed by Andrey one.
Thanks a lot to them !

Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Pray3r <[email protected]>
Reported-by: Andrey Konovalov <[email protected]>
Tested-by: Andrey Konovalov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ipv4/inet_connection_sock.c | 2 ++
1 file changed, 2 insertions(+)

--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -794,6 +794,8 @@ struct sock *inet_csk_clone_lock(const s
/* listeners have SOCK_RCU_FREE, not the children */
sock_reset_flag(newsk, SOCK_RCU_FREE);

+ inet_sk(newsk)->mc_list = NULL;
+
newsk->sk_mark = inet_rsk(req)->ir_mark;
atomic64_set(&newsk->sk_cookie,
atomic64_read(&inet_rsk(req)->ir_cookie));


2017-06-05 16:56:36

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 011/115] tcp: avoid fragmenting peculiar skbs in SACK

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Yuchung Cheng <[email protected]>


[ Upstream commit b451e5d24ba6687c6f0e7319c727a709a1846c06 ]

This patch fixes a bug in splitting an SKB during SACK
processing. Specifically if an skb contains multiple
packets and is only partially sacked in the higher sequences,
tcp_match_sack_to_skb() splits the skb and marks the second fragment
as SACKed.

The current code further attempts rounding up the first fragment
to MSS boundaries. But it misses a boundary condition when the
rounded-up fragment size (pkt_len) is exactly skb size. Spliting
such an skb is pointless and causses a kernel warning and aborts
the SACK processing. This patch universally checks such over-split
before calling tcp_fragment to prevent these unnecessary warnings.

Fixes: adb92db857ee ("tcp: Make SACK code to split only at mss boundaries")
Signed-off-by: Yuchung Cheng <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Soheil Hassas Yeganeh <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/ipv4/tcp_input.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1174,13 +1174,14 @@ static int tcp_match_skb_to_sack(struct
*/
if (pkt_len > mss) {
unsigned int new_len = (pkt_len / mss) * mss;
- if (!in_sack && new_len < pkt_len) {
+ if (!in_sack && new_len < pkt_len)
new_len += mss;
- if (new_len >= skb->len)
- return 0;
- }
pkt_len = new_len;
}
+
+ if (pkt_len >= skb->len && !in_sack)
+ return 0;
+
err = tcp_fragment(sk, skb, pkt_len, mss, GFP_ATOMIC);
if (err < 0)
return err;


2017-06-05 16:56:53

by Greg Kroah-Hartman

[permalink] [raw]
Subject: [PATCH 4.11 012/115] tipc: make macro tipc_wait_for_cond() smp safe

4.11-stable review patch. If anyone has any objections, please let me know.

------------------

From: Jon Paul Maloy <[email protected]>


[ Upstream commit 844cf763fba654436d3a4279b6a672c196cf1901 ]

The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
to fulfil its task. The latter, in turn, is evaluating the stated
condition outside the socket lock context. This is problematic if
the condition is accessing non-trivial data structures which may be
altered by incoming interrupts, as is the case with the cong_links()
linked list, used by socket to keep track of the current set of
congested links. We sometimes see crashes when this list is accessed
by a condition function at the same time as a SOCK_WAKEUP interrupt
is removing an element from the list.

We fix this by expanding selected parts of sk_wait_event() into the
outer macro, while ensuring that all evaluations of a given condition
are performed under socket lock protection.

Fixes: commit 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
Reviewed-by: Parthasarathy Bhuvaragan <[email protected]>
Signed-off-by: Jon Maloy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
---
net/tipc/socket.c | 38 +++++++++++++++++++-------------------
1 file changed, 19 insertions(+), 19 deletions(-)

--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -361,25 +361,25 @@ static int tipc_sk_sock_err(struct socke
return 0;
}

-#define tipc_wait_for_cond(sock_, timeout_, condition_) \
-({ \
- int rc_ = 0; \
- int done_ = 0; \
- \
- while (!(condition_) && !done_) { \
- struct sock *sk_ = sock->sk; \
- DEFINE_WAIT_FUNC(wait_, woken_wake_function); \
- \
- rc_ = tipc_sk_sock_err(sock_, timeout_); \
- if (rc_) \
- break; \
- prepare_to_wait(sk_sleep(sk_), &wait_, \
- TASK_INTERRUPTIBLE); \
- done_ = sk_wait_event(sk_, timeout_, \
- (condition_), &wait_); \
- remove_wait_queue(sk_sleep(sk_), &wait_); \
- } \
- rc_; \
+#define tipc_wait_for_cond(sock_, timeo_, condition_) \
+({ \
+ struct sock *sk_; \
+ int rc_; \
+ \
+ while ((rc_ = !(condition_))) { \
+ DEFINE_WAIT_FUNC(wait_, woken_wake_function); \
+ sk_ = (sock_)->sk; \
+ rc_ = tipc_sk_sock_err((sock_), timeo_); \
+ if (rc_) \
+ break; \
+ prepare_to_wait(sk_sleep(sk_), &wait_, TASK_INTERRUPTIBLE); \
+ release_sock(sk_); \
+ *(timeo_) = wait_woken(&wait_, TASK_INTERRUPTIBLE, *(timeo_)); \
+ sched_annotate_sleep(); \
+ lock_sock(sk_); \
+ remove_wait_queue(sk_sleep(sk_), &wait_); \
+ } \
+ rc_; \
})

/**


2017-06-05 20:33:45

by Shuah Khan

[permalink] [raw]
Subject: Re: [PATCH 4.11 000/115] 4.11.4-stable review

On 06/05/2017 10:16 AM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.11.4 release.
> There are 115 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Wed Jun 7 15:30:31 UTC 2017.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.11.4-rc1.gz
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.11.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h
>

Compiled and booted on my test system, No dmesg regressions.

thanks,
-- Shuah

2017-06-05 22:27:04

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 4.11 000/115] 4.11.4-stable review

On Mon, Jun 05, 2017 at 06:16:33PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.11.4 release.
> There are 115 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Wed Jun 7 15:30:31 UTC 2017.
> Anything received after that time might be too late.
>

Build results:
total: 145 pass: 145 fail: 0
Qemu test results:
total: 122 pass: 122 fail: 0

Details are available at http://kerneltests.org/builders.

Guenter

2017-06-06 07:20:29

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 4.11 000/115] 4.11.4-stable review

On Mon, Jun 05, 2017 at 03:26:59PM -0700, Guenter Roeck wrote:
> On Mon, Jun 05, 2017 at 06:16:33PM +0200, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.11.4 release.
> > There are 115 patches in this series, all will be posted as a response
> > to this one. If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Wed Jun 7 15:30:31 UTC 2017.
> > Anything received after that time might be too late.
> >
>
> Build results:
> total: 145 pass: 145 fail: 0
> Qemu test results:
> total: 122 pass: 122 fail: 0
>
> Details are available at http://kerneltests.org/builders.
>
> Guenter

Thanks for testing all of these and letting me know.

greg k-h

2017-06-06 07:20:55

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 4.11 000/115] 4.11.4-stable review

On Mon, Jun 05, 2017 at 02:33:31PM -0600, Shuah Khan wrote:
> On 06/05/2017 10:16 AM, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.11.4 release.
> > There are 115 patches in this series, all will be posted as a response
> > to this one. If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Wed Jun 7 15:30:31 UTC 2017.
> > Anything received after that time might be too late.
> >
> > The whole patch series can be found in one patch at:
> > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.11.4-rc1.gz
> > or in the git tree and branch at:
> > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.11.y
> > and the diffstat can be found below.
> >
> > thanks,
> >
> > greg k-h
> >
>
> Compiled and booted on my test system, No dmesg regressions.

Thanks for testing all of these and letting me know.

greg k-h