2008-12-03 19:48:19

by Greg KH

[permalink] [raw]
Subject: [patch 000/104] 2.6.27-stable review

This is the start of the stable review cycle for the 2.6.27.8 release.
There are 104 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let us know. If anyone is a maintainer of the proper subsystem, and
wants to add a Signed-off-by: line to the patch, please respond with it.

And yes, there are a lot of patches here, the big series are:
- cifs data corruption patches
- pci hotplug slot patches to fix the most common warning
showing up on kerneloops.org
- ext4 bugfixes

These patches are sent out with a number of different people on the Cc:
line. If you wish to be a reviewer, please email [email protected] to
add your name to the list. If you want to be off the reviewer list,
also email us.

Responses should be made by Friday, December 5, 20:00:00 UTC. Anything
received after that time might be too late.

The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v2.6/stable-review/patch-2.6.27.8-rc1.gz
and the diffstat can be found below.


thanks,

greg k-h

Documentation/filesystems/proc.txt | 27 +
Makefile | 2 +-
arch/ia64/kernel/acpi.c | 29 +-
arch/ia64/kernel/setup.c | 7 +-
arch/parisc/kernel/traps.c | 43 +-
arch/powerpc/platforms/cell/spufs/file.c | 3 +
arch/powerpc/platforms/cell/spufs/inode.c | 2 +
arch/x86/kernel/acpi/boot.c | 8 +
arch/x86/kernel/cpu/cpufreq/powernow-k8.c | 18 +-
arch/x86/kernel/cpu/cpufreq/powernow-k8.h | 17 +-
arch/x86/kernel/early-quirks.c | 55 ++-
arch/x86/kernel/setup.c | 2 +-
arch/x86/mm/discontig_32.c | 35 ++
arch/x86/power/hibernate_32.c | 4 +
arch/x86/xen/enlighten.c | 2 +-
drivers/acpi/ec.c | 3 +-
drivers/acpi/pci_slot.c | 2 +-
drivers/ata/libata-core.c | 25 +-
drivers/ata/libata-sff.c | 13 +-
drivers/firewire/fw-sbp2.c | 5 +
drivers/gpio/gpiolib.c | 2 +-
drivers/ieee1394/sbp2.c | 5 +
drivers/infiniband/hw/mlx4/cq.c | 5 +
drivers/input/keyboard/atkbd.c | 25 +
drivers/media/video/compat_ioctl32.c | 3 +
drivers/net/atl1e/atl1e_hw.c | 4 -
drivers/net/e1000/e1000_ethtool.c | 8 +-
drivers/net/e1000/e1000_main.c | 1 +
drivers/net/e1000e/ethtool.c | 8 +-
drivers/net/e1000e/netdev.c | 1 +
drivers/net/igb/igb_ethtool.c | 8 +-
drivers/net/igb/igb_main.c | 1 +
drivers/net/pcmcia/axnet_cs.c | 1 +
drivers/net/pcmcia/pcnet_cs.c | 1 -
drivers/net/wireless/ath9k/recv.c | 10 +-
drivers/net/wireless/rtl8187_dev.c | 3 +
drivers/parport/parport_serial.c | 2 +
drivers/pci/hotplug/acpiphp.h | 9 +-
drivers/pci/hotplug/acpiphp_core.c | 32 +-
drivers/pci/hotplug/cpci_hotplug.h | 6 +
drivers/pci/hotplug/cpci_hotplug_core.c | 75 +--
drivers/pci/hotplug/cpci_hotplug_pci.c | 4 +-
drivers/pci/hotplug/cpqphp.h | 13 +-
drivers/pci/hotplug/cpqphp_core.c | 43 +-
drivers/pci/hotplug/fakephp.c | 18 +-
drivers/pci/hotplug/ibmphp.h | 5 +-
drivers/pci/hotplug/ibmphp_ebda.c | 19 +-
drivers/pci/hotplug/pci_hotplug_core.c | 64 +--
drivers/pci/hotplug/pciehp.h | 9 +-
drivers/pci/hotplug/pciehp_core.c | 49 +-
drivers/pci/hotplug/pciehp_ctrl.c | 48 +-
drivers/pci/hotplug/pciehp_hpc.c | 1 -
drivers/pci/hotplug/rpaphp_slot.c | 10 +-
drivers/pci/hotplug/sgi_hotplug.c | 18 +-
drivers/pci/hotplug/shpchp.h | 9 +-
drivers/pci/hotplug/shpchp_core.c | 52 +--
drivers/pci/hotplug/shpchp_ctrl.c | 48 +-
drivers/pci/slot.c | 143 ++++-
drivers/spi/pxa2xx_spi.c | 24 +-
drivers/usb/gadget/f_rndis.c | 3 +-
drivers/usb/host/ehci-pci.c | 24 +
drivers/usb/mon/mon_bin.c | 5 +-
drivers/video/fbmem.c | 2 +-
drivers/watchdog/hpwdt.c | 5 +-
fs/cifs/cifs_debug.c | 277 +++++-----
fs/cifs/cifs_spnego.c | 3 +-
fs/cifs/cifsfs.c | 30 +-
fs/cifs/cifsglob.h | 43 +-
fs/cifs/cifsproto.h | 2 +-
fs/cifs/cifssmb.c | 97 ++--
fs/cifs/connect.c | 869 +++++++++++++++--------------
fs/cifs/file.c | 9 +-
fs/cifs/misc.c | 90 ++--
fs/cifs/transport.c | 48 ++-
fs/ecryptfs/keystore.c | 31 +-
fs/eventpoll.c | 85 +++-
fs/ext2/balloc.c | 3 +-
fs/ext3/balloc.c | 3 +-
fs/ext3/dir.c | 22 +-
fs/ext3/resize.c | 3 +-
fs/ext4/balloc.c | 4 +-
fs/ext4/dir.c | 20 +-
fs/ext4/ext4.h | 10 +-
fs/ext4/ialloc.c | 6 +-
fs/ext4/inode.c | 7 +-
fs/ext4/ioctl.c | 21 +-
fs/ext4/mballoc.c | 22 +-
fs/ext4/migrate.c | 10 +-
fs/ext4/resize.c | 9 +
fs/ext4/super.c | 77 ++-
fs/ext4/xattr.c | 6 +
fs/inotify.c | 150 +++++-
fs/jbd/transaction.c | 16 +-
fs/jbd2/checkpoint.c | 41 ++-
fs/jbd2/commit.c | 5 +-
fs/jbd2/journal.c | 49 ++-
include/asm-x86/mmzone_32.h | 4 +
include/asm-x86/pci_64.h | 14 -
include/linux/idr.h | 3 +-
include/linux/inotify.h | 11 +
include/linux/jbd2.h | 3 +-
include/linux/libata.h | 1 +
include/linux/pci.h | 8 +-
include/linux/pci_hotplug.h | 11 +-
include/linux/sched.h | 4 +
include/net/af_unix.h | 1 +
ipc/util.c | 14 +-
kernel/Makefile | 4 +-
kernel/audit_tree.c | 91 ++--
kernel/auditfilter.c | 14 +-
kernel/cgroup.c | 7 +-
kernel/cpuset.c | 12 +-
kernel/sched.c | 13 +-
kernel/sysctl.c | 10 +
lib/idr.c | 14 +-
lib/scatterlist.c | 2 +-
net/unix/af_unix.c | 2 +
net/unix/garbage.c | 13 +-
118 files changed, 2118 insertions(+), 1334 deletions(-)


2008-12-03 19:50:24

by Greg KH

[permalink] [raw]
Subject: [patch 001/104] USB: gadget rndis: send notifications

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Richard R?jfors <[email protected]>

commit ff3495052af48f7a2bf7961b131dc9e161dae19c upstream.

It turns out that atomic_inc_return() returns the *new* value
not the original one, so the logic in rndis_response_available()
kept the first RNDIS response notification from getting out.
This prevented interoperation with MS-Windows (but not Linux).

Fix this to make RNDIS behave again.

Signed-off-by: Richard R?jfors <[email protected]>
Signed-off-by: David Brownell <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/usb/gadget/f_rndis.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/usb/gadget/f_rndis.c
+++ b/drivers/usb/gadget/f_rndis.c
@@ -303,7 +303,7 @@ static void rndis_response_available(voi
__le32 *data = req->buf;
int status;

- if (atomic_inc_return(&rndis->notify_count))
+ if (atomic_inc_return(&rndis->notify_count) != 1)
return;

/* Send RNDIS RESPONSE_AVAILABLE notification; a

2008-12-03 19:50:51

by Greg KH

[permalink] [raw]
Subject: [patch 002/104] USB: gadget rndis: stop windows self-immolation

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: David Brownell <[email protected]>

commit 9c264521a9f836541c122b00f505cfd60cc5bbb5 upstream.

Somewhere in the conversion of the RNDIS gadget code to the new
framework, the descriptor of its data interface seems to have
been copied from the CDC Ethernet driver. Unfortunately that
means it got a nonzero altsetting ... which is incorrect. Issue
uncovered by Richard R?jfors <[email protected]>.

This patch fixes that problem, and resolves at least some cases
of Windows XP bluescreening itself.

Tested-by: Richard R?jfors <[email protected]>.
Signed-off-by: David Brownell <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/usb/gadget/f_rndis.c | 1 -
1 file changed, 1 deletion(-)

--- a/drivers/usb/gadget/f_rndis.c
+++ b/drivers/usb/gadget/f_rndis.c
@@ -172,7 +172,6 @@ static struct usb_interface_descriptor r
.bDescriptorType = USB_DT_INTERFACE,

/* .bInterfaceNumber = DYNAMIC */
- .bAlternateSetting = 1,
.bNumEndpoints = 2,
.bInterfaceClass = USB_CLASS_CDC_DATA,
.bInterfaceSubClass = 0,

2008-12-03 19:51:32

by Greg KH

[permalink] [raw]
Subject: [patch 004/104] USB: fix SB700 usb subsystem hang bug

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Andiry Xu <[email protected]>

commit b09bc6cbae4dd3a2d35722668ef2c502a7b8b093 upstream.

This patch is required for AMD SB700 south bridge revision A12 and A13 to avoid
USB subsystem hang symptom. The USB subsystem hang symptom is observed when the
system has multiple USB devices connected to it. In some cases a USB hub may be
required to observe this symptom.

This patch works around the problem by correcting the internal register setting
that will help by changing the behavior of the internal logic to avoid the
USB subsystem hang issue. The change in the behavior of the logic does not
impact the normal operation of the USB subsystem.

Reported-by: Volker Armin Hemmann <[email protected]>
Tested-by: Volker Armin Hemmann <[email protected]>
Signed-off-by: Andiry Xu <[email protected]>
Signed-off-by: Libin Yang <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/usb/host/ehci-pci.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

--- a/drivers/usb/host/ehci-pci.c
+++ b/drivers/usb/host/ehci-pci.c
@@ -66,6 +66,8 @@ static int ehci_pci_setup(struct usb_hcd
{
struct ehci_hcd *ehci = hcd_to_ehci(hcd);
struct pci_dev *pdev = to_pci_dev(hcd->self.controller);
+ struct pci_dev *p_smbus;
+ u8 rev;
u32 temp;
int retval;

@@ -166,6 +168,25 @@ static int ehci_pci_setup(struct usb_hcd
pci_write_config_byte(pdev, 0x4b, tmp | 0x20);
}
break;
+ case PCI_VENDOR_ID_ATI:
+ /* SB700 old version has a bug in EHCI controller,
+ * which causes usb devices lose response in some cases.
+ */
+ if (pdev->device == 0x4396) {
+ p_smbus = pci_get_device(PCI_VENDOR_ID_ATI,
+ PCI_DEVICE_ID_ATI_SBX00_SMBUS,
+ NULL);
+ if (!p_smbus)
+ break;
+ rev = p_smbus->revision;
+ if ((rev == 0x3a) || (rev == 0x3b)) {
+ u8 tmp;
+ pci_read_config_byte(pdev, 0x53, &tmp);
+ pci_write_config_byte(pdev, 0x53, tmp | (1<<3));
+ }
+ pci_dev_put(p_smbus);
+ }
+ break;
}

ehci_reset(ehci);

2008-12-03 19:52:16

by Greg KH

[permalink] [raw]
Subject: [patch 006/104] atl1e: fix broken multicast by removing unnecessary crc inversion

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: J. K. Cliburn <[email protected]>

commit 7ee0fddfe05f105d3346aa8774695e7130697836 upstream.

Inverting the crc after calling ether_crc_le() is unnecessary and breaks
multicast. Remove it.

Tested-by: David Madore <[email protected]>
Signed-off-by: Jay Cliburn <[email protected]>
Signed-off-by: Jeff Garzik <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/net/atl1e/atl1e_hw.c | 4 ----
1 file changed, 4 deletions(-)

--- a/drivers/net/atl1e/atl1e_hw.c
+++ b/drivers/net/atl1e/atl1e_hw.c
@@ -163,9 +163,6 @@ int atl1e_read_mac_addr(struct atl1e_hw
* atl1e_hash_mc_addr
* purpose
* set hash value for a multicast address
- * hash calcu processing :
- * 1. calcu 32bit CRC for multicast address
- * 2. reverse crc with MSB to LSB
*/
u32 atl1e_hash_mc_addr(struct atl1e_hw *hw, u8 *mc_addr)
{
@@ -174,7 +171,6 @@ u32 atl1e_hash_mc_addr(struct atl1e_hw *
int i;

crc32 = ether_crc_le(6, mc_addr);
- crc32 = ~crc32;
for (i = 0; i < 32; i++)
value |= (((crc32 >> i) & 1) << (31 - i));

2008-12-03 19:52:41

by Greg KH

[permalink] [raw]
Subject: [patch 007/104] cpuset: fix regression when failed to generate sched domains

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Li Zefan <[email protected]>

commit 700018e0a77b4113172257fcdaa1c58e27a5074f upstream.

Impact: properly rebuild sched-domains on kmalloc() failure

When cpuset failed to generate sched domains due to kmalloc()
failure, the scheduler should fallback to the single partition
'fallback_doms' and rebuild sched domains, but now it only
destroys but not rebuilds sched domains.

The regression was introduced by:

| commit dfb512ec4834116124da61d6c1ee10fd0aa32bd6
| Author: Max Krasnyansky <[email protected]>
| Date: Fri Aug 29 13:11:41 2008 -0700
|
| sched: arch_reinit_sched_domains() must destroy domains to force rebuild

After the above commit, partition_sched_domains(0, NULL, NULL) will
only destroy sched domains and partition_sched_domains(1, NULL, NULL)
will create the default sched domain.

Signed-off-by: Li Zefan <[email protected]>
Cc: Max Krasnyansky <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
kernel/cpuset.c | 12 ++++++++----
kernel/sched.c | 13 +++++++------
2 files changed, 15 insertions(+), 10 deletions(-)

--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -587,7 +587,6 @@ static int generate_sched_domains(cpumas
int ndoms; /* number of sched domains in result */
int nslot; /* next empty doms[] cpumask_t slot */

- ndoms = 0;
doms = NULL;
dattr = NULL;
csa = NULL;
@@ -674,10 +673,8 @@ restart:
* Convert <csn, csa> to <ndoms, doms> and populate cpu masks.
*/
doms = kmalloc(ndoms * sizeof(cpumask_t), GFP_KERNEL);
- if (!doms) {
- ndoms = 0;
+ if (!doms)
goto done;
- }

/*
* The rest of the code, including the scheduler, can deal with
@@ -732,6 +729,13 @@ restart:
done:
kfree(csa);

+ /*
+ * Fallback to the default domain if kmalloc() failed.
+ * See comments in partition_sched_domains().
+ */
+ if (doms == NULL)
+ ndoms = 1;
+
*domains = doms;
*attributes = dattr;
return ndoms;
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7692,13 +7692,14 @@ static int dattrs_equal(struct sched_dom
*
* The passed in 'doms_new' should be kmalloc'd. This routine takes
* ownership of it and will kfree it when done with it. If the caller
- * failed the kmalloc call, then it can pass in doms_new == NULL,
- * and partition_sched_domains() will fallback to the single partition
- * 'fallback_doms', it also forces the domains to be rebuilt.
+ * failed the kmalloc call, then it can pass in doms_new == NULL &&
+ * ndoms_new == 1, and partition_sched_domains() will fallback to
+ * the single partition 'fallback_doms', it also forces the domains
+ * to be rebuilt.
*
- * If doms_new==NULL it will be replaced with cpu_online_map.
- * ndoms_new==0 is a special case for destroying existing domains.
- * It will not create the default domain.
+ * If doms_new == NULL it will be replaced with cpu_online_map.
+ * ndoms_new == 0 is a special case for destroying existing domains,
+ * and it will not create the default domain.
*
* Call with hotplug lock held
*/

2008-12-03 19:51:15

by Greg KH

[permalink] [raw]
Subject: [patch 003/104] USB: usbmon: fix read(2)

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Pete Zaitcev <[email protected]>

commit f1c0a2a3aff53698f4855968d576464041d49b39 upstream.

There's a bug in the usbmon binary reader: When using read() to fetch
the packets and a packet's data is partially read, the next read call
will once again return up to len_cap bytes of data. The b_read counter
is not regarded when determining the remaining chunk size.

So, when dumping USB data with "cat /dev/usbmon0 > usbmon.trace" while
reading from a USB storage device and analyzing the dump file
afterwards it will get out of sync after a couple of packets.

Signed-off-by: Ingo van Lil <[email protected]>
Signed-off-by: Pete Zaitcev <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/usb/mon/mon_bin.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

--- a/drivers/usb/mon/mon_bin.c
+++ b/drivers/usb/mon/mon_bin.c
@@ -687,7 +687,10 @@ static ssize_t mon_bin_read(struct file
}

if (rp->b_read >= sizeof(struct mon_bin_hdr)) {
- step_len = min(nbytes, (size_t)ep->len_cap);
+ step_len = ep->len_cap;
+ step_len -= rp->b_read - sizeof(struct mon_bin_hdr);
+ if (step_len > nbytes)
+ step_len = nbytes;
offset = rp->b_out + PKT_SIZE;
offset += rp->b_read - sizeof(struct mon_bin_hdr);
if (offset >= rp->b_size)

2008-12-03 19:53:09

by Greg KH

[permalink] [raw]
Subject: [patch 008/104] cgroups: fix a serious bug in cgroupstats

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Li Zefan <[email protected]>

commit 33d283bef23132c48195eafc21449f8ba88fce6b upstream.

Try this, and you'll get oops immediately:
# cd Documentation/accounting/
# gcc -o getdelays getdelays.c
# mount -t cgroup -o debug xxx /mnt
# ./getdelays -C /mnt/tasks

Because a normal file's dentry->d_fsdata is a pointer to struct cftype,
not struct cgroup.

After the patch, it returns EINVAL if we try to get cgroupstats
from a normal file.

Cc: Balbir Singh <[email protected]>
Signed-off-by: Li Zefan <[email protected]>
Acked-by: Paul Menage <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
kernel/cgroup.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2045,10 +2045,13 @@ int cgroupstats_build(struct cgroupstats
struct cgroup *cgrp;
struct cgroup_iter it;
struct task_struct *tsk;
+
/*
- * Validate dentry by checking the superblock operations
+ * Validate dentry by checking the superblock operations,
+ * and make sure it's a directory.
*/
- if (dentry->d_sb->s_op != &cgroup_ops)
+ if (dentry->d_sb->s_op != &cgroup_ops ||
+ !S_ISDIR(dentry->d_inode->i_mode))
goto err;

ret = 0;

2008-12-03 19:51:49

by Greg KH

[permalink] [raw]
Subject: [patch 005/104] USB: fix SB600 USB subsystem hang bug

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Shane Huang <[email protected]>

commit 0a99e8ac430a27825bd055719765fd0d65cd797f upstream.

This patch is required for all AMD SB600 revisions to avoid USB subsystem hang
symptom. The USB subsystem hang symptom is observed when the system has
multiple USB devices connected to it. In some cases a USB hub may be required
to observe this symptom.

Reported in bugzilla as #11599, the similar patch for SB700 old revision is:
commit b09bc6cbae4dd3a2d35722668ef2c502a7b8b093

Reported-by: raffaele <[email protected]>
Tested-by: Roman Mamedov <[email protected]>
Signed-off-by: Shane Huang <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/usb/host/ehci-pci.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

--- a/drivers/usb/host/ehci-pci.c
+++ b/drivers/usb/host/ehci-pci.c
@@ -169,18 +169,21 @@ static int ehci_pci_setup(struct usb_hcd
}
break;
case PCI_VENDOR_ID_ATI:
- /* SB700 old version has a bug in EHCI controller,
+ /* SB600 and old version of SB700 have a bug in EHCI controller,
* which causes usb devices lose response in some cases.
*/
- if (pdev->device == 0x4396) {
+ if ((pdev->device == 0x4386) || (pdev->device == 0x4396)) {
p_smbus = pci_get_device(PCI_VENDOR_ID_ATI,
PCI_DEVICE_ID_ATI_SBX00_SMBUS,
NULL);
if (!p_smbus)
break;
rev = p_smbus->revision;
- if ((rev == 0x3a) || (rev == 0x3b)) {
+ if ((pdev->device == 0x4386) || (rev == 0x3a)
+ || (rev == 0x3b)) {
u8 tmp;
+ ehci_info(ehci, "applying AMD SB600/SB700 USB "
+ "freeze workaround\n");
pci_read_config_byte(pdev, 0x53, &tmp);
pci_write_config_byte(pdev, 0x53, tmp | (1<<3));
}

2008-12-03 19:54:43

by Greg KH

[permalink] [raw]
Subject: [patch 011/104] fbdev: clean the penguins dirty feet

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Clemens Ladisch <[email protected]>

commit cf7ee554f3a324e98181b0ea249d9d5be3a0acb8 upstream.

When booting in a direct color mode, the penguin has dirty feet, i.e.,
some pixels have the wrong color. This is caused by
fb_set_logo_directpalette() which does not initialize the last 32 palette
entries.

Signed-off-by: Clemens Ladisch <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Cc: Krzysztof Helt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/video/fbmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/video/fbmem.c
+++ b/drivers/video/fbmem.c
@@ -232,7 +232,7 @@ static void fb_set_logo_directpalette(st
greenshift = info->var.green.offset;
blueshift = info->var.blue.offset;

- for (i = 32; i < logo->clutsize; i++)
+ for (i = 32; i < 32 + logo->clutsize; i++)
palette[i] = i << redshift | i << greenshift | i << blueshift;
}

2008-12-03 19:54:10

by Greg KH

[permalink] [raw]
Subject: [patch 010/104] pxa2xx_spi: bugfix full duplex dma data corruption

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Ned Forrester <[email protected]>

commit 393df744e056ba24e9531d0657d09fc3c7c0dd22 upstream.

Fixes a data corruption bug in pxa2xx_spi.c when operating in full duplex
mode with DMA and using buffers that overlap.

SPI transmit and receive buffers are allowed to be the same or to overlap.
However, this driver fails if such overlap is attempted in DMA mode
because it maps the rx and tx buffers in the wrong order. By mapping
DMA_FROM_DEVICE (read) before DMA_TO_DEVICE (write), it invalidates the
cache before flushing it, thus discarding data which should have been
transmitted.

The patch corrects the order of mapping. This bug exists in all versions
of pxa2xx_spi.c; similar bugs are in the drivers for two other SPI
controllers (au1500, imx).

A version of this patch has been tested on kernel 2.6.20 using
verification of loopback data with: random transfer length, random
bits-per-word, random positive offsets (both larger and smaller than
transfer length) between the start of the rx and tx buffers, and varying
clock rates.

Signed-off-by: Ned Forrester <[email protected]>
Cc: Vernon Sauder <[email protected]>
Cc: J. Scott Merritt <[email protected]>
Signed-off-by: David Brownell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/spi/pxa2xx_spi.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

--- a/drivers/spi/pxa2xx_spi.c
+++ b/drivers/spi/pxa2xx_spi.c
@@ -348,21 +348,21 @@ static int map_dma_buffers(struct driver
} else
drv_data->tx_map_len = drv_data->len;

- /* Stream map the rx buffer */
- drv_data->rx_dma = dma_map_single(dev, drv_data->rx,
- drv_data->rx_map_len,
- DMA_FROM_DEVICE);
- if (dma_mapping_error(dev, drv_data->rx_dma))
- return 0;
-
- /* Stream map the tx buffer */
+ /* Stream map the tx buffer. Always do DMA_TO_DEVICE first
+ * so we flush the cache *before* invalidating it, in case
+ * the tx and rx buffers overlap.
+ */
drv_data->tx_dma = dma_map_single(dev, drv_data->tx,
- drv_data->tx_map_len,
- DMA_TO_DEVICE);
+ drv_data->tx_map_len, DMA_TO_DEVICE);
+ if (dma_mapping_error(dev, drv_data->tx_dma))
+ return 0;

- if (dma_mapping_error(dev, drv_data->tx_dma)) {
- dma_unmap_single(dev, drv_data->rx_dma,
+ /* Stream map the rx buffer */
+ drv_data->rx_dma = dma_map_single(dev, drv_data->rx,
drv_data->rx_map_len, DMA_FROM_DEVICE);
+ if (dma_mapping_error(dev, drv_data->rx_dma)) {
+ dma_unmap_single(dev, drv_data->tx_dma,
+ drv_data->tx_map_len, DMA_TO_DEVICE);
return 0;
}

2008-12-03 19:53:46

by Greg KH

[permalink] [raw]
Subject: [patch 009/104] eCryptfs: Allocate up to two scatterlists for crypto ops on keys

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Michael Halcrow <[email protected]>

commit ac97b9f9a2d0b83488e0bbcb8517b229d5c9b142 upstream.

I have received some reports of out-of-memory errors on some older AMD
architectures. These errors are what I would expect to see if
crypt_stat->key were split between two separate pages. eCryptfs should
not assume that any of the memory sent through virt_to_scatterlist() is
all contained in a single page, and so this patch allocates two
scatterlist structs instead of one when processing keys. I have received
confirmation from one person affected by this bug that this patch resolves
the issue for him, and so I am submitting it for inclusion in a future
stable release.

Note that virt_to_scatterlist() runs sg_init_table() on the scatterlist
structs passed to it, so the calls to sg_init_table() in
decrypt_passphrase_encrypted_session_key() are redundant.

Signed-off-by: Michael Halcrow <[email protected]>
Reported-by: Paulo J. S. Silva <[email protected]>
Cc: "Leon Woestenberg" <[email protected]>
Cc: Tim Gardner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ecryptfs/keystore.c | 31 ++++++++++++++-----------------
1 file changed, 14 insertions(+), 17 deletions(-)

--- a/fs/ecryptfs/keystore.c
+++ b/fs/ecryptfs/keystore.c
@@ -1037,17 +1037,14 @@ static int
decrypt_passphrase_encrypted_session_key(struct ecryptfs_auth_tok *auth_tok,
struct ecryptfs_crypt_stat *crypt_stat)
{
- struct scatterlist dst_sg;
- struct scatterlist src_sg;
+ struct scatterlist dst_sg[2];
+ struct scatterlist src_sg[2];
struct mutex *tfm_mutex;
struct blkcipher_desc desc = {
.flags = CRYPTO_TFM_REQ_MAY_SLEEP
};
int rc = 0;

- sg_init_table(&dst_sg, 1);
- sg_init_table(&src_sg, 1);
-
if (unlikely(ecryptfs_verbosity > 0)) {
ecryptfs_printk(
KERN_DEBUG, "Session key encryption key (size [%d]):\n",
@@ -1066,8 +1063,8 @@ decrypt_passphrase_encrypted_session_key
}
rc = virt_to_scatterlist(auth_tok->session_key.encrypted_key,
auth_tok->session_key.encrypted_key_size,
- &src_sg, 1);
- if (rc != 1) {
+ src_sg, 2);
+ if (rc < 1 || rc > 2) {
printk(KERN_ERR "Internal error whilst attempting to convert "
"auth_tok->session_key.encrypted_key to scatterlist; "
"expected rc = 1; got rc = [%d]. "
@@ -1079,8 +1076,8 @@ decrypt_passphrase_encrypted_session_key
auth_tok->session_key.encrypted_key_size;
rc = virt_to_scatterlist(auth_tok->session_key.decrypted_key,
auth_tok->session_key.decrypted_key_size,
- &dst_sg, 1);
- if (rc != 1) {
+ dst_sg, 2);
+ if (rc < 1 || rc > 2) {
printk(KERN_ERR "Internal error whilst attempting to convert "
"auth_tok->session_key.decrypted_key to scatterlist; "
"expected rc = 1; got rc = [%d]\n", rc);
@@ -1096,7 +1093,7 @@ decrypt_passphrase_encrypted_session_key
rc = -EINVAL;
goto out;
}
- rc = crypto_blkcipher_decrypt(&desc, &dst_sg, &src_sg,
+ rc = crypto_blkcipher_decrypt(&desc, dst_sg, src_sg,
auth_tok->session_key.encrypted_key_size);
mutex_unlock(tfm_mutex);
if (unlikely(rc)) {
@@ -1541,8 +1538,8 @@ write_tag_3_packet(char *dest, size_t *r
size_t i;
size_t encrypted_session_key_valid = 0;
char session_key_encryption_key[ECRYPTFS_MAX_KEY_BYTES];
- struct scatterlist dst_sg;
- struct scatterlist src_sg;
+ struct scatterlist dst_sg[2];
+ struct scatterlist src_sg[2];
struct mutex *tfm_mutex = NULL;
u8 cipher_code;
size_t packet_size_length;
@@ -1621,8 +1618,8 @@ write_tag_3_packet(char *dest, size_t *r
ecryptfs_dump_hex(session_key_encryption_key, 16);
}
rc = virt_to_scatterlist(crypt_stat->key, key_rec->enc_key_size,
- &src_sg, 1);
- if (rc != 1) {
+ src_sg, 2);
+ if (rc < 1 || rc > 2) {
ecryptfs_printk(KERN_ERR, "Error generating scatterlist "
"for crypt_stat session key; expected rc = 1; "
"got rc = [%d]. key_rec->enc_key_size = [%d]\n",
@@ -1631,8 +1628,8 @@ write_tag_3_packet(char *dest, size_t *r
goto out;
}
rc = virt_to_scatterlist(key_rec->enc_key, key_rec->enc_key_size,
- &dst_sg, 1);
- if (rc != 1) {
+ dst_sg, 2);
+ if (rc < 1 || rc > 2) {
ecryptfs_printk(KERN_ERR, "Error generating scatterlist "
"for crypt_stat encrypted session key; "
"expected rc = 1; got rc = [%d]. "
@@ -1653,7 +1650,7 @@ write_tag_3_packet(char *dest, size_t *r
rc = 0;
ecryptfs_printk(KERN_DEBUG, "Encrypting [%d] bytes of the key\n",
crypt_stat->key_size);
- rc = crypto_blkcipher_encrypt(&desc, &dst_sg, &src_sg,
+ rc = crypto_blkcipher_encrypt(&desc, dst_sg, src_sg,
(*key_rec).enc_key_size);
mutex_unlock(tfm_mutex);
if (rc) {

2008-12-03 19:55:05

by Greg KH

[permalink] [raw]
Subject: [patch 012/104] gpiolib: extend gpio label column width in debugfs file

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Jarkko Nikula <[email protected]>

commit 6e8ba729b6332f2a75572e02480936d2b51665aa upstream.

There are already various drivers having bigger label than 12 bytes. Most
of them fit well under 20 bytes but make column width exact so that
oversized labels don't mess up output alignment.

Signed-off-by: Jarkko Nikula <[email protected]>
Acked-by: David Brownell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/gpio/gpiolib.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -1049,7 +1049,7 @@ static void gpiolib_dbg_show(struct seq_
continue;

is_out = test_bit(FLAG_IS_OUT, &gdesc->flags);
- seq_printf(s, " gpio-%-3d (%-12s) %s %s",
+ seq_printf(s, " gpio-%-3d (%-20.20s) %s %s",
gpio, gdesc->label,
is_out ? "out" : "in ",
chip->get

2008-12-03 19:56:22

by Greg KH

[permalink] [raw]
Subject: [patch 015/104] parisc: fix kernel crash when unwinding a userspace process

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Helge Deller <[email protected]>

commit 7a3f5134a8f5bd7fa38b5645eef05e8a4eb62951 upstream.

Any user on existing parisc 32- and 64bit-kernels can easily crash
the kernel and as such enforce a DSO.
A simple testcase is available here:
http://gsyprf10.external.hp.com/~deller/crash.tgz

The problem is introduced by the fact, that the handle_interruption()
crash handler calls the show_regs() function, which in turn tries to
unwind the stack by calling parisc_show_stack(). Since the stack contains
userspace addresses, a try to unwind the stack is dangerous and useless
and leads to the crash.

The fix is trivial: For userspace processes
a) avoid to unwind the stack, and
b) avoid to resolve userspace addresses to kernel symbol names.

While touching this code, I converted print_symbol() to %pS
printk formats and made parisc_show_stack() static.

An initial patch for this was written by Kyle McMartin back in August:
http://marc.info/?l=linux-parisc&m=121805168830283&w=2

Compile and run-tested with a 64bit parisc kernel.

Signed-off-by: Helge Deller <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Kyle McMartin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/parisc/kernel/traps.c | 41 ++++++++++++++++++++---------------------
1 file changed, 20 insertions(+), 21 deletions(-)

--- a/arch/parisc/kernel/traps.c
+++ b/arch/parisc/kernel/traps.c
@@ -24,7 +24,6 @@
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/console.h>
-#include <linux/kallsyms.h>
#include <linux/bug.h>

#include <asm/assembly.h>
@@ -51,7 +50,7 @@
DEFINE_SPINLOCK(pa_dbit_lock);
#endif

-void parisc_show_stack(struct task_struct *t, unsigned long *sp,
+static void parisc_show_stack(struct task_struct *task, unsigned long *sp,
struct pt_regs *regs);

static int printbinary(char *buf, unsigned long x, int nbits)
@@ -121,18 +120,19 @@ static void print_fr(char *level, struct

void show_regs(struct pt_regs *regs)
{
- int i;
+ int i, user;
char *level;
unsigned long cr30, cr31;

- level = user_mode(regs) ? KERN_DEBUG : KERN_CRIT;
+ user = user_mode(regs);
+ level = user ? KERN_DEBUG : KERN_CRIT;

print_gr(level, regs);

for (i = 0; i < 8; i += 4)
PRINTREGS(level, regs->sr, "sr", RFMT, i);

- if (user_mode(regs))
+ if (user)
print_fr(level, regs);

cr30 = mfctl(30);
@@ -145,14 +145,18 @@ void show_regs(struct pt_regs *regs)
printk("%s CPU: %8d CR30: " RFMT " CR31: " RFMT "\n",
level, current_thread_info()->cpu, cr30, cr31);
printk("%s ORIG_R28: " RFMT "\n", level, regs->orig_r28);
- printk(level);
- print_symbol(" IAOQ[0]: %s\n", regs->iaoq[0]);
- printk(level);
- print_symbol(" IAOQ[1]: %s\n", regs->iaoq[1]);
- printk(level);
- print_symbol(" RP(r2): %s\n", regs->gr[2]);

- parisc_show_stack(current, NULL, regs);
+ if (user) {
+ printk("%s IAOQ[0]: " RFMT "\n", level, regs->iaoq[0]);
+ printk("%s IAOQ[1]: " RFMT "\n", level, regs->iaoq[1]);
+ printk("%s RP(r2): " RFMT "\n", level, regs->gr[2]);
+ } else {
+ printk("%s IAOQ[0]: %pS\n", level, (void *) regs->iaoq[0]);
+ printk("%s IAOQ[1]: %pS\n", level, (void *) regs->iaoq[1]);
+ printk("%s RP(r2): %pS\n", level, (void *) regs->gr[2]);
+
+ parisc_show_stack(current, NULL, regs);
+ }
}


@@ -173,20 +177,15 @@ static void do_show_stack(struct unwind_
break;

if (__kernel_text_address(info->ip)) {
- printk("%s [<" RFMT ">] ", (i&0x3)==1 ? KERN_CRIT : "", info->ip);
-#ifdef CONFIG_KALLSYMS
- print_symbol("%s\n", info->ip);
-#else
- if ((i & 0x03) == 0)
- printk("\n");
-#endif
+ printk(KERN_CRIT " [<" RFMT ">] %pS\n",
+ info->ip, (void *) info->ip);
i++;
}
}
- printk("\n");
+ printk(KERN_CRIT "\n");
}

-void parisc_show_stack(struct task_struct *task, unsigned long *sp,
+static void parisc_show_stack(struct task_struct *task, unsigned long *sp,
struct pt_regs *regs)
{
struct unwind_frame_info info;

2008-12-03 19:56:43

by Greg KH

[permalink] [raw]
Subject: [patch 016/104] epoll: introduce resource usage limits

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Davide Libenzi <[email protected]>

commit 7ef9964e6d1b911b78709f144000aacadd0ebc21 upstream.

It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface. Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds. To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced. A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:

max_user_instances = Maximum number of devices - per user

max_user_watches = Maximum number of "watched" fds - per user

The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM. As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users. The
default value for "max_user_instances" is set to 128, that should be
enough too.

This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
listed, so that should be ok.

[[email protected]: use get_current_user()]
Signed-off-by: Davide Libenzi <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Reported-by: Vegard Nossum <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
Documentation/filesystems/proc.txt | 27 +++++++++++
fs/eventpoll.c | 85 +++++++++++++++++++++++++++++++++----
include/linux/sched.h | 4 +
kernel/sysctl.c | 10 ++++
4 files changed, 118 insertions(+), 8 deletions(-)

--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -44,6 +44,7 @@ Table of Contents
2.14 /proc/<pid>/io - Display the IO accounting fields
2.15 /proc/<pid>/coredump_filter - Core dump filtering settings
2.16 /proc/<pid>/mountinfo - Information about mounts
+ 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface

------------------------------------------------------------------------------
Preface
@@ -2471,4 +2472,30 @@ For more information on mount propagatio

Documentation/filesystems/sharedsubtree.txt

+2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface
+--------------------------------------------------------
+
+This directory contains configuration options for the epoll(7) interface.
+
+max_user_instances
+------------------
+
+This is the maximum number of epoll file descriptors that a single user can
+have open at a given time. The default value is 128, and should be enough
+for normal users.
+
+max_user_watches
+----------------
+
+Every epoll file descriptor can store a number of files to be monitored
+for event readiness. Each one of these monitored files constitutes a "watch".
+This configuration option sets the maximum number of "watches" that are
+allowed for each user.
+Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes
+on a 64bit one.
+The current default value for max_user_watches is the 1/32 of the available
+low memory, divided for the "watch" cost in bytes.
+
+
------------------------------------------------------------------------------
+
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -102,6 +102,8 @@

#define EP_UNACTIVE_PTR ((void *) -1L)

+#define EP_ITEM_COST (sizeof(struct epitem) + sizeof(struct eppoll_entry))
+
struct epoll_filefd {
struct file *file;
int fd;
@@ -200,6 +202,9 @@ struct eventpoll {
* holding ->lock.
*/
struct epitem *ovflist;
+
+ /* The user that created the eventpoll descriptor */
+ struct user_struct *user;
};

/* Wait structure used by the poll hooks */
@@ -227,9 +232,17 @@ struct ep_pqueue {
};

/*
+ * Configuration options available inside /proc/sys/fs/epoll/
+ */
+/* Maximum number of epoll devices, per user */
+static int max_user_instances __read_mostly;
+/* Maximum number of epoll watched descriptors, per user */
+static int max_user_watches __read_mostly;
+
+/*
* This mutex is used to serialize ep_free() and eventpoll_release_file().
*/
-static struct mutex epmutex;
+static DEFINE_MUTEX(epmutex);

/* Safe wake up implementation */
static struct poll_safewake psw;
@@ -240,6 +253,33 @@ static struct kmem_cache *epi_cache __re
/* Slab cache used to allocate "struct eppoll_entry" */
static struct kmem_cache *pwq_cache __read_mostly;

+#ifdef CONFIG_SYSCTL
+
+#include <linux/sysctl.h>
+
+static int zero;
+
+ctl_table epoll_table[] = {
+ {
+ .procname = "max_user_instances",
+ .data = &max_user_instances,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .extra1 = &zero,
+ },
+ {
+ .procname = "max_user_watches",
+ .data = &max_user_watches,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .extra1 = &zero,
+ },
+ { .ctl_name = 0 }
+};
+#endif /* CONFIG_SYSCTL */
+

/* Setup the structure that is used as key for the RB tree */
static inline void ep_set_ffd(struct epoll_filefd *ffd,
@@ -402,6 +442,8 @@ static int ep_remove(struct eventpoll *e
/* At this point it is safe to free the eventpoll item */
kmem_cache_free(epi_cache, epi);

+ atomic_dec(&ep->user->epoll_watches);
+
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ep_remove(%p, %p)\n",
current, ep, file));

@@ -449,6 +491,8 @@ static void ep_free(struct eventpoll *ep

mutex_unlock(&epmutex);
mutex_destroy(&ep->mtx);
+ atomic_dec(&ep->user->epoll_devs);
+ free_uid(ep->user);
kfree(ep);
}

@@ -532,10 +576,19 @@ void eventpoll_release_file(struct file

static int ep_alloc(struct eventpoll **pep)
{
- struct eventpoll *ep = kzalloc(sizeof(*ep), GFP_KERNEL);
+ int error;
+ struct user_struct *user;
+ struct eventpoll *ep;

- if (!ep)
- return -ENOMEM;
+ user = get_current_user();
+ error = -EMFILE;
+ if (unlikely(atomic_read(&user->epoll_devs) >=
+ max_user_instances))
+ goto free_uid;
+ error = -ENOMEM;
+ ep = kzalloc(sizeof(*ep), GFP_KERNEL);
+ if (unlikely(!ep))
+ goto free_uid;

spin_lock_init(&ep->lock);
mutex_init(&ep->mtx);
@@ -544,12 +597,17 @@ static int ep_alloc(struct eventpoll **p
INIT_LIST_HEAD(&ep->rdllist);
ep->rbr = RB_ROOT;
ep->ovflist = EP_UNACTIVE_PTR;
+ ep->user = user;

*pep = ep;

DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ep_alloc() ep=%p\n",
current, ep));
return 0;
+
+free_uid:
+ free_uid(user);
+ return error;
}

/*
@@ -703,9 +761,11 @@ static int ep_insert(struct eventpoll *e
struct epitem *epi;
struct ep_pqueue epq;

- error = -ENOMEM;
+ if (unlikely(atomic_read(&ep->user->epoll_watches) >=
+ max_user_watches))
+ return -ENOSPC;
if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
- goto error_return;
+ return -ENOMEM;

/* Item initialization follow here ... */
INIT_LIST_HEAD(&epi->rdllink);
@@ -735,6 +795,7 @@ static int ep_insert(struct eventpoll *e
* install process. Namely an allocation for a wait queue failed due
* high memory pressure.
*/
+ error = -ENOMEM;
if (epi->nwait < 0)
goto error_unregister;

@@ -765,6 +826,8 @@ static int ep_insert(struct eventpoll *e

spin_unlock_irqrestore(&ep->lock, flags);

+ atomic_inc(&ep->user->epoll_watches);
+
/* We have to call this outside the lock */
if (pwake)
ep_poll_safewake(&psw, &ep->poll_wait);
@@ -789,7 +852,7 @@ error_unregister:
spin_unlock_irqrestore(&ep->lock, flags);

kmem_cache_free(epi_cache, epi);
-error_return:
+
return error;
}

@@ -1074,6 +1137,7 @@ asmlinkage long sys_epoll_create1(int fl
flags & O_CLOEXEC);
if (fd < 0)
ep_free(ep);
+ atomic_inc(&ep->user->epoll_devs);

error_return:
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d) = %d\n",
@@ -1295,7 +1359,12 @@ asmlinkage long sys_epoll_pwait(int epfd

static int __init eventpoll_init(void)
{
- mutex_init(&epmutex);
+ struct sysinfo si;
+
+ si_meminfo(&si);
+ max_user_instances = 128;
+ max_user_watches = (((si.totalram - si.totalhigh) / 32) << PAGE_SHIFT) /
+ EP_ITEM_COST;

/* Initialize the structure used to perform safe poll wait head wake ups */
ep_poll_safewake_init(&psw);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -587,6 +587,10 @@ struct user_struct {
atomic_t inotify_watches; /* How many inotify watches does this user have? */
atomic_t inotify_devs; /* How many inotify devs does this user have opened? */
#endif
+#ifdef CONFIG_EPOLL
+ atomic_t epoll_devs; /* The number of epoll descriptors currently open */
+ atomic_t epoll_watches; /* The number of file descriptors currently watched */
+#endif
#ifdef CONFIG_POSIX_MQUEUE
/* protected by mq_lock */
unsigned long mq_bytes; /* How many bytes can be allocated to mqueue? */
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -179,6 +179,9 @@ extern struct ctl_table random_table[];
#ifdef CONFIG_INOTIFY_USER
extern struct ctl_table inotify_table[];
#endif
+#ifdef CONFIG_EPOLL
+extern struct ctl_table epoll_table[];
+#endif

#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
int sysctl_legacy_va_layout;
@@ -1313,6 +1316,13 @@ static struct ctl_table fs_table[] = {
.child = inotify_table,
},
#endif
+#ifdef CONFIG_EPOLL
+ {
+ .procname = "epoll",
+ .mode = 0555,
+ .child = epoll_table,
+ },
+#endif
#endif
{
.ctl_name = KERN_SETUID_DUMPABLE,

2008-12-03 19:57:05

by Greg KH

[permalink] [raw]
Subject: [patch 017/104] Fix inotify watch removal/umount races

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Al Viro <[email protected]>

commit 8f7b0ba1c853919b85b54774775f567f30006107 upstream.

Inotify watch removals suck violently.

To kick the watch out we need (in this order) inode->inotify_mutex and
ih->mutex. That's fine if we have a hold on inode; however, for all
other cases we need to make damn sure we don't race with umount. We can
*NOT* just grab a reference to a watch - inotify_unmount_inodes() will
happily sail past it and we'll end with reference to inode potentially
outliving its superblock.

Ideally we just want to grab an active reference to superblock if we
can; that will make sure we won't go into inotify_umount_inodes() until
we are done. Cleanup is just deactivate_super().

However, that leaves a messy case - what if we *are* racing with
umount() and active references to superblock can't be acquired anymore?
We can bump ->s_count, grab ->s_umount, which will almost certainly wait
until the superblock is shut down and the watch in question is pining
for fjords. That's fine, but there is a problem - we might have hit the
window between ->s_active getting to 0 / ->s_count - below S_BIAS (i.e.
the moment when superblock is past the point of no return and is heading
for shutdown) and the moment when deactivate_super() acquires
->s_umount.

We could just do drop_super() yield() and retry, but that's rather
antisocial and this stuff is luser-triggerable. OTOH, having grabbed
->s_umount and having found that we'd got there first (i.e. that
->s_root is non-NULL) we know that we won't race with
inotify_umount_inodes().

So we could grab a reference to watch and do the rest as above, just
with drop_super() instead of deactivate_super(), right? Wrong. We had
to drop ih->mutex before we could grab ->s_umount. So the watch
could've been gone already.

That still can be dealt with - we need to save watch->wd, do idr_find()
and compare its result with our pointer. If they match, we either have
the damn thing still alive or we'd lost not one but two races at once,
the watch had been killed and a new one got created with the same ->wd
at the same address. That couldn't have happened in inotify_destroy(),
but inotify_rm_wd() could run into that. Still, "new one got created"
is not a problem - we have every right to kill it or leave it alone,
whatever's more convenient.

So we can use idr_find(...) == watch && watch->inode->i_sb == sb as
"grab it and kill it" check. If it's been our original watch, we are
fine, if it's a newcomer - nevermind, just pretend that we'd won the
race and kill the fscker anyway; we are safe since we know that its
superblock won't be going away.

And yes, this is far beyond mere "not very pretty"; so's the entire
concept of inotify to start with.

Signed-off-by: Al Viro <[email protected]>
Acked-by: Greg KH <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/inotify.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/inotify.h | 11 +++
kernel/audit_tree.c | 91 +++++++++++++++++------------
kernel/auditfilter.c | 14 ++--
4 files changed, 218 insertions(+), 48 deletions(-)

--- a/fs/inotify.c
+++ b/fs/inotify.c
@@ -106,6 +106,20 @@ void get_inotify_watch(struct inotify_wa
}
EXPORT_SYMBOL_GPL(get_inotify_watch);

+int pin_inotify_watch(struct inotify_watch *watch)
+{
+ struct super_block *sb = watch->inode->i_sb;
+ spin_lock(&sb_lock);
+ if (sb->s_count >= S_BIAS) {
+ atomic_inc(&sb->s_active);
+ spin_unlock(&sb_lock);
+ atomic_inc(&watch->count);
+ return 1;
+ }
+ spin_unlock(&sb_lock);
+ return 0;
+}
+
/**
* put_inotify_watch - decrements the ref count on a given watch. cleans up
* watch references if the count reaches zero. inotify_watch is freed by
@@ -124,6 +138,13 @@ void put_inotify_watch(struct inotify_wa
}
EXPORT_SYMBOL_GPL(put_inotify_watch);

+void unpin_inotify_watch(struct inotify_watch *watch)
+{
+ struct super_block *sb = watch->inode->i_sb;
+ put_inotify_watch(watch);
+ deactivate_super(sb);
+}
+
/*
* inotify_handle_get_wd - returns the next WD for use by the given handle
*
@@ -479,6 +500,112 @@ void inotify_init_watch(struct inotify_w
}
EXPORT_SYMBOL_GPL(inotify_init_watch);

+/*
+ * Watch removals suck violently. To kick the watch out we need (in this
+ * order) inode->inotify_mutex and ih->mutex. That's fine if we have
+ * a hold on inode; however, for all other cases we need to make damn sure
+ * we don't race with umount. We can *NOT* just grab a reference to a
+ * watch - inotify_unmount_inodes() will happily sail past it and we'll end
+ * with reference to inode potentially outliving its superblock. Ideally
+ * we just want to grab an active reference to superblock if we can; that
+ * will make sure we won't go into inotify_umount_inodes() until we are
+ * done. Cleanup is just deactivate_super(). However, that leaves a messy
+ * case - what if we *are* racing with umount() and active references to
+ * superblock can't be acquired anymore? We can bump ->s_count, grab
+ * ->s_umount, which will almost certainly wait until the superblock is shut
+ * down and the watch in question is pining for fjords. That's fine, but
+ * there is a problem - we might have hit the window between ->s_active
+ * getting to 0 / ->s_count - below S_BIAS (i.e. the moment when superblock
+ * is past the point of no return and is heading for shutdown) and the
+ * moment when deactivate_super() acquires ->s_umount. We could just do
+ * drop_super() yield() and retry, but that's rather antisocial and this
+ * stuff is luser-triggerable. OTOH, having grabbed ->s_umount and having
+ * found that we'd got there first (i.e. that ->s_root is non-NULL) we know
+ * that we won't race with inotify_umount_inodes(). So we could grab a
+ * reference to watch and do the rest as above, just with drop_super() instead
+ * of deactivate_super(), right? Wrong. We had to drop ih->mutex before we
+ * could grab ->s_umount. So the watch could've been gone already.
+ *
+ * That still can be dealt with - we need to save watch->wd, do idr_find()
+ * and compare its result with our pointer. If they match, we either have
+ * the damn thing still alive or we'd lost not one but two races at once,
+ * the watch had been killed and a new one got created with the same ->wd
+ * at the same address. That couldn't have happened in inotify_destroy(),
+ * but inotify_rm_wd() could run into that. Still, "new one got created"
+ * is not a problem - we have every right to kill it or leave it alone,
+ * whatever's more convenient.
+ *
+ * So we can use idr_find(...) == watch && watch->inode->i_sb == sb as
+ * "grab it and kill it" check. If it's been our original watch, we are
+ * fine, if it's a newcomer - nevermind, just pretend that we'd won the
+ * race and kill the fscker anyway; we are safe since we know that its
+ * superblock won't be going away.
+ *
+ * And yes, this is far beyond mere "not very pretty"; so's the entire
+ * concept of inotify to start with.
+ */
+
+/**
+ * pin_to_kill - pin the watch down for removal
+ * @ih: inotify handle
+ * @watch: watch to kill
+ *
+ * Called with ih->mutex held, drops it. Possible return values:
+ * 0 - nothing to do, it has died
+ * 1 - remove it, drop the reference and deactivate_super()
+ * 2 - remove it, drop the reference and drop_super(); we tried hard to avoid
+ * that variant, since it involved a lot of PITA, but that's the best that
+ * could've been done.
+ */
+static int pin_to_kill(struct inotify_handle *ih, struct inotify_watch *watch)
+{
+ struct super_block *sb = watch->inode->i_sb;
+ s32 wd = watch->wd;
+
+ spin_lock(&sb_lock);
+ if (sb->s_count >= S_BIAS) {
+ atomic_inc(&sb->s_active);
+ spin_unlock(&sb_lock);
+ get_inotify_watch(watch);
+ mutex_unlock(&ih->mutex);
+ return 1; /* the best outcome */
+ }
+ sb->s_count++;
+ spin_unlock(&sb_lock);
+ mutex_unlock(&ih->mutex); /* can't grab ->s_umount under it */
+ down_read(&sb->s_umount);
+ if (likely(!sb->s_root)) {
+ /* fs is already shut down; the watch is dead */
+ drop_super(sb);
+ return 0;
+ }
+ /* raced with the final deactivate_super() */
+ mutex_lock(&ih->mutex);
+ if (idr_find(&ih->idr, wd) != watch || watch->inode->i_sb != sb) {
+ /* the watch is dead */
+ mutex_unlock(&ih->mutex);
+ drop_super(sb);
+ return 0;
+ }
+ /* still alive or freed and reused with the same sb and wd; kill */
+ get_inotify_watch(watch);
+ mutex_unlock(&ih->mutex);
+ return 2;
+}
+
+static void unpin_and_kill(struct inotify_watch *watch, int how)
+{
+ struct super_block *sb = watch->inode->i_sb;
+ put_inotify_watch(watch);
+ switch (how) {
+ case 1:
+ deactivate_super(sb);
+ break;
+ case 2:
+ drop_super(sb);
+ }
+}
+
/**
* inotify_destroy - clean up and destroy an inotify instance
* @ih: inotify handle
@@ -490,11 +617,15 @@ void inotify_destroy(struct inotify_hand
* pretty. We cannot do a simple iteration over the list, because we
* do not know the inode until we iterate to the watch. But we need to
* hold inode->inotify_mutex before ih->mutex. The following works.
+ *
+ * AV: it had to become even uglier to start working ;-/
*/
while (1) {
struct inotify_watch *watch;
struct list_head *watches;
+ struct super_block *sb;
struct inode *inode;
+ int how;

mutex_lock(&ih->mutex);
watches = &ih->watches;
@@ -503,8 +634,10 @@ void inotify_destroy(struct inotify_hand
break;
}
watch = list_first_entry(watches, struct inotify_watch, h_list);
- get_inotify_watch(watch);
- mutex_unlock(&ih->mutex);
+ sb = watch->inode->i_sb;
+ how = pin_to_kill(ih, watch);
+ if (!how)
+ continue;

inode = watch->inode;
mutex_lock(&inode->inotify_mutex);
@@ -518,7 +651,7 @@ void inotify_destroy(struct inotify_hand

mutex_unlock(&ih->mutex);
mutex_unlock(&inode->inotify_mutex);
- put_inotify_watch(watch);
+ unpin_and_kill(watch, how);
}

/* free this handle: the put matching the get in inotify_init() */
@@ -719,7 +852,9 @@ void inotify_evict_watch(struct inotify_
int inotify_rm_wd(struct inotify_handle *ih, u32 wd)
{
struct inotify_watch *watch;
+ struct super_block *sb;
struct inode *inode;
+ int how;

mutex_lock(&ih->mutex);
watch = idr_find(&ih->idr, wd);
@@ -727,9 +862,12 @@ int inotify_rm_wd(struct inotify_handle
mutex_unlock(&ih->mutex);
return -EINVAL;
}
- get_inotify_watch(watch);
+ sb = watch->inode->i_sb;
+ how = pin_to_kill(ih, watch);
+ if (!how)
+ return 0;
+
inode = watch->inode;
- mutex_unlock(&ih->mutex);

mutex_lock(&inode->inotify_mutex);
mutex_lock(&ih->mutex);
@@ -740,7 +878,7 @@ int inotify_rm_wd(struct inotify_handle

mutex_unlock(&ih->mutex);
mutex_unlock(&inode->inotify_mutex);
- put_inotify_watch(watch);
+ unpin_and_kill(watch, how);

return 0;
}
--- a/include/linux/inotify.h
+++ b/include/linux/inotify.h
@@ -134,6 +134,8 @@ extern void inotify_remove_watch_locked(
struct inotify_watch *);
extern void get_inotify_watch(struct inotify_watch *);
extern void put_inotify_watch(struct inotify_watch *);
+extern int pin_inotify_watch(struct inotify_watch *);
+extern void unpin_inotify_watch(struct inotify_watch *);

#else

@@ -228,6 +230,15 @@ static inline void put_inotify_watch(str
{
}

+extern inline int pin_inotify_watch(struct inotify_watch *watch)
+{
+ return 0;
+}
+
+extern inline void unpin_inotify_watch(struct inotify_watch *watch)
+{
+}
+
#endif /* CONFIG_INOTIFY */

#endif /* __KERNEL __ */
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -1094,8 +1094,8 @@ static void audit_inotify_unregister(str
list_for_each_entry_safe(p, n, in_list, ilist) {
list_del(&p->ilist);
inotify_rm_watch(audit_ih, &p->wdata);
- /* the put matching the get in audit_do_del_rule() */
- put_inotify_watch(&p->wdata);
+ /* the unpin matching the pin in audit_do_del_rule() */
+ unpin_inotify_watch(&p->wdata);
}
}

@@ -1389,9 +1389,13 @@ static inline int audit_del_rule(struct
/* Put parent on the inotify un-registration
* list. Grab a reference before releasing
* audit_filter_mutex, to be released in
- * audit_inotify_unregister(). */
- list_add(&parent->ilist, &inotify_list);
- get_inotify_watch(&parent->wdata);
+ * audit_inotify_unregister().
+ * If filesystem is going away, just leave
+ * the sucker alone, eviction will take
+ * care of it.
+ */
+ if (pin_inotify_watch(&parent->wdata))
+ list_add(&parent->ilist, &inotify_list);
}
}
}
--- a/kernel/audit_tree.c
+++ b/kernel/audit_tree.c
@@ -24,6 +24,7 @@ struct audit_chunk {
struct list_head trees; /* with root here */
int dead;
int count;
+ atomic_long_t refs;
struct rcu_head head;
struct node {
struct list_head list;
@@ -56,7 +57,8 @@ static LIST_HEAD(prune_list);
* tree is refcounted; one reference for "some rules on rules_list refer to
* it", one for each chunk with pointer to it.
*
- * chunk is refcounted by embedded inotify_watch.
+ * chunk is refcounted by embedded inotify_watch + .refs (non-zero refcount
+ * of watch contributes 1 to .refs).
*
* node.index allows to get from node.list to containing chunk.
* MSB of that sucker is stolen to mark taggings that we might have to
@@ -121,6 +123,7 @@ static struct audit_chunk *alloc_chunk(i
INIT_LIST_HEAD(&chunk->hash);
INIT_LIST_HEAD(&chunk->trees);
chunk->count = count;
+ atomic_long_set(&chunk->refs, 1);
for (i = 0; i < count; i++) {
INIT_LIST_HEAD(&chunk->owners[i].list);
chunk->owners[i].index = i;
@@ -129,9 +132,8 @@ static struct audit_chunk *alloc_chunk(i
return chunk;
}

-static void __free_chunk(struct rcu_head *rcu)
+static void free_chunk(struct audit_chunk *chunk)
{
- struct audit_chunk *chunk = container_of(rcu, struct audit_chunk, head);
int i;

for (i = 0; i < chunk->count; i++) {
@@ -141,14 +143,16 @@ static void __free_chunk(struct rcu_head
kfree(chunk);
}

-static inline void free_chunk(struct audit_chunk *chunk)
+void audit_put_chunk(struct audit_chunk *chunk)
{
- call_rcu(&chunk->head, __free_chunk);
+ if (atomic_long_dec_and_test(&chunk->refs))
+ free_chunk(chunk);
}

-void audit_put_chunk(struct audit_chunk *chunk)
+static void __put_chunk(struct rcu_head *rcu)
{
- put_inotify_watch(&chunk->watch);
+ struct audit_chunk *chunk = container_of(rcu, struct audit_chunk, head);
+ audit_put_chunk(chunk);
}

enum {HASH_SIZE = 128};
@@ -176,7 +180,7 @@ struct audit_chunk *audit_tree_lookup(co

list_for_each_entry_rcu(p, list, hash) {
if (p->watch.inode == inode) {
- get_inotify_watch(&p->watch);
+ atomic_long_inc(&p->refs);
return p;
}
}
@@ -194,17 +198,49 @@ int audit_tree_match(struct audit_chunk

/* tagging and untagging inodes with trees */

-static void untag_chunk(struct audit_chunk *chunk, struct node *p)
+static struct audit_chunk *find_chunk(struct node *p)
+{
+ int index = p->index & ~(1U<<31);
+ p -= index;
+ return container_of(p, struct audit_chunk, owners[0]);
+}
+
+static void untag_chunk(struct node *p)
{
+ struct audit_chunk *chunk = find_chunk(p);
struct audit_chunk *new;
struct audit_tree *owner;
int size = chunk->count - 1;
int i, j;

+ if (!pin_inotify_watch(&chunk->watch)) {
+ /*
+ * Filesystem is shutting down; all watches are getting
+ * evicted, just take it off the node list for this
+ * tree and let the eviction logics take care of the
+ * rest.
+ */
+ owner = p->owner;
+ if (owner->root == chunk) {
+ list_del_init(&owner->same_root);
+ owner->root = NULL;
+ }
+ list_del_init(&p->list);
+ p->owner = NULL;
+ put_tree(owner);
+ return;
+ }
+
+ spin_unlock(&hash_lock);
+
+ /*
+ * pin_inotify_watch() succeeded, so the watch won't go away
+ * from under us.
+ */
mutex_lock(&chunk->watch.inode->inotify_mutex);
if (chunk->dead) {
mutex_unlock(&chunk->watch.inode->inotify_mutex);
- return;
+ goto out;
}

owner = p->owner;
@@ -221,7 +257,7 @@ static void untag_chunk(struct audit_chu
inotify_evict_watch(&chunk->watch);
mutex_unlock(&chunk->watch.inode->inotify_mutex);
put_inotify_watch(&chunk->watch);
- return;
+ goto out;
}

new = alloc_chunk(size);
@@ -263,7 +299,7 @@ static void untag_chunk(struct audit_chu
inotify_evict_watch(&chunk->watch);
mutex_unlock(&chunk->watch.inode->inotify_mutex);
put_inotify_watch(&chunk->watch);
- return;
+ goto out;

Fallback:
// do the best we can
@@ -277,6 +313,9 @@ Fallback:
put_tree(owner);
spin_unlock(&hash_lock);
mutex_unlock(&chunk->watch.inode->inotify_mutex);
+out:
+ unpin_inotify_watch(&chunk->watch);
+ spin_lock(&hash_lock);
}

static int create_chunk(struct inode *inode, struct audit_tree *tree)
@@ -387,13 +426,6 @@ static int tag_chunk(struct inode *inode
return 0;
}

-static struct audit_chunk *find_chunk(struct node *p)
-{
- int index = p->index & ~(1U<<31);
- p -= index;
- return container_of(p, struct audit_chunk, owners[0]);
-}
-
static void kill_rules(struct audit_tree *tree)
{
struct audit_krule *rule, *next;
@@ -431,17 +463,10 @@ static void prune_one(struct audit_tree
spin_lock(&hash_lock);
while (!list_empty(&victim->chunks)) {
struct node *p;
- struct audit_chunk *chunk;

p = list_entry(victim->chunks.next, struct node, list);
- chunk = find_chunk(p);
- get_inotify_watch(&chunk->watch);
- spin_unlock(&hash_lock);
-
- untag_chunk(chunk, p);

- put_inotify_watch(&chunk->watch);
- spin_lock(&hash_lock);
+ untag_chunk(p);
}
spin_unlock(&hash_lock);
put_tree(victim);
@@ -469,7 +494,6 @@ static void trim_marked(struct audit_tre

while (!list_empty(&tree->chunks)) {
struct node *node;
- struct audit_chunk *chunk;

node = list_entry(tree->chunks.next, struct node, list);

@@ -477,14 +501,7 @@ static void trim_marked(struct audit_tre
if (!(node->index & (1U<<31)))
break;

- chunk = find_chunk(node);
- get_inotify_watch(&chunk->watch);
- spin_unlock(&hash_lock);
-
- untag_chunk(chunk, node);
-
- put_inotify_watch(&chunk->watch);
- spin_lock(&hash_lock);
+ untag_chunk(node);
}
if (!tree->root && !tree->goner) {
tree->goner = 1;
@@ -878,7 +895,7 @@ static void handle_event(struct inotify_
static void destroy_watch(struct inotify_watch *watch)
{
struct audit_chunk *chunk = container_of(watch, struct audit_chunk, watch);
- free_chunk(chunk);
+ call_rcu(&chunk->head, __put_chunk);
}

static const struct inotify_operations rtree_inotify_ops = {

2008-12-03 19:58:01

by Greg KH

[permalink] [raw]
Subject: [patch 019/104] V4L/DVB (9352): Add some missing compat32 ioctls

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Gregor Jasny <[email protected]>

commit c7f09db6852d85e7f76322815051aad1c88d08cf upstream.

This patch adds the missing compat ioctls that are needed to
operate Skype in combination with libv4l and a MJPEG only camera.

If you think it's trivial enough please submit it to -stable, too.

Signed-off-by: Gregor Jasny <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/media/video/compat_ioctl32.c | 3 +++
1 file changed, 3 insertions(+)

--- a/drivers/media/video/compat_ioctl32.c
+++ b/drivers/media/video/compat_ioctl32.c
@@ -867,6 +867,7 @@ long v4l_compat_ioctl32(struct file *fil
case VIDIOC_STREAMON32:
case VIDIOC_STREAMOFF32:
case VIDIOC_G_PARM:
+ case VIDIOC_S_PARM:
case VIDIOC_G_STD:
case VIDIOC_S_STD:
case VIDIOC_G_TUNER:
@@ -885,6 +886,8 @@ long v4l_compat_ioctl32(struct file *fil
case VIDIOC_S_INPUT32:
case VIDIOC_TRY_FMT32:
case VIDIOC_S_HW_FREQ_SEEK:
+ case VIDIOC_ENUM_FRAMESIZES:
+ case VIDIOC_ENUM_FRAMEINTERVALS:
ret = do_video_ioctl(file, cmd, arg);
break;

2008-12-03 19:58:41

by Greg KH

[permalink] [raw]
Subject: [patch 020/104] Input: atkbd - add keymap quirk for Inventec Symphony systems

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Matthew Garrett <[email protected]>

commit a8215b81cc31cf267506bc6a4a4bfe93f4ca1652 upstream.

The Zepto 6615WD laptop (rebranded Inventec Symphony system) needs a
key release quirk for its volume keys to work. The attached patch adds
the quirk to the atkbd driver.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=460237

Signed-off-by: Matthew Garrett <[email protected]>
Signed-off-by: Adel Gadllah <[email protected]>
Signed-off-by: Dmitry Torokhov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/input/keyboard/atkbd.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

--- a/drivers/input/keyboard/atkbd.c
+++ b/drivers/input/keyboard/atkbd.c
@@ -868,6 +868,22 @@ static void atkbd_hp_keymap_fixup(struct
}

/*
+ * Inventec system with broken key release on volume keys
+ */
+static void atkbd_inventec_keymap_fixup(struct atkbd *atkbd)
+{
+ const unsigned int forced_release_keys[] = {
+ 0xae, 0xb0,
+ };
+ int i;
+
+ if (atkbd->set == 2)
+ for (i = 0; i < ARRAY_SIZE(forced_release_keys); i++)
+ __set_bit(forced_release_keys[i],
+ atkbd->force_release_mask);
+}
+
+/*
* atkbd_set_keycode_table() initializes keyboard's keycode table
* according to the selected scancode set
*/
@@ -1478,6 +1494,15 @@ static struct dmi_system_id atkbd_dmi_qu
.callback = atkbd_setup_fixup,
.driver_data = atkbd_hp_keymap_fixup,
},
+ {
+ .ident = "Inventec Symphony",
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "INVENTEC"),
+ DMI_MATCH(DMI_PRODUCT_NAME, "SYMPHONY 6.0/7.0"),
+ },
+ .callback = atkbd_setup_fixup,
+ .driver_data = atkbd_inventec_keymap_fixup,
+ },
{ }
};

2008-12-03 19:59:28

by Greg KH

[permalink] [raw]
Subject: [patch 022/104] parport_serial: fix array overflow

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Takashi Iwai <[email protected]>

commit 36be47d6d8d98f54b6c4f891e9f54fb2bf554584 upstream.

The netmos_9xx5_combo type assumes that PCI SSID provides always the
correct value for the number of parallel and serial ports, but there are
indeed broken devices with wrong numbers, which may result in Oops.

This patch simply adds the check of the array range.

Reference: Novell bnc#447067
https://bugzilla.novell.com/show_bug.cgi?id=447067

Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/parport/parport_serial.c | 2 ++
1 file changed, 2 insertions(+)

--- a/drivers/parport/parport_serial.c
+++ b/drivers/parport/parport_serial.c
@@ -70,6 +70,8 @@ static int __devinit netmos_parallel_ini
* parallel ports and <S> is the number of serial ports.
*/
card->numports = (dev->subsystem_device & 0xf0) >> 4;
+ if (card->numports > ARRAY_SIZE(card->addr))
+ card->numports = ARRAY_SIZE(card->addr);
return 0;
}

2008-12-03 19:59:46

by Greg KH

[permalink] [raw]
Subject: [patch 023/104] x86: more general identifier for Phoenix BIOS

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Philipp Kohlbecher <[email protected]>

commit 0af40a4b1050c050e62eb1dc30b82d5ab22bf221 upstream.

Impact: widen the reach of the low-memory-protect DMI quirk

Phoenix BIOSes variously identify their vendor as "Phoenix Technologies,
LTD" or "Phoenix Technologies LTD" (without the comma.)

This patch makes the identification string in the bad_bios_dmi_table
more general (following a suggestion by Ingo Molnar), so that both
versions are handled.

Again, the patched file compiles cleanly and the patch has been tested
successfully on my machine.

Signed-off-by: Philipp Kohlbecher <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/x86/kernel/setup.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -604,7 +604,7 @@ static struct dmi_system_id __initdata b
.callback = dmi_low_memory_corruption,
.ident = "Phoenix BIOS",
.matches = {
- DMI_MATCH(DMI_BIOS_VENDOR, "Phoenix Technologies, LTD"),
+ DMI_MATCH(DMI_BIOS_VENDOR, "Phoenix Technologies"),
},
},
#endif

2008-12-03 20:00:04

by Greg KH

[permalink] [raw]
Subject: [patch 024/104] x86: always define DECLARE_PCI_UNMAP* macros

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Joerg Roedel <[email protected]>

commit b627c8b17ccacba38c975bc0f69a49fc4e5261c9 upstream.

Impact: fix boot crash on AMD IOMMU if CONFIG_GART_IOMMU is off

Currently these macros evaluate to a no-op except the kernel is compiled
with GART or Calgary support. But we also need these macros when we have
SWIOTLB, VT-d or AMD IOMMU in the kernel. Since we always compile at
least with SWIOTLB we can define these macros always.

This patch is also for stable backport for the same reason the SWIOTLB
default selection patch is.

Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/asm-x86/pci_64.h | 14 --------------
1 file changed, 14 deletions(-)

--- a/include/asm-x86/pci_64.h
+++ b/include/asm-x86/pci_64.h
@@ -34,8 +34,6 @@ extern void pci_iommu_alloc(void);
*/
#define PCI_DMA_BUS_IS_PHYS (dma_ops->is_phys)

-#if defined(CONFIG_GART_IOMMU) || defined(CONFIG_CALGARY_IOMMU)
-
#define DECLARE_PCI_UNMAP_ADDR(ADDR_NAME) \
dma_addr_t ADDR_NAME;
#define DECLARE_PCI_UNMAP_LEN(LEN_NAME) \
@@ -49,18 +47,6 @@ extern void pci_iommu_alloc(void);
#define pci_unmap_len_set(PTR, LEN_NAME, VAL) \
(((PTR)->LEN_NAME) = (VAL))

-#else
-/* No IOMMU */
-
-#define DECLARE_PCI_UNMAP_ADDR(ADDR_NAME)
-#define DECLARE_PCI_UNMAP_LEN(LEN_NAME)
-#define pci_unmap_addr(PTR, ADDR_NAME) (0)
-#define pci_unmap_addr_set(PTR, ADDR_NAME, VAL) do { } while (0)
-#define pci_unmap_len(PTR, LEN_NAME) (0)
-#define pci_unmap_len_set(PTR, LEN_NAME, VAL) do { } while (0)
-
-#endif
-
#endif /* __KERNEL__ */

#endif /* __x8664_PCI_H */

2008-12-03 19:57:31

by Greg KH

[permalink] [raw]
Subject: [patch 018/104] IA64: fix boot panic caused by offline CPUs

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Doug Chapman <[email protected]>

commit 62ee0540f5e5a804b79cae8b3c0185a85f02436b upstream.

This fixes a regression introduced by 2c6e6db41f01b6b4eb98809350827c9678996698
"Minimize per_cpu reservations." That patch incorrectly used information about
what CPUs are possible that was not yet initialized by ACPI. The end result
was that per_cpu structures for offline CPUs were not initialized causing a
NULL pointer reference.

Since we cannot do the full acpi_boot_init() call any earlier, the simplest
fix is to just parse the MADT for SAPIC entries early to find the CPU
info. This should also allow for some cleanup of the code added by the
"Minimize per_cpu reservations". This patch just fixes the regressions, the
cleanup will come in a later patch.

Signed-off-by: Doug Chapman <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
CC: Robin Holt <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/ia64/kernel/acpi.c | 29 ++++++++++++++++++++++++-----
arch/ia64/kernel/setup.c | 7 ++++---
2 files changed, 28 insertions(+), 8 deletions(-)

--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -656,6 +656,30 @@ static int __init acpi_parse_fadt(struct
return 0;
}

+int __init early_acpi_boot_init(void)
+{
+ int ret;
+
+ /*
+ * do a partial walk of MADT to determine how many CPUs
+ * we have including offline CPUs
+ */
+ if (acpi_table_parse(ACPI_SIG_MADT, acpi_parse_madt)) {
+ printk(KERN_ERR PREFIX "Can't find MADT\n");
+ return 0;
+ }
+
+ ret = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_SAPIC,
+ acpi_parse_lsapic, NR_CPUS);
+ if (ret < 1)
+ printk(KERN_ERR PREFIX
+ "Error parsing MADT - no LAPIC entries\n");
+
+ return 0;
+}
+
+
+
int __init acpi_boot_init(void)
{

@@ -679,11 +703,6 @@ int __init acpi_boot_init(void)
printk(KERN_ERR PREFIX
"Error parsing LAPIC address override entry\n");

- if (acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_SAPIC, acpi_parse_lsapic, NR_CPUS)
- < 1)
- printk(KERN_ERR PREFIX
- "Error parsing MADT - no LAPIC entries\n");
-
if (acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC_NMI, acpi_parse_lapic_nmi, 0)
< 0)
printk(KERN_ERR PREFIX "Error parsing LAPIC NMI entry\n");
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -549,8 +549,12 @@ setup_arch (char **cmdline_p)
#ifdef CONFIG_ACPI
/* Initialize the ACPI boot-time table parser */
acpi_table_init();
+ early_acpi_boot_init();
# ifdef CONFIG_ACPI_NUMA
acpi_numa_init();
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+ prefill_possible_map();
+#endif
per_cpu_scan_finalize((cpus_weight(early_cpu_possible_map) == 0 ?
32 : cpus_weight(early_cpu_possible_map)),
additional_cpus > 0 ? additional_cpus : 0);
@@ -841,9 +845,6 @@ void __init
setup_per_cpu_areas (void)
{
/* start_kernel() requires this... */
-#ifdef CONFIG_ACPI_HOTPLUG_CPU
- prefill_possible_map();
-#endif
}

/*

2008-12-03 20:00:37

by Greg KH

[permalink] [raw]
Subject: [patch 025/104] ath9k: Fix SW-IOMMU bounce buffer starvation

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Luis R. Rodriguez <[email protected]>

commit ca0c7e5101fd4f37fed8e851709f08580b92fbb3 upstream.

This should fix the SW-IOMMU bounce buffer starvation
seen ok kernel.org bugzilla 11811:

http://bugzilla.kernel.org/show_bug.cgi?id=11811

Users on MacBook Pro 3.1/MacBook v2 would see something like:

DMA: Out of SW-IOMMU space for 4224 bytes at device 0000:0b:00.0

Unfortunately its only easy to trigger on MacBook Pro 3.1/MacBook v2
so far so its difficult to debug (even with swiotlb=force).

We were pci_unmap_single()'ing less bytes than what we called
for with pci_map_single() and as such we were starving
the swiotlb from its 64MB amount of bounce buffers. We remain
consistent and now always use sc->rxbufsize for RX. While at
it we update the beacon DMA maps as well to only use the data
portion of the skb, previous to this we were pci_map_single()'ing
more data for beaconing than what we tell the hardware it can use,
therefore pushing more iotlb abuse.

Still not sure why this is so easily triggerable on
MacBook Pro 3.1, it may be the hardware configuration
tends to use more memory > 3GB mark for DMA.

Signed-off-by: Maciej Zenczykowski <[email protected]>
Signed-off-by: Bennyam Malavazi <[email protected]>
Signed-off-by: Luis R. Rodriguez <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/net/wireless/ath9k/recv.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

--- a/drivers/net/wireless/ath9k/recv.c
+++ b/drivers/net/wireless/ath9k/recv.c
@@ -1011,7 +1011,7 @@ int ath_rx_tasklet(struct ath_softc *sc,

pci_dma_sync_single_for_cpu(sc->pdev,
bf->bf_buf_addr,
- skb_tailroom(skb),
+ sc->sc_rxbufsize,
PCI_DMA_FROMDEVICE);
pci_unmap_single(sc->pdev,
bf->bf_buf_addr,
@@ -1303,8 +1303,7 @@ dma_addr_t ath_skb_map_single(struct ath
* NB: do NOT use skb->len, which is 0 on initialization.
* Use skb's entire data area instead.
*/
- *pa = pci_map_single(sc->pdev, skb->data,
- skb_end_pointer(skb) - skb->head, direction);
+ *pa = pci_map_single(sc->pdev, skb->data, sc->sc_rxbufsize, direction);
return *pa;
}

@@ -1314,6 +1313,5 @@ void ath_skb_unmap_single(struct ath_sof
dma_addr_t *pa)
{
/* Unmap skb's entire data area */
- pci_unmap_single(sc->pdev, *pa,
- skb_end_pointer(skb) - skb->head, direction);
+ pci_unmap_single(sc->pdev, *pa, sc->sc_rxbufsize, direction);
}

2008-12-03 20:01:20

by Greg KH

[permalink] [raw]
Subject: [patch 027/104] axnet_cs / pcnet_cs: moving PCMCIA_DEVICE_PROD_ID for Netgear FA411

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Cord Walter <[email protected]>

commit 208fbec5bec1de4fce48aab41efde11ba25ab04c upstream.

Hi,

after noticing that my Netgear FA411 (PCMCIA-NIC) [1] stopped working with
the release of the 2.6.25 kernel (sidux-version), I checked the
respective driver sources and noticed that the pcnet_cs driver bailed
out with "use axnet_cs instead" for the Netgear FA411, but axnet_cs
doesn't claim this ID.

I compiled a kernel with the PCMCIA-ID for the netgear card moved to
axnet_cs from pcnet_cs which worked. I then contacted sidux-kernel
maintainer Stefan Lippers-Hollmann who turned the info into this patch
and integrated it into the kernel:

<http://svn.berlios.de/svnroot/repos/fullstory/linux-sidux-2.6/trunk/debian/patches/features/2.6.27.4_PCMCIA_move-PCMCIA-ID-for-Netgear-FA411-from-pcnet_cs-to-axnet_cs.patch>

This works for me and AFAIK there were no reports of any breakage for
other devices on sidux-support.

This looks like a trivial patch, but since I have very limited
experience with kernel modifications I might be woefully wrong there.
But if there are no side effects of this patch, is it possible to get it
into the official kernel?

I can provide more detailed information on the affected hardware if
necessary.

-cord

[1]
Socket 1 Device 0: [axnet_cs] (bus ID: 1.0)
Configuration: state: on
Product Name: NETGEAR FA411 Fast Ethernet
Identification: manf_id: 0x0149 card_id: 0x0411
function: 6 (network)
prod_id(1): "NETGEAR" (0x9aa79dc3)
prod_id(2): "FA411" (0x40fad875)
prod_id(3): "Fast Ethernet" (0xb4be14e3)
prod_id(4): --- (---)

From: Stefan Lippers-Hollmann <[email protected]>
Date: Sat, 1 Nov 2008 23:53:04 +0000
Subject: [patch 027/104] PCMCIA: move PCMCIA ID for Netgear FA411 from pcnet_cs to axnet_cs:

Since kernel 2.6.25, commit 61da96be07ec860e260ca4af0199b9d48d000b80
(pcnet_cs: if AX88190-based card, printk "use axnet_cs instead" message.),
pcnet_cs bails out with "use axnet_cs instead" for the Netgear FA411, but
axnet_cs doesn't claim this ID.

Socket 1 Device 0: [axnet_cs] (bus ID: 1.0)
Configuration: state: on
Product Name: NETGEAR FA411 Fast Ethernet
Identification: manf_id: 0x0149 card_id: 0x0411
function: 6 (network)
prod_id(1): "NETGEAR" (0x9aa79dc3)
prod_id(2): "FA411" (0x40fad875)
prod_id(3): "Fast Ethernet" (0xb4be14e3)
prod_id(4): --- (---)

Signed-off-by: Stefan Lippers-Hollmann <[email protected]>
Signed-off-by: Cord Walter <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/net/pcmcia/axnet_cs.c | 1 +
drivers/net/pcmcia/pcnet_cs.c | 1 -
2 files changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/pcmcia/axnet_cs.c
+++ b/drivers/net/pcmcia/axnet_cs.c
@@ -787,6 +787,7 @@ static struct pcmcia_device_id axnet_ids
PCMCIA_DEVICE_PROD_ID12("IO DATA", "ETXPCM", 0x547e66dc, 0x233adac2),
PCMCIA_DEVICE_PROD_ID12("Linksys", "EtherFast 10/100 PC Card (PCMPC100 V3)", 0x0733cc81, 0x232019a8),
PCMCIA_DEVICE_PROD_ID12("MELCO", "LPC3-TX", 0x481e0094, 0xf91af609),
+ PCMCIA_DEVICE_PROD_ID12("NETGEAR", "FA411", 0x9aa79dc3, 0x40fad875),
PCMCIA_DEVICE_PROD_ID12("PCMCIA", "100BASE", 0x281f1c5d, 0x7c2add04),
PCMCIA_DEVICE_PROD_ID12("PCMCIA", "FastEtherCard", 0x281f1c5d, 0x7ef26116),
PCMCIA_DEVICE_PROD_ID12("PCMCIA", "FEP501", 0x281f1c5d, 0x2e272058),
--- a/drivers/net/pcmcia/pcnet_cs.c
+++ b/drivers/net/pcmcia/pcnet_cs.c
@@ -1697,7 +1697,6 @@ static struct pcmcia_device_id pcnet_ids
PCMCIA_DEVICE_PROD_ID12("National Semiconductor", "InfoMover NE4100", 0x36e1191f, 0xa6617ec8),
PCMCIA_DEVICE_PROD_ID12("NEC", "PC-9801N-J12", 0x18df0ba0, 0xbc912d76),
PCMCIA_DEVICE_PROD_ID12("NETGEAR", "FA410TX", 0x9aa79dc3, 0x60e5bc0e),
- PCMCIA_DEVICE_PROD_ID12("NETGEAR", "FA411", 0x9aa79dc3, 0x40fad875),
PCMCIA_DEVICE_PROD_ID12("Network Everywhere", "Fast Ethernet 10/100 PC Card", 0x820a67b6, 0x31ed1a5f),
PCMCIA_DEVICE_PROD_ID12("NextCom K.K.", "Next Hawk", 0xaedaec74, 0xad050ef1),
PCMCIA_DEVICE_PROD_ID12("PCMCIA", "10/100Mbps Ethernet Card", 0x281f1c5d, 0x6e41773b),

2008-12-03 20:00:55

by Greg KH

[permalink] [raw]
Subject: [patch 026/104] ath9k: correct expected max RX buffer size

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Luis R. Rodriguez <[email protected]>

commit b4b6cda2298b0c9a0af902312184b775b8867c65 upstream

We should only tell the hardware its capable of DMA'ing
to us only what we asked dev_alloc_skb(). Prior to this
it is possible a large RX'd frame could have corrupted
DMA data but for us but we were saved only because we
were previously also pci_map_single()'ing the same large
value. The issue prior to this though was we were unmapping
a smaller amount which the prior DMA patch fixed.

Signed-off-by: Bennyam Malavazi <[email protected]>
Signed-off-by: Luis R. Rodriguez <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/net/wireless/ath9k/recv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/wireless/ath9k/recv.c
+++ b/drivers/net/wireless/ath9k/recv.c
@@ -52,7 +52,7 @@ static void ath_rx_buf_link(struct ath_s
/* setup rx descriptors */
ath9k_hw_setuprxdesc(ah,
ds,
- skb_tailroom(skb), /* buffer size */
+ sc->sc_rxbufsize,
0);

if (sc->sc_rxlink == NULL)

2008-12-03 19:59:01

by Greg KH

[permalink] [raw]
Subject: [patch 021/104] lib/idr.c: fix rcu related race with idr_find

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Manfred Spraul <[email protected]>

commit 6ff2d39b91aec3dcae951afa982059e3dd9b49dc upstream.

2nd part of the fixes needed for
http://bugzilla.kernel.org/show_bug.cgi?id=11796.

When the idr tree is either grown or shrunk, then the update to the number
of layers and the top pointer were not atomic. This race caused crashes.

The attached patch fixes that by replicating the layers counter in each
layer, thus idr_find doesn't need idp->layers anymore.

Signed-off-by: Manfred Spraul <[email protected]>
Cc: Clement Calmels <[email protected]>
Cc: Nadia Derbey <[email protected]>
Cc: Pierre Peiffer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/linux/idr.h | 3 ++-
lib/idr.c | 14 ++++++++++++--
2 files changed, 14 insertions(+), 3 deletions(-)

--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -52,13 +52,14 @@ struct idr_layer {
unsigned long bitmap; /* A zero bit means "space here" */
struct idr_layer *ary[1<<IDR_BITS];
int count; /* When zero, we can release it */
+ int layer; /* distance from leaf */
struct rcu_head rcu_head;
};

struct idr {
struct idr_layer *top;
struct idr_layer *id_free;
- int layers;
+ int layers; /* only valid without concurrent changes */
int id_free_cnt;
spinlock_t lock;
};
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -185,6 +185,7 @@ static int sub_alloc(struct idr *idp, in
new = get_from_free_list(idp);
if (!new)
return -1;
+ new->layer = l-1;
rcu_assign_pointer(p->ary[m], new);
p->count++;
}
@@ -210,6 +211,7 @@ build_up:
if (unlikely(!p)) {
if (!(p = get_from_free_list(idp)))
return -1;
+ p->layer = 0;
layers = 1;
}
/*
@@ -237,6 +239,7 @@ build_up:
}
new->ary[0] = p;
new->count = 1;
+ new->layer = layers-1;
if (p->bitmap == IDR_FULL)
__set_bit(0, &new->bitmap);
p = new;
@@ -493,17 +496,21 @@ void *idr_find(struct idr *idp, int id)
int n;
struct idr_layer *p;

- n = idp->layers * IDR_BITS;
p = rcu_dereference(idp->top);
+ if (!p)
+ return NULL;
+ n = (p->layer+1) * IDR_BITS;

/* Mask off upper bits we don't use for the search. */
id &= MAX_ID_MASK;

if (id >= (1 << n))
return NULL;
+ BUG_ON(n == 0);

while (n > 0 && p) {
n -= IDR_BITS;
+ BUG_ON(n != p->layer*IDR_BITS);
p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
}
return((void *)p);
@@ -582,8 +589,11 @@ void *idr_replace(struct idr *idp, void
int n;
struct idr_layer *p, *old_p;

- n = idp->layers * IDR_BITS;
p = idp->top;
+ if (!p)
+ return ERR_PTR(-EINVAL);
+
+ n = (p->layer+1) * IDR_BITS;

id &= MAX_ID_MASK;

2008-12-03 20:01:38

by Greg KH

[permalink] [raw]
Subject: [patch 028/104] PCI Hotplug core: add name param pci_hp_register interface

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit 1359f2701b96abd9bb69c1273fb995a093b6409a upstream.

Update pci_hp_register() to take a const char *name parameter.

The motivation for this is to clean up the individual hotplug
drivers so that each one does not have to manage its own name.
The PCI core should be the place where we manage the name.

We update the interface and all callsites first, in a
"no functional change" manner, and clean up the drivers later.

Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Reviewed-by: Matthew Wilcox <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/acpiphp_core.c | 3 ++-
drivers/pci/hotplug/cpci_hotplug_core.c | 3 ++-
drivers/pci/hotplug/cpqphp_core.c | 3 ++-
drivers/pci/hotplug/fakephp.c | 3 ++-
drivers/pci/hotplug/ibmphp_ebda.c | 3 ++-
drivers/pci/hotplug/pci_hotplug_core.c | 15 ++++++++-------
drivers/pci/hotplug/pciehp_core.c | 3 ++-
drivers/pci/hotplug/rpaphp_slot.c | 2 +-
drivers/pci/hotplug/sgi_hotplug.c | 3 ++-
drivers/pci/hotplug/shpchp_core.c | 3 ++-
include/linux/pci_hotplug.h | 3 ++-
11 files changed, 27 insertions(+), 17 deletions(-)

--- a/drivers/pci/hotplug/acpiphp_core.c
+++ b/drivers/pci/hotplug/acpiphp_core.c
@@ -340,7 +340,8 @@ int acpiphp_register_hotplug_slot(struct

retval = pci_hp_register(slot->hotplug_slot,
acpiphp_slot->bridge->pci_bus,
- acpiphp_slot->device);
+ acpiphp_slot->device,
+ slot->name);
if (retval == -EBUSY)
goto error_hpslot;
if (retval) {
--- a/drivers/pci/hotplug/cpci_hotplug_core.c
+++ b/drivers/pci/hotplug/cpci_hotplug_core.c
@@ -285,7 +285,8 @@ cpci_hp_register_bus(struct pci_bus *bus
info->attention_status = cpci_get_attention_status(slot);

dbg("registering slot %s", slot->hotplug_slot->name);
- status = pci_hp_register(slot->hotplug_slot, bus, i);
+ status = pci_hp_register(slot->hotplug_slot, bus, i,
+ slot->hotplug_slot->name);
if (status) {
err("pci_hp_register failed with error %d", status);
goto error_name;
--- a/drivers/pci/hotplug/cpqphp_core.c
+++ b/drivers/pci/hotplug/cpqphp_core.c
@@ -436,7 +436,8 @@ static int ctrl_slot_setup(struct contro
slot_number);
result = pci_hp_register(hotplug_slot,
ctrl->pci_dev->bus,
- slot->device);
+ slot->device,
+ hotplug_slot->name);
if (result) {
err("pci_hp_register failed with error %d\n", result);
goto error_name;
--- a/drivers/pci/hotplug/fakephp.c
+++ b/drivers/pci/hotplug/fakephp.c
@@ -126,7 +126,8 @@ static int add_slot(struct pci_dev *dev)
slot->release = &dummy_release;
slot->private = dslot;

- retval = pci_hp_register(slot, dev->bus, PCI_SLOT(dev->devfn));
+ retval = pci_hp_register(slot, dev->bus, PCI_SLOT(dev->devfn),
+ slot->name);
if (retval) {
err("pci_hp_register failed with error %d\n", retval);
goto error_dslot;
--- a/drivers/pci/hotplug/ibmphp_ebda.c
+++ b/drivers/pci/hotplug/ibmphp_ebda.c
@@ -1002,7 +1002,8 @@ static int __init ebda_rsrc_controller (

snprintf (tmp_slot->hotplug_slot->name, 30, "%s", create_file_name (tmp_slot));
pci_hp_register(tmp_slot->hotplug_slot,
- pci_find_bus(0, tmp_slot->bus), tmp_slot->device);
+ pci_find_bus(0, tmp_slot->bus), tmp_slot->device,
+ tmp_slot->hotplug_slot->name);
}

print_ebda_hpc ();
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -221,7 +221,8 @@ static int init_slots(struct controller
duplicate_name:
retval = pci_hp_register(hotplug_slot,
ctrl->pci_dev->subordinate,
- slot->device);
+ slot->device,
+ slot->name);
if (retval) {
/*
* If slot N already exists, we'll try to create
--- a/drivers/pci/hotplug/pci_hotplug_core.c
+++ b/drivers/pci/hotplug/pci_hotplug_core.c
@@ -547,13 +547,15 @@ out:
* @bus: bus this slot is on
* @slot: pointer to the &struct hotplug_slot to register
* @slot_nr: slot number
+ * @name: name registered with kobject core
*
* Registers a hotplug slot with the pci hotplug subsystem, which will allow
* userspace interaction to the slot.
*
* Returns 0 if successful, anything else for an error.
*/
-int pci_hp_register(struct hotplug_slot *slot, struct pci_bus *bus, int slot_nr)
+int pci_hp_register(struct hotplug_slot *slot, struct pci_bus *bus, int slot_nr,
+ const char *name)
{
int result;
struct pci_slot *pci_slot;
@@ -569,7 +571,7 @@ int pci_hp_register(struct hotplug_slot
}

/* Check if we have already registered a slot with the same name. */
- if (get_slot_from_name(slot->name))
+ if (get_slot_from_name(name))
return -EEXIST;

/*
@@ -577,7 +579,7 @@ int pci_hp_register(struct hotplug_slot
* driver and call it here again. If we've already created the
* pci_slot, the interface will simply bump the refcount.
*/
- pci_slot = pci_create_slot(bus, slot_nr, slot->name);
+ pci_slot = pci_create_slot(bus, slot_nr, name);
if (IS_ERR(pci_slot))
return PTR_ERR(pci_slot);

@@ -593,8 +595,8 @@ int pci_hp_register(struct hotplug_slot
/*
* Allow pcihp drivers to override the ACPI_PCI_SLOT name.
*/
- if (strcmp(kobject_name(&pci_slot->kobj), slot->name)) {
- result = kobject_rename(&pci_slot->kobj, slot->name);
+ if (strcmp(kobject_name(&pci_slot->kobj), name)) {
+ result = kobject_rename(&pci_slot->kobj, name);
if (result) {
pci_destroy_slot(pci_slot);
return result;
@@ -607,8 +609,7 @@ int pci_hp_register(struct hotplug_slot

result = fs_add_slot(pci_slot);
kobject_uevent(&pci_slot->kobj, KOBJ_ADD);
- dbg("Added slot %s to the list\n", slot->name);
-
+ dbg("Added slot %s to the list\n", name);

return result;
}
--- a/drivers/pci/hotplug/rpaphp_slot.c
+++ b/drivers/pci/hotplug/rpaphp_slot.c
@@ -137,7 +137,7 @@ int rpaphp_register_slot(struct slot *sl
slotno = PCI_SLOT(PCI_DN(slot->dn->child)->devfn);
else
slotno = -1;
- retval = pci_hp_register(php_slot, slot->bus, slotno);
+ retval = pci_hp_register(php_slot, slot->bus, slotno, slot->name);
if (retval) {
err("pci_hp_register failed with error %d\n", retval);
return retval;
--- a/drivers/pci/hotplug/sgi_hotplug.c
+++ b/drivers/pci/hotplug/sgi_hotplug.c
@@ -653,7 +653,8 @@ static int sn_hotplug_slot_register(stru
bss_hotplug_slot->ops = &sn_hotplug_slot_ops;
bss_hotplug_slot->release = &sn_release_slot;

- rc = pci_hp_register(bss_hotplug_slot, pci_bus, device);
+ rc = pci_hp_register(bss_hotplug_slot, pci_bus, device,
+ bss_hotplug_slot->name);
if (rc)
goto register_err;

--- a/drivers/pci/hotplug/shpchp_core.c
+++ b/drivers/pci/hotplug/shpchp_core.c
@@ -146,7 +146,8 @@ static int init_slots(struct controller
slot->hp_slot, slot->number, ctrl->slot_device_offset);
duplicate_name:
retval = pci_hp_register(slot->hotplug_slot,
- ctrl->pci_dev->subordinate, slot->device);
+ ctrl->pci_dev->subordinate, slot->device,
+ hotplug_slot->name);
if (retval) {
/*
* If slot N already exists, we'll try to create
--- a/include/linux/pci_hotplug.h
+++ b/include/linux/pci_hotplug.h
@@ -165,7 +165,8 @@ struct hotplug_slot {
};
#define to_hotplug_slot(n) container_of(n, struct hotplug_slot, kobj)

-extern int pci_hp_register(struct hotplug_slot *, struct pci_bus *, int nr);
+extern int pci_hp_register(struct hotplug_slot *, struct pci_bus *, int nr,
+ const char *name);
extern int pci_hp_deregister(struct hotplug_slot *slot);
extern int __must_check pci_hp_change_slot_info (struct hotplug_slot *slot,
struct hotplug_slot_info *info);

2008-12-03 19:55:58

by Greg KH

[permalink] [raw]
Subject: [patch 014/104] sysvipc: fix the ipc structures initialization

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Nadia Derbey <[email protected]>

commit e00b4ff7ebf098b11b11be403921c1cf41d9e321 upstream.

A problem was found while reviewing the code after Bugzilla bug
http://bugzilla.kernel.org/show_bug.cgi?id=11796.

In ipc_addid(), the newly allocated ipc structure is inserted into the
ipcs tree (i.e made visible to readers) without locking it. This is not
correct since its initialization continues after it has been inserted in
the tree.

This patch moves the ipc structure lock initialization + locking before
the actual insertion.

Signed-off-by: Nadia Derbey <[email protected]>
Reported-by: Clement Calmels <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
ipc/util.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

--- a/ipc/util.c
+++ b/ipc/util.c
@@ -266,9 +266,17 @@ int ipc_addid(struct ipc_ids* ids, struc
if (ids->in_use >= size)
return -ENOSPC;

+ spin_lock_init(&new->lock);
+ new->deleted = 0;
+ rcu_read_lock();
+ spin_lock(&new->lock);
+
err = idr_get_new(&ids->ipcs_idr, new, &id);
- if (err)
+ if (err) {
+ spin_unlock(&new->lock);
+ rcu_read_unlock();
return err;
+ }

ids->in_use++;

@@ -280,10 +288,6 @@ int ipc_addid(struct ipc_ids* ids, struc
ids->seq = 0;

new->id = ipc_buildid(id, new->seq);
- spin_lock_init(&new->lock);
- new->deleted = 0;
- rcu_read_lock();
- spin_lock(&new->lock);
return id;
}

2008-12-03 19:55:37

by Greg KH

[permalink] [raw]
Subject: [patch 013/104] lib/scatterlist.c: fix kunmap() argument in sg_miter_stop()

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Arjan van de Ven <[email protected]>

commit f652c521e0bec2e70cf123f47e80117a7e6ed139 upstream.

kunmap() takes as argument the struct page that orginally got kmap()'d,
however the sg_miter_stop() function passed it the kernel virtual address
instead, resulting in weird stuff.

Somehow I ended up fixing this bug by accident while looking for a bug in
the same area.

Reported-by: kerneloops.org
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
lib/scatterlist.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -395,7 +395,7 @@ void sg_miter_stop(struct sg_mapping_ite
WARN_ON(!irqs_disabled());
kunmap_atomic(miter->addr, KM_BIO_SRC_IRQ);
} else
- kunmap(miter->addr);
+ kunmap(miter->page);

miter->page = NULL;
miter->addr = NULL;

2008-12-03 20:01:55

by Greg KH

[permalink] [raw]
Subject: [patch 029/104] PCI: update pci_create_slot() to take a hotplug param

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit 828f37683e6d3ab5912989df0d04201db7ad798e upstream.

Slot detection drivers can co-exist with hotplug drivers. The names
of the detected/claimed slots may be different depending on module
load order.

For legacy reasons, we need to allow hotplug drivers to override
the slot name if a detection driver is loaded first (and they find
the same slots).

Creating and overriding slot names should be an atomic operation,
otherwise you get a locking nightmare as various drivers race to
call pci_create_slot().

pci_create_slot() is already serialized by grabbing the pci_bus_sem.

We update the API and add a 'hotplug' param, which is:

set if the caller is a hotplug driver
NULL if the caller is a detection driver

pci_create_slot() does not actually use the 'hotplug' parameter in this
patch. A later patch will add the logic that uses it.

Cc: [email protected]
Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/acpi/pci_slot.c | 2 +-
drivers/pci/hotplug/pci_hotplug_core.c | 2 +-
drivers/pci/slot.c | 4 +++-
include/linux/pci.h | 3 ++-
4 files changed, 7 insertions(+), 4 deletions(-)

--- a/drivers/acpi/pci_slot.c
+++ b/drivers/acpi/pci_slot.c
@@ -150,7 +150,7 @@ register_slot(acpi_handle handle, u32 lv
}

snprintf(name, sizeof(name), "%u", (u32)sun);
- pci_slot = pci_create_slot(pci_bus, device, name);
+ pci_slot = pci_create_slot(pci_bus, device, name, NULL);
if (IS_ERR(pci_slot)) {
err("pci_create_slot returned %ld\n", PTR_ERR(pci_slot));
kfree(slot);
--- a/drivers/pci/hotplug/pci_hotplug_core.c
+++ b/drivers/pci/hotplug/pci_hotplug_core.c
@@ -579,7 +579,7 @@ int pci_hp_register(struct hotplug_slot
* driver and call it here again. If we've already created the
* pci_slot, the interface will simply bump the refcount.
*/
- pci_slot = pci_create_slot(bus, slot_nr, name);
+ pci_slot = pci_create_slot(bus, slot_nr, name, slot);
if (IS_ERR(pci_slot))
return PTR_ERR(pci_slot);

--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -78,6 +78,7 @@ static struct kobj_type pci_slot_ktype =
* @parent: struct pci_bus of parent bridge
* @slot_nr: PCI_SLOT(pci_dev->devfn) or -1 for placeholder
* @name: user visible string presented in /sys/bus/pci/slots/<name>
+ * @hotplug: set if caller is hotplug driver, NULL otherwise
*
* PCI slots have first class attributes such as address, speed, width,
* and a &struct pci_slot is used to manage them. This interface will
@@ -106,7 +107,8 @@ static struct kobj_type pci_slot_ktype =
*/

struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
- const char *name)
+ const char *name,
+ struct hotplug_slot *hotplug)
{
struct pci_slot *slot;
int err;
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -509,7 +509,8 @@ struct pci_bus *pci_create_bus(struct de
struct pci_bus *pci_add_new_bus(struct pci_bus *parent, struct pci_dev *dev,
int busnr);
struct pci_slot *pci_create_slot(struct pci_bus *parent, int slot_nr,
- const char *name);
+ const char *name,
+ struct hotplug_slot *hotplug);
void pci_destroy_slot(struct pci_slot *slot);
void pci_update_slot_number(struct pci_slot *slot, int slot_nr);
int pci_scan_slot(struct pci_bus *bus, int devfn);

2008-12-03 20:02:44

by Greg KH

[permalink] [raw]
Subject: [patch 031/104] PCI: prevent duplicate slot names

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit 5fe6cc60680d29740b85278e17a002fa27b7e642 upstream.

Prevent callers of pci_create_slot() from registering slots with
duplicate names. This condition occurs most often when PCI hotplug
drivers are loaded on platforms with broken firmware that assigns
identical names to multiple slots.

We now rename these duplicate slots on behalf of the user.

If firmware assigns the name N to multiple slots, then:

The first registered slot is assigned N
The second registered slot is assigned N-1
The third registered slot is assigned N-2
etc.

This is the permanent fix mentioned in earlier commits d6a9e9b4 and
167e782e (shpchp/pciehp: Rename duplicate slot name...).

We take advantage of the new 'hotplug' parameter in pci_create_slot()
to prevent a slot create/rename race between hotplug drivers and
detection drivers.

Scenario A:
hotplug driver detection driver
-------------- ----------------
pci_create_slot(hotplug=set)
pci_create_slot(hotplug=NULL)

The hotplug driver creates the slot with its desired name, and then
releases the semaphore. Now, the detection driver tries to create
the same slot, but it already exists. We don't care about renaming,
so return the existing slot.

Scenario B:
hotplug driver detection driver
-------------- ----------------
pci_create_slot(hotplug=NULL)
pci_create_slot(hotplug=set)

The detection driver creates the slot with name "X". Then the hotplug
driver tries to create the same slot, but wants the name "Y" instead.
We detect that we're trying to create the same slot and that we also
want a rename, so rename the slot to "Y" and return.

Scenario C:
hotplug driver hotplug driver
-------------- ----------------
pci_create_slot(hotplug=set)
pci_create_slot(hotplug=set)

Two separate hotplug drivers are attempting to claim the slot and
are passing valid hotplug_slot args to pci_create_slot(). We detect
that the slot already has a ->hotplug callback, prevent a rename,
and return -EBUSY.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/pci_hotplug_core.c | 26 ------
drivers/pci/hotplug/pciehp_core.c | 14 ---
drivers/pci/hotplug/shpchp_core.c | 15 ---
drivers/pci/slot.c | 139 ++++++++++++++++++++++++++-------
4 files changed, 114 insertions(+), 80 deletions(-)

--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -191,7 +191,6 @@ static int init_slots(struct controller
struct slot *slot;
struct hotplug_slot *hotplug_slot;
struct hotplug_slot_info *info;
- int len, dup = 1;
int retval = -ENOMEM;

list_for_each_entry(slot, &ctrl->slot_list, slot_list) {
@@ -218,24 +217,11 @@ static int init_slots(struct controller
dbg("Registering bus=%x dev=%x hp_slot=%x sun=%x "
"slot_device_offset=%x\n", slot->bus, slot->device,
slot->hp_slot, slot->number, ctrl->slot_device_offset);
-duplicate_name:
retval = pci_hp_register(hotplug_slot,
ctrl->pci_dev->subordinate,
slot->device,
slot->name);
if (retval) {
- /*
- * If slot N already exists, we'll try to create
- * slot N-1, N-2 ... N-M, until we overflow.
- */
- if (retval == -EEXIST) {
- len = snprintf(slot->name, SLOT_NAME_SIZE,
- "%d-%d", slot->number, dup++);
- if (len < SLOT_NAME_SIZE)
- goto duplicate_name;
- else
- err("duplicate slot name overflow\n");
- }
err("pci_hp_register failed with error %d\n", retval);
goto error_info;
}
--- a/drivers/pci/hotplug/pci_hotplug_core.c
+++ b/drivers/pci/hotplug/pci_hotplug_core.c
@@ -569,12 +569,6 @@ int pci_hp_register(struct hotplug_slot

mutex_lock(&pci_hp_mutex);

- /* Check if we have already registered a slot with the same name. */
- if (get_slot_from_name(name)) {
- result = -EEXIST;
- goto out;
- }
-
/*
* No problems if we call this interface from both ACPI_PCI_SLOT
* driver and call it here again. If we've already created the
@@ -583,27 +577,12 @@ int pci_hp_register(struct hotplug_slot
pci_slot = pci_create_slot(bus, slot_nr, name, slot);
if (IS_ERR(pci_slot)) {
result = PTR_ERR(pci_slot);
- goto cleanup;
- }
-
- if (pci_slot->hotplug) {
- dbg("%s: already claimed\n", __func__);
- result = -EBUSY;
- goto cleanup;
+ goto out;
}

slot->pci_slot = pci_slot;
pci_slot->hotplug = slot;

- /*
- * Allow pcihp drivers to override the ACPI_PCI_SLOT name.
- */
- if (strcmp(kobject_name(&pci_slot->kobj), name)) {
- result = kobject_rename(&pci_slot->kobj, name);
- if (result)
- goto cleanup;
- }
-
list_add(&slot->slot_list, &pci_hotplug_slot_list);

result = fs_add_slot(pci_slot);
@@ -612,9 +591,6 @@ int pci_hp_register(struct hotplug_slot
out:
mutex_unlock(&pci_hp_mutex);
return result;
-cleanup:
- pci_destroy_slot(pci_slot);
- goto out;
}

/**
--- a/drivers/pci/hotplug/shpchp_core.c
+++ b/drivers/pci/hotplug/shpchp_core.c
@@ -102,7 +102,7 @@ static int init_slots(struct controller
struct hotplug_slot *hotplug_slot;
struct hotplug_slot_info *info;
int retval = -ENOMEM;
- int i, len, dup = 1;
+ int i;

for (i = 0; i < ctrl->num_slots; i++) {
slot = kzalloc(sizeof(*slot), GFP_KERNEL);
@@ -144,23 +144,10 @@ static int init_slots(struct controller
dbg("Registering bus=%x dev=%x hp_slot=%x sun=%x "
"slot_device_offset=%x\n", slot->bus, slot->device,
slot->hp_slot, slot->number, ctrl->slot_device_offset);
-duplicate_name:
retval = pci_hp_register(slot->hotplug_slot,
ctrl->pci_dev->subordinate, slot->device,
hotplug_slot->name);
if (retval) {
- /*
- * If slot N already exists, we'll try to create
- * slot N-1, N-2 ... N-M, until we overflow.
- */
- if (retval == -EEXIST) {
- len = snprintf(slot->name, SLOT_NAME_SIZE,
- "%d-%d", slot->number, dup++);
- if (len < SLOT_NAME_SIZE)
- goto duplicate_name;
- else
- err("duplicate slot name overflow\n");
- }
err("pci_hp_register failed with error %d\n", retval);
goto error_info;
}
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -73,6 +73,77 @@ static struct kobj_type pci_slot_ktype =
.default_attrs = pci_slot_default_attrs,
};

+static char *make_slot_name(const char *name)
+{
+ char *new_name;
+ int len, max, dup;
+
+ new_name = kstrdup(name, GFP_KERNEL);
+ if (!new_name)
+ return NULL;
+
+ /*
+ * Make sure we hit the realloc case the first time through the
+ * loop. 'len' will be strlen(name) + 3 at that point which is
+ * enough space for "name-X" and the trailing NUL.
+ */
+ len = strlen(name) + 2;
+ max = 1;
+ dup = 1;
+
+ for (;;) {
+ struct kobject *dup_slot;
+ dup_slot = kset_find_obj(pci_slots_kset, new_name);
+ if (!dup_slot)
+ break;
+ kobject_put(dup_slot);
+ if (dup == max) {
+ len++;
+ max *= 10;
+ kfree(new_name);
+ new_name = kmalloc(len, GFP_KERNEL);
+ if (!new_name)
+ break;
+ }
+ sprintf(new_name, "%s-%d", name, dup++);
+ }
+
+ return new_name;
+}
+
+static int rename_slot(struct pci_slot *slot, const char *name)
+{
+ int result = 0;
+ char *slot_name;
+
+ if (strcmp(kobject_name(&slot->kobj), name) == 0)
+ return result;
+
+ slot_name = make_slot_name(name);
+ if (!slot_name)
+ return -ENOMEM;
+
+ result = kobject_rename(&slot->kobj, slot_name);
+ kfree(slot_name);
+
+ return result;
+}
+
+static struct pci_slot *get_slot(struct pci_bus *parent, int slot_nr)
+{
+ struct pci_slot *slot;
+ /*
+ * We already hold pci_bus_sem so don't worry
+ */
+ list_for_each_entry(slot, &parent->slots, list)
+ if (slot->number == slot_nr) {
+ kobject_get(&slot->kobj);
+ return slot;
+ }
+
+ return NULL;
+}
+
/**
* pci_create_slot - create or increment refcount for physical PCI slot
* @parent: struct pci_bus of parent bridge
@@ -85,7 +156,17 @@ static struct kobj_type pci_slot_ktype =
* either return a new &struct pci_slot to the caller, or if the pci_slot
* already exists, its refcount will be incremented.
*
- * Slots are uniquely identified by a @pci_bus, @slot_nr, @name tuple.
+ * Slots are uniquely identified by a @pci_bus, @slot_nr tuple.
+ *
+ * There are known platforms with broken firmware that assign the same
+ * name to multiple slots. Workaround these broken platforms by renaming
+ * the slots on behalf of the caller. If firmware assigns name N to
+ * multiple slots:
+ *
+ * The first slot is assigned N
+ * The second slot is assigned N-1
+ * The third slot is assigned N-2
+ * etc.
*
* Placeholder slots:
* In most cases, @pci_bus, @slot_nr will be sufficient to uniquely identify
@@ -94,12 +175,8 @@ static struct kobj_type pci_slot_ktype =
* the slot. In this scenario, the caller may pass -1 for @slot_nr.
*
* The following semantics are imposed when the caller passes @slot_nr ==
- * -1. First, the check for existing %struct pci_slot is skipped, as the
- * caller may know about several unpopulated slots on a given %struct
- * pci_bus, and each slot would have a @slot_nr of -1. Uniqueness for
- * these slots is then determined by the @name parameter. We expect
- * kobject_init_and_add() to warn us if the caller attempts to create
- * multiple slots with the same name. The other change in semantics is
+ * -1. First, we no longer check for an existing %struct pci_slot, as there
+ * may be many slots with @slot_nr of -1. The other change in semantics is
* user-visible, which is the 'address' parameter presented in sysfs will
* consist solely of a dddd:bb tuple, where dddd is the PCI domain of the
* %struct pci_bus and bb is the bus number. In other words, the devfn of
@@ -111,44 +188,53 @@ struct pci_slot *pci_create_slot(struct
struct hotplug_slot *hotplug)
{
struct pci_slot *slot;
- int err;
+ int err = 0;
+ char *slot_name = NULL;

down_write(&pci_bus_sem);

if (slot_nr == -1)
goto placeholder;

- /* If we've already created this slot, bump refcount and return. */
- list_for_each_entry(slot, &parent->slots, list) {
- if (slot->number == slot_nr) {
- kobject_get(&slot->kobj);
- pr_debug("%s: inc refcount to %d on %04x:%02x:%02x\n",
- __func__,
- atomic_read(&slot->kobj.kref.refcount),
- pci_domain_nr(parent), parent->number,
- slot_nr);
- goto out;
+ /*
+ * Hotplug drivers are allowed to rename an existing slot,
+ * but only if not already claimed.
+ */
+ slot = get_slot(parent, slot_nr);
+ if (slot) {
+ if (hotplug) {
+ if ((err = slot->hotplug ? -EBUSY : 0)
+ || (err = rename_slot(slot, name))) {
+ kobject_put(&slot->kobj);
+ slot = NULL;
+ goto err;
+ }
}
+ goto out;
}

placeholder:
slot = kzalloc(sizeof(*slot), GFP_KERNEL);
if (!slot) {
- slot = ERR_PTR(-ENOMEM);
- goto out;
+ err = -ENOMEM;
+ goto err;
}

slot->bus = parent;
slot->number = slot_nr;

slot->kobj.kset = pci_slots_kset;
- err = kobject_init_and_add(&slot->kobj, &pci_slot_ktype, NULL,
- "%s", name);
- if (err) {
- printk(KERN_ERR "Unable to register kobject %s\n", name);
+ slot_name = make_slot_name(name);
+ if (!slot_name) {
+ err = -ENOMEM;
goto err;
}

+ err = kobject_init_and_add(&slot->kobj, &pci_slot_ktype, NULL,
+ "%s", slot_name);
+ if (err)
+ goto err;
+
INIT_LIST_HEAD(&slot->list);
list_add(&slot->list, &parent->slots);

@@ -156,10 +242,10 @@ placeholder:
pr_debug("%s: created pci_slot on %04x:%02x:%02x\n",
__func__, pci_domain_nr(parent), parent->number, slot_nr);

- out:
+out:
up_write(&pci_bus_sem);
return slot;
- err:
+err:
kfree(slot);
slot = ERR_PTR(err);
goto out;
@@ -205,7 +291,6 @@ EXPORT_SYMBOL_GPL(pci_update_slot_number
* just call kobject_put on its kobj and let our release methods do the
* rest.
*/
-
void pci_destroy_slot(struct pci_slot *slot)
{
pr_debug("%s: dec refcount to %d on %04x:%02x:%02x\n", __func__,

2008-12-03 20:02:29

by Greg KH

[permalink] [raw]
Subject: [patch 030/104] PCI Hotplug: serialize pci_hp_register and pci_hp_deregister


2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Kenji Kaneshige <[email protected]>

commit 95cb9093960b6249fdbe7417bf513a1358aaa51a upstream.

Convert the pci_hotplug_slot_list_lock, which only protected the
list of hotplug slots, to a pci_hp_mutex which now protects both
interfaces.

Signed-off-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/pci_hotplug_core.c | 51 ++++++++++++++++++---------------
1 file changed, 28 insertions(+), 23 deletions(-)

--- a/drivers/pci/hotplug/pci_hotplug_core.c
+++ b/drivers/pci/hotplug/pci_hotplug_core.c
@@ -37,6 +37,7 @@
#include <linux/init.h>
#include <linux/mount.h>
#include <linux/namei.h>
+#include <linux/mutex.h>
#include <linux/pci.h>
#include <linux/pci_hotplug.h>
#include <asm/uaccess.h>
@@ -61,7 +62,7 @@ static int debug;
//////////////////////////////////////////////////////////////////

static LIST_HEAD(pci_hotplug_slot_list);
-static DEFINE_SPINLOCK(pci_hotplug_slot_list_lock);
+static DEFINE_MUTEX(pci_hp_mutex);

/* these strings match up with the values in pci_bus_speed */
static char *pci_bus_speed_strings[] = {
@@ -530,16 +531,12 @@ static struct hotplug_slot *get_slot_fro
struct hotplug_slot *slot;
struct list_head *tmp;

- spin_lock(&pci_hotplug_slot_list_lock);
list_for_each (tmp, &pci_hotplug_slot_list) {
slot = list_entry (tmp, struct hotplug_slot, slot_list);
if (strcmp(slot->name, name) == 0)
- goto out;
+ return slot;
}
- slot = NULL;
-out:
- spin_unlock(&pci_hotplug_slot_list_lock);
- return slot;
+ return NULL;
}

/**
@@ -570,9 +567,13 @@ int pci_hp_register(struct hotplug_slot
return -EINVAL;
}

+ mutex_lock(&pci_hp_mutex);
+
/* Check if we have already registered a slot with the same name. */
- if (get_slot_from_name(name))
- return -EEXIST;
+ if (get_slot_from_name(name)) {
+ result = -EEXIST;
+ goto out;
+ }

/*
* No problems if we call this interface from both ACPI_PCI_SLOT
@@ -580,13 +581,15 @@ int pci_hp_register(struct hotplug_slot
* pci_slot, the interface will simply bump the refcount.
*/
pci_slot = pci_create_slot(bus, slot_nr, name, slot);
- if (IS_ERR(pci_slot))
- return PTR_ERR(pci_slot);
+ if (IS_ERR(pci_slot)) {
+ result = PTR_ERR(pci_slot);
+ goto cleanup;
+ }

if (pci_slot->hotplug) {
dbg("%s: already claimed\n", __func__);
- pci_destroy_slot(pci_slot);
- return -EBUSY;
+ result = -EBUSY;
+ goto cleanup;
}

slot->pci_slot = pci_slot;
@@ -597,21 +600,21 @@ int pci_hp_register(struct hotplug_slot
*/
if (strcmp(kobject_name(&pci_slot->kobj), name)) {
result = kobject_rename(&pci_slot->kobj, name);
- if (result) {
- pci_destroy_slot(pci_slot);
- return result;
- }
+ if (result)
+ goto cleanup;
}

- spin_lock(&pci_hotplug_slot_list_lock);
list_add(&slot->slot_list, &pci_hotplug_slot_list);
- spin_unlock(&pci_hotplug_slot_list_lock);

result = fs_add_slot(pci_slot);
kobject_uevent(&pci_slot->kobj, KOBJ_ADD);
dbg("Added slot %s to the list\n", name);
-
+out:
+ mutex_unlock(&pci_hp_mutex);
return result;
+cleanup:
+ pci_destroy_slot(pci_slot);
+ goto out;
}

/**
@@ -631,13 +634,14 @@ int pci_hp_deregister(struct hotplug_slo
if (!hotplug)
return -ENODEV;

+ mutex_lock(&pci_hp_mutex);
temp = get_slot_from_name(hotplug->name);
- if (temp != hotplug)
+ if (temp != hotplug) {
+ mutex_unlock(&pci_hp_mutex);
return -ENODEV;
+ }

- spin_lock(&pci_hotplug_slot_list_lock);
list_del(&hotplug->slot_list);
- spin_unlock(&pci_hotplug_slot_list_lock);

slot = hotplug->pci_slot;
fs_remove_slot(slot);
@@ -646,6 +650,7 @@ int pci_hp_deregister(struct hotplug_slo
hotplug->release(hotplug);
slot->hotplug = NULL;
pci_destroy_slot(slot);
+ mutex_unlock(&pci_hp_mutex);

return 0;
}

2008-12-03 20:03:08

by Greg KH

[permalink] [raw]
Subject: [patch 032/104] PCI, PCI Hotplug: introduce slot_name helpers

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit 0ad772ec464d3fcf9d210836b97e654f393606c4 upstream

In preparation for cleaning up the various hotplug drivers
such that they don't have to manage their own 'name' parameters
anymore, we provide the following convenience functions:

pci_slot_name()
hotplug_slot_name()

These helpers will be used by individual hotplug drivers.

Cc: [email protected]
Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/slot.c | 2 +-
include/linux/pci.h | 5 +++++
include/linux/pci_hotplug.h | 5 +++++
3 files changed, 11 insertions(+), 1 deletion(-)

--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -116,7 +116,7 @@ static int rename_slot(struct pci_slot *
int result = 0;
char *slot_name;

- if (strcmp(kobject_name(&slot->kobj), name) == 0)
+ if (strcmp(pci_slot_name(slot), name) == 0)
return result;

slot_name = make_slot_name(name);
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -64,6 +64,11 @@ struct pci_slot {
struct kobject kobj;
};

+static inline const char *pci_slot_name(const struct pci_slot *slot)
+{
+ return kobject_name(&slot->kobj);
+}
+
/* File state for mmap()s on /proc/bus/pci/X/Y */
enum pci_mmap_state {
pci_mmap_io,
--- a/include/linux/pci_hotplug.h
+++ b/include/linux/pci_hotplug.h
@@ -165,6 +165,11 @@ struct hotplug_slot {
};
#define to_hotplug_slot(n) container_of(n, struct hotplug_slot, kobj)

+static inline const char *hotplug_slot_name(const struct hotplug_slot *slot)
+{
+ return pci_slot_name(slot->pci_slot);
+}
+
extern int pci_hp_register(struct hotplug_slot *, struct pci_bus *, int nr,
const char *name);
extern int pci_hp_deregister(struct hotplug_slot *slot);

2008-12-03 20:03:45

by Greg KH

[permalink] [raw]
Subject: [patch 034/104] PCI: cpci_hotplug: stop managing hotplug_slot->name

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit d6c479e0b777afcd7a26ca62e122e3f878ccc830 upstream.

We no longer need to manage our version of hotplug_slot->name
since the PCI and hotplug core manage it on our behalf.

Now, we simply advise the PCI core of the name that we would
like, and let the core take care of the rest.

Cc: [email protected]
Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/cpci_hotplug.h | 6 ++
drivers/pci/hotplug/cpci_hotplug_core.c | 76 ++++++++++++--------------------
drivers/pci/hotplug/cpci_hotplug_pci.c | 4 -
3 files changed, 37 insertions(+), 49 deletions(-)

--- a/drivers/pci/hotplug/cpci_hotplug_core.c
+++ b/drivers/pci/hotplug/cpci_hotplug_core.c
@@ -108,7 +108,7 @@ enable_slot(struct hotplug_slot *hotplug
struct slot *slot = hotplug_slot->private;
int retval = 0;

- dbg("%s - physical_slot = %s", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s", __func__, slot_name(slot));

if (controller->ops->set_power)
retval = controller->ops->set_power(slot, 1);
@@ -121,25 +121,23 @@ disable_slot(struct hotplug_slot *hotplu
struct slot *slot = hotplug_slot->private;
int retval = 0;

- dbg("%s - physical_slot = %s", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s", __func__, slot_name(slot));

down_write(&list_rwsem);

/* Unconfigure device */
- dbg("%s - unconfiguring slot %s",
- __func__, slot->hotplug_slot->name);
+ dbg("%s - unconfiguring slot %s", __func__, slot_name(slot));
if ((retval = cpci_unconfigure_slot(slot))) {
err("%s - could not unconfigure slot %s",
- __func__, slot->hotplug_slot->name);
+ __func__, slot_name(slot));
goto disable_error;
}
- dbg("%s - finished unconfiguring slot %s",
- __func__, slot->hotplug_slot->name);
+ dbg("%s - finished unconfiguring slot %s", __func__, slot_name(slot));

/* Clear EXT (by setting it) */
if (cpci_clear_ext(slot)) {
err("%s - could not clear EXT for slot %s",
- __func__, slot->hotplug_slot->name);
+ __func__, slot_name(slot));
retval = -ENODEV;
goto disable_error;
}
@@ -214,7 +212,6 @@ static void release_slot(struct hotplug_
struct slot *slot = hotplug_slot->private;

kfree(slot->hotplug_slot->info);
- kfree(slot->hotplug_slot->name);
kfree(slot->hotplug_slot);
if (slot->dev)
pci_dev_put(slot->dev);
@@ -222,12 +219,6 @@ static void release_slot(struct hotplug_
}

#define SLOT_NAME_SIZE 6
-static void
-make_slot_name(struct slot *slot)
-{
- snprintf(slot->hotplug_slot->name,
- SLOT_NAME_SIZE, "%02x:%02x", slot->bus->number, slot->number);
-}

int
cpci_hp_register_bus(struct pci_bus *bus, u8 first, u8 last)
@@ -235,7 +226,7 @@ cpci_hp_register_bus(struct pci_bus *bus
struct slot *slot;
struct hotplug_slot *hotplug_slot;
struct hotplug_slot_info *info;
- char *name;
+ char name[SLOT_NAME_SIZE];
int status = -ENOMEM;
int i;

@@ -262,35 +253,31 @@ cpci_hp_register_bus(struct pci_bus *bus
goto error_hpslot;
hotplug_slot->info = info;

- name = kmalloc(SLOT_NAME_SIZE, GFP_KERNEL);
- if (!name)
- goto error_info;
- hotplug_slot->name = name;
-
slot->bus = bus;
slot->number = i;
slot->devfn = PCI_DEVFN(i, 0);

+ snprintf(name, SLOT_NAME_SIZE, "%02x:%02x", bus->number, i);
+
hotplug_slot->private = slot;
hotplug_slot->release = &release_slot;
- make_slot_name(slot);
hotplug_slot->ops = &cpci_hotplug_slot_ops;

/*
* Initialize the slot info structure with some known
* good values.
*/
- dbg("initializing slot %s", slot->hotplug_slot->name);
+ dbg("initializing slot %s", name);
info->power_status = cpci_get_power_status(slot);
info->attention_status = cpci_get_attention_status(slot);

- dbg("registering slot %s", slot->hotplug_slot->name);
- status = pci_hp_register(slot->hotplug_slot, bus, i,
- slot->hotplug_slot->name);
+ dbg("registering slot %s", name);
+ status = pci_hp_register(slot->hotplug_slot, bus, i, name);
if (status) {
err("pci_hp_register failed with error %d", status);
- goto error_name;
+ goto error_info;
}
+ dbg("slot registered with name: %s", slot_name(slot));

/* Add slot to our internal list */
down_write(&list_rwsem);
@@ -299,8 +286,6 @@ cpci_hp_register_bus(struct pci_bus *bus
up_write(&list_rwsem);
}
return 0;
-error_name:
- kfree(name);
error_info:
kfree(info);
error_hpslot:
@@ -328,7 +313,7 @@ cpci_hp_unregister_bus(struct pci_bus *b
list_del(&slot->slot_list);
slots--;

- dbg("deregistering slot %s", slot->hotplug_slot->name);
+ dbg("deregistering slot %s", slot_name(slot));
status = pci_hp_deregister(slot->hotplug_slot);
if (status) {
err("pci_hp_deregister failed with error %d",
@@ -380,11 +365,10 @@ init_slots(int clear_ins)
return -1;
}
list_for_each_entry(slot, &slot_list, slot_list) {
- dbg("%s - looking at slot %s",
- __func__, slot->hotplug_slot->name);
+ dbg("%s - looking at slot %s", __func__, slot_name(slot));
if (clear_ins && cpci_check_and_clear_ins(slot))
dbg("%s - cleared INS for slot %s",
- __func__, slot->hotplug_slot->name);
+ __func__, slot_name(slot));
dev = pci_get_slot(slot->bus, PCI_DEVFN(slot->number, 0));
if (dev) {
if (update_adapter_status(slot->hotplug_slot, 1))
@@ -415,8 +399,7 @@ check_slots(void)
}
extracted = inserted = 0;
list_for_each_entry(slot, &slot_list, slot_list) {
- dbg("%s - looking at slot %s",
- __func__, slot->hotplug_slot->name);
+ dbg("%s - looking at slot %s", __func__, slot_name(slot));
if (cpci_check_and_clear_ins(slot)) {
/*
* Some broken hardware (e.g. PLX 9054AB) asserts
@@ -424,35 +407,34 @@ check_slots(void)
*/
if (slot->dev) {
warn("slot %s already inserted",
- slot->hotplug_slot->name);
+ slot_name(slot));
inserted++;
continue;
}

/* Process insertion */
- dbg("%s - slot %s inserted",
- __func__, slot->hotplug_slot->name);
+ dbg("%s - slot %s inserted", __func__, slot_name(slot));

/* GSM, debug */
hs_csr = cpci_get_hs_csr(slot);
dbg("%s - slot %s HS_CSR (1) = %04x",
- __func__, slot->hotplug_slot->name, hs_csr);
+ __func__, slot_name(slot), hs_csr);

/* Configure device */
dbg("%s - configuring slot %s",
- __func__, slot->hotplug_slot->name);
+ __func__, slot_name(slot));
if (cpci_configure_slot(slot)) {
err("%s - could not configure slot %s",
- __func__, slot->hotplug_slot->name);
+ __func__, slot_name(slot));
continue;
}
dbg("%s - finished configuring slot %s",
- __func__, slot->hotplug_slot->name);
+ __func__, slot_name(slot));

/* GSM, debug */
hs_csr = cpci_get_hs_csr(slot);
dbg("%s - slot %s HS_CSR (2) = %04x",
- __func__, slot->hotplug_slot->name, hs_csr);
+ __func__, slot_name(slot), hs_csr);

if (update_latch_status(slot->hotplug_slot, 1))
warn("failure to update latch file");
@@ -465,18 +447,18 @@ check_slots(void)
/* GSM, debug */
hs_csr = cpci_get_hs_csr(slot);
dbg("%s - slot %s HS_CSR (3) = %04x",
- __func__, slot->hotplug_slot->name, hs_csr);
+ __func__, slot_name(slot), hs_csr);

inserted++;
} else if (cpci_check_ext(slot)) {
/* Process extraction request */
dbg("%s - slot %s extracted",
- __func__, slot->hotplug_slot->name);
+ __func__, slot_name(slot));

/* GSM, debug */
hs_csr = cpci_get_hs_csr(slot);
dbg("%s - slot %s HS_CSR = %04x",
- __func__, slot->hotplug_slot->name, hs_csr);
+ __func__, slot_name(slot), hs_csr);

if (!slot->extracting) {
if (update_latch_status(slot->hotplug_slot, 0)) {
@@ -494,7 +476,7 @@ check_slots(void)
* bother trying to tell the driver or not?
*/
err("card in slot %s was improperly removed",
- slot->hotplug_slot->name);
+ slot_name(slot));
if (update_adapter_status(slot->hotplug_slot, 0))
warn("failure to update adapter file");
slot->extracting = 0;
--- a/drivers/pci/hotplug/cpci_hotplug.h
+++ b/drivers/pci/hotplug/cpci_hotplug.h
@@ -30,6 +30,7 @@

#include <linux/types.h>
#include <linux/pci.h>
+#include <linux/pci_hotplug.h>

/* PICMG 2.1 R2.0 HS CSR bits: */
#define HS_CSR_INS 0x0080
@@ -69,6 +70,11 @@ struct cpci_hp_controller {
struct cpci_hp_controller_ops *ops;
};

+static inline const char *slot_name(struct slot *slot)
+{
+ return hotplug_slot_name(slot->hotplug_slot);
+}
+
extern int cpci_hp_register_controller(struct cpci_hp_controller *controller);
extern int cpci_hp_unregister_controller(struct cpci_hp_controller *controller);
extern int cpci_hp_register_bus(struct pci_bus *bus, u8 first, u8 last);
--- a/drivers/pci/hotplug/cpci_hotplug_pci.c
+++ b/drivers/pci/hotplug/cpci_hotplug_pci.c
@@ -209,7 +209,7 @@ int cpci_led_on(struct slot* slot)
hs_cap + 2,
hs_csr)) {
err("Could not set LOO for slot %s",
- slot->hotplug_slot->name);
+ hotplug_slot_name(slot->hotplug_slot));
return -ENODEV;
}
}
@@ -238,7 +238,7 @@ int cpci_led_off(struct slot* slot)
hs_cap + 2,
hs_csr)) {
err("Could not clear LOO for slot %s",
- slot->hotplug_slot->name);
+ hotplug_slot_name(slot->hotplug_slot));
return -ENODEV;
}
}

2008-12-03 20:03:27

by Greg KH

[permalink] [raw]
Subject: [patch 033/104] PCI: acpiphp: remove name parameter

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit df77cd10078e36e1b89964e5e8c206add399a98d upstream.

We do not need to manage our own name parameter, especially since
the PCI core can change it on our behalf, in the case of duplicate
slot names.

Remove 'name' from acpiphp's version of struct slot.

Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/acpiphp.h | 9 +++++----
drivers/pci/hotplug/acpiphp_core.c | 31 ++++++++++++++++---------------
2 files changed, 21 insertions(+), 19 deletions(-)

--- a/drivers/pci/hotplug/acpiphp_core.c
+++ b/drivers/pci/hotplug/acpiphp_core.c
@@ -44,6 +44,9 @@

#define MY_NAME "acpiphp"

+/* name size which is used for entries in pcihpfs */
+#define SLOT_NAME_SIZE 21 /* {_SUN} */
+
static int debug;
int acpiphp_debug;

@@ -84,7 +87,6 @@ static struct hotplug_slot_ops acpi_hotp
.get_adapter_status = get_adapter_status,
};

-
/**
* acpiphp_register_attention - set attention LED callback
* @info: must be completely filled with LED callbacks
@@ -136,7 +138,7 @@ static int enable_slot(struct hotplug_sl
{
struct slot *slot = hotplug_slot->private;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

/* enable the specified slot */
return acpiphp_enable_slot(slot->acpi_slot);
@@ -154,7 +156,7 @@ static int disable_slot(struct hotplug_s
struct slot *slot = hotplug_slot->private;
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

/* disable the specified slot */
retval = acpiphp_disable_slot(slot->acpi_slot);
@@ -177,7 +179,7 @@ static int disable_slot(struct hotplug_s
{
int retval = -ENODEV;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, hotplug_slot_name(hotplug_slot));

if (attention_info && try_module_get(attention_info->owner)) {
retval = attention_info->set_attn(hotplug_slot, status);
@@ -200,7 +202,7 @@ static int get_power_status(struct hotpl
{
struct slot *slot = hotplug_slot->private;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

*value = acpiphp_get_power_status(slot->acpi_slot);

@@ -222,7 +224,7 @@ static int get_attention_status(struct h
{
int retval = -EINVAL;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, hotplug_slot_name(hotplug_slot));

if (attention_info && try_module_get(attention_info->owner)) {
retval = attention_info->get_attn(hotplug_slot, value);
@@ -245,7 +247,7 @@ static int get_latch_status(struct hotpl
{
struct slot *slot = hotplug_slot->private;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

*value = acpiphp_get_latch_status(slot->acpi_slot);

@@ -265,7 +267,7 @@ static int get_adapter_status(struct hot
{
struct slot *slot = hotplug_slot->private;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

*value = acpiphp_get_adapter_status(slot->acpi_slot);

@@ -299,7 +301,7 @@ static void release_slot(struct hotplug_
{
struct slot *slot = hotplug_slot->private;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

kfree(slot->hotplug_slot);
kfree(slot);
@@ -310,6 +312,7 @@ int acpiphp_register_hotplug_slot(struct
{
struct slot *slot;
int retval = -ENOMEM;
+ char name[SLOT_NAME_SIZE];

slot = kzalloc(sizeof(*slot), GFP_KERNEL);
if (!slot)
@@ -321,8 +324,6 @@ int acpiphp_register_hotplug_slot(struct

slot->hotplug_slot->info = &slot->info;

- slot->hotplug_slot->name = slot->name;
-
slot->hotplug_slot->private = slot;
slot->hotplug_slot->release = &release_slot;
slot->hotplug_slot->ops = &acpi_hotplug_slot_ops;
@@ -336,12 +337,12 @@ int acpiphp_register_hotplug_slot(struct
slot->hotplug_slot->info->cur_bus_speed = PCI_SPEED_UNKNOWN;

acpiphp_slot->slot = slot;
- snprintf(slot->name, sizeof(slot->name), "%u", slot->acpi_slot->sun);
+ snprintf(name, SLOT_NAME_SIZE, "%u", slot->acpi_slot->sun);

retval = pci_hp_register(slot->hotplug_slot,
acpiphp_slot->bridge->pci_bus,
acpiphp_slot->device,
- slot->name);
+ name);
if (retval == -EBUSY)
goto error_hpslot;
if (retval) {
@@ -349,7 +350,7 @@ int acpiphp_register_hotplug_slot(struct
goto error_hpslot;
}

- info("Slot [%s] registered\n", slot->hotplug_slot->name);
+ info("Slot [%s] registered\n", slot_name(slot));

return 0;
error_hpslot:
@@ -366,7 +367,7 @@ void acpiphp_unregister_hotplug_slot(str
struct slot *slot = acpiphp_slot->slot;
int retval = 0;

- info ("Slot [%s] unregistered\n", slot->hotplug_slot->name);
+ info("Slot [%s] unregistered\n", slot_name(slot));

retval = pci_hp_deregister(slot->hotplug_slot);
if (retval)
--- a/drivers/pci/hotplug/acpiphp.h
+++ b/drivers/pci/hotplug/acpiphp.h
@@ -50,9 +50,6 @@
#define info(format, arg...) printk(KERN_INFO "%s: " format, MY_NAME , ## arg)
#define warn(format, arg...) printk(KERN_WARNING "%s: " format, MY_NAME , ## arg)

-/* name size which is used for entries in pcihpfs */
-#define SLOT_NAME_SIZE 20 /* {_SUN} */
-
struct acpiphp_bridge;
struct acpiphp_slot;

@@ -63,9 +60,13 @@ struct slot {
struct hotplug_slot *hotplug_slot;
struct acpiphp_slot *acpi_slot;
struct hotplug_slot_info info;
- char name[SLOT_NAME_SIZE];
};

+static inline const char *slot_name(struct slot *slot)
+{
+ return hotplug_slot_name(slot->hotplug_slot);
+}
+
/*
* struct acpiphp_bridge - PCI bridge information
*

2008-12-03 20:04:09

by Greg KH

[permalink] [raw]
Subject: [patch 035/104] PCI: cpqphp: stop managing hotplug_slot->name

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit 30ac7acd05d1449ac784de144c4b5237be25b0b4 upstream.

We no longer need to manage our version of hotplug_slot->name
since the PCI and hotplug core manage it on our behalf.

Now, we simply advise the PCI core of the name that we would
like, and let the core take care of the rest.

Cc: [email protected]
Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/cpqphp.h | 13 ++++-------
drivers/pci/hotplug/cpqphp_core.c | 42 +++++++++++++++++---------------------
2 files changed, 24 insertions(+), 31 deletions(-)

--- a/drivers/pci/hotplug/cpqphp_core.c
+++ b/drivers/pci/hotplug/cpqphp_core.c
@@ -315,14 +315,15 @@ static void release_slot(struct hotplug_
{
struct slot *slot = hotplug_slot->private;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

kfree(slot->hotplug_slot->info);
- kfree(slot->hotplug_slot->name);
kfree(slot->hotplug_slot);
kfree(slot);
}

+#define SLOT_NAME_SIZE 10
+
static int ctrl_slot_setup(struct controller *ctrl,
void __iomem *smbios_start,
void __iomem *smbios_table)
@@ -335,6 +336,7 @@ static int ctrl_slot_setup(struct contro
u8 slot_number;
u8 ctrl_slot;
u32 tempdword;
+ char name[SLOT_NAME_SIZE];
void __iomem *slot_entry= NULL;
int result = -ENOMEM;

@@ -363,16 +365,12 @@ static int ctrl_slot_setup(struct contro
if (!hotplug_slot->info)
goto error_hpslot;
hotplug_slot_info = hotplug_slot->info;
- hotplug_slot->name = kmalloc(SLOT_NAME_SIZE, GFP_KERNEL);
-
- if (!hotplug_slot->name)
- goto error_info;

slot->ctrl = ctrl;
slot->bus = ctrl->bus;
slot->device = slot_device;
slot->number = slot_number;
- dbg("slot->number = %d\n", slot->number);
+ dbg("slot->number = %u\n", slot->number);

slot_entry = get_SMBIOS_entry(smbios_start, smbios_table, 9,
slot_entry);
@@ -418,9 +416,9 @@ static int ctrl_slot_setup(struct contro
/* register this slot with the hotplug pci core */
hotplug_slot->release = &release_slot;
hotplug_slot->private = slot;
- make_slot_name(hotplug_slot->name, SLOT_NAME_SIZE, slot);
+ snprintf(name, SLOT_NAME_SIZE, "%u", slot->number);
hotplug_slot->ops = &cpqphp_hotplug_slot_ops;
-
+
hotplug_slot_info->power_status = get_slot_enabled(ctrl, slot);
hotplug_slot_info->attention_status =
cpq_get_attention_status(ctrl, slot);
@@ -437,10 +435,10 @@ static int ctrl_slot_setup(struct contro
result = pci_hp_register(hotplug_slot,
ctrl->pci_dev->bus,
slot->device,
- hotplug_slot->name);
+ name);
if (result) {
err("pci_hp_register failed with error %d\n", result);
- goto error_name;
+ goto error_info;
}

slot->next = ctrl->slot;
@@ -452,8 +450,6 @@ static int ctrl_slot_setup(struct contro
}

return 0;
-error_name:
- kfree(hotplug_slot->name);
error_info:
kfree(hotplug_slot_info);
error_hpslot:
@@ -639,7 +635,7 @@ static int set_attention_status (struct
u8 device;
u8 function;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

if (cpqhp_get_bus_dev(ctrl, &bus, &devfn, slot->number) == -1)
return -ENODEV;
@@ -666,7 +662,7 @@ static int process_SI(struct hotplug_slo
u8 device;
u8 function;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

if (cpqhp_get_bus_dev(ctrl, &bus, &devfn, slot->number) == -1)
return -ENODEV;
@@ -698,7 +694,7 @@ static int process_SS(struct hotplug_slo
u8 device;
u8 function;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

if (cpqhp_get_bus_dev(ctrl, &bus, &devfn, slot->number) == -1)
return -ENODEV;
@@ -721,7 +717,7 @@ static int hardware_test(struct hotplug_
struct slot *slot = hotplug_slot->private;
struct controller *ctrl = slot->ctrl;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

return cpqhp_hardware_test(ctrl, value);
}
@@ -732,7 +728,7 @@ static int get_power_status(struct hotpl
struct slot *slot = hotplug_slot->private;
struct controller *ctrl = slot->ctrl;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

*value = get_slot_enabled(ctrl, slot);
return 0;
@@ -743,7 +739,7 @@ static int get_attention_status(struct h
struct slot *slot = hotplug_slot->private;
struct controller *ctrl = slot->ctrl;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

*value = cpq_get_attention_status(ctrl, slot);
return 0;
@@ -754,7 +750,7 @@ static int get_latch_status(struct hotpl
struct slot *slot = hotplug_slot->private;
struct controller *ctrl = slot->ctrl;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

*value = cpq_get_latch_status(ctrl, slot);

@@ -766,7 +762,7 @@ static int get_adapter_status(struct hot
struct slot *slot = hotplug_slot->private;
struct controller *ctrl = slot->ctrl;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

*value = get_presence_status(ctrl, slot);

@@ -778,7 +774,7 @@ static int get_max_bus_speed (struct hot
struct slot *slot = hotplug_slot->private;
struct controller *ctrl = slot->ctrl;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

*value = ctrl->speed_capability;

@@ -790,7 +786,7 @@ static int get_cur_bus_speed (struct hot
struct slot *slot = hotplug_slot->private;
struct controller *ctrl = slot->ctrl;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

*value = ctrl->speed;

--- a/drivers/pci/hotplug/cpqphp.h
+++ b/drivers/pci/hotplug/cpqphp.h
@@ -449,6 +449,11 @@ extern u8 cpqhp_disk_irq;

/* inline functions */

+static inline char *slot_name(struct slot *slot)
+{
+ return hotplug_slot_name(slot->hotplug_slot);
+}
+
/*
* return_resource
*
@@ -696,14 +701,6 @@ static inline int get_presence_status(st
return presence_save;
}

-#define SLOT_NAME_SIZE 10
-
-static inline void make_slot_name(char *buffer, int buffer_size, struct slot *slot)
-{
- snprintf(buffer, buffer_size, "%d", slot->number);
-}
-
-
static inline int wait_for_ctrl_irq(struct controller *ctrl)
{
DECLARE_WAITQUEUE(wait, current);

2008-12-03 20:04:31

by Greg KH

[permalink] [raw]
Subject: [patch 036/104] PCI: fakephp: remove name parameter

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit 43caae884b5a5e2eacb4879225341cb49700e129 upstream.

Remove 'name' from fakephp's struct dummy_slot, as the PCI core
will now manage our slot name for us.

Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/fakephp.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)

--- a/drivers/pci/hotplug/fakephp.c
+++ b/drivers/pci/hotplug/fakephp.c
@@ -66,7 +66,6 @@ struct dummy_slot {
struct pci_dev *dev;
struct work_struct remove_work;
unsigned long removed;
- char name[8];
};

static int debug;
@@ -96,10 +95,13 @@ static void dummy_release(struct hotplug
kfree(dslot);
}

+#define SLOT_NAME_SIZE 8
+
static int add_slot(struct pci_dev *dev)
{
struct dummy_slot *dslot;
struct hotplug_slot *slot;
+ char name[SLOT_NAME_SIZE];
int retval = -ENOMEM;
static int count = 1;

@@ -119,20 +121,18 @@ static int add_slot(struct pci_dev *dev)
if (!dslot)
goto error_info;

- slot->name = dslot->name;
- snprintf(slot->name, sizeof(dslot->name), "fake%d", count++);
- dbg("slot->name = %s\n", slot->name);
+ snprintf(name, SLOT_NAME_SIZE, "fake%d", count++);
slot->ops = &dummy_hotplug_slot_ops;
slot->release = &dummy_release;
slot->private = dslot;

- retval = pci_hp_register(slot, dev->bus, PCI_SLOT(dev->devfn),
- slot->name);
+ retval = pci_hp_register(slot, dev->bus, PCI_SLOT(dev->devfn), name);
if (retval) {
err("pci_hp_register failed with error %d\n", retval);
goto error_dslot;
}

+ dbg("slot->name = %s\n", hotplug_slot_name(slot));
dslot->slot = slot;
dslot->dev = pci_dev_get(dev);
list_add (&dslot->node, &slot_list);
@@ -168,10 +168,11 @@ static void remove_slot(struct dummy_slo
{
int retval;

- dbg("removing slot %s\n", dslot->slot->name);
+ dbg("removing slot %s\n", hotplug_slot_name(dslot->slot));
retval = pci_hp_deregister(dslot->slot);
if (retval)
- err("Problem unregistering a slot %s\n", dslot->slot->name);
+ err("Problem unregistering a slot %s\n",
+ hotplug_slot_name(dslot->slot));
}

/* called from the single-threaded workqueue handler to remove a slot */
@@ -309,7 +310,7 @@ static int disable_slot(struct hotplug_s
return -ENODEV;
dslot = slot->private;

- dbg("%s - physical_slot = %s\n", __func__, slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, hotplug_slot_name(slot));

for (func = 7; func >= 0; func--) {
dev = pci_get_slot(dslot->dev->bus, dslot->dev->devfn + func);

2008-12-03 20:05:12

by Greg KH

[permalink] [raw]
Subject: [patch 038/104] PCI: pciehp: remove name parameter

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit e1acb24f059defdaa0264e925f19cc21b0a3e592 upstream.

We do not need to manage our own name parameter, especially since
the PCI core can change it on our behalf, in the case of duplicate
slot names.

Remove 'name' from pciehp's version of struct slot, and remove
unused 'task_list' as well.

Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/pciehp.h | 9 ++++---
drivers/pci/hotplug/pciehp_core.c | 34 ++++++++++++++------------
drivers/pci/hotplug/pciehp_ctrl.c | 48 +++++++++++++++++++-------------------
drivers/pci/hotplug/pciehp_hpc.c | 1
4 files changed, 48 insertions(+), 44 deletions(-)

--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -180,7 +180,8 @@ static struct hotplug_slot_attribute hot
*/
static void release_slot(struct hotplug_slot *hotplug_slot)
{
- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__,
+ hotplug_slot_name(hotplug_slot));

kfree(hotplug_slot->info);
kfree(hotplug_slot);
@@ -191,6 +192,7 @@ static int init_slots(struct controller
struct slot *slot;
struct hotplug_slot *hotplug_slot;
struct hotplug_slot_info *info;
+ char name[SLOT_NAME_SIZE];
int retval = -ENOMEM;

list_for_each_entry(slot, &ctrl->slot_list, slot_list) {
@@ -204,15 +206,11 @@ static int init_slots(struct controller

/* register this slot with the hotplug pci core */
hotplug_slot->info = info;
- hotplug_slot->name = slot->name;
hotplug_slot->private = slot;
hotplug_slot->release = &release_slot;
hotplug_slot->ops = &pciehp_hotplug_slot_ops;
- get_power_status(hotplug_slot, &info->power_status);
- get_attention_status(hotplug_slot, &info->attention_status);
- get_latch_status(hotplug_slot, &info->latch_status);
- get_adapter_status(hotplug_slot, &info->adapter_status);
slot->hotplug_slot = hotplug_slot;
+ snprintf(name, SLOT_NAME_SIZE, "%u", slot->number);

dbg("Registering bus=%x dev=%x hp_slot=%x sun=%x "
"slot_device_offset=%x\n", slot->bus, slot->device,
@@ -220,11 +218,15 @@ static int init_slots(struct controller
retval = pci_hp_register(hotplug_slot,
ctrl->pci_dev->subordinate,
slot->device,
- slot->name);
+ name);
if (retval) {
err("pci_hp_register failed with error %d\n", retval);
goto error_info;
}
+ get_power_status(hotplug_slot, &info->power_status);
+ get_attention_status(hotplug_slot, &info->attention_status);
+ get_latch_status(hotplug_slot, &info->latch_status);
+ get_adapter_status(hotplug_slot, &info->adapter_status);
/* create additional sysfs entries */
if (EMI(ctrl)) {
retval = sysfs_create_file(&hotplug_slot->pci_slot->kobj,
@@ -265,7 +267,7 @@ static int set_attention_status(struct h
{
struct slot *slot = hotplug_slot->private;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

hotplug_slot->info->attention_status = status;

@@ -280,7 +282,7 @@ static int enable_slot(struct hotplug_sl
{
struct slot *slot = hotplug_slot->private;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

return pciehp_sysfs_enable_slot(slot);
}
@@ -290,7 +292,7 @@ static int disable_slot(struct hotplug_s
{
struct slot *slot = hotplug_slot->private;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

return pciehp_sysfs_disable_slot(slot);
}
@@ -300,7 +302,7 @@ static int get_power_status(struct hotpl
struct slot *slot = hotplug_slot->private;
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_power_status(slot, value);
if (retval < 0)
@@ -314,7 +316,7 @@ static int get_attention_status(struct h
struct slot *slot = hotplug_slot->private;
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_attention_status(slot, value);
if (retval < 0)
@@ -328,7 +330,7 @@ static int get_latch_status(struct hotpl
struct slot *slot = hotplug_slot->private;
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_latch_status(slot, value);
if (retval < 0)
@@ -342,7 +344,7 @@ static int get_adapter_status(struct hot
struct slot *slot = hotplug_slot->private;
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_adapter_status(slot, value);
if (retval < 0)
@@ -357,7 +359,7 @@ static int get_max_bus_speed(struct hotp
struct slot *slot = hotplug_slot->private;
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_max_bus_speed(slot, value);
if (retval < 0)
@@ -371,7 +373,7 @@ static int get_cur_bus_speed(struct hotp
struct slot *slot = hotplug_slot->private;
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_cur_bus_speed(slot, value);
if (retval < 0)
--- a/drivers/pci/hotplug/pciehp_ctrl.c
+++ b/drivers/pci/hotplug/pciehp_ctrl.c
@@ -65,7 +65,7 @@ u8 pciehp_handle_attention_button(struct
/*
* Button pressed - See if need to TAKE ACTION!!!
*/
- info("Button pressed on Slot(%s)\n", p_slot->name);
+ info("Button pressed on Slot(%s)\n", slot_name(p_slot));
event_type = INT_BUTTON_PRESS;

queue_interrupt_event(p_slot, event_type);
@@ -86,13 +86,13 @@ u8 pciehp_handle_switch_change(struct sl
/*
* Switch opened
*/
- info("Latch open on Slot(%s)\n", p_slot->name);
+ info("Latch open on Slot(%s)\n", slot_name(p_slot));
event_type = INT_SWITCH_OPEN;
} else {
/*
* Switch closed
*/
- info("Latch close on Slot(%s)\n", p_slot->name);
+ info("Latch close on Slot(%s)\n", slot_name(p_slot));
event_type = INT_SWITCH_CLOSE;
}

@@ -117,13 +117,13 @@ u8 pciehp_handle_presence_change(struct
/*
* Card Present
*/
- info("Card present on Slot(%s)\n", p_slot->name);
+ info("Card present on Slot(%s)\n", slot_name(p_slot));
event_type = INT_PRESENCE_ON;
} else {
/*
* Not Present
*/
- info("Card not present on Slot(%s)\n", p_slot->name);
+ info("Card not present on Slot(%s)\n", slot_name(p_slot));
event_type = INT_PRESENCE_OFF;
}

@@ -143,13 +143,13 @@ u8 pciehp_handle_power_fault(struct slot
/*
* power fault Cleared
*/
- info("Power fault cleared on Slot(%s)\n", p_slot->name);
+ info("Power fault cleared on Slot(%s)\n", slot_name(p_slot));
event_type = INT_POWER_FAULT_CLEAR;
} else {
/*
* power fault
*/
- info("Power fault on Slot(%s)\n", p_slot->name);
+ info("Power fault on Slot(%s)\n", slot_name(p_slot));
event_type = INT_POWER_FAULT;
info("power fault bit %x set\n", 0);
}
@@ -404,11 +404,11 @@ static void handle_button_press_event(st
if (getstatus) {
p_slot->state = BLINKINGOFF_STATE;
info("PCI slot #%s - powering off due to button "
- "press.\n", p_slot->name);
+ "press.\n", slot_name(p_slot));
} else {
p_slot->state = BLINKINGON_STATE;
info("PCI slot #%s - powering on due to button "
- "press.\n", p_slot->name);
+ "press.\n", slot_name(p_slot));
}
/* blink green LED and turn off amber */
if (PWR_LED(ctrl))
@@ -425,7 +425,7 @@ static void handle_button_press_event(st
* press the attention again before the 5 sec. limit
* expires to cancel hot-add or hot-remove
*/
- info("Button cancel on Slot(%s)\n", p_slot->name);
+ info("Button cancel on Slot(%s)\n", slot_name(p_slot));
dbg("%s: button cancel\n", __func__);
cancel_delayed_work(&p_slot->work);
if (p_slot->state == BLINKINGOFF_STATE) {
@@ -438,7 +438,7 @@ static void handle_button_press_event(st
if (ATTN_LED(ctrl))
p_slot->hpc_ops->set_attention_status(p_slot, 0);
info("PCI slot #%s - action canceled due to button press\n",
- p_slot->name);
+ slot_name(p_slot));
p_slot->state = STATIC_STATE;
break;
case POWEROFF_STATE:
@@ -448,7 +448,7 @@ static void handle_button_press_event(st
* this means that the previous attention button action
* to hot-add or hot-remove is undergoing
*/
- info("Button ignore on Slot(%s)\n", p_slot->name);
+ info("Button ignore on Slot(%s)\n", slot_name(p_slot));
update_slot_info(p_slot);
break;
default:
@@ -529,7 +529,7 @@ int pciehp_enable_slot(struct slot *p_sl
rc = p_slot->hpc_ops->get_adapter_status(p_slot, &getstatus);
if (rc || !getstatus) {
info("%s: no adapter on slot(%s)\n", __func__,
- p_slot->name);
+ slot_name(p_slot));
mutex_unlock(&p_slot->ctrl->crit_sect);
return -ENODEV;
}
@@ -537,7 +537,7 @@ int pciehp_enable_slot(struct slot *p_sl
rc = p_slot->hpc_ops->get_latch_status(p_slot, &getstatus);
if (rc || getstatus) {
info("%s: latch open on slot(%s)\n", __func__,
- p_slot->name);
+ slot_name(p_slot));
mutex_unlock(&p_slot->ctrl->crit_sect);
return -ENODEV;
}
@@ -547,7 +547,7 @@ int pciehp_enable_slot(struct slot *p_sl
rc = p_slot->hpc_ops->get_power_status(p_slot, &getstatus);
if (rc || getstatus) {
info("%s: already enabled on slot(%s)\n", __func__,
- p_slot->name);
+ slot_name(p_slot));
mutex_unlock(&p_slot->ctrl->crit_sect);
return -EINVAL;
}
@@ -582,7 +582,7 @@ int pciehp_disable_slot(struct slot *p_s
ret = p_slot->hpc_ops->get_adapter_status(p_slot, &getstatus);
if (ret || !getstatus) {
info("%s: no adapter on slot(%s)\n", __func__,
- p_slot->name);
+ slot_name(p_slot));
mutex_unlock(&p_slot->ctrl->crit_sect);
return -ENODEV;
}
@@ -592,7 +592,7 @@ int pciehp_disable_slot(struct slot *p_s
ret = p_slot->hpc_ops->get_latch_status(p_slot, &getstatus);
if (ret || getstatus) {
info("%s: latch open on slot(%s)\n", __func__,
- p_slot->name);
+ slot_name(p_slot));
mutex_unlock(&p_slot->ctrl->crit_sect);
return -ENODEV;
}
@@ -602,7 +602,7 @@ int pciehp_disable_slot(struct slot *p_s
ret = p_slot->hpc_ops->get_power_status(p_slot, &getstatus);
if (ret || !getstatus) {
info("%s: already disabled slot(%s)\n", __func__,
- p_slot->name);
+ slot_name(p_slot));
mutex_unlock(&p_slot->ctrl->crit_sect);
return -EINVAL;
}
@@ -632,14 +632,14 @@ int pciehp_sysfs_enable_slot(struct slot
break;
case POWERON_STATE:
info("Slot %s is already in powering on state\n",
- p_slot->name);
+ slot_name(p_slot));
break;
case BLINKINGOFF_STATE:
case POWEROFF_STATE:
- info("Already enabled on slot %s\n", p_slot->name);
+ info("Already enabled on slot %s\n", slot_name(p_slot));
break;
default:
- err("Not a valid state on slot %s\n", p_slot->name);
+ err("Not a valid state on slot %s\n", slot_name(p_slot));
break;
}
mutex_unlock(&p_slot->lock);
@@ -664,14 +664,14 @@ int pciehp_sysfs_disable_slot(struct slo
break;
case POWEROFF_STATE:
info("Slot %s is already in powering off state\n",
- p_slot->name);
+ slot_name(p_slot));
break;
case BLINKINGON_STATE:
case POWERON_STATE:
- info("Already disabled on slot %s\n", p_slot->name);
+ info("Already disabled on slot %s\n", slot_name(p_slot));
break;
default:
- err("Not a valid state on slot %s\n", p_slot->name);
+ err("Not a valid state on slot %s\n", slot_name(p_slot));
break;
}
mutex_unlock(&p_slot->lock);
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -61,15 +61,13 @@ extern struct workqueue_struct *pciehp_w
struct slot {
u8 bus;
u8 device;
- u32 number;
u8 state;
- struct timer_list task_event;
u8 hp_slot;
+ u32 number;
struct controller *ctrl;
struct hpc_ops *hpc_ops;
struct hotplug_slot *hotplug_slot;
struct list_head slot_list;
- char name[SLOT_NAME_SIZE];
unsigned long last_emi_toggle;
struct delayed_work work; /* work for button event */
struct mutex lock;
@@ -161,6 +159,11 @@ int pciehp_enable_slot(struct slot *p_sl
int pciehp_disable_slot(struct slot *p_slot);
int pcie_enable_notification(struct controller *ctrl);

+static inline const char *slot_name(struct slot *slot)
+{
+ return hotplug_slot_name(slot->hotplug_slot);
+}
+
static inline struct slot *pciehp_find_slot(struct controller *ctrl, u8 device)
{
struct slot *slot;
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -1044,7 +1044,6 @@ static int pcie_init_slot(struct control
slot->device = ctrl->slot_device_offset + slot->hp_slot;
slot->hpc_ops = ctrl->hpc_ops;
slot->number = ctrl->first_slot;
- snprintf(slot->name, SLOT_NAME_SIZE, "%d", slot->number);
mutex_init(&slot->lock);
INIT_DELAYED_WORK(&slot->work, pciehp_queue_pushbutton_work);
list_add(&slot->slot_list, &ctrl->slot_list);

2008-12-03 20:04:49

by Greg KH

[permalink] [raw]
Subject: [patch 037/104] PCI: ibmphp: stop managing hotplug_slot->name

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit a32615a1a661f83661e8a26c3bc7763f716da8f3 upstream.

We no longer need to manage our version of hotplug_slot->name
since the PCI and hotplug core manage it on our behalf.

Now, we simply advise the PCI core of the name that we would
like, and let the core take care of the rest.

Additionally, slightly rearrange the members of struct slot
so they are naturally aligned to eliminate holes.

Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/ibmphp.h | 5 ++---
drivers/pci/hotplug/ibmphp_ebda.c | 20 +++++++-------------
2 files changed, 9 insertions(+), 16 deletions(-)

--- a/drivers/pci/hotplug/ibmphp_ebda.c
+++ b/drivers/pci/hotplug/ibmphp_ebda.c
@@ -620,11 +620,14 @@ static u8 calculate_first_slot (u8 slot_
return first_slot + 1;

}
+
+#define SLOT_NAME_SIZE 30
+
static char *create_file_name (struct slot * slot_cur)
{
struct opt_rio *opt_vg_ptr = NULL;
struct opt_rio_lo *opt_lo_ptr = NULL;
- static char str[30];
+ static char str[SLOT_NAME_SIZE];
int which = 0; /* rxe = 1, chassis = 0 */
u8 number = 1; /* either chassis or rxe # */
u8 first_slot = 1;
@@ -736,7 +739,6 @@ static void release_slot(struct hotplug_

slot = hotplug_slot->private;
kfree(slot->hotplug_slot->info);
- kfree(slot->hotplug_slot->name);
kfree(slot->hotplug_slot);
slot->ctrl = NULL;
slot->bus_on = NULL;
@@ -768,6 +770,7 @@ static int __init ebda_rsrc_controller (
int rc;
struct slot *tmp_slot;
struct list_head *list;
+ char name[SLOT_NAME_SIZE];

addr = hpc_list_ptr->phys_addr;
for (ctlr = 0; ctlr < hpc_list_ptr->num_ctlrs; ctlr++) {
@@ -931,12 +934,6 @@ static int __init ebda_rsrc_controller (
goto error_no_hp_info;
}

- hp_slot_ptr->name = kmalloc(30, GFP_KERNEL);
- if (!hp_slot_ptr->name) {
- rc = -ENOMEM;
- goto error_no_hp_name;
- }
-
tmp_slot = kzalloc(sizeof(*tmp_slot), GFP_KERNEL);
if (!tmp_slot) {
rc = -ENOMEM;
@@ -1000,10 +997,9 @@ static int __init ebda_rsrc_controller (
list_for_each (list, &ibmphp_slot_head) {
tmp_slot = list_entry (list, struct slot, ibm_slot_list);

- snprintf (tmp_slot->hotplug_slot->name, 30, "%s", create_file_name (tmp_slot));
+ snprintf(name, SLOT_NAME_SIZE, "%s", create_file_name(tmp_slot));
pci_hp_register(tmp_slot->hotplug_slot,
- pci_find_bus(0, tmp_slot->bus), tmp_slot->device,
- tmp_slot->hotplug_slot->name);
+ pci_find_bus(0, tmp_slot->bus), tmp_slot->device, name);
}

print_ebda_hpc ();
@@ -1013,8 +1009,6 @@ static int __init ebda_rsrc_controller (
error:
kfree (hp_slot_ptr->private);
error_no_slot:
- kfree (hp_slot_ptr->name);
-error_no_hp_name:
kfree (hp_slot_ptr->info);
error_no_hp_info:
kfree (hp_slot_ptr);
--- a/drivers/pci/hotplug/ibmphp.h
+++ b/drivers/pci/hotplug/ibmphp.h
@@ -707,17 +707,16 @@ struct slot {
u8 device;
u8 number;
u8 real_physical_slot_num;
- char name[100];
u32 capabilities;
u8 supported_speed;
u8 supported_bus_mode;
+ u8 flag; /* this is for disable slot and polling */
+ u8 ctlr_index;
struct hotplug_slot *hotplug_slot;
struct controller *ctrl;
struct pci_func *func;
u8 irq[4];
- u8 flag; /* this is for disable slot and polling */
int bit_mode; /* 0 = 32, 1 = 64 */
- u8 ctlr_index;
struct bus_info *bus_on;
struct list_head ibm_slot_list;
u8 status;

2008-12-03 20:06:24

by Greg KH

[permalink] [raw]
Subject: [patch 041/104] PCI: shcphp: remove name parameter

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit 66f1705580f796a3f52c092e9dc92cbe5df41dd6 upstream.

We do not need to manage our own name parameter, especially since
the PCI core can change it on our behalf, in the case of duplicate
slot names.

Remove 'name' from shpchp's version of struct slot.

This change also removes the unused struct task_event from the
slot structure.

Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/shpchp.h | 9 ++++---
drivers/pci/hotplug/shpchp_core.c | 38 ++++++++++++++----------------
drivers/pci/hotplug/shpchp_ctrl.c | 48 +++++++++++++++++++-------------------
3 files changed, 48 insertions(+), 47 deletions(-)

--- a/drivers/pci/hotplug/shpchp_core.c
+++ b/drivers/pci/hotplug/shpchp_core.c
@@ -89,7 +89,7 @@ static void release_slot(struct hotplug_
{
struct slot *slot = hotplug_slot->private;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

kfree(slot->hotplug_slot->info);
kfree(slot->hotplug_slot);
@@ -101,6 +101,7 @@ static int init_slots(struct controller
struct slot *slot;
struct hotplug_slot *hotplug_slot;
struct hotplug_slot_info *info;
+ char name[SLOT_NAME_SIZE];
int retval = -ENOMEM;
int i;

@@ -119,8 +120,6 @@ static int init_slots(struct controller
goto error_hpslot;
hotplug_slot->info = info;

- hotplug_slot->name = slot->name;
-
slot->hp_slot = i;
slot->ctrl = ctrl;
slot->bus = ctrl->pci_dev->subordinate->number;
@@ -133,25 +132,24 @@ static int init_slots(struct controller
/* register this slot with the hotplug pci core */
hotplug_slot->private = slot;
hotplug_slot->release = &release_slot;
- snprintf(slot->name, SLOT_NAME_SIZE, "%d", slot->number);
+ snprintf(name, SLOT_NAME_SIZE, "%d", slot->number);
hotplug_slot->ops = &shpchp_hotplug_slot_ops;

- get_power_status(hotplug_slot, &info->power_status);
- get_attention_status(hotplug_slot, &info->attention_status);
- get_latch_status(hotplug_slot, &info->latch_status);
- get_adapter_status(hotplug_slot, &info->adapter_status);
-
dbg("Registering bus=%x dev=%x hp_slot=%x sun=%x "
"slot_device_offset=%x\n", slot->bus, slot->device,
slot->hp_slot, slot->number, ctrl->slot_device_offset);
retval = pci_hp_register(slot->hotplug_slot,
- ctrl->pci_dev->subordinate, slot->device,
- hotplug_slot->name);
+ ctrl->pci_dev->subordinate, slot->device, name);
if (retval) {
err("pci_hp_register failed with error %d\n", retval);
goto error_info;
}

+ get_power_status(hotplug_slot, &info->power_status);
+ get_attention_status(hotplug_slot, &info->attention_status);
+ get_latch_status(hotplug_slot, &info->latch_status);
+ get_adapter_status(hotplug_slot, &info->adapter_status);
+
list_add(&slot->slot_list, &ctrl->slot_list);
}

@@ -189,7 +187,7 @@ static int set_attention_status (struct
{
struct slot *slot = get_slot(hotplug_slot);

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

hotplug_slot->info->attention_status = status;
slot->hpc_ops->set_attention_status(slot, status);
@@ -201,7 +199,7 @@ static int enable_slot (struct hotplug_s
{
struct slot *slot = get_slot(hotplug_slot);

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

return shpchp_sysfs_enable_slot(slot);
}
@@ -210,7 +208,7 @@ static int disable_slot (struct hotplug_
{
struct slot *slot = get_slot(hotplug_slot);

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

return shpchp_sysfs_disable_slot(slot);
}
@@ -220,7 +218,7 @@ static int get_power_status (struct hotp
struct slot *slot = get_slot(hotplug_slot);
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_power_status(slot, value);
if (retval < 0)
@@ -234,7 +232,7 @@ static int get_attention_status (struct
struct slot *slot = get_slot(hotplug_slot);
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_attention_status(slot, value);
if (retval < 0)
@@ -248,7 +246,7 @@ static int get_latch_status (struct hotp
struct slot *slot = get_slot(hotplug_slot);
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_latch_status(slot, value);
if (retval < 0)
@@ -262,7 +260,7 @@ static int get_adapter_status (struct ho
struct slot *slot = get_slot(hotplug_slot);
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_adapter_status(slot, value);
if (retval < 0)
@@ -277,7 +275,7 @@ static int get_max_bus_speed(struct hotp
struct slot *slot = get_slot(hotplug_slot);
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_max_bus_speed(slot, value);
if (retval < 0)
@@ -291,7 +289,7 @@ static int get_cur_bus_speed (struct hot
struct slot *slot = get_slot(hotplug_slot);
int retval;

- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot->name);
+ dbg("%s - physical_slot = %s\n", __func__, slot_name(slot));

retval = slot->hpc_ops->get_cur_bus_speed(slot, value);
if (retval < 0)
--- a/drivers/pci/hotplug/shpchp_ctrl.c
+++ b/drivers/pci/hotplug/shpchp_ctrl.c
@@ -70,7 +70,7 @@ u8 shpchp_handle_attention_button(u8 hp_
/*
* Button pressed - See if need to TAKE ACTION!!!
*/
- info("Button pressed on Slot(%s)\n", p_slot->name);
+ info("Button pressed on Slot(%s)\n", slot_name(p_slot));
event_type = INT_BUTTON_PRESS;

queue_interrupt_event(p_slot, event_type);
@@ -98,7 +98,7 @@ u8 shpchp_handle_switch_change(u8 hp_slo
/*
* Switch opened
*/
- info("Latch open on Slot(%s)\n", p_slot->name);
+ info("Latch open on Slot(%s)\n", slot_name(p_slot));
event_type = INT_SWITCH_OPEN;
if (p_slot->pwr_save && p_slot->presence_save) {
event_type = INT_POWER_FAULT;
@@ -108,7 +108,7 @@ u8 shpchp_handle_switch_change(u8 hp_slo
/*
* Switch closed
*/
- info("Latch close on Slot(%s)\n", p_slot->name);
+ info("Latch close on Slot(%s)\n", slot_name(p_slot));
event_type = INT_SWITCH_CLOSE;
}

@@ -135,13 +135,13 @@ u8 shpchp_handle_presence_change(u8 hp_s
/*
* Card Present
*/
- info("Card present on Slot(%s)\n", p_slot->name);
+ info("Card present on Slot(%s)\n", slot_name(p_slot));
event_type = INT_PRESENCE_ON;
} else {
/*
* Not Present
*/
- info("Card not present on Slot(%s)\n", p_slot->name);
+ info("Card not present on Slot(%s)\n", slot_name(p_slot));
event_type = INT_PRESENCE_OFF;
}

@@ -164,14 +164,14 @@ u8 shpchp_handle_power_fault(u8 hp_slot,
/*
* Power fault Cleared
*/
- info("Power fault cleared on Slot(%s)\n", p_slot->name);
+ info("Power fault cleared on Slot(%s)\n", slot_name(p_slot));
p_slot->status = 0x00;
event_type = INT_POWER_FAULT_CLEAR;
} else {
/*
* Power fault
*/
- info("Power fault on Slot(%s)\n", p_slot->name);
+ info("Power fault on Slot(%s)\n", slot_name(p_slot));
event_type = INT_POWER_FAULT;
/* set power fault status for this board */
p_slot->status = 0xFF;
@@ -493,11 +493,11 @@ static void handle_button_press_event(st
if (getstatus) {
p_slot->state = BLINKINGOFF_STATE;
info("PCI slot #%s - powering off due to button "
- "press.\n", p_slot->name);
+ "press.\n", slot_name(p_slot));
} else {
p_slot->state = BLINKINGON_STATE;
info("PCI slot #%s - powering on due to button "
- "press.\n", p_slot->name);
+ "press.\n", slot_name(p_slot));
}
/* blink green LED and turn off amber */
p_slot->hpc_ops->green_led_blink(p_slot);
@@ -512,7 +512,7 @@ static void handle_button_press_event(st
* press the attention again before the 5 sec. limit
* expires to cancel hot-add or hot-remove
*/
- info("Button cancel on Slot(%s)\n", p_slot->name);
+ info("Button cancel on Slot(%s)\n", slot_name(p_slot));
dbg("%s: button cancel\n", __func__);
cancel_delayed_work(&p_slot->work);
if (p_slot->state == BLINKINGOFF_STATE)
@@ -521,7 +521,7 @@ static void handle_button_press_event(st
p_slot->hpc_ops->green_led_off(p_slot);
p_slot->hpc_ops->set_attention_status(p_slot, 0);
info("PCI slot #%s - action canceled due to button press\n",
- p_slot->name);
+ slot_name(p_slot));
p_slot->state = STATIC_STATE;
break;
case POWEROFF_STATE:
@@ -531,7 +531,7 @@ static void handle_button_press_event(st
* this means that the previous attention button action
* to hot-add or hot-remove is undergoing
*/
- info("Button ignore on Slot(%s)\n", p_slot->name);
+ info("Button ignore on Slot(%s)\n", slot_name(p_slot));
update_slot_info(p_slot);
break;
default:
@@ -574,17 +574,17 @@ static int shpchp_enable_slot (struct sl
mutex_lock(&p_slot->ctrl->crit_sect);
rc = p_slot->hpc_ops->get_adapter_status(p_slot, &getstatus);
if (rc || !getstatus) {
- info("No adapter on slot(%s)\n", p_slot->name);
+ info("No adapter on slot(%s)\n", slot_name(p_slot));
goto out;
}
rc = p_slot->hpc_ops->get_latch_status(p_slot, &getstatus);
if (rc || getstatus) {
- info("Latch open on slot(%s)\n", p_slot->name);
+ info("Latch open on slot(%s)\n", slot_name(p_slot));
goto out;
}
rc = p_slot->hpc_ops->get_power_status(p_slot, &getstatus);
if (rc || getstatus) {
- info("Already enabled on slot(%s)\n", p_slot->name);
+ info("Already enabled on slot(%s)\n", slot_name(p_slot));
goto out;
}

@@ -633,17 +633,17 @@ static int shpchp_disable_slot (struct s

rc = p_slot->hpc_ops->get_adapter_status(p_slot, &getstatus);
if (rc || !getstatus) {
- info("No adapter on slot(%s)\n", p_slot->name);
+ info("No adapter on slot(%s)\n", slot_name(p_slot));
goto out;
}
rc = p_slot->hpc_ops->get_latch_status(p_slot, &getstatus);
if (rc || getstatus) {
- info("Latch open on slot(%s)\n", p_slot->name);
+ info("Latch open on slot(%s)\n", slot_name(p_slot));
goto out;
}
rc = p_slot->hpc_ops->get_power_status(p_slot, &getstatus);
if (rc || !getstatus) {
- info("Already disabled slot(%s)\n", p_slot->name);
+ info("Already disabled slot(%s)\n", slot_name(p_slot));
goto out;
}

@@ -671,14 +671,14 @@ int shpchp_sysfs_enable_slot(struct slot
break;
case POWERON_STATE:
info("Slot %s is already in powering on state\n",
- p_slot->name);
+ slot_name(p_slot));
break;
case BLINKINGOFF_STATE:
case POWEROFF_STATE:
- info("Already enabled on slot %s\n", p_slot->name);
+ info("Already enabled on slot %s\n", slot_name(p_slot));
break;
default:
- err("Not a valid state on slot %s\n", p_slot->name);
+ err("Not a valid state on slot %s\n", slot_name(p_slot));
break;
}
mutex_unlock(&p_slot->lock);
@@ -703,14 +703,14 @@ int shpchp_sysfs_disable_slot(struct slo
break;
case POWEROFF_STATE:
info("Slot %s is already in powering off state\n",
- p_slot->name);
+ slot_name(p_slot));
break;
case BLINKINGON_STATE:
case POWERON_STATE:
- info("Already disabled on slot %s\n", p_slot->name);
+ info("Already disabled on slot %s\n", slot_name(p_slot));
break;
default:
- err("Not a valid state on slot %s\n", p_slot->name);
+ err("Not a valid state on slot %s\n", slot_name(p_slot));
break;
}
mutex_unlock(&p_slot->lock);
--- a/drivers/pci/hotplug/shpchp.h
+++ b/drivers/pci/hotplug/shpchp.h
@@ -69,15 +69,13 @@ struct slot {
u8 state;
u8 presence_save;
u8 pwr_save;
- struct timer_list task_event;
- u8 hp_slot;
struct controller *ctrl;
struct hpc_ops *hpc_ops;
struct hotplug_slot *hotplug_slot;
struct list_head slot_list;
- char name[SLOT_NAME_SIZE];
struct delayed_work work; /* work for button event */
struct mutex lock;
+ u8 hp_slot;
};

struct event_info {
@@ -169,6 +167,11 @@ extern void cleanup_slots(struct control
extern void shpchp_queue_pushbutton_work(struct work_struct *work);
extern int shpc_init( struct controller *ctrl, struct pci_dev *pdev);

+static inline const char *slot_name(struct slot *slot)
+{
+ return hotplug_slot_name(slot->hotplug_slot);
+}
+
#ifdef CONFIG_ACPI
#include <linux/pci-acpi.h>
static inline int get_hp_params_from_firmware(struct pci_dev *dev,

2008-12-03 20:05:58

by Greg KH

[permalink] [raw]
Subject: [patch 040/104] PCI: SGI Hotplug: stop managing bss_hotplug_slot->name

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit 85234ce86dfa62b779faa19a70364a06e3f7fc32 upstream.

We no longer need to manage our version of hotplug_slot->name
since the PCI and hotplug core manage it on our behalf.

Update the sn_hp_slot_private_alloc() interface to fill in
the correct name for us, as that function already has all
the parameters needed to determine the name.

Cc: [email protected]
Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/sgi_hotplug.c | 19 ++++++-------------
1 file changed, 6 insertions(+), 13 deletions(-)

--- a/drivers/pci/hotplug/sgi_hotplug.c
+++ b/drivers/pci/hotplug/sgi_hotplug.c
@@ -161,7 +161,8 @@ static int sn_pci_bus_valid(struct pci_b
}

static int sn_hp_slot_private_alloc(struct hotplug_slot *bss_hotplug_slot,
- struct pci_bus *pci_bus, int device)
+ struct pci_bus *pci_bus, int device,
+ char *name)
{
struct pcibus_info *pcibus_info;
struct slot *slot;
@@ -173,15 +174,9 @@ static int sn_hp_slot_private_alloc(stru
return -ENOMEM;
bss_hotplug_slot->private = slot;

- bss_hotplug_slot->name = kmalloc(SN_SLOT_NAME_SIZE, GFP_KERNEL);
- if (!bss_hotplug_slot->name) {
- kfree(bss_hotplug_slot->private);
- return -ENOMEM;
- }
-
slot->device_num = device;
slot->pci_bus = pci_bus;
- sprintf(bss_hotplug_slot->name, "%04x:%02x:%02x",
+ sprintf(name, "%04x:%02x:%02x",
pci_domain_nr(pci_bus),
((u16)pcibus_info->pbi_buscommon.bs_persist_busnum),
device + 1);
@@ -608,7 +603,6 @@ static inline int get_power_status(struc
static void sn_release_slot(struct hotplug_slot *bss_hotplug_slot)
{
kfree(bss_hotplug_slot->info);
- kfree(bss_hotplug_slot->name);
kfree(bss_hotplug_slot->private);
kfree(bss_hotplug_slot);
}
@@ -618,6 +612,7 @@ static int sn_hotplug_slot_register(stru
int device;
struct pci_slot *pci_slot;
struct hotplug_slot *bss_hotplug_slot;
+ char name[SN_SLOT_NAME_SIZE];
int rc = 0;

/*
@@ -645,16 +640,14 @@ static int sn_hotplug_slot_register(stru
}

if (sn_hp_slot_private_alloc(bss_hotplug_slot,
- pci_bus, device)) {
+ pci_bus, device, name)) {
rc = -ENOMEM;
goto alloc_err;
}
-
bss_hotplug_slot->ops = &sn_hotplug_slot_ops;
bss_hotplug_slot->release = &sn_release_slot;

- rc = pci_hp_register(bss_hotplug_slot, pci_bus, device,
- bss_hotplug_slot->name);
+ rc = pci_hp_register(bss_hotplug_slot, pci_bus, device, name);
if (rc)
goto register_err;

2008-12-03 20:05:41

by Greg KH

[permalink] [raw]
Subject: [patch 039/104] PCI: rpaphp: kmalloc/kfree slot->name directly

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit b2132fecca02fa05d509ba4c8c1e51dee6ccd003 upstream.

rpaphp tends to use slot->name directly everywhere, and doesn't
ever need slot->hotplug_slot->name.

struct hotplug_slot->name is going away, so convert rpaphp directly
manipulate its own slot->name everywhere, and don't bother touching
slot->hotplug_slot->name.

Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/rpaphp_slot.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

--- a/drivers/pci/hotplug/rpaphp_slot.c
+++ b/drivers/pci/hotplug/rpaphp_slot.c
@@ -43,7 +43,7 @@ static void rpaphp_release_slot(struct h
void dealloc_slot_struct(struct slot *slot)
{
kfree(slot->hotplug_slot->info);
- kfree(slot->hotplug_slot->name);
+ kfree(slot->name);
kfree(slot->hotplug_slot);
kfree(slot);
}
@@ -63,11 +63,9 @@ struct slot *alloc_slot_struct(struct de
GFP_KERNEL);
if (!slot->hotplug_slot->info)
goto error_hpslot;
- slot->hotplug_slot->name = kmalloc(strlen(drc_name) + 1, GFP_KERNEL);
- if (!slot->hotplug_slot->name)
+ slot->name = kstrdup(drc_name, GFP_KERNEL);
+ if (!slot->name)
goto error_info;
- slot->name = slot->hotplug_slot->name;
- strcpy(slot->name, drc_name);
slot->dn = dn;
slot->index = drc_index;
slot->power_domain = power_domain;

2008-12-03 20:06:40

by Greg KH

[permalink] [raw]
Subject: [patch 042/104] PCI: Hotplug core: remove name

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alex Chiang <[email protected]>

commit 58319b802a614f10f1b5238fbde7a4b2e9a60069 upstream.

Now that the PCI core manages the 'name' for each individual
hotplug driver, and all drivers (except rpaphp) have been converted
to use hotplug_slot_name(), there is no need for the PCI hotplug
core to drag around its own copy of name either.

Cc: [email protected]
Cc: [email protected]
Acked-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
Signed-off-by: Jesse Barnes <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/pci/hotplug/pci_hotplug_core.c | 6 +++---
include/linux/pci_hotplug.h | 3 ---
2 files changed, 3 insertions(+), 6 deletions(-)

--- a/drivers/pci/hotplug/pci_hotplug_core.c
+++ b/drivers/pci/hotplug/pci_hotplug_core.c
@@ -533,7 +533,7 @@ static struct hotplug_slot *get_slot_fro

list_for_each (tmp, &pci_hotplug_slot_list) {
slot = list_entry (tmp, struct hotplug_slot, slot_list);
- if (strcmp(slot->name, name) == 0)
+ if (strcmp(hotplug_slot_name(slot), name) == 0)
return slot;
}
return NULL;
@@ -611,7 +611,7 @@ int pci_hp_deregister(struct hotplug_slo
return -ENODEV;

mutex_lock(&pci_hp_mutex);
- temp = get_slot_from_name(hotplug->name);
+ temp = get_slot_from_name(hotplug_slot_name(hotplug));
if (temp != hotplug) {
mutex_unlock(&pci_hp_mutex);
return -ENODEV;
@@ -621,7 +621,7 @@ int pci_hp_deregister(struct hotplug_slo

slot = hotplug->pci_slot;
fs_remove_slot(slot);
- dbg("Removed slot %s from the list\n", hotplug->name);
+ dbg("Removed slot %s from the list\n", hotplug_slot_name(hotplug));

hotplug->release(hotplug);
slot->hotplug = NULL;
--- a/include/linux/pci_hotplug.h
+++ b/include/linux/pci_hotplug.h
@@ -142,8 +142,6 @@ struct hotplug_slot_info {

/**
* struct hotplug_slot - used to register a physical slot with the hotplug pci core
- * @name: the name of the slot being registered. This string must
- * be unique amoung slots registered on this system.
* @ops: pointer to the &struct hotplug_slot_ops to be used for this slot
* @info: pointer to the &struct hotplug_slot_info for the initial values for
* this slot.
@@ -153,7 +151,6 @@ struct hotplug_slot_info {
* needs.
*/
struct hotplug_slot {
- char *name;
struct hotplug_slot_ops *ops;
struct hotplug_slot_info *info;
void (*release) (struct hotplug_slot *slot);

2008-12-03 20:06:57

by Greg KH

[permalink] [raw]
Subject: [patch 043/104] CPUFREQ: powernow-k8: ignore out-of-range PstateStatus value

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Andreas Herrmann <[email protected]>

commit a266d9f1253a38ec2d5655ebcd6846298b0554f4 upstream.

A workaround for AMD CPU family 11h erratum 311 might cause that the
P-state Status Register shows a "current P-state" which is larger than
the "current P-state limit" in P-state Current Limit Register. For the
wrong P-state value there is no ACPI _PSS object defined and
powernow-k8/cpufreq can't determine the proper CPU frequency for that
state.

As a consequence this can cause a panic during boot (potentially with
all recent kernel versions -- at least I have reproduced it with
various 2.6.27 kernels and with the current .28 series), as an
example:

powernow-k8: Found 1 AMD Turion(tm)X2 Ultra DualCore Mobile ZM-82 processors (2 \
)
powernow-k8: 0 : pstate 0 (2200 MHz)
powernow-k8: 1 : pstate 1 (1100 MHz)
powernow-k8: 2 : pstate 2 (600 MHz)
BUG: unable to handle kernel paging request at ffff88086e7528b8
IP: [<ffffffff80486361>] cpufreq_stats_update+0x4a/0x5f
PGD 202063 PUD 0
Oops: 0002 [#1] SMP
last sysfs file:
CPU 1
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.28-rc3-dirty #16
RIP: 0010:[<ffffffff80486361>] [<ffffffff80486361>] cpufreq_stats_update+0x4a/0\
f
Synaptics claims to have extended capabilities, but I'm not able to read them.<6\
6
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88006e7528c0
RDX: 00000000ffffffff RSI: ffff88006e54af00 RDI: ffffffff808f056c
RBP: 00000000fffee697 R08: 0000000000000003 R09: ffff88006e73f080
R10: 0000000000000001 R11: 00000000002191c0 R12: ffff88006fb83c10
R13: 00000000ffffffff R14: 0000000000000001 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88006fb50740(0000) knlGS:0000000000000000
Unable to initialize Synaptics hardware.
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff88086e7528b8 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff88006fb82000, task ffff88006fb816d0)
Stack:
ffff88006e74da50 0000000000000000 ffff88006e54af00 ffffffff804863c7
ffff88006e74da50 0000000000000000 00000000ffffffff 0000000000000000
ffff88006fb83c10 ffffffff8024b46c ffffffff808f0560 ffff88006fb83c10
Call Trace:
[<ffffffff804863c7>] ? cpufreq_stat_notifier_trans+0x51/0x83
[<ffffffff8024b46c>] ? notifier_call_chain+0x29/0x4c
[<ffffffff8024b561>] ? __srcu_notifier_call_chain+0x46/0x61
[<ffffffff8048496d>] ? cpufreq_notify_transition+0x93/0xa9
[<ffffffff8021ab8d>] ? powernowk8_target+0x1e8/0x5f3
[<ffffffff80486687>] ? cpufreq_governor_performance+0x1b/0x20
[<ffffffff80484886>] ? __cpufreq_governor+0x71/0xa8
[<ffffffff80484b21>] ? __cpufreq_set_policy+0x101/0x13e
[<ffffffff80485bcd>] ? cpufreq_add_dev+0x3f0/0x4cd
[<ffffffff8048577a>] ? handle_update+0x0/0x8
[<ffffffff803c2062>] ? sysdev_driver_register+0xb6/0x10d
[<ffffffff8056592c>] ? powernowk8_init+0x0/0x7e
[<ffffffff8048604c>] ? cpufreq_register_driver+0x8f/0x140
[<ffffffff80209056>] ? _stext+0x56/0x14f
[<ffffffff802c2234>] ? proc_register+0x122/0x17d
[<ffffffff802c23a0>] ? create_proc_entry+0x73/0x8a
[<ffffffff8025c259>] ? register_irq_proc+0x92/0xaa
[<ffffffff8025c2c8>] ? init_irq_proc+0x57/0x69
[<ffffffff807fc85f>] ? kernel_init+0x116/0x169
[<ffffffff8020cc79>] ? child_rip+0xa/0x11
[<ffffffff807fc749>] ? kernel_init+0x0/0x169
[<ffffffff8020cc6f>] ? child_rip+0x0/0x11
Code: 05 c5 83 36 00 48 c7 c2 48 5d 86 80 48 8b 04 d8 48 8b 40 08 48 8b 34 02 48\

RIP [<ffffffff80486361>] cpufreq_stats_update+0x4a/0x5f
RSP <ffff88006fb83b20>
CR2: ffff88086e7528b8
---[ end trace 0678bac75e67a2f7 ]---
Kernel panic - not syncing: Attempted to kill init!

In short, aftereffect of the wrong P-state is that
cpufreq_stats_update() uses "-1" as index for some array in

cpufreq_stats_update (unsigned int cpu)
{
...
if (stat->time_in_state)
stat->time_in_state[stat->last_index] =
cputime64_add(stat->time_in_state[stat->last_index],
cputime_sub(cur_time, stat->last_time));
...
}

Fortunately, the wrong P-state value is returned only if the core is
in P-state 0. This fix solves the problem by detecting the
out-of-range P-state, ignoring it, and using "0" instead.

Cc: Mark Langsdorf <[email protected]>
Signed-off-by: Andreas Herrmann <[email protected]>
Signed-off-by: Dave Jones <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/x86/kernel/cpu/cpufreq/powernow-k8.c | 18 +++++++++++++++---
arch/x86/kernel/cpu/cpufreq/powernow-k8.h | 17 ++++++++++++++++-
2 files changed, 31 insertions(+), 4 deletions(-)

--- a/arch/x86/kernel/cpu/cpufreq/powernow-k8.c
+++ b/arch/x86/kernel/cpu/cpufreq/powernow-k8.c
@@ -116,9 +116,20 @@ static int query_current_values_with_pen
u32 i = 0;

if (cpu_family == CPU_HW_PSTATE) {
- rdmsr(MSR_PSTATE_STATUS, lo, hi);
- i = lo & HW_PSTATE_MASK;
- data->currpstate = i;
+ if (data->currpstate == HW_PSTATE_INVALID) {
+ /* read (initial) hw pstate if not yet set */
+ rdmsr(MSR_PSTATE_STATUS, lo, hi);
+ i = lo & HW_PSTATE_MASK;
+
+ /*
+ * a workaround for family 11h erratum 311 might cause
+ * an "out-of-range Pstate if the core is in Pstate-0
+ */
+ if (i >= data->numps)
+ data->currpstate = HW_PSTATE_0;
+ else
+ data->currpstate = i;
+ }
return 0;
}
do {
@@ -1117,6 +1128,7 @@ static int __cpuinit powernowk8_cpu_init
}

data->cpu = pol->cpu;
+ data->currpstate = HW_PSTATE_INVALID;

if (powernow_k8_cpu_init_acpi(data)) {
/*
--- a/arch/x86/kernel/cpu/cpufreq/powernow-k8.h
+++ b/arch/x86/kernel/cpu/cpufreq/powernow-k8.h
@@ -5,6 +5,19 @@
* http://www.gnu.org/licenses/gpl.html
*/

+
+enum pstate {
+ HW_PSTATE_INVALID = 0xff,
+ HW_PSTATE_0 = 0,
+ HW_PSTATE_1 = 1,
+ HW_PSTATE_2 = 2,
+ HW_PSTATE_3 = 3,
+ HW_PSTATE_4 = 4,
+ HW_PSTATE_5 = 5,
+ HW_PSTATE_6 = 6,
+ HW_PSTATE_7 = 7,
+};
+
struct powernow_k8_data {
unsigned int cpu;

@@ -23,7 +36,9 @@ struct powernow_k8_data {
u32 exttype; /* extended interface = 1 */

/* keep track of the current fid / vid or pstate */
- u32 currvid, currfid, currpstate;
+ u32 currvid;
+ u32 currfid;
+ enum pstate currpstate;

/* the powernow_table includes all frequency and vid/fid pairings:
* fid are the lower 8 bits of the index, vid are the upper 8 bits.

2008-12-03 20:07:52

by Greg KH

[permalink] [raw]
Subject: [patch 044/104] xen: do not reserve 2 pages of padding between hypervisor and fixmap.

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Ian Campbell <[email protected]>

commit 5dc64a3442b98eaa0e3730c35fcf00cf962a93e7 upstream.

When reserving space for the hypervisor the Xen paravirt backend adds
an extra two pages (this was carried forward from the 2.6.18-xen tree
which had them "for safety"). Depending on various CONFIG options this
can cause the boot time fixmaps to span multiple PMDs which is not
supported and triggers a WARN in early_ioremap_init().

This was exposed by 2216d199b1430d1c0affb1498a9ebdbd9c0de439 which
moved the dmi table parsing earlier.
x86: fix CONFIG_X86_RESERVE_LOW_64K=y

The bad_bios_dmi_table() quirk never triggered because we do DMI setup
too late. Move it a bit earlier.

There is no real reason to reserve these two extra pages and the
fixmap already incorporates FIX_HOLE which serves the same
purpose. None of the other callers of reserve_top_address do this.

Signed-off-by: Ian Campbell <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/x86/xen/enlighten.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1413,7 +1413,7 @@ static void __init xen_reserve_top(void)
if (HYPERVISOR_xen_version(XENVER_platform_parameters, &pp) == 0)
top = pp.virt_start;

- reserve_top_address(-top + 2 * PAGE_SIZE);
+ reserve_top_address(-top);
#endif /* CONFIG_X86_32 */
}

2008-12-03 20:08:15

by Greg KH

[permalink] [raw]
Subject: [patch 045/104] x86: Hibernate: Fix breakage on x86_32 with CONFIG_NUMA set

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Rafael J. Wysocki <[email protected]>

backport of commit 97a70e548bd97d5a46ae9d44f24aafcc013fd701 to the 2.6.27 kernel.

The NUMA code on x86_32 creates special memory mapping that allows
each node's pgdat to be located in this node's memory. For this
purpose it allocates a memory area at the end of each node's memory
and maps this area so that it is accessible with virtual addresses
belonging to low memory. As a result, if there is high memory,
these NUMA-allocated areas are physically located in high memory,
although they are mapped to low memory addresses.

Our hibernation code does not take that into account and for this
reason hibernation fails on all x86_32 systems with CONFIG_NUMA=y and
with high memory present. Fix this by adding a special mapping for
the NUMA-allocated memory areas to the temporary page tables created
during the last phase of resume.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/x86/mm/discontig_32.c | 35 +++++++++++++++++++++++++++++++++++
arch/x86/power/hibernate_32.c | 4 ++++
include/asm-x86/mmzone_32.h | 4 ++++
3 files changed, 43 insertions(+)

--- a/arch/x86/mm/discontig_32.c
+++ b/arch/x86/mm/discontig_32.c
@@ -222,6 +222,41 @@ static void __init remap_numa_kva(void)
}
}

+#ifdef CONFIG_HIBERNATION
+/**
+ * resume_map_numa_kva - add KVA mapping to the temporary page tables created
+ * during resume from hibernation
+ * @pgd_base - temporary resume page directory
+ */
+void resume_map_numa_kva(pgd_t *pgd_base)
+{
+ int node;
+
+ for_each_online_node(node) {
+ unsigned long start_va, start_pfn, size, pfn;
+
+ start_va = (unsigned long)node_remap_start_vaddr[node];
+ start_pfn = node_remap_start_pfn[node];
+ size = node_remap_size[node];
+
+ printk(KERN_DEBUG "%s: node %d\n", __FUNCTION__, node);
+
+ for (pfn = 0; pfn < size; pfn += PTRS_PER_PTE) {
+ unsigned long vaddr = start_va + (pfn << PAGE_SHIFT);
+ pgd_t *pgd = pgd_base + pgd_index(vaddr);
+ pud_t *pud = pud_offset(pgd, vaddr);
+ pmd_t *pmd = pmd_offset(pud, vaddr);
+
+ set_pmd(pmd, pfn_pmd(start_pfn + pfn,
+ PAGE_KERNEL_LARGE_EXEC));
+
+ printk(KERN_DEBUG "%s: %08lx -> pfn %08lx\n",
+ __FUNCTION__, vaddr, start_pfn + pfn);
+ }
+ }
+}
+#endif
+
static unsigned long calculate_numa_remap_pages(void)
{
int nid;
--- a/arch/x86/power/hibernate_32.c
+++ b/arch/x86/power/hibernate_32.c
@@ -12,6 +12,7 @@
#include <asm/system.h>
#include <asm/page.h>
#include <asm/pgtable.h>
+#include <asm/mmzone.h>

/* Defined in hibernate_asm_32.S */
extern int restore_image(void);
@@ -127,6 +128,9 @@ static int resume_physical_mapping_init(
}
}
}
+
+ resume_map_numa_kva(pgd_base);
+
return 0;
}

--- a/include/asm-x86/mmzone_32.h
+++ b/include/asm-x86/mmzone_32.h
@@ -34,10 +34,14 @@ static inline void get_memcfg_numa(void)

extern int early_pfn_to_nid(unsigned long pfn);

+extern void resume_map_numa_kva(pgd_t *pgd);
+
#else /* !CONFIG_NUMA */

#define get_memcfg_numa get_memcfg_numa_flat

+static inline void resume_map_numa_kva(pgd_t *pgd) {}
+
#endif /* CONFIG_NUMA */

#ifdef CONFIG_DISCONTIGMEM

2008-12-03 20:08:33

by Greg KH

[permalink] [raw]
Subject: [patch 046/104] x86: SB600: skip ACPI IRQ0 override if it is not routed to INT2 of IOAPIC

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Andreas Herrmann <[email protected]>

commit 26adcfbf00e0726b4469070aa2f530dcf963f484 upstream.

On some more HP laptops BIOS reports an IRQ0 override
but the SB600 chipset is configured such that timer
interrupts go to INT0 of IOAPIC.

Check IRQ0 routing and if it is routed to INT0 of IOAPIC skip the
timer override.

http://bugzilla.kernel.org/show_bug.cgi?id=11715
http://bugzilla.kernel.org/show_bug.cgi?id=11516

Signed-off-by: Andreas Herrmann <[email protected]>
Signed-off-by: Len Brown <[email protected]>
Cc: Chuck Ebbert <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/x86/kernel/early-quirks.c | 55 ++++++++++++++++++++++++++++++++++++++---
1 file changed, 52 insertions(+), 3 deletions(-)

--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -95,7 +95,8 @@ static void __init nvidia_bugs(int num,

}

-static u32 ati_ixp4x0_rev(int num, int slot, int func)
+#if defined(CONFIG_ACPI) && defined(CONFIG_X86_IO_APIC)
+static u32 __init ati_ixp4x0_rev(int num, int slot, int func)
{
u32 d;
u8 b;
@@ -115,7 +116,6 @@ static u32 ati_ixp4x0_rev(int num, int s

static void __init ati_bugs(int num, int slot, int func)
{
-#if defined(CONFIG_ACPI) && defined (CONFIG_X86_IO_APIC)
u32 d;
u8 b;

@@ -138,9 +138,56 @@ static void __init ati_bugs(int num, int
printk(KERN_INFO "If you got timer trouble "
"try acpi_use_timer_override\n");
}
-#endif
}

+static u32 __init ati_sbx00_rev(int num, int slot, int func)
+{
+ u32 old, d;
+
+ d = read_pci_config(num, slot, func, 0x70);
+ old = d;
+ d &= ~(1<<8);
+ write_pci_config(num, slot, func, 0x70, d);
+ d = read_pci_config(num, slot, func, 0x8);
+ d &= 0xff;
+ write_pci_config(num, slot, func, 0x70, old);
+
+ return d;
+}
+
+static void __init ati_bugs_contd(int num, int slot, int func)
+{
+ u32 d, rev;
+
+ if (acpi_use_timer_override)
+ return;
+
+ rev = ati_sbx00_rev(num, slot, func);
+ if (rev > 0x13)
+ return;
+
+ /* check for IRQ0 interrupt swap */
+ d = read_pci_config(num, slot, func, 0x64);
+ if (!(d & (1<<14)))
+ acpi_skip_timer_override = 1;
+
+ if (acpi_skip_timer_override) {
+ printk(KERN_INFO "SB600 revision 0x%x\n", rev);
+ printk(KERN_INFO "Ignoring ACPI timer override.\n");
+ printk(KERN_INFO "If you got timer trouble "
+ "try acpi_use_timer_override\n");
+ }
+}
+#else
+static void __init ati_bugs(int num, int slot, int func)
+{
+}
+
+static void __init ati_bugs_contd(int num, int slot, int func)
+{
+}
+#endif
+
#define QFLAG_APPLY_ONCE 0x1
#define QFLAG_APPLIED 0x2
#define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED)
@@ -162,6 +209,8 @@ static struct chipset early_qrk[] __init
PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, fix_hypertransport_config },
{ PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_IXP400_SMBUS,
PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs },
+ { PCI_VENDOR_ID_ATI, PCI_DEVICE_ID_ATI_SBX00_SMBUS,
+ PCI_CLASS_SERIAL_SMBUS, PCI_ANY_ID, 0, ati_bugs_contd },
{}
};

2008-12-03 20:08:53

by Greg KH

[permalink] [raw]
Subject: [patch 047/104] libata: Avoid overflow in libata when tf->hba_lbal > 127

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Chuck Ebbert <[email protected]>

Combination of these two upstream patches:

ba14a9c291aa867896a90b3571fcc1c3759942ff
libata: Avoid overflow in ata_tf_to_lba48() when tf->hba_lbal > 127

44901a96847b9967c057832b185e2f34ee6a14e5
libata: Avoid overflow in ata_tf_read_block() when tf->hba_lbal > 127

Originally written by Roland Dreier, but backported by Chuck.


Cc: Roland Dreier <[email protected]>
Cc: Jeff Garzik <[email protected]>
Signed-off-by: Chuck Ebbert <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/ata/libata-core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -551,7 +551,7 @@ u64 ata_tf_read_block(struct ata_taskfil
if (tf->flags & ATA_TFLAG_LBA48) {
block |= (u64)tf->hob_lbah << 40;
block |= (u64)tf->hob_lbam << 32;
- block |= tf->hob_lbal << 24;
+ block |= (u64)tf->hob_lbal << 24;
} else
block |= (tf->device & 0xf) << 24;

@@ -1207,7 +1207,7 @@ u64 ata_tf_to_lba48(const struct ata_tas

sectors |= ((u64)(tf->hob_lbah & 0xff)) << 40;
sectors |= ((u64)(tf->hob_lbam & 0xff)) << 32;
- sectors |= (tf->hob_lbal & 0xff) << 24;
+ sectors |= ((u64)(tf->hob_lbal & 0xff)) << 24;
sectors |= (tf->lbah & 0xff) << 16;
sectors |= (tf->lbam & 0xff) << 8;
sectors |= (tf->lbal & 0xff);

2008-12-03 20:09:19

by Greg KH

[permalink] [raw]
Subject: [patch 048/104] x86: call dmi-quirks for HP Laptops after early-quirks are executed

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Andreas Herrmann <[email protected]>

commit 35af28219e684a36cc8b1ff456c370ce22be157d upstream.

Impact: make warning message disappear - functionality unchanged

Problems with bogus IRQ0 override of those laptops should be fixed
with commits

x86: SB600: skip IRQ0 override if it is not routed to INT2 of IOAPIC
x86: SB450: skip IRQ0 override if it is not routed to INT2 of IOAPIC

that introduce early-quirks based on chipset configuration.

For further information, see
http://bugzilla.kernel.org/show_bug.cgi?id=11516

Instead of removing the related dmi-quirks completely we'd like to
keep them for (at least) one kernel version -- to double-check whether
the early-quirks really took effect. But the dmi-quirks need to be
called after early-quirks are executed. With this patch calling
sequence for dmi-quriks is changed as follows:

acpi_boot_table_init() (dmi-quirks)
...
early_quirks() (detect bogus IRQ0 override)
...
acpi_boot_init() (late dmi-quirks and setup IO APIC)

Note: Plan is to remove the "late dmi-quirks" with next kernel version.

Signed-off-by: Andreas Herrmann <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/x86/kernel/acpi/boot.c | 8 ++++++++
1 file changed, 8 insertions(+)

--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -1593,6 +1593,11 @@ static struct dmi_system_id __initdata a
DMI_MATCH(DMI_PRODUCT_NAME, "TravelMate 360"),
},
},
+ {}
+};
+
+/* second table for DMI checks that should run after early-quirks */
+static struct dmi_system_id __initdata acpi_dmi_table_late[] = {
/*
* HP laptops which use a DSDT reporting as HP/SB400/10000,
* which includes some code which overrides all temperature
@@ -1721,6 +1726,9 @@ int __init early_acpi_boot_init(void)

int __init acpi_boot_init(void)
{
+ /* those are executed after early-quirks are executed */
+ dmi_check_system(acpi_dmi_table_late);
+
/*
* If acpi_disabled, bail out
* One exception: acpi=ht continues far enough to enumerate LAPICs

2008-12-03 20:09:37

by Greg KH

[permalink] [raw]
Subject: [patch 049/104] igb: Use device_set_wakeup_enable

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Rafael J. Wysocki <[email protected]>

commit e1b86d8479f90aadee57a3d07d8e61c815c202d9 upstream.

Since dev->power.should_wakeup bit is used by the PCI core to
decide whether the device should wake up the system from sleep
states, set/unset this bit whenever WOL is enabled/disabled using
igb_set_wol(). Accordingly, use device_can_wakeup() for checking
if wake-up is supported by the device.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/net/igb/igb_ethtool.c | 8 ++++++--
drivers/net/igb/igb_main.c | 1 +
2 files changed, 7 insertions(+), 2 deletions(-)

--- a/drivers/net/igb/igb_ethtool.c
+++ b/drivers/net/igb/igb_ethtool.c
@@ -1776,7 +1776,8 @@ static void igb_get_wol(struct net_devic

/* this function will set ->supported = 0 and return 1 if wol is not
* supported by this hardware */
- if (igb_wol_exclusion(adapter, wol))
+ if (igb_wol_exclusion(adapter, wol) ||
+ !device_can_wakeup(&adapter->pdev->dev))
return;

/* apply any specific unsupported masks here */
@@ -1805,7 +1806,8 @@ static int igb_set_wol(struct net_device
if (wol->wolopts & (WAKE_PHY | WAKE_ARP | WAKE_MAGICSECURE))
return -EOPNOTSUPP;

- if (igb_wol_exclusion(adapter, wol))
+ if (igb_wol_exclusion(adapter, wol) ||
+ !device_can_wakeup(&adapter->pdev->dev))
return wol->wolopts ? -EOPNOTSUPP : 0;

switch (hw->device_id) {
@@ -1825,6 +1827,8 @@ static int igb_set_wol(struct net_device
if (wol->wolopts & WAKE_MAGIC)
adapter->wol |= E1000_WUFC_MAG;

+ device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol);
+
return 0;
}

--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -1220,6 +1220,7 @@ static int __devinit igb_probe(struct pc

/* initialize the wol settings based on the eeprom settings */
adapter->wol = adapter->eeprom_wol;
+ device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol);

/* reset the hardware with the new settings */
igb_reset(adapter);

2008-12-03 20:09:53

by Greg KH

[permalink] [raw]
Subject: [patch 050/104] e1000: Use device_set_wakeup_enable

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Rafael J. Wysocki <[email protected]>

commit de1264896c8012a261c1cba17e6a61199c276ad3 upstream.

Since dev->power.should_wakeup bit is used by the PCI core to
decide whether the device should wake up the system from sleep
states, set/unset this bit whenever WOL is enabled/disabled using
e1000_set_wol(). Accordingly, use device_can_wakeup() for checking
if wake-up is supported by the device.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/net/e1000/e1000_ethtool.c | 8 ++++++--
drivers/net/e1000/e1000_main.c | 1 +
2 files changed, 7 insertions(+), 2 deletions(-)

--- a/drivers/net/e1000/e1000_ethtool.c
+++ b/drivers/net/e1000/e1000_ethtool.c
@@ -1774,7 +1774,8 @@ static void e1000_get_wol(struct net_dev

/* this function will set ->supported = 0 and return 1 if wol is not
* supported by this hardware */
- if (e1000_wol_exclusion(adapter, wol))
+ if (e1000_wol_exclusion(adapter, wol) ||
+ !device_can_wakeup(&adapter->pdev->dev))
return;

/* apply any specific unsupported masks here */
@@ -1811,7 +1812,8 @@ static int e1000_set_wol(struct net_devi
if (wol->wolopts & (WAKE_PHY | WAKE_ARP | WAKE_MAGICSECURE))
return -EOPNOTSUPP;

- if (e1000_wol_exclusion(adapter, wol))
+ if (e1000_wol_exclusion(adapter, wol) ||
+ !device_can_wakeup(&adapter->pdev->dev))
return wol->wolopts ? -EOPNOTSUPP : 0;

switch (hw->device_id) {
@@ -1838,6 +1840,8 @@ static int e1000_set_wol(struct net_devi
if (wol->wolopts & WAKE_MAGIC)
adapter->wol |= E1000_WUFC_MAG;

+ device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol);
+
return 0;
}

--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -1180,6 +1180,7 @@ static int __devinit e1000_probe(struct

/* initialize the wol settings based on the eeprom settings */
adapter->wol = adapter->eeprom_wol;
+ device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol);

/* print bus type/speed/width info */
DPRINTK(PROBE, INFO, "(PCI%s:%s:%s) ",

2008-12-03 20:10:25

by Greg KH

[permalink] [raw]
Subject: [patch 051/104] e1000e: Use device_set_wakeup_enable

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Rafael J. Wysocki <[email protected]>

commit 6ff68026f4757d68461b7fbeca5c944e1f5f8b44 upstream.

Since dev->power.should_wakeup bit is used by the PCI core to
decide whether the device should wake up the system from sleep
states, set/unset this bit whenever WOL is enabled/disabled using
e1000_set_wol(). Accordingly, use device_can_wakeup() for checking
if wake-up is supported by the device.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/net/e1000e/ethtool.c | 8 ++++++--
drivers/net/e1000e/netdev.c | 1 +
2 files changed, 7 insertions(+), 2 deletions(-)

--- a/drivers/net/e1000e/ethtool.c
+++ b/drivers/net/e1000e/ethtool.c
@@ -1681,7 +1681,8 @@ static void e1000_get_wol(struct net_dev
wol->supported = 0;
wol->wolopts = 0;

- if (!(adapter->flags & FLAG_HAS_WOL))
+ if (!(adapter->flags & FLAG_HAS_WOL) ||
+ !device_can_wakeup(&adapter->pdev->dev))
return;

wol->supported = WAKE_UCAST | WAKE_MCAST |
@@ -1719,7 +1720,8 @@ static int e1000_set_wol(struct net_devi
if (wol->wolopts & WAKE_MAGICSECURE)
return -EOPNOTSUPP;

- if (!(adapter->flags & FLAG_HAS_WOL))
+ if (!(adapter->flags & FLAG_HAS_WOL) ||
+ !device_can_wakeup(&adapter->pdev->dev))
return wol->wolopts ? -EOPNOTSUPP : 0;

/* these settings will always override what we currently have */
@@ -1738,6 +1740,8 @@ static int e1000_set_wol(struct net_devi
if (wol->wolopts & WAKE_ARP)
adapter->wol |= E1000_WUFC_ARP;

+ device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol);
+
return 0;
}

--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -4616,6 +4616,7 @@ static int __devinit e1000_probe(struct

/* initialize the wol settings based on the eeprom settings */
adapter->wol = adapter->eeprom_wol;
+ device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol);

/* reset the hardware with the new settings */
e1000e_reset(adapter);

2008-12-03 20:10:46

by Greg KH

[permalink] [raw]
Subject: [patch 052/104] libata: blacklist Seagate drives which time out FLUSH_CACHE when used with NCQ

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Tejun Heo <[email protected]>

commit ac70a964b0e22a95af3628c344815857a01461b7 upstream.

Some recent Seagate harddrives have firmware bug which causes FLUSH
CACHE to timeout under certain circumstances if NCQ is being used.
This can be worked around by disabling NCQ and fixed by updating the
firmware. Implement ATA_HORKAGE_FIRMWARE_UPDATE and blacklist these
devices.

The wiki page has been updated to contain information on this issue.

http://ata.wiki.kernel.org/index.php/Known_issues

Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Jeff Garzik <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/ata/libata-core.c | 21 +++++++++++++++++++++
include/linux/libata.h | 1 +
2 files changed, 22 insertions(+)

--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2428,6 +2428,13 @@ int ata_dev_configure(struct ata_device
}
}

+ if ((dev->horkage & ATA_HORKAGE_FIRMWARE_WARN) && print_info) {
+ ata_dev_printk(dev, KERN_WARNING, "WARNING: device requires "
+ "firmware update to be fully functional.\n");
+ ata_dev_printk(dev, KERN_WARNING, " contact the vendor "
+ "or visit http://ata.wiki.kernel.org.\n");
+ }
+
return 0;

err_out_nosup:
@@ -3971,6 +3978,20 @@ static const struct ata_blacklist_entry
{ "ST380817AS", "3.42", ATA_HORKAGE_NONCQ },
{ "ST3160023AS", "3.42", ATA_HORKAGE_NONCQ },

+ /* Seagate NCQ + FLUSH CACHE firmware bug */
+ { "ST31500341AS", "9JU138", ATA_HORKAGE_NONCQ |
+ ATA_HORKAGE_FIRMWARE_WARN },
+ { "ST31000333AS", "9FZ136", ATA_HORKAGE_NONCQ |
+ ATA_HORKAGE_FIRMWARE_WARN },
+ { "ST3640623AS", "9FZ164", ATA_HORKAGE_NONCQ |
+ ATA_HORKAGE_FIRMWARE_WARN },
+ { "ST3640323AS", "9FZ134", ATA_HORKAGE_NONCQ |
+ ATA_HORKAGE_FIRMWARE_WARN },
+ { "ST3320813AS", "9FZ182", ATA_HORKAGE_NONCQ |
+ ATA_HORKAGE_FIRMWARE_WARN },
+ { "ST3320613AS", "9FZ162", ATA_HORKAGE_NONCQ |
+ ATA_HORKAGE_FIRMWARE_WARN },
+
/* Blacklist entries taken from Silicon Image 3124/3132
Windows driver .inf file - also several Linux problem reports */
{ "HTS541060G9SA00", "MB3OC60D", ATA_HORKAGE_NONCQ, },
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -364,6 +364,7 @@ enum {
ATA_HORKAGE_IPM = (1 << 7), /* Link PM problems */
ATA_HORKAGE_IVB = (1 << 8), /* cbl det validity bit bugs */
ATA_HORKAGE_STUCK_ERR = (1 << 9), /* stuck ERR on next PACKET */
+ ATA_HORKAGE_FIRMWARE_WARN = (1 << 12), /* firwmare update warning */

/* DMA mask for user DMA control: User visible values; DO NOT
renumber */

2008-12-03 20:11:11

by Greg KH

[permalink] [raw]
Subject: [patch 053/104] rtl8187: add device ID 0bda:8198

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: John W. Linville <[email protected]>

commit 746db510395e32ff57b9f8582e520df6b3fac618 upstream.

Reported by [email protected] to work here:

http://bugzilla.kernel.org/show_bug.cgi?id=11728

Signed-off-by: John W. Linville <[email protected]>
Cc: Zoomer <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/net/wireless/rtl8187_dev.c | 1 +
1 file changed, 1 insertion(+)

--- a/drivers/net/wireless/rtl8187_dev.c
+++ b/drivers/net/wireless/rtl8187_dev.c
@@ -37,6 +37,7 @@ static struct usb_device_id rtl8187_tabl
{USB_DEVICE(0x0bda, 0x8187), .driver_info = DEVICE_RTL8187},
{USB_DEVICE(0x0bda, 0x8189), .driver_info = DEVICE_RTL8187B},
{USB_DEVICE(0x0bda, 0x8197), .driver_info = DEVICE_RTL8187B},
+ {USB_DEVICE(0x0bda, 0x8198), .driver_info = DEVICE_RTL8187B},
/* Netgear */
{USB_DEVICE(0x0846, 0x6100), .driver_info = DEVICE_RTL8187},
{USB_DEVICE(0x0846, 0x6a00), .driver_info = DEVICE_RTL8187},

2008-12-03 20:11:45

by Greg KH

[permalink] [raw]
Subject: [patch 054/104] rtl8187: Add USB ID for Belkin F5D7050 with RTL8187B chip

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Florent Fourcot <[email protected]>

commit eaca90dab6ab9853223029deffdd226f41b2028c upstream.

The Belkin F5D7050rev5000de (id 050d:705e) has the Realtek RTL8187B chip
and works with the 2.6.27 driver.

Signed-off-by: Larry Finger <[email protected]>
Signed-off-by: John W. Linville <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/net/wireless/rtl8187_dev.c | 2 ++
1 file changed, 2 insertions(+)

--- a/drivers/net/wireless/rtl8187_dev.c
+++ b/drivers/net/wireless/rtl8187_dev.c
@@ -33,6 +33,8 @@ MODULE_LICENSE("GPL");
static struct usb_device_id rtl8187_table[] __devinitdata = {
/* Asus */
{USB_DEVICE(0x0b05, 0x171d), .driver_info = DEVICE_RTL8187},
+ /* Belkin */
+ {USB_DEVICE(0x050d, 0x705e), .driver_info = DEVICE_RTL8187B},
/* Realtek */
{USB_DEVICE(0x0bda, 0x8187), .driver_info = DEVICE_RTL8187},
{USB_DEVICE(0x0bda, 0x8189), .driver_info = DEVICE_RTL8187B},

2008-12-03 20:12:06

by Greg KH

[permalink] [raw]
Subject: [patch 055/104] cifs: Reduce number of socket retries in large write path

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Steve French <[email protected]>

Backport of upstream commit edf1ae403896cb7750800508b14996ba6be39a53
for -stable.

[CIFS] Reduce number of socket retries in large write path

CIFS in some heavy stress conditions cifs could get EAGAIN
repeatedly in smb_send2 which led to repeated retries and eventually
failure of large writes which could lead to data corruption.

There are three changes that were suggested by various network
developers:

1) convert cifs from non-blocking to blocking tcp sendmsg
(we left in the retry on failure)
2) change cifs to not set sendbuf and rcvbuf size for the socket
(let tcp autotune the buffer sizes since that works much better
in the TCP stack now)
3) if we have a partial frame sent in smb_send2, mark the tcp
session as invalid (close the socket and reconnect) so we do
not corrupt the remaining part of the SMB with the beginning
of the next SMB.

This does not appear to hurt performance measurably and has
been run in various scenarios, but it definately removes
a corruption that we were seeing in some high stress
test cases.

Acked-by: Shirish Pargaonkar <[email protected]>
Signed-off-by: Steve French <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/cifs/cifsglob.h | 2 +
fs/cifs/cifsproto.h | 2 -
fs/cifs/connect.c | 58 ++++++++++++++++++++++++++++++++++++++--------------
fs/cifs/transport.c | 41 +++++++++++++++++++++++++++---------
4 files changed, 77 insertions(+), 26 deletions(-)

--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -141,6 +141,8 @@ struct TCP_Server_Info {
char versionMajor;
char versionMinor;
bool svlocal:1; /* local server or remote */
+ bool noblocksnd; /* use blocking sendmsg */
+ bool noautotune; /* do not autotune send buf sizes */
atomic_t socketUseCount; /* number of open cifs sessions on socket */
atomic_t inFlight; /* number of requests on the wire to server */
#ifdef CONFIG_CIFS_STATS2
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -36,7 +36,7 @@ extern void cifs_buf_release(void *);
extern struct smb_hdr *cifs_small_buf_get(void);
extern void cifs_small_buf_release(void *);
extern int smb_send(struct socket *, struct smb_hdr *,
- unsigned int /* length */ , struct sockaddr *);
+ unsigned int /* length */ , struct sockaddr *, bool);
extern unsigned int _GetXid(void);
extern void _FreeXid(unsigned int);
#define GetXid() (int)_GetXid(); cFYI(1,("CIFS VFS: in %s as Xid: %d with uid: %d",__func__, xid,current->fsuid));
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -90,6 +90,8 @@ struct smb_vol {
bool nocase:1; /* request case insensitive filenames */
bool nobrl:1; /* disable sending byte range locks to srv */
bool seal:1; /* request transport encryption on share */
+ bool noblocksnd:1;
+ bool noautotune:1;
unsigned int rsize;
unsigned int wsize;
unsigned int sockopt;
@@ -100,9 +102,11 @@ struct smb_vol {
static int ipv4_connect(struct sockaddr_in *psin_server,
struct socket **csocket,
char *netb_name,
- char *server_netb_name);
+ char *server_netb_name,
+ bool noblocksnd,
+ bool nosndbuf); /* ipv6 never set sndbuf size */
static int ipv6_connect(struct sockaddr_in6 *psin_server,
- struct socket **csocket);
+ struct socket **csocket, bool noblocksnd);


/*
@@ -188,12 +192,13 @@ cifs_reconnect(struct TCP_Server_Info *s
try_to_freeze();
if (server->protocolType == IPV6) {
rc = ipv6_connect(&server->addr.sockAddr6,
- &server->ssocket);
+ &server->ssocket, server->noautotune);
} else {
rc = ipv4_connect(&server->addr.sockAddr,
&server->ssocket,
server->workstation_RFC1001_name,
- server->server_RFC1001_name);
+ server->server_RFC1001_name,
+ server->noblocksnd, server->noautotune);
}
if (rc) {
cFYI(1, ("reconnect error %d", rc));
@@ -409,8 +414,14 @@ incomplete_rcv:
msleep(1); /* minimum sleep to prevent looping
allowing socket to clear and app threads to set
tcpStatus CifsNeedReconnect if server hung */
- if (pdu_length < 4)
+ if (pdu_length < 4) {
+ iov.iov_base = (4 - pdu_length) +
+ (char *)smb_buffer;
+ iov.iov_len = pdu_length;
+ smb_msg.msg_control = NULL;
+ smb_msg.msg_controllen = 0;
goto incomplete_rcv;
+ }
else
continue;
} else if (length <= 0) {
@@ -1186,6 +1197,10 @@ cifs_parse_mount_options(char *options,
/* ignore */
} else if (strnicmp(data, "rw", 2) == 0) {
vol->rw = true;
+ } else if (strnicmp(data, "noblocksnd", 11) == 0) {
+ vol->noblocksnd = true;
+ } else if (strnicmp(data, "noautotune", 10) == 0) {
+ vol->noautotune = true;
} else if ((strnicmp(data, "suid", 4) == 0) ||
(strnicmp(data, "nosuid", 6) == 0) ||
(strnicmp(data, "exec", 4) == 0) ||
@@ -1506,7 +1521,8 @@ static void rfc1002mangle(char *target,

static int
ipv4_connect(struct sockaddr_in *psin_server, struct socket **csocket,
- char *netbios_name, char *target_name)
+ char *netbios_name, char *target_name,
+ bool noblocksnd, bool noautotune)
{
int rc = 0;
int connected = 0;
@@ -1578,11 +1594,15 @@ ipv4_connect(struct sockaddr_in *psin_se
(*csocket)->sk->sk_sndbuf,
(*csocket)->sk->sk_rcvbuf, (*csocket)->sk->sk_rcvtimeo));
(*csocket)->sk->sk_rcvtimeo = 7 * HZ;
+ if (!noblocksnd)
+ (*csocket)->sk->sk_sndtimeo = 3 * HZ;
/* make the bufsizes depend on wsize/rsize and max requests */
- if ((*csocket)->sk->sk_sndbuf < (200 * 1024))
- (*csocket)->sk->sk_sndbuf = 200 * 1024;
- if ((*csocket)->sk->sk_rcvbuf < (140 * 1024))
- (*csocket)->sk->sk_rcvbuf = 140 * 1024;
+ if (noautotune) {
+ if ((*csocket)->sk->sk_sndbuf < (200 * 1024))
+ (*csocket)->sk->sk_sndbuf = 200 * 1024;
+ if ((*csocket)->sk->sk_rcvbuf < (140 * 1024))
+ (*csocket)->sk->sk_rcvbuf = 140 * 1024;
+ }

/* send RFC1001 sessinit */
if (psin_server->sin_port == htons(RFC1001_PORT)) {
@@ -1619,7 +1639,7 @@ ipv4_connect(struct sockaddr_in *psin_se
/* sizeof RFC1002_SESSION_REQUEST with no scope */
smb_buf->smb_buf_length = 0x81000044;
rc = smb_send(*csocket, smb_buf, 0x44,
- (struct sockaddr *)psin_server);
+ (struct sockaddr *)psin_server, noblocksnd);
kfree(ses_init_buf);
msleep(1); /* RFC1001 layer in at least one server
requires very short break before negprot
@@ -1639,7 +1659,8 @@ ipv4_connect(struct sockaddr_in *psin_se
}

static int
-ipv6_connect(struct sockaddr_in6 *psin_server, struct socket **csocket)
+ipv6_connect(struct sockaddr_in6 *psin_server, struct socket **csocket,
+ bool noblocksnd)
{
int rc = 0;
int connected = 0;
@@ -1708,6 +1729,8 @@ ipv6_connect(struct sockaddr_in6 *psin_s
the default. sock_setsockopt not used because it expects
user space buffer */
(*csocket)->sk->sk_rcvtimeo = 7 * HZ;
+ if (!noblocksnd)
+ (*csocket)->sk->sk_sndtimeo = 3 * HZ;

return rc;
}
@@ -1961,11 +1984,14 @@ cifs_mount(struct super_block *sb, struc
cFYI(1, ("attempting ipv6 connect"));
/* BB should we allow ipv6 on port 139? */
/* other OS never observed in Wild doing 139 with v6 */
- rc = ipv6_connect(&sin_server6, &csocket);
+ rc = ipv6_connect(&sin_server6, &csocket,
+ volume_info.noblocksnd);
} else
rc = ipv4_connect(&sin_server, &csocket,
- volume_info.source_rfc1001_name,
- volume_info.target_rfc1001_name);
+ volume_info.source_rfc1001_name,
+ volume_info.target_rfc1001_name,
+ volume_info.noblocksnd,
+ volume_info.noautotune);
if (rc < 0) {
cERROR(1, ("Error connecting to IPv4 socket. "
"Aborting operation"));
@@ -1980,6 +2006,8 @@ cifs_mount(struct super_block *sb, struc
sock_release(csocket);
goto out;
} else {
+ srvTcp->noblocksnd = volume_info.noblocksnd;
+ srvTcp->noautotune = volume_info.noautotune;
memcpy(&srvTcp->addr.sockAddr, &sin_server,
sizeof(struct sockaddr_in));
atomic_set(&srvTcp->inFlight, 0);
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -162,7 +162,7 @@ void DeleteTconOplockQEntries(struct cif

int
smb_send(struct socket *ssocket, struct smb_hdr *smb_buffer,
- unsigned int smb_buf_length, struct sockaddr *sin)
+ unsigned int smb_buf_length, struct sockaddr *sin, bool noblocksnd)
{
int rc = 0;
int i = 0;
@@ -179,7 +179,10 @@ smb_send(struct socket *ssocket, struct
smb_msg.msg_namelen = sizeof(struct sockaddr);
smb_msg.msg_control = NULL;
smb_msg.msg_controllen = 0;
- smb_msg.msg_flags = MSG_DONTWAIT + MSG_NOSIGNAL; /* BB add more flags?*/
+ if (noblocksnd)
+ smb_msg.msg_flags = MSG_DONTWAIT + MSG_NOSIGNAL;
+ else
+ smb_msg.msg_flags = MSG_NOSIGNAL;

/* smb header is converted in header_assemble. bcc and rest of SMB word
area, and byte area if necessary, is converted to littleendian in
@@ -230,8 +233,8 @@ smb_send(struct socket *ssocket, struct
}

static int
-smb_send2(struct socket *ssocket, struct kvec *iov, int n_vec,
- struct sockaddr *sin)
+smb_send2(struct TCP_Server_Info *server, struct kvec *iov, int n_vec,
+ struct sockaddr *sin, bool noblocksnd)
{
int rc = 0;
int i = 0;
@@ -241,6 +244,7 @@ smb_send2(struct socket *ssocket, struct
unsigned int total_len;
int first_vec = 0;
unsigned int smb_buf_length = smb_buffer->smb_buf_length;
+ struct socket *ssocket = server->ssocket;

if (ssocket == NULL)
return -ENOTSOCK; /* BB eventually add reconnect code here */
@@ -249,7 +253,10 @@ smb_send2(struct socket *ssocket, struct
smb_msg.msg_namelen = sizeof(struct sockaddr);
smb_msg.msg_control = NULL;
smb_msg.msg_controllen = 0;
- smb_msg.msg_flags = MSG_DONTWAIT + MSG_NOSIGNAL; /* BB add more flags?*/
+ if (noblocksnd)
+ smb_msg.msg_flags = MSG_DONTWAIT + MSG_NOSIGNAL;
+ else
+ smb_msg.msg_flags = MSG_NOSIGNAL;

/* smb header is converted in header_assemble. bcc and rest of SMB word
area, and byte area if necessary, is converted to littleendian in
@@ -313,6 +320,16 @@ smb_send2(struct socket *ssocket, struct
i = 0; /* in case we get ENOSPC on the next send */
}

+ if ((total_len > 0) && (total_len != smb_buf_length + 4)) {
+ cFYI(1, ("partial send (%d remaining), terminating session",
+ total_len));
+ /* If we have only sent part of an SMB then the next SMB
+ could be taken as the remainder of this one. We need
+ to kill the socket so the server throws away the partial
+ SMB */
+ server->tcpStatus = CifsNeedReconnect;
+ }
+
if (rc < 0) {
cERROR(1, ("Error %d sending data on socket to server", rc));
} else
@@ -519,8 +536,9 @@ SendReceive2(const unsigned int xid, str
#ifdef CONFIG_CIFS_STATS2
atomic_inc(&ses->server->inSend);
#endif
- rc = smb_send2(ses->server->ssocket, iov, n_vec,
- (struct sockaddr *) &(ses->server->addr.sockAddr));
+ rc = smb_send2(ses->server, iov, n_vec,
+ (struct sockaddr *) &(ses->server->addr.sockAddr),
+ ses->server->noblocksnd);
#ifdef CONFIG_CIFS_STATS2
atomic_dec(&ses->server->inSend);
midQ->when_sent = jiffies;
@@ -712,7 +730,8 @@ SendReceive(const unsigned int xid, stru
atomic_inc(&ses->server->inSend);
#endif
rc = smb_send(ses->server->ssocket, in_buf, in_buf->smb_buf_length,
- (struct sockaddr *) &(ses->server->addr.sockAddr));
+ (struct sockaddr *) &(ses->server->addr.sockAddr),
+ ses->server->noblocksnd);
#ifdef CONFIG_CIFS_STATS2
atomic_dec(&ses->server->inSend);
midQ->when_sent = jiffies;
@@ -852,7 +871,8 @@ send_nt_cancel(struct cifsTconInfo *tcon
return rc;
}
rc = smb_send(ses->server->ssocket, in_buf, in_buf->smb_buf_length,
- (struct sockaddr *) &(ses->server->addr.sockAddr));
+ (struct sockaddr *) &(ses->server->addr.sockAddr),
+ ses->server->noblocksnd);
up(&ses->server->tcpSem);
return rc;
}
@@ -942,7 +962,8 @@ SendReceiveBlockingLock(const unsigned i
atomic_inc(&ses->server->inSend);
#endif
rc = smb_send(ses->server->ssocket, in_buf, in_buf->smb_buf_length,
- (struct sockaddr *) &(ses->server->addr.sockAddr));
+ (struct sockaddr *) &(ses->server->addr.sockAddr),
+ ses->server->noblocksnd);
#ifdef CONFIG_CIFS_STATS2
atomic_dec(&ses->server->inSend);
midQ->when_sent = jiffies;

2008-12-03 20:12:33

by Greg KH

[permalink] [raw]
Subject: [patch 056/104] cifs: Fix error in smb_send2

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Steve French <[email protected]>

Backport of upstream commit 61de800d33af585cb7e6f27b5cdd51029c6855cb
for -stable.

[CIFS] fix error in smb_send2

smb_send2 exit logic was strange, and with the previous change
could cause us to fail large
smb writes when all of the smb was not sent as one chunk.

Acked-by: Jeff Layton <[email protected]>
Signed-off-by: Steve French <[email protected]>
Cc: Suresh Jayaraman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/cifs/cifssmb.c | 2 +-
fs/cifs/file.c | 2 +-
fs/cifs/transport.c | 7 +++++--
3 files changed, 7 insertions(+), 4 deletions(-)

--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1534,7 +1534,7 @@ CIFSSMBWrite(const int xid, struct cifsT
__u32 bytes_sent;
__u16 byte_count;

- /* cFYI(1,("write at %lld %d bytes",offset,count));*/
+ /* cFYI(1, ("write at %lld %d bytes",offset,count));*/
if (tcon->ses == NULL)
return -ECONNABORTED;

--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1813,7 +1813,7 @@ static int cifs_readpages(struct file *f
pTcon = cifs_sb->tcon;

pagevec_init(&lru_pvec, 0);
- cFYI(DBG2, ("rpages: num pages %d", num_pages));
+ cFYI(DBG2, ("rpages: num pages %d", num_pages));
for (i = 0; i < num_pages; ) {
unsigned contig_pages;
struct page *tmp_page;
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -291,8 +291,11 @@ smb_send2(struct TCP_Server_Info *server
if (rc < 0)
break;

- if (rc >= total_len) {
- WARN_ON(rc > total_len);
+ if (rc == total_len) {
+ total_len = 0;
+ break;
+ } else if (rc > total_len) {
+ cERROR(1, ("sent %d requested %d", rc, total_len));
break;
}
if (rc == 0) {

2008-12-03 20:13:17

by Greg KH

[permalink] [raw]
Subject: [patch 058/104] powerpc/spufs: add a missing mutex_unlock

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Kou Ishizaki <[email protected]>

commit 6747c2ee8abf749e63fee8cd01a9ee293e6a4247 upstream.

A mutex_unlock(&gang->aff_mutex) in spufs_create_context() is missing
in case spufs_context_open() fails. As a result, spu_create syscall
and spu_get_idle() may block.

This patch adds the mutex_unlock.

Signed-off-by: Kou Ishizaki <[email protected]>
Signed-off-by: Jeremy Kerr <[email protected]>
Acked-by: Andre Detsch <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/powerpc/platforms/cell/spufs/inode.c | 2 ++
1 file changed, 2 insertions(+)

--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -496,6 +496,8 @@ spufs_create_context(struct inode *inode
ret = spufs_context_open(dget(dentry), mntget(mnt));
if (ret < 0) {
WARN_ON(spufs_rmdir(inode, dentry));
+ if (affinity)
+ mutex_unlock(&gang->aff_mutex);
mutex_unlock(&inode->i_mutex);
spu_forget(SPUFS_I(dentry->d_inode)->i_ctx);
goto out;

2008-12-03 20:12:53

by Greg KH

[permalink] [raw]
Subject: [patch 057/104] powerpc/spufs: Fix spinning in spufs_ps_fault on signal

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Jeremy Kerr <[email protected]>

commit 606572634c3faa5b32a8fc430266e6e9d78d2179 upstream.

Currently, we can end up in an infinite loop if we get a signal
while the kernel has faulted in spufs_ps_fault. Eg:

alarm(1);

write(fd, some_spu_psmap_register_address, 4);

- the write's copy_from_user will fault on the ps mapping, and
signal_pending will be non-zero. Because returning from the fault
handler will never clear TIF_SIGPENDING, so we'll just keep faulting,
resulting in an unkillable process using 100% of CPU.

This change returns VM_FAULT_SIGBUS if there's a fatal signal pending,
letting us escape the loop.

Signed-off-by: Jeremy Kerr <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
arch/powerpc/platforms/cell/spufs/file.c | 3 +++
1 file changed, 3 insertions(+)

--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -390,6 +390,9 @@ static int spufs_ps_fault(struct vm_area
if (offset >= ps_size)
return VM_FAULT_SIGBUS;

+ if (fatal_signal_pending(current))
+ return VM_FAULT_SIGBUS;
+
/*
* Because we release the mmap_sem, the context may be destroyed while
* we're in spu_wait. Grab an extra reference so it isn't destroyed

2008-12-03 20:13:36

by Greg KH

[permalink] [raw]
Subject: [patch 060/104] WATCHDOG: hpwdt: Fix kdump when using hpwdt

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Bernhard Walle <[email protected]>

commit 290172e79036fc25a22aaf3da4835ee634886183 upstream.

When the "hpwdt" module is loaded (even if the /dev/watchdog device is not
opened), then kdump does not work. The panic kernel either does not start at
all or crash in various places.

The problem is that hpwdt_pretimeout is registered with register_die_notifier()
with the highest possible priority. Because it returns NOTIFY_STOP, the
crash_nmi_callback which is also registered with register_die_notifier()
is never executed. This causes the shutdown of other CPUs to fail.

Reverting the order is no option: The crash_nmi_callback executes HLT
and so never returns normally. Because of that, it must be executed as
last notifier, which currently is done.

So, that patch returns NOTIFY_OK to keep the crash_nmi_callback executed.

Signed-off-by: Bernhard Walle <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>
Signed-off-by: Thomas Mingarelli <[email protected]>
Cc: Vivek Goyal <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/watchdog/hpwdt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/watchdog/hpwdt.c
+++ b/drivers/watchdog/hpwdt.c
@@ -485,7 +485,7 @@ static int hpwdt_pretimeout(struct notif
"Management Log for details.\n");
}

- return NOTIFY_STOP;
+ return NOTIFY_OK;
}

/*

2008-12-03 20:14:17

by Greg KH

[permalink] [raw]
Subject: [patch 061/104] Remove -mno-spe flags as they dont belong

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Kumar Gala <[email protected]>

commit 65ecc14a30ad21bed9aabdfd6a2ae1a1aaaa6a00 upstream, tweaked to get
it to apply to 2.6.27

For some unknown reason at Steven Rostedt added in disabling of the SPE
instruction generation for e500 based PPC cores in commit
6ec562328fda585be2d7f472cfac99d3b44d362a.

We are removing it because:

1. It generates e500 kernels that don't work
2. its not the correct set of flags to do this
3. we handle this in the arch/powerpc/Makefile already
4. its unknown in talking to Steven why he did this

Signed-off-by: Kumar Gala <[email protected]>
Tested-and-Acked-by: Steven Rostedt <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
kernel/Makefile | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -11,8 +11,6 @@ obj-y = sched.o fork.o exec_domain.o
hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
notifier.o ksysfs.o pm_qos_params.o sched_clock.o

-CFLAGS_REMOVE_sched.o = -mno-spe
-
ifdef CONFIG_FTRACE
# Do not trace debug files and internal ftrace files
CFLAGS_REMOVE_lockdep.o = -pg
@@ -21,7 +19,7 @@ CFLAGS_REMOVE_mutex-debug.o = -pg
CFLAGS_REMOVE_rtmutex-debug.o = -pg
CFLAGS_REMOVE_cgroup-debug.o = -pg
CFLAGS_REMOVE_sched_clock.o = -pg
-CFLAGS_REMOVE_sched.o = -mno-spe -pg
+CFLAGS_REMOVE_sched.o = -pg
endif

obj-$(CONFIG_PROFILING) += profile.o

2008-12-03 20:13:51

by Greg KH

[permalink] [raw]
Subject: [patch 059/104] WATCHDOG: hpwdt: set the mapped BIOS address space as executable

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Bernhard Walle <[email protected]>

commit 060264133b946786b4b28a1ba79e6725eaf258f3 upstream.

The address provided by the SMBIOS/DMI CRU information is mapped via
ioremap() in the virtual address space. However, since the address is
executed (i.e. call'd), we need to set that pages as executable.

Without that, I get following oops on a HP ProLiant DL385 G2
machine with BIOS from 05/29/2008 when I trigger crashdump:

BUG: unable to handle kernel paging request at ffffc20011090c00
IP: [<ffffc20011090c00>] 0xffffc20011090c00
PGD 12f813067 PUD 7fe6a067 PMD 7effe067 PTE 80000000fffd3173
Oops: 0011 [1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
CPU 1
Modules linked in: autofs4 ipv6 af_packet cpufreq_conservative cpufreq_userspace
cpufreq_powersave powernow_k8 fuse loop dm_mod rtc_cmos ipmi_si sg rtc_core i2c
_piix4 ipmi_msghandler bnx2 sr_mod container button i2c_core hpilo joydev pcspkr
rtc_lib shpchp hpwdt cdrom pci_hotplug usbhid hid ff_memless ohci_hcd ehci_hcd
uhci_hcd usbcore edd ext3 mbcache jbd fan ide_pci_generic serverworks ide_core p
ata_serverworks pata_acpi cciss ata_generic libata scsi_mod dock thermal process
or thermal_sys hwmon
Supported: Yes
Pid: 0, comm: swapper Not tainted 2.6.27.5-HEAD_20081111100657-default #1
RIP: 0010:[<ffffc20011090c00>] [<ffffc20011090c00>] 0xffffc20011090c00
RSP: 0018:ffff88012f6f9e68 EFLAGS: 00010046
RAX: 0000000000000d02 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88012f6f9e98 R08: 666666666666660a R09: ffffffffa1006fc0
R10: 0000000000000000 R11: ffff88012f6f3ea8 R12: ffffc20011090c00
R13: ffff88012f6f9ee8 R14: 000000000000000e R15: 0000000000000000
FS: 00007ff70b29a6f0(0000) GS:ffff88012f6512c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffffc20011090c00 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff88012f6f2000, task ffff88007fa8a1c0)
Stack: ffffffffa0f8502b 0000000000000002 ffffffff80738d50 0000000000000000
0000000000000046 0000000000000046 00000000fffffffe ffffffffa0f852ec
0000000000000000 ffffffff804ad9a6 0000000000000000 0000000000000000
Call Trace:
Inexact backtrace:

<NMI> [<ffffffffa0f8502b>] ? asminline_call+0x2b/0x55 [hpwdt]
[<ffffffffa0f852ec>] hpwdt_pretimeout+0x3c/0xa0 [hpwdt]
[<ffffffff804ad9a6>] ? notifier_call_chain+0x29/0x4c
[<ffffffff802587e4>] ? notify_die+0x2d/0x32
[<ffffffff804abbdc>] ? default_do_nmi+0x53/0x1d9
[<ffffffff804abd90>] ? do_nmi+0x2e/0x43
[<ffffffff804ab552>] ? nmi+0xa2/0xd0
[<ffffffff80221ef9>] ? native_safe_halt+0x2/0x3
<<EOE>> [<ffffffff8021345d>] ? default_idle+0x38/0x54
[<ffffffff8021359a>] ? c1e_idle+0x118/0x11c
[<ffffffff8020b3b5>] ? cpu_idle+0xa9/0xf1

Code: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff <55> 50 e8 00 00 00 00 58 48 2d 07 10 40 00 48 8b e8 58 e9 68 02
RIP [<ffffc20011090c00>] 0xffffc20011090c00
RSP <ffff88012f6f9e68>
CR2: ffffc20011090c00
Kernel panic - not syncing: Fatal exception

Signed-off-by: Bernhard Walle <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>
Cc: Ingo Molnar <[email protected]>
Acked-by: "H. Peter Anvin" <[email protected]>
Signed-off-by: Thomas Mingarelli <[email protected]>
Cc: Alan Cox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/watchdog/hpwdt.c | 3 +++
1 file changed, 3 insertions(+)

--- a/drivers/watchdog/hpwdt.c
+++ b/drivers/watchdog/hpwdt.c
@@ -40,6 +40,7 @@
#include <linux/bootmem.h>
#include <linux/slab.h>
#include <asm/desc.h>
+#include <asm/cacheflush.h>

#define PCI_BIOS32_SD_VALUE 0x5F32335F /* "_32_" */
#define CRU_BIOS_SIGNATURE_VALUE 0x55524324
@@ -394,6 +395,8 @@ static void __devinit dmi_find_cru(const
smbios_cru64_ptr->double_offset;
cru_rom_addr = ioremap(cru_physical_address,
smbios_cru64_ptr->double_length);
+ set_memory_x((unsigned long)cru_rom_addr & PAGE_MASK,
+ smbios_cru64_ptr->double_length >> PAGE_SHIFT);
}
}
}

2008-12-03 20:14:40

by Greg KH

[permalink] [raw]
Subject: [patch 062/104] ACPI: EC: count interrupts only if called from interrupt handler.

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alexey Starikovskiy <[email protected]>

commit 7b4d469228a92a00e412675817cedd60133de38a upstream.

fix 2.6.28 EC interrupt storm regression

Signed-off-by: Alexey Starikovskiy <[email protected]>
Signed-off-by: Len Brown <[email protected]>
Cc: Alan Jenkins <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/acpi/ec.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -219,7 +219,8 @@ static void gpe_transaction(struct acpi_
goto unlock;
err:
/* false interrupt, state didn't change */
- ++ec->curr->irq_count;
+ if (in_interrupt())
+ ++ec->curr->irq_count;
unlock:
spin_unlock_irqrestore(&ec->curr_lock, flags);
}

2008-12-03 20:14:58

by Greg KH

[permalink] [raw]
Subject: [patch 063/104] ieee1394: sbp2: another iPod mini quirk entry

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Stefan Richter <[email protected]>

commit 9e0de91011ef6fe6eb3bb63f7ea15f586955660a upstream.

Add another model ID of a broken firmware to prevent early I/O errors
by acesses at the end of the disk. Reported at linux1394-user,
http://marc.info/?t=122670842900002

Signed-off-by: Stefan Richter <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/ieee1394/sbp2.c | 5 +++++
1 file changed, 5 insertions(+)

--- a/drivers/ieee1394/sbp2.c
+++ b/drivers/ieee1394/sbp2.c
@@ -402,6 +402,11 @@ static const struct {
},
/* iPod mini */ {
.firmware_revision = 0x0a2700,
+ .model_id = 0x000022,
+ .workarounds = SBP2_WORKAROUND_FIX_CAPACITY,
+ },
+ /* iPod mini */ {
+ .firmware_revision = 0x0a2700,
.model_id = 0x000023,
.workarounds = SBP2_WORKAROUND_FIX_CAPACITY,
},

2008-12-03 20:15:29

by Greg KH

[permalink] [raw]
Subject: [patch 064/104] firewire: fw-sbp2: another iPod mini quirk entry

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Stefan Richter <[email protected]>

commit 031bb27c4bf77c2f60b3f3dea8cce63ef0d1fba9 upstream.

Add another model ID of a broken firmware to prevent early I/O errors
by acesses at the end of the disk. Reported at linux1394-user,
http://marc.info/?t=122670842900002

Signed-off-by: Stefan Richter <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/firewire/fw-sbp2.c | 5 +++++
1 file changed, 5 insertions(+)

--- a/drivers/firewire/fw-sbp2.c
+++ b/drivers/firewire/fw-sbp2.c
@@ -365,6 +365,11 @@ static const struct {
},
/* iPod mini */ {
.firmware_revision = 0x0a2700,
+ .model = 0x000022,
+ .workarounds = SBP2_WORKAROUND_FIX_CAPACITY,
+ },
+ /* iPod mini */ {
+ .firmware_revision = 0x0a2700,
.model = 0x000023,
.workarounds = SBP2_WORKAROUND_FIX_CAPACITY,
},

2008-12-03 20:16:17

by Greg KH

[permalink] [raw]
Subject: [patch 066/104] net: Fix soft lockups/OOM issues w/ unix garbage collector (CVE-2008-5300)

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: dann frazier <[email protected]>

commit 5f23b734963ec7eaa3ebcd9050da0c9b7d143dd3 upstream.

This is an implementation of David Miller's suggested fix in:
https://bugzilla.redhat.com/show_bug.cgi?id=470201

It has been updated to use wait_event() instead of
wait_event_interruptible().

Paraphrasing the description from the above report, it makes sendmsg()
block while UNIX garbage collection is in progress. This avoids a
situation where child processes continue to queue new FDs over a
AF_UNIX socket to a parent which is in the exit path and running
garbage collection on these FDs. This contention can result in soft
lockups and oom-killing of unrelated processes.

Signed-off-by: dann frazier <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
include/net/af_unix.h | 1 +
net/unix/af_unix.c | 2 ++
net/unix/garbage.c | 13 ++++++++++---
3 files changed, 13 insertions(+), 3 deletions(-)

--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -9,6 +9,7 @@
extern void unix_inflight(struct file *fp);
extern void unix_notinflight(struct file *fp);
extern void unix_gc(void);
+extern void wait_for_unix_gc(void);

#define UNIX_HASH_SIZE 256

--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1341,6 +1341,7 @@ static int unix_dgram_sendmsg(struct kio

if (NULL == siocb->scm)
siocb->scm = &tmp_scm;
+ wait_for_unix_gc();
err = scm_send(sock, msg, siocb->scm);
if (err < 0)
return err;
@@ -1491,6 +1492,7 @@ static int unix_stream_sendmsg(struct ki

if (NULL == siocb->scm)
siocb->scm = &tmp_scm;
+ wait_for_unix_gc();
err = scm_send(sock, msg, siocb->scm);
if (err < 0)
return err;
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -80,6 +80,7 @@
#include <linux/file.h>
#include <linux/proc_fs.h>
#include <linux/mutex.h>
+#include <linux/wait.h>

#include <net/sock.h>
#include <net/af_unix.h>
@@ -91,6 +92,7 @@
static LIST_HEAD(gc_inflight_list);
static LIST_HEAD(gc_candidates);
static DEFINE_SPINLOCK(unix_gc_lock);
+static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);

unsigned int unix_tot_inflight;

@@ -266,12 +268,16 @@ static void inc_inflight_move_tail(struc
list_move_tail(&u->link, &gc_candidates);
}

-/* The external entry point: unix_gc() */
+static bool gc_in_progress = false;

-void unix_gc(void)
+void wait_for_unix_gc(void)
{
- static bool gc_in_progress = false;
+ wait_event(unix_gc_wait, gc_in_progress == false);
+}

+/* The external entry point: unix_gc() */
+void unix_gc(void)
+{
struct unix_sock *u;
struct unix_sock *next;
struct sk_buff_head hitlist;
@@ -376,6 +382,7 @@ void unix_gc(void)
/* All candidates should have been detached by now. */
BUG_ON(!list_empty(&gc_candidates));
gc_in_progress = false;
+ wake_up(&unix_gc_wait);

out:
spin_unlock(&unix_gc_lock);

2008-12-03 20:15:50

by Greg KH

[permalink] [raw]
Subject: [patch 065/104] IB/mlx4: Fix MTT leakage in resize CQ


2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Jack Morgenstein <[email protected]>

commit 42ab01c31526ac1d06d193f81a498bf3cf2acfe4 upstream.

When resizing a CQ, MTTs associated with the old CQE buffer were not
freed. As a result, if any app used resize CQ repeatedly, all MTTs
were eventually exhausted, which led to all memory registration
operations failing until the driver is reloaded.

Once the RESIZE_CQ command returns successfully from FW, FW no longer
accesses the old CQ buffer, so it is safe to deallocate the MTT
entries used by the old CQ buffer.

Finally, if the RESIZE_CQ command fails, the MTTs allocated for the
new CQEs buffer also need to be de-allocated.

This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1416>.

Signed-off-by: Jack Morgenstein <[email protected]>
Signed-off-by: Roland Dreier <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/infiniband/hw/mlx4/cq.c | 5 +++++
1 file changed, 5 insertions(+)

--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -343,6 +343,7 @@ int mlx4_ib_resize_cq(struct ib_cq *ibcq
{
struct mlx4_ib_dev *dev = to_mdev(ibcq->device);
struct mlx4_ib_cq *cq = to_mcq(ibcq);
+ struct mlx4_mtt mtt;
int outst_cqe;
int err;

@@ -376,10 +377,13 @@ int mlx4_ib_resize_cq(struct ib_cq *ibcq
goto out;
}

+ mtt = cq->buf.mtt;
+
err = mlx4_cq_resize(dev->dev, &cq->mcq, entries, &cq->resize_buf->buf.mtt);
if (err)
goto err_buf;

+ mlx4_mtt_cleanup(dev->dev, &mtt);
if (ibcq->uobject) {
cq->buf = cq->resize_buf->buf;
cq->ibcq.cqe = cq->resize_buf->cqe;
@@ -406,6 +410,7 @@ int mlx4_ib_resize_cq(struct ib_cq *ibcq
goto out;

err_buf:
+ mlx4_mtt_cleanup(dev->dev, &cq->resize_buf->buf.mtt);
if (!ibcq->uobject)
mlx4_ib_free_cq_buf(dev, &cq->resize_buf->buf,
cq->resize_buf->cqe);

2008-12-03 20:16:36

by Greg KH

[permalink] [raw]
Subject: [patch 067/104] libata: improve phantom device detection

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Tejun Heo <[email protected]>

commit 6a6b97d360702b98c02c7fca4c4e088dcf3a2985 upstream.

Currently libata uses four methods to detect device presence.

1. PHY status if available.
2. TF register R/W test (only promotes presence, never demotes)
3. device signature after reset
4. IDENTIFY failure detection in SFF state machine

Combination of the above works well in most cases but recently there
have been a few reports where a phantom device causes unnecessary
delay during probe. In both cases, PHY status wasn't available. In
one case, it passed #2 and #3 and failed IDENTIFY with ATA_ERR which
didn't qualify as #4. The other failed #2 but as it passed #3 and #4,
it still caused failure.

In both cases, phantom device reported diagnostic failure, so these
cases can be safely worked around by considering any !ATA_DRQ IDENTIFY
failure as NODEV_HINT if diagnostic failure is set.

Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Jeff Garzik <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
drivers/ata/libata-sff.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

--- a/drivers/ata/libata-sff.c
+++ b/drivers/ata/libata-sff.c
@@ -1227,10 +1227,19 @@ fsm_start:
/* ATA PIO protocol */
if (unlikely((status & ATA_DRQ) == 0)) {
/* handle BSY=0, DRQ=0 as error */
- if (likely(status & (ATA_ERR | ATA_DF)))
+ if (likely(status & (ATA_ERR | ATA_DF))) {
/* device stops HSM for abort/error */
qc->err_mask |= AC_ERR_DEV;
- else {
+
+ /* If diagnostic failed and this is
+ * IDENTIFY, it's likely a phantom
+ * device. Mark hint.
+ */
+ if (qc->dev->horkage &
+ ATA_HORKAGE_DIAGNOSTIC)
+ qc->err_mask |=
+ AC_ERR_NODEV_HINT;
+ } else {
/* HSM violation. Let EH handle this.
* Phantom devices also trigger this
* condition. Mark hint.

2008-12-03 20:17:20

by Greg KH

[permalink] [raw]
Subject: [patch 069/104] cifs: remove unused list, add new cifs sock list to prepare for mount/umount fix

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Steve French <[email protected]>

commit fb396016647ae9de5b3bd8c4ee4f7b9cc7148bd5 upstream.

Also adds two lines missing from the previous patch (for the need reconnect flag in the
/proc/fs/cifs/DebugData handling)

The new global_cifs_sock_list is added, and initialized in init_cifs but not used yet.
Jeff Layton will be adding code in to use that and to remove the GlobalTcon and GlobalSMBSession
lists.

CC: Jeff Layton <[email protected]>
CC: Shirish Pargaonkar <[email protected]>
Signed-off-by: Steve French <[email protected]>
Cc: Suresh Jayaraman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/cifs/cifs_debug.c | 4 ++--
fs/cifs/cifsfs.c | 6 +++---
fs/cifs/cifsglob.h | 23 ++++++++---------------
3 files changed, 13 insertions(+), 20 deletions(-)

--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -204,7 +204,7 @@ static int cifs_debug_data_proc_show(str
else
seq_printf(m, " type: %d ", dev_type);

- if (tcon->tidStatus == CifsNeedReconnect)
+ if (tcon->need_reconnect)
seq_puts(m, "\tDISCONNECTED ");
}
read_unlock(&GlobalSMBSeslock);
@@ -311,7 +311,7 @@ static int cifs_stats_proc_show(struct s
i++;
tcon = list_entry(tmp, struct cifsTconInfo, cifsConnectionList);
seq_printf(m, "\n%d) %s", i, tcon->treeName);
- if (tcon->tidStatus == CifsNeedReconnect)
+ if (tcon->need_reconnect)
seq_puts(m, "\tDISCONNECTED ");
seq_printf(m, "\nSMBs: %d Oplock Breaks: %d",
atomic_read(&tcon->num_smbs_sent),
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1013,9 +1013,9 @@ init_cifs(void)
{
int rc = 0;
cifs_proc_init();
-/* INIT_LIST_HEAD(&GlobalServerList);*/ /* BB not implemented yet */
- INIT_LIST_HEAD(&GlobalSMBSessionList);
- INIT_LIST_HEAD(&GlobalTreeConnectionList);
+ INIT_LIST_HEAD(&global_cifs_sock_list);
+ INIT_LIST_HEAD(&GlobalSMBSessionList); /* BB to be removed by jl */
+ INIT_LIST_HEAD(&GlobalTreeConnectionList); /* BB to be removed by jl */
INIT_LIST_HEAD(&GlobalOplock_Q);
#ifdef CONFIG_CIFS_EXPERIMENTAL
INIT_LIST_HEAD(&GlobalDnotifyReqList);
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -590,22 +590,15 @@ require use of the stronger protocol */
#define GLOBAL_EXTERN extern
#endif

-/*
- * The list of servers that did not respond with NT LM 0.12.
- * This list helps improve performance and eliminate the messages indicating
- * that we had a communications error talking to the server in this list.
- */
-/* Feature not supported */
-/* GLOBAL_EXTERN struct servers_not_supported *NotSuppList; */
-
-/*
- * The following is a hash table of all the users we know about.
- */
-GLOBAL_EXTERN struct smbUidInfo *GlobalUidList[UID_HASH];

-/* GLOBAL_EXTERN struct list_head GlobalServerList; BB not implemented yet */
-GLOBAL_EXTERN struct list_head GlobalSMBSessionList;
-GLOBAL_EXTERN struct list_head GlobalTreeConnectionList;
+/* the list of TCP_Server_Info structures, ie each of the sockets
+ * connecting our client to a distinct server (ip address), is
+ * chained together by global_cifs_sock_list. The list of all our SMB
+ * sessions (and from that the tree connections) can be found
+ * by iterating over global_cifs_sock_list */
+GLOBAL_EXTERN struct list_head global_cifs_sock_list;
+GLOBAL_EXTERN struct list_head GlobalSMBSessionList; /* BB to be removed by jl*/
+GLOBAL_EXTERN struct list_head GlobalTreeConnectionList; /* BB to be removed */
GLOBAL_EXTERN rwlock_t GlobalSMBSeslock; /* protects list inserts on 3 above */

GLOBAL_EXTERN struct list_head GlobalOplock_Q;

2008-12-03 20:16:54

by Greg KH

[permalink] [raw]
Subject: [patch 068/104] cifs: Fix cifs reconnection flags

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Steve French <[email protected]>

commit 3b7952109361c684caf0c50474da8662ecc81019 upstream

[CIFS] Fix cifs reconnection flags

In preparation for Jeff's big umount/mount fixes to remove the possibility of
various races in cifs mount and linked list handling of sessions, sockets and
tree connections, this patch cleans up some repetitive code in cifs_mount,
and addresses a problem with ses->status and tcon->tidStatus in which we
were overloading the "need_reconnect" state with other status in that
field. So the "need_reconnect" flag has been broken out from those
two state fields (need reconnect was not mutually exclusive from some of the
other possible tid and ses states). In addition, a few exit cases in
cifs_mount were cleaned up, and a problem with a tcon flag (for lease support)
was not being set consistently for the 2nd mount of the same share

CC: Jeff Layton <[email protected]>
CC: Shirish Pargaonkar <[email protected]>
Signed-off-by: Steve French <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/cifs/cifsfs.c | 2
fs/cifs/cifsglob.h | 5 +
fs/cifs/cifssmb.c | 40 ++++----
fs/cifs/connect.c | 252 ++++++++++++++++++++++++++---------------------------
fs/cifs/file.c | 2
5 files changed, 155 insertions(+), 146 deletions(-)

--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -967,7 +967,7 @@ static int cifs_oplock_thread(void *dumm
not bother sending an oplock release if session
to server still is disconnected since oplock
already released by the server in that case */
- if (pTcon->tidStatus != CifsNeedReconnect) {
+ if (!pTcon->need_reconnect) {
rc = CIFSSMBLock(0, pTcon, netfid,
0 /* len */ , 0 /* offset */, 0,
0, LOCKING_ANDX_OPLOCK_RELEASE,
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -122,6 +122,8 @@ struct cifs_cred {
*/

struct TCP_Server_Info {
+ struct list_head tcp_ses_list;
+ struct list_head smb_ses_list;
/* 15 character server name + 0x20 16th byte indicating type = srv */
char server_RFC1001_name[SERVER_NAME_LEN_WITH_NULL];
char unicode_server_Name[SERVER_NAME_LEN_WITH_NULL * 2];
@@ -195,6 +197,7 @@ struct cifsUidInfo {
*/
struct cifsSesInfo {
struct list_head cifsSessionList;
+ struct list_head tcon_list;
struct semaphore sesSem;
#if 0
struct cifsUidInfo *uidInfo; /* pointer to user info */
@@ -216,6 +219,7 @@ struct cifsSesInfo {
char userName[MAX_USERNAME_SIZE + 1];
char *domainName;
char *password;
+ bool need_reconnect:1; /* connection reset, uid now invalid */
};
/* no more than one of the following three session flags may be set */
#define CIFS_SES_NT4 1
@@ -287,6 +291,7 @@ struct cifsTconInfo {
bool seal:1; /* transport encryption for this mounted share */
bool unix_ext:1; /* if false disable Linux extensions to CIFS protocol
for this mount even if server would support */
+ bool need_reconnect:1; /* connection reset, tid now invalid */
/* BB add field for back pointer to sb struct(s)? */
};

--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -190,10 +190,10 @@ small_smb_init(int smb_command, int wct,
/* need to prevent multiple threads trying to
simultaneously reconnect the same SMB session */
down(&tcon->ses->sesSem);
- if (tcon->ses->status == CifsNeedReconnect)
+ if (tcon->ses->need_reconnect)
rc = cifs_setup_session(0, tcon->ses,
nls_codepage);
- if (!rc && (tcon->tidStatus == CifsNeedReconnect)) {
+ if (!rc && (tcon->need_reconnect)) {
mark_open_files_invalid(tcon);
rc = CIFSTCon(0, tcon->ses, tcon->treeName,
tcon, nls_codepage);
@@ -295,7 +295,7 @@ smb_init(int smb_command, int wct, struc
check for tcp and smb session status done differently
for those three - in the calling routine */
if (tcon) {
- if (tcon->tidStatus == CifsExiting) {
+ if (tcon->need_reconnect) {
/* only tree disconnect, open, and write,
(and ulogoff which does not have tcon)
are allowed as we start force umount */
@@ -337,10 +337,10 @@ smb_init(int smb_command, int wct, struc
/* need to prevent multiple threads trying to
simultaneously reconnect the same SMB session */
down(&tcon->ses->sesSem);
- if (tcon->ses->status == CifsNeedReconnect)
+ if (tcon->ses->need_reconnect)
rc = cifs_setup_session(0, tcon->ses,
nls_codepage);
- if (!rc && (tcon->tidStatus == CifsNeedReconnect)) {
+ if (!rc && (tcon->need_reconnect)) {
mark_open_files_invalid(tcon);
rc = CIFSTCon(0, tcon->ses, tcon->treeName,
tcon, nls_codepage);
@@ -759,7 +759,7 @@ CIFSSMBTDis(const int xid, struct cifsTc

/* No need to return error on this operation if tid invalidated and
closed on server already e.g. due to tcp session crashing */
- if (tcon->tidStatus == CifsNeedReconnect) {
+ if (tcon->need_reconnect) {
up(&tcon->tconSem);
return 0;
}
@@ -806,32 +806,36 @@ CIFSSMBLogoff(const int xid, struct cifs
up(&ses->sesSem);
return -EBUSY;
}
+
+ if (ses->server == NULL)
+ return -EIO;
+
+ if (ses->need_reconnect)
+ goto session_already_dead; /* no need to send SMBlogoff if uid
+ already closed due to reconnect */
rc = small_smb_init(SMB_COM_LOGOFF_ANDX, 2, NULL, (void **)&pSMB);
if (rc) {
up(&ses->sesSem);
return rc;
}

- if (ses->server) {
- pSMB->hdr.Mid = GetNextMid(ses->server);
+ pSMB->hdr.Mid = GetNextMid(ses->server);

- if (ses->server->secMode &
+ if (ses->server->secMode &
(SECMODE_SIGN_REQUIRED | SECMODE_SIGN_ENABLED))
pSMB->hdr.Flags2 |= SMBFLG2_SECURITY_SIGNATURE;
- }

pSMB->hdr.Uid = ses->Suid;

pSMB->AndXCommand = 0xFF;
rc = SendReceiveNoRsp(xid, ses, (struct smb_hdr *) pSMB, 0);
- if (ses->server) {
- atomic_dec(&ses->server->socketUseCount);
- if (atomic_read(&ses->server->socketUseCount) == 0) {
- spin_lock(&GlobalMid_Lock);
- ses->server->tcpStatus = CifsExiting;
- spin_unlock(&GlobalMid_Lock);
- rc = -ESHUTDOWN;
- }
+session_already_dead:
+ atomic_dec(&ses->server->socketUseCount);
+ if (atomic_read(&ses->server->socketUseCount) == 0) {
+ spin_lock(&GlobalMid_Lock);
+ ses->server->tcpStatus = CifsExiting;
+ spin_unlock(&GlobalMid_Lock);
+ rc = -ESHUTDOWN;
}
up(&ses->sesSem);

--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -147,7 +147,7 @@ cifs_reconnect(struct TCP_Server_Info *s
ses = list_entry(tmp, struct cifsSesInfo, cifsSessionList);
if (ses->server) {
if (ses->server == server) {
- ses->status = CifsNeedReconnect;
+ ses->need_reconnect = true;
ses->ipc_tid = 0;
}
}
@@ -156,7 +156,7 @@ cifs_reconnect(struct TCP_Server_Info *s
list_for_each(tmp, &GlobalTreeConnectionList) {
tcon = list_entry(tmp, struct cifsTconInfo, cifsConnectionList);
if ((tcon->ses) && (tcon->ses->server == server))
- tcon->tidStatus = CifsNeedReconnect;
+ tcon->need_reconnect = true;
}
read_unlock(&GlobalSMBSeslock);
/* do not want to be sending data on a socket we are freeing */
@@ -1868,6 +1868,92 @@ convert_delimiter(char *path, char delim
}
}

+static void setup_cifs_sb(struct smb_vol *pvolume_info,
+ struct cifs_sb_info *cifs_sb)
+{
+ if (pvolume_info->rsize > CIFSMaxBufSize) {
+ cERROR(1, ("rsize %d too large, using MaxBufSize",
+ pvolume_info->rsize));
+ cifs_sb->rsize = CIFSMaxBufSize;
+ } else if ((pvolume_info->rsize) &&
+ (pvolume_info->rsize <= CIFSMaxBufSize))
+ cifs_sb->rsize = pvolume_info->rsize;
+ else /* default */
+ cifs_sb->rsize = CIFSMaxBufSize;
+
+ if (pvolume_info->wsize > PAGEVEC_SIZE * PAGE_CACHE_SIZE) {
+ cERROR(1, ("wsize %d too large, using 4096 instead",
+ pvolume_info->wsize));
+ cifs_sb->wsize = 4096;
+ } else if (pvolume_info->wsize)
+ cifs_sb->wsize = pvolume_info->wsize;
+ else
+ cifs_sb->wsize = min_t(const int,
+ PAGEVEC_SIZE * PAGE_CACHE_SIZE,
+ 127*1024);
+ /* old default of CIFSMaxBufSize was too small now
+ that SMB Write2 can send multiple pages in kvec.
+ RFC1001 does not describe what happens when frame
+ bigger than 128K is sent so use that as max in
+ conjunction with 52K kvec constraint on arch with 4K
+ page size */
+
+ if (cifs_sb->rsize < 2048) {
+ cifs_sb->rsize = 2048;
+ /* Windows ME may prefer this */
+ cFYI(1, ("readsize set to minimum: 2048"));
+ }
+ /* calculate prepath */
+ cifs_sb->prepath = pvolume_info->prepath;
+ if (cifs_sb->prepath) {
+ cifs_sb->prepathlen = strlen(cifs_sb->prepath);
+ /* we can not convert the / to \ in the path
+ separators in the prefixpath yet because we do not
+ know (until reset_cifs_unix_caps is called later)
+ whether POSIX PATH CAP is available. We normalize
+ the / to \ after reset_cifs_unix_caps is called */
+ pvolume_info->prepath = NULL;
+ } else
+ cifs_sb->prepathlen = 0;
+ cifs_sb->mnt_uid = pvolume_info->linux_uid;
+ cifs_sb->mnt_gid = pvolume_info->linux_gid;
+ cifs_sb->mnt_file_mode = pvolume_info->file_mode;
+ cifs_sb->mnt_dir_mode = pvolume_info->dir_mode;
+ cFYI(1, ("file mode: 0x%x dir mode: 0x%x",
+ cifs_sb->mnt_file_mode, cifs_sb->mnt_dir_mode));
+
+ if (pvolume_info->noperm)
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_NO_PERM;
+ if (pvolume_info->setuids)
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_SET_UID;
+ if (pvolume_info->server_ino)
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_SERVER_INUM;
+ if (pvolume_info->remap)
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_MAP_SPECIAL_CHR;
+ if (pvolume_info->no_xattr)
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_NO_XATTR;
+ if (pvolume_info->sfu_emul)
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_UNX_EMUL;
+ if (pvolume_info->nobrl)
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_NO_BRL;
+ if (pvolume_info->cifs_acl)
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_CIFS_ACL;
+ if (pvolume_info->override_uid)
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_OVERR_UID;
+ if (pvolume_info->override_gid)
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_OVERR_GID;
+ if (pvolume_info->dynperm)
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_DYNPERM;
+ if (pvolume_info->direct_io) {
+ cFYI(1, ("mounting share using direct i/o"));
+ cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_DIRECT_IO;
+ }
+
+ if ((pvolume_info->cifs_acl) && (pvolume_info->dynperm))
+ cERROR(1, ("mount option dynperm ignored if cifsacl "
+ "mount option supported"));
+}
+
int
cifs_mount(struct super_block *sb, struct cifs_sb_info *cifs_sb,
char *mount_data, const char *devname)
@@ -1973,9 +2059,7 @@ cifs_mount(struct super_block *sb, struc
goto out;
}

- if (srvTcp) {
- cFYI(1, ("Existing tcp session with server found"));
- } else { /* create socket */
+ if (!srvTcp) { /* create socket */
if (volume_info.port)
sin_server.sin_port = htons(volume_info.port);
else
@@ -2051,7 +2135,7 @@ cifs_mount(struct super_block *sb, struc
cFYI(1, ("Existing smb sess found (status=%d)",
pSesInfo->status));
down(&pSesInfo->sesSem);
- if (pSesInfo->status == CifsNeedReconnect) {
+ if (pSesInfo->need_reconnect) {
cFYI(1, ("Session needs reconnect"));
rc = cifs_setup_session(xid, pSesInfo,
cifs_sb->local_nls);
@@ -2101,139 +2185,52 @@ cifs_mount(struct super_block *sb, struc

/* search for existing tcon to this server share */
if (!rc) {
- if (volume_info.rsize > CIFSMaxBufSize) {
- cERROR(1, ("rsize %d too large, using MaxBufSize",
- volume_info.rsize));
- cifs_sb->rsize = CIFSMaxBufSize;
- } else if ((volume_info.rsize) &&
- (volume_info.rsize <= CIFSMaxBufSize))
- cifs_sb->rsize = volume_info.rsize;
- else /* default */
- cifs_sb->rsize = CIFSMaxBufSize;
-
- if (volume_info.wsize > PAGEVEC_SIZE * PAGE_CACHE_SIZE) {
- cERROR(1, ("wsize %d too large, using 4096 instead",
- volume_info.wsize));
- cifs_sb->wsize = 4096;
- } else if (volume_info.wsize)
- cifs_sb->wsize = volume_info.wsize;
- else
- cifs_sb->wsize =
- min_t(const int, PAGEVEC_SIZE * PAGE_CACHE_SIZE,
- 127*1024);
- /* old default of CIFSMaxBufSize was too small now
- that SMB Write2 can send multiple pages in kvec.
- RFC1001 does not describe what happens when frame
- bigger than 128K is sent so use that as max in
- conjunction with 52K kvec constraint on arch with 4K
- page size */
-
- if (cifs_sb->rsize < 2048) {
- cifs_sb->rsize = 2048;
- /* Windows ME may prefer this */
- cFYI(1, ("readsize set to minimum: 2048"));
- }
- /* calculate prepath */
- cifs_sb->prepath = volume_info.prepath;
- if (cifs_sb->prepath) {
- cifs_sb->prepathlen = strlen(cifs_sb->prepath);
- /* we can not convert the / to \ in the path
- separators in the prefixpath yet because we do not
- know (until reset_cifs_unix_caps is called later)
- whether POSIX PATH CAP is available. We normalize
- the / to \ after reset_cifs_unix_caps is called */
- volume_info.prepath = NULL;
- } else
- cifs_sb->prepathlen = 0;
- cifs_sb->mnt_uid = volume_info.linux_uid;
- cifs_sb->mnt_gid = volume_info.linux_gid;
- cifs_sb->mnt_file_mode = volume_info.file_mode;
- cifs_sb->mnt_dir_mode = volume_info.dir_mode;
- cFYI(1, ("file mode: 0x%x dir mode: 0x%x",
- cifs_sb->mnt_file_mode, cifs_sb->mnt_dir_mode));
-
- if (volume_info.noperm)
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_NO_PERM;
- if (volume_info.setuids)
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_SET_UID;
- if (volume_info.server_ino)
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_SERVER_INUM;
- if (volume_info.remap)
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_MAP_SPECIAL_CHR;
- if (volume_info.no_xattr)
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_NO_XATTR;
- if (volume_info.sfu_emul)
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_UNX_EMUL;
- if (volume_info.nobrl)
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_NO_BRL;
- if (volume_info.cifs_acl)
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_CIFS_ACL;
- if (volume_info.override_uid)
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_OVERR_UID;
- if (volume_info.override_gid)
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_OVERR_GID;
- if (volume_info.dynperm)
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_DYNPERM;
- if (volume_info.direct_io) {
- cFYI(1, ("mounting share using direct i/o"));
- cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_DIRECT_IO;
- }
-
- if ((volume_info.cifs_acl) && (volume_info.dynperm))
- cERROR(1, ("mount option dynperm ignored if cifsacl "
- "mount option supported"));
-
+ setup_cifs_sb(&volume_info, cifs_sb);
tcon =
find_unc(sin_server.sin_addr.s_addr, volume_info.UNC,
volume_info.username);
if (tcon) {
cFYI(1, ("Found match on UNC path"));
- /* we can have only one retry value for a connection
- to a share so for resources mounted more than once
- to the same server share the last value passed in
- for the retry flag is used */
- tcon->retry = volume_info.retry;
- tcon->nocase = volume_info.nocase;
if (tcon->seal != volume_info.seal)
cERROR(1, ("transport encryption setting "
"conflicts with existing tid"));
} else {
tcon = tconInfoAlloc();
- if (tcon == NULL)
+ if (tcon == NULL) {
rc = -ENOMEM;
- else {
- /* check for null share name ie connecting to
- * dfs root */
-
- /* BB check if this works for exactly length
- * three strings */
- if ((strchr(volume_info.UNC + 3, '\\') == NULL)
- && (strchr(volume_info.UNC + 3, '/') ==
- NULL)) {
-/* rc = connect_to_dfs_path(xid, pSesInfo,
- "", cifs_sb->local_nls,
- cifs_sb->mnt_cifs_flags &
- CIFS_MOUNT_MAP_SPECIAL_CHR);*/
- cFYI(1, ("DFS root not supported"));
- rc = -ENODEV;
- goto out;
- } else {
- /* BB Do we need to wrap sesSem around
- * this TCon call and Unix SetFS as
- * we do on SessSetup and reconnect? */
- rc = CIFSTCon(xid, pSesInfo,
- volume_info.UNC,
- tcon, cifs_sb->local_nls);
- cFYI(1, ("CIFS Tcon rc = %d", rc));
- }
- if (!rc) {
- atomic_inc(&pSesInfo->inUse);
- tcon->retry = volume_info.retry;
- tcon->nocase = volume_info.nocase;
- tcon->seal = volume_info.seal;
- }
+ goto mount_fail_check;
}
+
+ /* check for null share name ie connect to dfs root */
+
+ /* BB check if works for exactly length 3 strings */
+ if ((strchr(volume_info.UNC + 3, '\\') == NULL)
+ && (strchr(volume_info.UNC + 3, '/') == NULL)) {
+ /* rc = connect_to_dfs_path(...) */
+ cFYI(1, ("DFS root not supported"));
+ rc = -ENODEV;
+ goto mount_fail_check;
+ } else {
+ /* BB Do we need to wrap sesSem around
+ * this TCon call and Unix SetFS as
+ * we do on SessSetup and reconnect? */
+ rc = CIFSTCon(xid, pSesInfo, volume_info.UNC,
+ tcon, cifs_sb->local_nls);
+ cFYI(1, ("CIFS Tcon rc = %d", rc));
+ }
+ if (!rc) {
+ atomic_inc(&pSesInfo->inUse);
+ tcon->seal = volume_info.seal;
+ } else
+ goto mount_fail_check;
}
+
+ /* we can have only one retry value for a connection
+ to a share so for resources mounted more than once
+ to the same server share the last value passed in
+ for the retry flag is used */
+ tcon->retry = volume_info.retry;
+ tcon->nocase = volume_info.nocase;
}
if (pSesInfo) {
if (pSesInfo->capabilities & CAP_LARGE_FILES) {
@@ -2246,6 +2243,7 @@ cifs_mount(struct super_block *sb, struc
sb->s_time_gran = 100;

/* on error free sesinfo and tcon struct if needed */
+mount_fail_check:
if (rc) {
/* if session setup failed, use count is zero but
we still need to free cifsd thread */
@@ -3499,6 +3497,7 @@ CIFSTCon(unsigned int xid, struct cifsSe
/* above now done in SendReceive */
if ((rc == 0) && (tcon != NULL)) {
tcon->tidStatus = CifsGood;
+ tcon->need_reconnect = false;
tcon->tid = smb_buffer_response->Tid;
bcc_ptr = pByteArea(smb_buffer_response);
length = strnlen(bcc_ptr, BCC(smb_buffer_response) - 2);
@@ -3730,6 +3729,7 @@ int cifs_setup_session(unsigned int xid,
} else {
cFYI(1, ("CIFS Session Established successfully"));
pSesInfo->status = CifsGood;
+ pSesInfo->need_reconnect = false;
}

ss_err_exit:
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -493,7 +493,7 @@ int cifs_close(struct inode *inode, stru
if (pTcon) {
/* no sense reconnecting to close a file that is
already closed */
- if (pTcon->tidStatus != CifsNeedReconnect) {
+ if (!pTcon->need_reconnect) {
timeout = 2;
while ((atomic_read(&pSMBFile->wrtPending) != 0)
&& (timeout <= 2048)) {

2008-12-03 20:17:39

by Greg KH

[permalink] [raw]
Subject: [patch 070/104] cifs: clean up server protocol handling

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Steve French <[email protected]>

commit 3ec332ef7a38c2327e18d087d4120a8e3bd3dc6e upstream.

We're currently declaring both a sockaddr_in and sockaddr6_in on the
stack, but we really only need storage for one of them. Declare a
sockaddr struct and cast it to the proper type. Also, eliminate the
protocolType field in the TCP_Server_Info struct. It's redundant since
we have a sa_family field in the sockaddr anyway.

We may need to revisit this if SCTP is ever implemented, but for now
this will simplify the code.

CIFS over IPv6 also has a number of problems currently. This fixes all
of them that I found. Eventually, it would be nice to move more of the
code to be protocol independent, but this is a start.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Steve French <[email protected]>
Cc: Suresh Jayaraman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>


---
fs/cifs/cifs_spnego.c | 3 +-
fs/cifs/cifsglob.h | 3 --
fs/cifs/connect.c | 57 ++++++++++++++++++++++++++------------------------
3 files changed, 33 insertions(+), 30 deletions(-)

--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -85,8 +85,7 @@ enum securityEnum {
};

enum protocolEnum {
- IPV4 = 0,
- IPV6,
+ TCP = 0,
SCTP
/* Netbios frames protocol not supported at this time */
};
--- a/fs/cifs/cifs_spnego.c
+++ b/fs/cifs/cifs_spnego.c
@@ -70,7 +70,8 @@ struct key_type cifs_spnego_key_type = {
strlen("ver=0xFF") */
#define MAX_MECH_STR_LEN 13 /* length of longest security mechanism name, eg
in future could have strlen(";sec=ntlmsspi") */
-#define MAX_IPV6_ADDR_LEN 42 /* eg FEDC:BA98:7654:3210:FEDC:BA98:7654:3210/60 */
+/* max possible addr len eg FEDC:BA98:7654:3210:FEDC:BA98:7654:3210/128 */
+#define MAX_IPV6_ADDR_LEN 43
/* get a key struct with a SPNEGO security blob, suitable for session setup */
struct key *
cifs_get_spnego_key(struct cifsSesInfo *sesInfo)
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -190,7 +190,7 @@ cifs_reconnect(struct TCP_Server_Info *s

while ((!kthread_should_stop()) && (server->tcpStatus != CifsGood)) {
try_to_freeze();
- if (server->protocolType == IPV6) {
+ if (server->addr.sockAddr6.sin6_family == AF_INET6) {
rc = ipv6_connect(&server->addr.sockAddr6,
&server->ssocket, server->noautotune);
} else {
@@ -1960,10 +1960,10 @@ cifs_mount(struct super_block *sb, struc
{
int rc = 0;
int xid;
- int address_type = AF_INET;
struct socket *csocket = NULL;
- struct sockaddr_in sin_server;
- struct sockaddr_in6 sin_server6;
+ struct sockaddr addr;
+ struct sockaddr_in *sin_server = (struct sockaddr_in *) &addr;
+ struct sockaddr_in6 *sin_server6 = (struct sockaddr_in6 *) &addr;
struct smb_vol volume_info;
struct cifsSesInfo *pSesInfo = NULL;
struct cifsSesInfo *existingCifsSes = NULL;
@@ -1974,6 +1974,7 @@ cifs_mount(struct super_block *sb, struc

/* cFYI(1, ("Entering cifs_mount. Xid: %d with: %s", xid, mount_data)); */

+ memset(&addr, 0, sizeof(struct sockaddr));
memset(&volume_info, 0, sizeof(struct smb_vol));
if (cifs_parse_mount_options(mount_data, devname, &volume_info)) {
rc = -EINVAL;
@@ -1996,16 +1997,16 @@ cifs_mount(struct super_block *sb, struc

if (volume_info.UNCip && volume_info.UNC) {
rc = cifs_inet_pton(AF_INET, volume_info.UNCip,
- &sin_server.sin_addr.s_addr);
+ &sin_server->sin_addr.s_addr);

if (rc <= 0) {
/* not ipv4 address, try ipv6 */
rc = cifs_inet_pton(AF_INET6, volume_info.UNCip,
- &sin_server6.sin6_addr.in6_u);
+ &sin_server6->sin6_addr.in6_u);
if (rc > 0)
- address_type = AF_INET6;
+ addr.sa_family = AF_INET6;
} else {
- address_type = AF_INET;
+ addr.sa_family = AF_INET;
}

if (rc <= 0) {
@@ -2045,39 +2046,38 @@ cifs_mount(struct super_block *sb, struc
}
}

- if (address_type == AF_INET)
- existingCifsSes = cifs_find_tcp_session(&sin_server.sin_addr,
+ if (addr.sa_family == AF_INET)
+ existingCifsSes = cifs_find_tcp_session(&sin_server->sin_addr,
NULL /* no ipv6 addr */,
volume_info.username, &srvTcp);
- else if (address_type == AF_INET6) {
+ else if (addr.sa_family == AF_INET6) {
cFYI(1, ("looking for ipv6 address"));
existingCifsSes = cifs_find_tcp_session(NULL /* no ipv4 addr */,
- &sin_server6.sin6_addr,
+ &sin_server6->sin6_addr,
volume_info.username, &srvTcp);
} else {
rc = -EINVAL;
goto out;
}

- if (!srvTcp) { /* create socket */
- if (volume_info.port)
- sin_server.sin_port = htons(volume_info.port);
- else
- sin_server.sin_port = 0;
- if (address_type == AF_INET6) {
+ if (!srvTcp) {
+ if (addr.sa_family == AF_INET6) {
cFYI(1, ("attempting ipv6 connect"));
/* BB should we allow ipv6 on port 139? */
/* other OS never observed in Wild doing 139 with v6 */
- rc = ipv6_connect(&sin_server6, &csocket,
+ sin_server6->sin6_port = htons(volume_info.port);
+ rc = ipv6_connect(sin_server6, &csocket,
volume_info.noblocksnd);
- } else
- rc = ipv4_connect(&sin_server, &csocket,
+ } else {
+ sin_server->sin_port = htons(volume_info.port);
+ rc = ipv4_connect(sin_server, &csocket,
volume_info.source_rfc1001_name,
volume_info.target_rfc1001_name,
volume_info.noblocksnd,
volume_info.noautotune);
+ }
if (rc < 0) {
- cERROR(1, ("Error connecting to IPv4 socket. "
+ cERROR(1, ("Error connecting to socket. "
"Aborting operation"));
if (csocket != NULL)
sock_release(csocket);
@@ -2092,12 +2092,15 @@ cifs_mount(struct super_block *sb, struc
} else {
srvTcp->noblocksnd = volume_info.noblocksnd;
srvTcp->noautotune = volume_info.noautotune;
- memcpy(&srvTcp->addr.sockAddr, &sin_server,
- sizeof(struct sockaddr_in));
+ if (addr.sa_family == AF_INET6)
+ memcpy(&srvTcp->addr.sockAddr6, sin_server6,
+ sizeof(struct sockaddr_in6));
+ else
+ memcpy(&srvTcp->addr.sockAddr, sin_server,
+ sizeof(struct sockaddr_in));
atomic_set(&srvTcp->inFlight, 0);
/* BB Add code for ipv6 case too */
srvTcp->ssocket = csocket;
- srvTcp->protocolType = IPV4;
srvTcp->hostname = extract_hostname(volume_info.UNC);
if (IS_ERR(srvTcp->hostname)) {
rc = PTR_ERR(srvTcp->hostname);
@@ -2149,7 +2152,7 @@ cifs_mount(struct super_block *sb, struc
else {
pSesInfo->server = srvTcp;
sprintf(pSesInfo->serverName, "%u.%u.%u.%u",
- NIPQUAD(sin_server.sin_addr.s_addr));
+ NIPQUAD(sin_server->sin_addr.s_addr));
}

if (!rc) {
@@ -2187,7 +2190,7 @@ cifs_mount(struct super_block *sb, struc
if (!rc) {
setup_cifs_sb(&volume_info, cifs_sb);
tcon =
- find_unc(sin_server.sin_addr.s_addr, volume_info.UNC,
+ find_unc(sin_server->sin_addr.s_addr, volume_info.UNC,
volume_info.username);
if (tcon) {
cFYI(1, ("Found match on UNC path"));

2008-12-03 20:18:29

by Greg KH

[permalink] [raw]
Subject: [patch 072/104] cifs: reinstate sharing of SMB sessions sans races

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Jeff Layton <[email protected]>

commit 14fbf50d695207754daeb96270b3027a3821121f upstream

We do this by abandoning the global list of SMB sessions and instead
moving to a per-server list. This entails adding a new list head to the
TCP_Server_Info struct. The refcounting for the cifsSesInfo is moved to
a non-atomic variable. We have to protect it by a lock anyway, so there's
no benefit to making it an atomic. The list and refcount are protected
by the global cifs_tcp_ses_lock.

The patch also adds a new routines to find and put SMB sessions and
that properly take and put references under the lock.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Steve French <[email protected]>
Cc: Suresh Jayaraman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/cifs/cifs_debug.c | 53 ++++++------
fs/cifs/cifsfs.c | 17 +--
fs/cifs/cifsglob.h | 6 -
fs/cifs/cifsproto.h | 1
fs/cifs/cifssmb.c | 22 +---
fs/cifs/connect.c | 225 ++++++++++++++++++++++++++++-----------------------
fs/cifs/misc.c | 16 +--
7 files changed, 174 insertions(+), 166 deletions(-)

--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -107,9 +107,9 @@ void cifs_dump_mids(struct TCP_Server_In
#ifdef CONFIG_PROC_FS
static int cifs_debug_data_proc_show(struct seq_file *m, void *v)
{
- struct list_head *tmp;
- struct list_head *tmp1;
+ struct list_head *tmp, *tmp2, *tmp3;
struct mid_q_entry *mid_entry;
+ struct TCP_Server_Info *server;
struct cifsSesInfo *ses;
struct cifsTconInfo *tcon;
int i;
@@ -122,43 +122,45 @@ static int cifs_debug_data_proc_show(str
seq_printf(m, "Servers:");

i = 0;
- read_lock(&GlobalSMBSeslock);
- list_for_each(tmp, &GlobalSMBSessionList) {
+ read_lock(&cifs_tcp_ses_lock);
+ list_for_each(tmp, &cifs_tcp_ses_list) {
+ server = list_entry(tmp, struct TCP_Server_Info,
+ tcp_ses_list);
i++;
- ses = list_entry(tmp, struct cifsSesInfo, cifsSessionList);
- if ((ses->serverDomain == NULL) || (ses->serverOS == NULL) ||
- (ses->serverNOS == NULL)) {
- seq_printf(m, "\nentry for %s not fully "
- "displayed\n\t", ses->serverName);
- } else {
- seq_printf(m,
+ list_for_each(tmp2, &server->smb_ses_list) {
+ ses = list_entry(tmp2, struct cifsSesInfo,
+ smb_ses_list);
+ if ((ses->serverDomain == NULL) ||
+ (ses->serverOS == NULL) ||
+ (ses->serverNOS == NULL)) {
+ seq_printf(m, "\nentry for %s not fully "
+ "displayed\n\t", ses->serverName);
+ } else {
+ seq_printf(m,
"\n%d) Name: %s Domain: %s Mounts: %d OS:"
" %s \n\tNOS: %s\tCapability: 0x%x\n\tSMB"
" session status: %d\t",
i, ses->serverName, ses->serverDomain,
- atomic_read(&ses->inUse),
- ses->serverOS, ses->serverNOS,
+ ses->ses_count, ses->serverOS, ses->serverNOS,
ses->capabilities, ses->status);
- }
- if (ses->server) {
+ }
seq_printf(m, "TCP status: %d\n\tLocal Users To "
- "Server: %d SecMode: 0x%x Req On Wire: %d",
- ses->server->tcpStatus,
- ses->server->srv_count,
- ses->server->secMode,
- atomic_read(&ses->server->inFlight));
+ "Server: %d SecMode: 0x%x Req On Wire: %d",
+ server->tcpStatus, server->srv_count,
+ server->secMode,
+ atomic_read(&server->inFlight));

#ifdef CONFIG_CIFS_STATS2
seq_printf(m, " In Send: %d In MaxReq Wait: %d",
- atomic_read(&ses->server->inSend),
- atomic_read(&ses->server->num_waiters));
+ atomic_read(&server->inSend),
+ atomic_read(&server->num_waiters));
#endif

seq_puts(m, "\nMIDs:\n");

spin_lock(&GlobalMid_Lock);
- list_for_each(tmp1, &ses->server->pending_mid_q) {
- mid_entry = list_entry(tmp1, struct
+ list_for_each(tmp3, &server->pending_mid_q) {
+ mid_entry = list_entry(tmp3, struct
mid_q_entry,
qhead);
seq_printf(m, "State: %d com: %d pid:"
@@ -171,9 +173,8 @@ static int cifs_debug_data_proc_show(str
}
spin_unlock(&GlobalMid_Lock);
}
-
}
- read_unlock(&GlobalSMBSeslock);
+ read_unlock(&cifs_tcp_ses_lock);
seq_putc(m, '\n');

seq_puts(m, "Shares:");
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -985,24 +985,24 @@ static int cifs_oplock_thread(void *dumm
static int cifs_dnotify_thread(void *dummyarg)
{
struct list_head *tmp;
- struct cifsSesInfo *ses;
+ struct TCP_Server_Info *server;

do {
if (try_to_freeze())
continue;
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(15*HZ);
- read_lock(&GlobalSMBSeslock);
/* check if any stuck requests that need
to be woken up and wakeq so the
thread can wake up and error out */
- list_for_each(tmp, &GlobalSMBSessionList) {
- ses = list_entry(tmp, struct cifsSesInfo,
- cifsSessionList);
- if (ses->server && atomic_read(&ses->server->inFlight))
- wake_up_all(&ses->server->response_q);
+ read_lock(&cifs_tcp_ses_lock);
+ list_for_each(tmp, &cifs_tcp_ses_list) {
+ server = list_entry(tmp, struct TCP_Server_Info,
+ tcp_ses_list);
+ if (atomic_read(&server->inFlight))
+ wake_up_all(&server->response_q);
}
- read_unlock(&GlobalSMBSeslock);
+ read_unlock(&cifs_tcp_ses_lock);
} while (!kthread_should_stop());

return 0;
@@ -1014,7 +1014,6 @@ init_cifs(void)
int rc = 0;
cifs_proc_init();
INIT_LIST_HEAD(&cifs_tcp_ses_list);
- INIT_LIST_HEAD(&GlobalSMBSessionList); /* BB to be removed by jl */
INIT_LIST_HEAD(&GlobalTreeConnectionList); /* BB to be removed by jl */
INIT_LIST_HEAD(&GlobalOplock_Q);
#ifdef CONFIG_CIFS_EXPERIMENTAL
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -195,14 +195,14 @@ struct cifsUidInfo {
* Session structure. One of these for each uid session with a particular host
*/
struct cifsSesInfo {
- struct list_head cifsSessionList;
+ struct list_head smb_ses_list;
struct list_head tcon_list;
struct semaphore sesSem;
#if 0
struct cifsUidInfo *uidInfo; /* pointer to user info */
#endif
struct TCP_Server_Info *server; /* pointer to server info */
- atomic_t inUse; /* # of mounts (tree connections) on this ses */
+ int ses_count; /* reference counter */
enum statusEnum status;
unsigned overrideSecFlg; /* if non-zero override global sec flags */
__u16 ipc_tid; /* special tid for connection to IPC share */
@@ -600,8 +600,6 @@ GLOBAL_EXTERN struct list_head cifs_tcp

/* protects cifs_tcp_ses_list and srv_count for each tcp session */
GLOBAL_EXTERN rwlock_t cifs_tcp_ses_lock;
-
-GLOBAL_EXTERN struct list_head GlobalSMBSessionList; /* BB to be removed by jl*/
GLOBAL_EXTERN struct list_head GlobalTreeConnectionList; /* BB to be removed */
GLOBAL_EXTERN rwlock_t GlobalSMBSeslock; /* protects list inserts on 3 above */

--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -102,7 +102,6 @@ extern void acl_to_uid_mode(struct inode
const __u16 *pfid);
extern int mode_to_acl(struct inode *inode, const char *path, __u64);

-extern void cifs_put_tcp_session(struct TCP_Server_Info *server);
extern int cifs_mount(struct super_block *, struct cifs_sb_info *, char *,
const char *);
extern int cifs_umount(struct super_block *, struct cifs_sb_info *);
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -799,20 +799,16 @@ CIFSSMBLogoff(const int xid, struct cifs
int rc = 0;

cFYI(1, ("In SMBLogoff for session disconnect"));
- if (ses)
- down(&ses->sesSem);
- else
- return -EIO;
-
- atomic_dec(&ses->inUse);
- if (atomic_read(&ses->inUse) > 0) {
- up(&ses->sesSem);
- return -EBUSY;
- }

- if (ses->server == NULL)
+ /*
+ * BB: do we need to check validity of ses and server? They should
+ * always be valid since we have an active reference. If not, that
+ * should probably be a BUG()
+ */
+ if (!ses || !ses->server)
return -EIO;

+ down(&ses->sesSem);
if (ses->need_reconnect)
goto session_already_dead; /* no need to send SMBlogoff if uid
already closed due to reconnect */
@@ -833,10 +829,6 @@ CIFSSMBLogoff(const int xid, struct cifs
pSMB->AndXCommand = 0xFF;
rc = SendReceiveNoRsp(xid, ses, (struct smb_hdr *) pSMB, 0);
session_already_dead:
- if (ses->server) {
- cifs_put_tcp_session(ses->server);
- rc = 0;
- }
up(&ses->sesSem);

/* if session dead then we do not need to do ulogoff,
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -142,23 +142,18 @@ cifs_reconnect(struct TCP_Server_Info *s

/* before reconnecting the tcp session, mark the smb session (uid)
and the tid bad so they are not used until reconnected */
- read_lock(&GlobalSMBSeslock);
- list_for_each(tmp, &GlobalSMBSessionList) {
- ses = list_entry(tmp, struct cifsSesInfo, cifsSessionList);
- if (ses->server) {
- if (ses->server == server) {
- ses->need_reconnect = true;
- ses->ipc_tid = 0;
- }
- }
- /* else tcp and smb sessions need reconnection */
+ read_lock(&cifs_tcp_ses_lock);
+ list_for_each(tmp, &server->smb_ses_list) {
+ ses = list_entry(tmp, struct cifsSesInfo, smb_ses_list);
+ ses->need_reconnect = true;
+ ses->ipc_tid = 0;
}
+ read_unlock(&cifs_tcp_ses_lock);
list_for_each(tmp, &GlobalTreeConnectionList) {
tcon = list_entry(tmp, struct cifsTconInfo, cifsConnectionList);
if ((tcon->ses) && (tcon->ses->server == server))
tcon->need_reconnect = true;
}
- read_unlock(&GlobalSMBSeslock);
/* do not want to be sending data on a socket we are freeing */
down(&server->tcpSem);
if (server->ssocket) {
@@ -702,29 +697,29 @@ multi_t2_fnd:
if (smallbuf) /* no sense logging a debug message if NULL */
cifs_small_buf_release(smallbuf);

- read_lock(&GlobalSMBSeslock);
+ /*
+ * BB: we shouldn't have to do any of this. It shouldn't be
+ * possible to exit from the thread with active SMB sessions
+ */
+ read_lock(&cifs_tcp_ses_lock);
if (list_empty(&server->pending_mid_q)) {
/* loop through server session structures attached to this and
mark them dead */
- list_for_each(tmp, &GlobalSMBSessionList) {
- ses =
- list_entry(tmp, struct cifsSesInfo,
- cifsSessionList);
- if (ses->server == server) {
- ses->status = CifsExiting;
- ses->server = NULL;
- }
+ list_for_each(tmp, &server->smb_ses_list) {
+ ses = list_entry(tmp, struct cifsSesInfo,
+ smb_ses_list);
+ ses->status = CifsExiting;
+ ses->server = NULL;
}
- read_unlock(&GlobalSMBSeslock);
+ read_unlock(&cifs_tcp_ses_lock);
} else {
/* although we can not zero the server struct pointer yet,
since there are active requests which may depnd on them,
mark the corresponding SMB sessions as exiting too */
- list_for_each(tmp, &GlobalSMBSessionList) {
+ list_for_each(tmp, &server->smb_ses_list) {
ses = list_entry(tmp, struct cifsSesInfo,
- cifsSessionList);
- if (ses->server == server)
- ses->status = CifsExiting;
+ smb_ses_list);
+ ses->status = CifsExiting;
}

spin_lock(&GlobalMid_Lock);
@@ -739,7 +734,7 @@ multi_t2_fnd:
}
}
spin_unlock(&GlobalMid_Lock);
- read_unlock(&GlobalSMBSeslock);
+ read_unlock(&cifs_tcp_ses_lock);
/* 1/8th of sec is more than enough time for them to exit */
msleep(125);
}
@@ -761,14 +756,13 @@ multi_t2_fnd:
if there are any pointing to this (e.g
if a crazy root user tried to kill cifsd
kernel thread explicitly this might happen) */
- write_lock(&GlobalSMBSeslock);
- list_for_each(tmp, &GlobalSMBSessionList) {
- ses = list_entry(tmp, struct cifsSesInfo,
- cifsSessionList);
- if (ses->server == server)
- ses->server = NULL;
+ /* BB: This shouldn't be necessary, see above */
+ read_lock(&cifs_tcp_ses_lock);
+ list_for_each(tmp, &server->smb_ses_list) {
+ ses = list_entry(tmp, struct cifsSesInfo, smb_ses_list);
+ ses->server = NULL;
}
- write_unlock(&GlobalSMBSeslock);
+ read_unlock(&cifs_tcp_ses_lock);

kfree(server->hostname);
kfree(server);
@@ -1390,7 +1384,7 @@ cifs_find_tcp_session(struct sockaddr *a
return NULL;
}

-void
+static void
cifs_put_tcp_session(struct TCP_Server_Info *server)
{
struct task_struct *task;
@@ -1413,6 +1407,50 @@ cifs_put_tcp_session(struct TCP_Server_I
force_sig(SIGKILL, task);
}

+static struct cifsSesInfo *
+cifs_find_smb_ses(struct TCP_Server_Info *server, char *username)
+{
+ struct list_head *tmp;
+ struct cifsSesInfo *ses;
+
+ write_lock(&cifs_tcp_ses_lock);
+ list_for_each(tmp, &server->smb_ses_list) {
+ ses = list_entry(tmp, struct cifsSesInfo, smb_ses_list);
+ if (strncmp(ses->userName, username, MAX_USERNAME_SIZE))
+ continue;
+
+ ++ses->ses_count;
+ write_unlock(&cifs_tcp_ses_lock);
+ return ses;
+ }
+ write_unlock(&cifs_tcp_ses_lock);
+ return NULL;
+}
+
+static void
+cifs_put_smb_ses(struct cifsSesInfo *ses)
+{
+ int xid;
+ struct TCP_Server_Info *server = ses->server;
+
+ write_lock(&cifs_tcp_ses_lock);
+ if (--ses->ses_count > 0) {
+ write_unlock(&cifs_tcp_ses_lock);
+ return;
+ }
+
+ list_del_init(&ses->smb_ses_list);
+ write_unlock(&cifs_tcp_ses_lock);
+
+ if (ses->status == CifsGood) {
+ xid = GetXid();
+ CIFSSMBLogoff(xid, ses);
+ _FreeXid(xid);
+ }
+ sesInfoFree(ses);
+ cifs_put_tcp_session(server);
+}
+
int
get_dfs_path(int xid, struct cifsSesInfo *pSesInfo, const char *old_path,
const struct nls_table *nls_codepage, unsigned int *pnum_referrals,
@@ -1945,7 +1983,6 @@ cifs_mount(struct super_block *sb, struc
struct sockaddr_in6 *sin_server6 = (struct sockaddr_in6 *) &addr;
struct smb_vol volume_info;
struct cifsSesInfo *pSesInfo = NULL;
- struct cifsSesInfo *existingCifsSes = NULL;
struct cifsTconInfo *tcon = NULL;
struct TCP_Server_Info *srvTcp = NULL;

@@ -2099,6 +2136,7 @@ cifs_mount(struct super_block *sb, struc
volume_info.target_rfc1001_name, 16);
srvTcp->sequence_number = 0;
INIT_LIST_HEAD(&srvTcp->tcp_ses_list);
+ INIT_LIST_HEAD(&srvTcp->smb_ses_list);
++srvTcp->srv_count;
write_lock(&cifs_tcp_ses_lock);
list_add(&srvTcp->tcp_ses_list,
@@ -2107,10 +2145,16 @@ cifs_mount(struct super_block *sb, struc
}
}

- if (existingCifsSes) {
- pSesInfo = existingCifsSes;
+ pSesInfo = cifs_find_smb_ses(srvTcp, volume_info.username);
+ if (pSesInfo) {
cFYI(1, ("Existing smb sess found (status=%d)",
pSesInfo->status));
+ /*
+ * The existing SMB session already has a reference to srvTcp,
+ * so we can put back the extra one we got before
+ */
+ cifs_put_tcp_session(srvTcp);
+
down(&pSesInfo->sesSem);
if (pSesInfo->need_reconnect) {
cFYI(1, ("Session needs reconnect"));
@@ -2121,41 +2165,44 @@ cifs_mount(struct super_block *sb, struc
} else if (!rc) {
cFYI(1, ("Existing smb sess not found"));
pSesInfo = sesInfoAlloc();
- if (pSesInfo == NULL)
+ if (pSesInfo == NULL) {
rc = -ENOMEM;
- else {
- pSesInfo->server = srvTcp;
- sprintf(pSesInfo->serverName, "%u.%u.%u.%u",
- NIPQUAD(sin_server->sin_addr.s_addr));
+ goto mount_fail_check;
}

- if (!rc) {
- /* volume_info.password freed at unmount */
- if (volume_info.password) {
- pSesInfo->password = volume_info.password;
- /* set to NULL to prevent freeing on exit */
- volume_info.password = NULL;
- }
- if (volume_info.username)
- strncpy(pSesInfo->userName,
- volume_info.username,
- MAX_USERNAME_SIZE);
- if (volume_info.domainname) {
- int len = strlen(volume_info.domainname);
- pSesInfo->domainName =
- kmalloc(len + 1, GFP_KERNEL);
- if (pSesInfo->domainName)
- strcpy(pSesInfo->domainName,
- volume_info.domainname);
- }
- pSesInfo->linux_uid = volume_info.linux_uid;
- pSesInfo->overrideSecFlg = volume_info.secFlg;
- down(&pSesInfo->sesSem);
- /* BB FIXME need to pass vol->secFlgs BB */
- rc = cifs_setup_session(xid, pSesInfo,
- cifs_sb->local_nls);
- up(&pSesInfo->sesSem);
+ /* new SMB session uses our srvTcp ref */
+ pSesInfo->server = srvTcp;
+ sprintf(pSesInfo->serverName, "%u.%u.%u.%u",
+ NIPQUAD(sin_server->sin_addr.s_addr));
+
+ write_lock(&cifs_tcp_ses_lock);
+ list_add(&pSesInfo->smb_ses_list, &srvTcp->smb_ses_list);
+ write_unlock(&cifs_tcp_ses_lock);
+
+ /* volume_info.password freed at unmount */
+ if (volume_info.password) {
+ pSesInfo->password = volume_info.password;
+ /* set to NULL to prevent freeing on exit */
+ volume_info.password = NULL;
+ }
+ if (volume_info.username)
+ strncpy(pSesInfo->userName, volume_info.username,
+ MAX_USERNAME_SIZE);
+ if (volume_info.domainname) {
+ int len = strlen(volume_info.domainname);
+ pSesInfo->domainName = kmalloc(len + 1, GFP_KERNEL);
+ if (pSesInfo->domainName)
+ strcpy(pSesInfo->domainName,
+ volume_info.domainname);
}
+ pSesInfo->linux_uid = volume_info.linux_uid;
+ pSesInfo->overrideSecFlg = volume_info.secFlg;
+ down(&pSesInfo->sesSem);
+
+ /* BB FIXME need to pass vol->secFlgs BB */
+ rc = cifs_setup_session(xid, pSesInfo,
+ cifs_sb->local_nls);
+ up(&pSesInfo->sesSem);
}

/* search for existing tcon to this server share */
@@ -2190,11 +2237,9 @@ cifs_mount(struct super_block *sb, struc
tcon, cifs_sb->local_nls);
cFYI(1, ("CIFS Tcon rc = %d", rc));
}
- if (!rc) {
- atomic_inc(&pSesInfo->inUse);
- tcon->seal = volume_info.seal;
- } else
+ if (rc)
goto mount_fail_check;
+ tcon->seal = volume_info.seal;
}

/* we can have only one retry value for a connection
@@ -2214,7 +2259,7 @@ cifs_mount(struct super_block *sb, struc
/* BB FIXME fix time_gran to be larger for LANMAN sessions */
sb->s_time_gran = 100;

-/* on error free sesinfo and tcon struct if needed */
+ /* on error free sesinfo and tcon struct if needed */
mount_fail_check:
if (rc) {
/* If find_unc succeeded then rc == 0 so we can not end */
@@ -2222,21 +2267,11 @@ mount_fail_check:
if (tcon)
tconInfoFree(tcon);

- if (existingCifsSes == NULL) {
- if (pSesInfo) {
- if ((pSesInfo->server) &&
- (pSesInfo->status == CifsGood))
- CIFSSMBLogoff(xid, pSesInfo);
- else {
- cFYI(1, ("No session or bad tcon"));
- }
- if (pSesInfo->server)
- cifs_put_tcp_session(
- pSesInfo->server);
- sesInfoFree(pSesInfo);
- /* pSesInfo = NULL; */
- }
- }
+ /* should also end up putting our tcp session ref if needed */
+ if (pSesInfo)
+ cifs_put_smb_ses(pSesInfo);
+ else
+ cifs_put_tcp_session(srvTcp);
} else {
atomic_inc(&tcon->useCount);
cifs_sb->tcon = tcon;
@@ -3532,17 +3567,7 @@ cifs_umount(struct super_block *sb, stru
}
DeleteTconOplockQEntries(cifs_sb->tcon);
tconInfoFree(cifs_sb->tcon);
- if ((ses) && (ses->server)) {
- /* save off task so we do not refer to ses later */
- cifsd_task = ses->server->tsk;
- cFYI(1, ("About to do SMBLogoff "));
- rc = CIFSSMBLogoff(xid, ses);
- if (rc == -EBUSY) {
- FreeXid(xid);
- return 0;
- }
- } else
- cFYI(1, ("No session or bad tcon"));
+ cifs_put_smb_ses(ses);
}

cifs_sb->tcon = NULL;
@@ -3550,8 +3575,6 @@ cifs_umount(struct super_block *sb, stru
cifs_sb->prepathlen = 0;
cifs_sb->prepath = NULL;
kfree(tmp);
- if (ses)
- sesInfoFree(ses);

FreeXid(xid);
return rc;
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -75,12 +75,11 @@ sesInfoAlloc(void)

ret_buf = kzalloc(sizeof(struct cifsSesInfo), GFP_KERNEL);
if (ret_buf) {
- write_lock(&GlobalSMBSeslock);
atomic_inc(&sesInfoAllocCount);
ret_buf->status = CifsNew;
- list_add(&ret_buf->cifsSessionList, &GlobalSMBSessionList);
+ ++ret_buf->ses_count;
+ INIT_LIST_HEAD(&ret_buf->smb_ses_list);
init_MUTEX(&ret_buf->sesSem);
- write_unlock(&GlobalSMBSeslock);
}
return ret_buf;
}
@@ -93,10 +92,7 @@ sesInfoFree(struct cifsSesInfo *buf_to_f
return;
}

- write_lock(&GlobalSMBSeslock);
atomic_dec(&sesInfoAllocCount);
- list_del(&buf_to_free->cifsSessionList);
- write_unlock(&GlobalSMBSeslock);
kfree(buf_to_free->serverOS);
kfree(buf_to_free->serverDomain);
kfree(buf_to_free->serverNOS);
@@ -354,9 +350,9 @@ header_assemble(struct smb_hdr *buffer,
if (current->fsuid != treeCon->ses->linux_uid) {
cFYI(1, ("Multiuser mode and UID "
"did not match tcon uid"));
- read_lock(&GlobalSMBSeslock);
- list_for_each(temp_item, &GlobalSMBSessionList) {
- ses = list_entry(temp_item, struct cifsSesInfo, cifsSessionList);
+ read_lock(&cifs_tcp_ses_lock);
+ list_for_each(temp_item, &treeCon->ses->server->smb_ses_list) {
+ ses = list_entry(temp_item, struct cifsSesInfo, smb_ses_list);
if (ses->linux_uid == current->fsuid) {
if (ses->server == treeCon->ses->server) {
cFYI(1, ("found matching uid substitute right smb_uid"));
@@ -368,7 +364,7 @@ header_assemble(struct smb_hdr *buffer,
}
}
}
- read_unlock(&GlobalSMBSeslock);
+ read_unlock(&cifs_tcp_ses_lock);
}
}
}

2008-12-03 20:17:56

by Greg KH

[permalink] [raw]
Subject: [patch 071/104] cifs: disable sharing session and tcon and add new TCP sharing code

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Jeff Layton <[email protected]>

commit e7ddee9037e7dd43de1ad08b51727e552aedd836 upstream.

The code that allows these structs to be shared is extremely racy.
Disable the sharing of SMB and tcon structs for now until we can
come up with a way to do this that's race free.

We want to continue to share TCP sessions, however since they are
required for multiuser mounts. For that, implement a new (hopefully
race-free) scheme. Add a new global list of TCP sessions, and take
care to get a reference to it whenever we're dealing with one.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Steve French <[email protected]>
Cc: Suresh Jayaraman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/cifs/cifs_debug.c | 2
fs/cifs/cifsfs.c | 3
fs/cifs/cifsglob.h | 17 ++--
fs/cifs/cifsproto.h | 1
fs/cifs/cifssmb.c | 18 ++--
fs/cifs/connect.c | 206 +++++++++++++++++----------------------------------
6 files changed, 95 insertions(+), 152 deletions(-)

--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -144,7 +144,7 @@ static int cifs_debug_data_proc_show(str
seq_printf(m, "TCP status: %d\n\tLocal Users To "
"Server: %d SecMode: 0x%x Req On Wire: %d",
ses->server->tcpStatus,
- atomic_read(&ses->server->socketUseCount),
+ ses->server->srv_count,
ses->server->secMode,
atomic_read(&ses->server->inFlight));

--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1013,7 +1013,7 @@ init_cifs(void)
{
int rc = 0;
cifs_proc_init();
- INIT_LIST_HEAD(&global_cifs_sock_list);
+ INIT_LIST_HEAD(&cifs_tcp_ses_list);
INIT_LIST_HEAD(&GlobalSMBSessionList); /* BB to be removed by jl */
INIT_LIST_HEAD(&GlobalTreeConnectionList); /* BB to be removed by jl */
INIT_LIST_HEAD(&GlobalOplock_Q);
@@ -1043,6 +1043,7 @@ init_cifs(void)
GlobalMaxActiveXid = 0;
memset(Local_System_Name, 0, 15);
rwlock_init(&GlobalSMBSeslock);
+ rwlock_init(&cifs_tcp_ses_lock);
spin_lock_init(&GlobalMid_Lock);

if (cifs_max_pending < 2) {
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -123,6 +123,7 @@ struct cifs_cred {
struct TCP_Server_Info {
struct list_head tcp_ses_list;
struct list_head smb_ses_list;
+ int srv_count; /* reference counter */
/* 15 character server name + 0x20 16th byte indicating type = srv */
char server_RFC1001_name[SERVER_NAME_LEN_WITH_NULL];
char unicode_server_Name[SERVER_NAME_LEN_WITH_NULL * 2];
@@ -144,7 +145,6 @@ struct TCP_Server_Info {
bool svlocal:1; /* local server or remote */
bool noblocksnd; /* use blocking sendmsg */
bool noautotune; /* do not autotune send buf sizes */
- atomic_t socketUseCount; /* number of open cifs sessions on socket */
atomic_t inFlight; /* number of requests on the wire to server */
#ifdef CONFIG_CIFS_STATS2
atomic_t inSend; /* requests trying to send */
@@ -589,13 +589,18 @@ require use of the stronger protocol */
#define GLOBAL_EXTERN extern
#endif

-
-/* the list of TCP_Server_Info structures, ie each of the sockets
+/*
+ * the list of TCP_Server_Info structures, ie each of the sockets
* connecting our client to a distinct server (ip address), is
- * chained together by global_cifs_sock_list. The list of all our SMB
+ * chained together by cifs_tcp_ses_list. The list of all our SMB
* sessions (and from that the tree connections) can be found
- * by iterating over global_cifs_sock_list */
-GLOBAL_EXTERN struct list_head global_cifs_sock_list;
+ * by iterating over cifs_tcp_ses_list
+ */
+GLOBAL_EXTERN struct list_head cifs_tcp_ses_list;
+
+/* protects cifs_tcp_ses_list and srv_count for each tcp session */
+GLOBAL_EXTERN rwlock_t cifs_tcp_ses_lock;
+
GLOBAL_EXTERN struct list_head GlobalSMBSessionList; /* BB to be removed by jl*/
GLOBAL_EXTERN struct list_head GlobalTreeConnectionList; /* BB to be removed */
GLOBAL_EXTERN rwlock_t GlobalSMBSeslock; /* protects list inserts on 3 above */
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -102,6 +102,7 @@ extern void acl_to_uid_mode(struct inode
const __u16 *pfid);
extern int mode_to_acl(struct inode *inode, const char *path, __u64);

+extern void cifs_put_tcp_session(struct TCP_Server_Info *server);
extern int cifs_mount(struct super_block *, struct cifs_sb_info *, char *,
const char *);
extern int cifs_umount(struct super_block *, struct cifs_sb_info *);
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -664,8 +664,9 @@ CIFSSMBNegotiate(unsigned int xid, struc
rc = -EIO;
goto neg_err_exit;
}
-
- if (server->socketUseCount.counter > 1) {
+ read_lock(&cifs_tcp_ses_lock);
+ if (server->srv_count > 1) {
+ read_unlock(&cifs_tcp_ses_lock);
if (memcmp(server->server_GUID,
pSMBr->u.extended_response.
GUID, 16) != 0) {
@@ -674,9 +675,11 @@ CIFSSMBNegotiate(unsigned int xid, struc
pSMBr->u.extended_response.GUID,
16);
}
- } else
+ } else {
+ read_unlock(&cifs_tcp_ses_lock);
memcpy(server->server_GUID,
pSMBr->u.extended_response.GUID, 16);
+ }

if (count == 16) {
server->secType = RawNTLMSSP;
@@ -830,12 +833,9 @@ CIFSSMBLogoff(const int xid, struct cifs
pSMB->AndXCommand = 0xFF;
rc = SendReceiveNoRsp(xid, ses, (struct smb_hdr *) pSMB, 0);
session_already_dead:
- atomic_dec(&ses->server->socketUseCount);
- if (atomic_read(&ses->server->socketUseCount) == 0) {
- spin_lock(&GlobalMid_Lock);
- ses->server->tcpStatus = CifsExiting;
- spin_unlock(&GlobalMid_Lock);
- rc = -ESHUTDOWN;
+ if (ses->server) {
+ cifs_put_tcp_session(ses->server);
+ rc = 0;
}
up(&ses->sesSem);

--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -657,6 +657,11 @@ multi_t2_fnd:
}
} /* end while !EXITING */

+ /* take it off the list, if it's not already */
+ write_lock(&cifs_tcp_ses_lock);
+ list_del_init(&server->tcp_ses_list);
+ write_unlock(&cifs_tcp_ses_lock);
+
spin_lock(&GlobalMid_Lock);
server->tcpStatus = CifsExiting;
spin_unlock(&GlobalMid_Lock);
@@ -1346,92 +1351,66 @@ cifs_parse_mount_options(char *options,
return 0;
}

-static struct cifsSesInfo *
-cifs_find_tcp_session(struct in_addr *target_ip_addr,
- struct in6_addr *target_ip6_addr,
- char *userName, struct TCP_Server_Info **psrvTcp)
+static struct TCP_Server_Info *
+cifs_find_tcp_session(struct sockaddr *addr)
{
struct list_head *tmp;
- struct cifsSesInfo *ses;
-
- *psrvTcp = NULL;
-
- read_lock(&GlobalSMBSeslock);
- list_for_each(tmp, &GlobalSMBSessionList) {
- ses = list_entry(tmp, struct cifsSesInfo, cifsSessionList);
- if (!ses->server)
+ struct TCP_Server_Info *server;
+ struct sockaddr_in *addr4 = (struct sockaddr_in *) addr;
+ struct sockaddr_in6 *addr6 = (struct sockaddr_in6 *) addr;
+
+ write_lock(&cifs_tcp_ses_lock);
+ list_for_each(tmp, &cifs_tcp_ses_list) {
+ server = list_entry(tmp, struct TCP_Server_Info,
+ tcp_ses_list);
+
+ /*
+ * the demux thread can exit on its own while still in CifsNew
+ * so don't accept any sockets in that state. Since the
+ * tcpStatus never changes back to CifsNew it's safe to check
+ * for this without a lock.
+ */
+ if (server->tcpStatus == CifsNew)
continue;

- if (target_ip_addr &&
- ses->server->addr.sockAddr.sin_addr.s_addr != target_ip_addr->s_addr)
- continue;
- else if (target_ip6_addr &&
- memcmp(&ses->server->addr.sockAddr6.sin6_addr,
- target_ip6_addr, sizeof(*target_ip6_addr)))
- continue;
- /* BB lock server and tcp session; increment use count here?? */
-
- /* found a match on the TCP session */
- *psrvTcp = ses->server;
+ if (addr->sa_family == AF_INET &&
+ (addr4->sin_addr.s_addr !=
+ server->addr.sockAddr.sin_addr.s_addr))
+ continue;
+ else if (addr->sa_family == AF_INET6 &&
+ memcmp(&server->addr.sockAddr6.sin6_addr,
+ &addr6->sin6_addr, sizeof(addr6->sin6_addr)))
+ continue;

- /* BB check if reconnection needed */
- if (strncmp(ses->userName, userName, MAX_USERNAME_SIZE) == 0) {
- read_unlock(&GlobalSMBSeslock);
- /* Found exact match on both TCP and
- SMB sessions */
- return ses;
- }
- /* else tcp and smb sessions need reconnection */
+ ++server->srv_count;
+ write_unlock(&cifs_tcp_ses_lock);
+ return server;
}
- read_unlock(&GlobalSMBSeslock);
-
+ write_unlock(&cifs_tcp_ses_lock);
return NULL;
}

-static struct cifsTconInfo *
-find_unc(__be32 new_target_ip_addr, char *uncName, char *userName)
+void
+cifs_put_tcp_session(struct TCP_Server_Info *server)
{
- struct list_head *tmp;
- struct cifsTconInfo *tcon;
- __be32 old_ip;
-
- read_lock(&GlobalSMBSeslock);
-
- list_for_each(tmp, &GlobalTreeConnectionList) {
- cFYI(1, ("Next tcon"));
- tcon = list_entry(tmp, struct cifsTconInfo, cifsConnectionList);
- if (!tcon->ses || !tcon->ses->server)
- continue;
-
- old_ip = tcon->ses->server->addr.sockAddr.sin_addr.s_addr;
- cFYI(1, ("old ip addr: %x == new ip %x ?",
- old_ip, new_target_ip_addr));
+ struct task_struct *task;

- if (old_ip != new_target_ip_addr)
- continue;
-
- /* BB lock tcon, server, tcp session and increment use count? */
- /* found a match on the TCP session */
- /* BB check if reconnection needed */
- cFYI(1, ("IP match, old UNC: %s new: %s",
- tcon->treeName, uncName));
-
- if (strncmp(tcon->treeName, uncName, MAX_TREE_SIZE))
- continue;
-
- cFYI(1, ("and old usr: %s new: %s",
- tcon->treeName, uncName));
+ write_lock(&cifs_tcp_ses_lock);
+ if (--server->srv_count > 0) {
+ write_unlock(&cifs_tcp_ses_lock);
+ return;
+ }

- if (strncmp(tcon->ses->userName, userName, MAX_USERNAME_SIZE))
- continue;
+ list_del_init(&server->tcp_ses_list);
+ write_unlock(&cifs_tcp_ses_lock);

- /* matched smb session (user name) */
- read_unlock(&GlobalSMBSeslock);
- return tcon;
- }
+ spin_lock(&GlobalMid_Lock);
+ server->tcpStatus = CifsExiting;
+ spin_unlock(&GlobalMid_Lock);

- read_unlock(&GlobalSMBSeslock);
- return NULL;
+ task = xchg(&server->tsk, NULL);
+ if (task)
+ force_sig(SIGKILL, task);
}

int
@@ -2046,21 +2025,10 @@ cifs_mount(struct super_block *sb, struc
}
}

- if (addr.sa_family == AF_INET)
- existingCifsSes = cifs_find_tcp_session(&sin_server->sin_addr,
- NULL /* no ipv6 addr */,
- volume_info.username, &srvTcp);
- else if (addr.sa_family == AF_INET6) {
- cFYI(1, ("looking for ipv6 address"));
- existingCifsSes = cifs_find_tcp_session(NULL /* no ipv4 addr */,
- &sin_server6->sin6_addr,
- volume_info.username, &srvTcp);
- } else {
- rc = -EINVAL;
- goto out;
- }
-
- if (!srvTcp) {
+ srvTcp = cifs_find_tcp_session(&addr);
+ if (srvTcp) {
+ cFYI(1, ("Existing tcp session with server found"));
+ } else { /* create socket */
if (addr.sa_family == AF_INET6) {
cFYI(1, ("attempting ipv6 connect"));
/* BB should we allow ipv6 on port 139? */
@@ -2130,6 +2098,12 @@ cifs_mount(struct super_block *sb, struc
memcpy(srvTcp->server_RFC1001_name,
volume_info.target_rfc1001_name, 16);
srvTcp->sequence_number = 0;
+ INIT_LIST_HEAD(&srvTcp->tcp_ses_list);
+ ++srvTcp->srv_count;
+ write_lock(&cifs_tcp_ses_lock);
+ list_add(&srvTcp->tcp_ses_list,
+ &cifs_tcp_ses_list);
+ write_unlock(&cifs_tcp_ses_lock);
}
}

@@ -2181,17 +2155,12 @@ cifs_mount(struct super_block *sb, struc
rc = cifs_setup_session(xid, pSesInfo,
cifs_sb->local_nls);
up(&pSesInfo->sesSem);
- if (!rc)
- atomic_inc(&srvTcp->socketUseCount);
}
}

/* search for existing tcon to this server share */
if (!rc) {
setup_cifs_sb(&volume_info, cifs_sb);
- tcon =
- find_unc(sin_server->sin_addr.s_addr, volume_info.UNC,
- volume_info.username);
if (tcon) {
cFYI(1, ("Found match on UNC path"));
if (tcon->seal != volume_info.seal)
@@ -2248,47 +2217,22 @@ cifs_mount(struct super_block *sb, struc
/* on error free sesinfo and tcon struct if needed */
mount_fail_check:
if (rc) {
- /* if session setup failed, use count is zero but
- we still need to free cifsd thread */
- if (atomic_read(&srvTcp->socketUseCount) == 0) {
- spin_lock(&GlobalMid_Lock);
- srvTcp->tcpStatus = CifsExiting;
- spin_unlock(&GlobalMid_Lock);
- if (srvTcp->tsk) {
- /* If we could verify that kthread_stop would
- always wake up processes blocked in
- tcp in recv_mesg then we could remove the
- send_sig call */
- force_sig(SIGKILL, srvTcp->tsk);
- kthread_stop(srvTcp->tsk);
- }
- }
/* If find_unc succeeded then rc == 0 so we can not end */
- if (tcon) /* up accidently freeing someone elses tcon struct */
+ /* up accidently freeing someone elses tcon struct */
+ if (tcon)
tconInfoFree(tcon);
+
if (existingCifsSes == NULL) {
if (pSesInfo) {
if ((pSesInfo->server) &&
- (pSesInfo->status == CifsGood)) {
- int temp_rc;
- temp_rc = CIFSSMBLogoff(xid, pSesInfo);
- /* if the socketUseCount is now zero */
- if ((temp_rc == -ESHUTDOWN) &&
- (pSesInfo->server) &&
- (pSesInfo->server->tsk)) {
- force_sig(SIGKILL,
- pSesInfo->server->tsk);
- kthread_stop(pSesInfo->server->tsk);
- }
- } else {
+ (pSesInfo->status == CifsGood))
+ CIFSSMBLogoff(xid, pSesInfo);
+ else {
cFYI(1, ("No session or bad tcon"));
- if ((pSesInfo->server) &&
- (pSesInfo->server->tsk)) {
- force_sig(SIGKILL,
- pSesInfo->server->tsk);
- kthread_stop(pSesInfo->server->tsk);
- }
}
+ if (pSesInfo->server)
+ cifs_put_tcp_session(
+ pSesInfo->server);
sesInfoFree(pSesInfo);
/* pSesInfo = NULL; */
}
@@ -3596,15 +3540,7 @@ cifs_umount(struct super_block *sb, stru
if (rc == -EBUSY) {
FreeXid(xid);
return 0;
- } else if (rc == -ESHUTDOWN) {
- cFYI(1, ("Waking up socket by sending signal"));
- if (cifsd_task) {
- force_sig(SIGKILL, cifsd_task);
- kthread_stop(cifsd_task);
- }
- rc = 0;
- } /* else - we have an smb session
- left on this socket do not kill cifsd */
+ }
} else
cFYI(1, ("No session or bad tcon"));
}

2008-12-03 20:18:49

by Greg KH

[permalink] [raw]
Subject: [patch 073/104] cifs: minor cleanup to cifs_mount

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Steve French <[email protected]>

commit d82c2df54e2f7e447476350848d8eccc8d2fe46a upstream

Signed-off-by: Steve French <[email protected]>
Cc: Suresh Jayaraman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>


---
fs/cifs/connect.c | 74 ++++++++++++++++++++++++------------------------------
1 file changed, 34 insertions(+), 40 deletions(-)

--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -1357,7 +1357,6 @@ cifs_find_tcp_session(struct sockaddr *a
list_for_each(tmp, &cifs_tcp_ses_list) {
server = list_entry(tmp, struct TCP_Server_Info,
tcp_ses_list);
-
/*
* the demux thread can exit on its own while still in CifsNew
* so don't accept any sockets in that state. Since the
@@ -1378,6 +1377,7 @@ cifs_find_tcp_session(struct sockaddr *a

++server->srv_count;
write_unlock(&cifs_tcp_ses_lock);
+ cFYI(1, ("Existing tcp session with server found"));
return server;
}
write_unlock(&cifs_tcp_ses_lock);
@@ -2063,9 +2063,7 @@ cifs_mount(struct super_block *sb, struc
}

srvTcp = cifs_find_tcp_session(&addr);
- if (srvTcp) {
- cFYI(1, ("Existing tcp session with server found"));
- } else { /* create socket */
+ if (!srvTcp) { /* create socket */
if (addr.sa_family == AF_INET6) {
cFYI(1, ("attempting ipv6 connect"));
/* BB should we allow ipv6 on port 139? */
@@ -2272,44 +2270,40 @@ mount_fail_check:
cifs_put_smb_ses(pSesInfo);
else
cifs_put_tcp_session(srvTcp);
- } else {
- atomic_inc(&tcon->useCount);
- cifs_sb->tcon = tcon;
- tcon->ses = pSesInfo;
-
- /* do not care if following two calls succeed - informational */
- if (!tcon->ipc) {
- CIFSSMBQFSDeviceInfo(xid, tcon);
- CIFSSMBQFSAttributeInfo(xid, tcon);
- }
+ goto out;
+ }
+ atomic_inc(&tcon->useCount);
+ cifs_sb->tcon = tcon;
+ tcon->ses = pSesInfo;
+
+ /* do not care if following two calls succeed - informational */
+ if (!tcon->ipc) {
+ CIFSSMBQFSDeviceInfo(xid, tcon);
+ CIFSSMBQFSAttributeInfo(xid, tcon);
+ }

- /* tell server which Unix caps we support */
- if (tcon->ses->capabilities & CAP_UNIX)
- /* reset of caps checks mount to see if unix extensions
- disabled for just this mount */
- reset_cifs_unix_caps(xid, tcon, sb, &volume_info);
- else
- tcon->unix_ext = 0; /* server does not support them */
+ /* tell server which Unix caps we support */
+ if (tcon->ses->capabilities & CAP_UNIX)
+ /* reset of caps checks mount to see if unix extensions
+ disabled for just this mount */
+ reset_cifs_unix_caps(xid, tcon, sb, &volume_info);
+ else
+ tcon->unix_ext = 0; /* server does not support them */

- /* convert forward to back slashes in prepath here if needed */
- if ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_POSIX_PATHS) == 0)
- convert_delimiter(cifs_sb->prepath,
- CIFS_DIR_SEP(cifs_sb));
-
- if ((tcon->unix_ext == 0) && (cifs_sb->rsize > (1024 * 127))) {
- cifs_sb->rsize = 1024 * 127;
- cFYI(DBG2,
- ("no very large read support, rsize now 127K"));
- }
- if (!(tcon->ses->capabilities & CAP_LARGE_WRITE_X))
- cifs_sb->wsize = min(cifs_sb->wsize,
- (tcon->ses->server->maxBuf -
- MAX_CIFS_HDR_SIZE));
- if (!(tcon->ses->capabilities & CAP_LARGE_READ_X))
- cifs_sb->rsize = min(cifs_sb->rsize,
- (tcon->ses->server->maxBuf -
- MAX_CIFS_HDR_SIZE));
- }
+ /* convert forward to back slashes in prepath here if needed */
+ if ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_POSIX_PATHS) == 0)
+ convert_delimiter(cifs_sb->prepath, CIFS_DIR_SEP(cifs_sb));
+
+ if ((tcon->unix_ext == 0) && (cifs_sb->rsize > (1024 * 127))) {
+ cifs_sb->rsize = 1024 * 127;
+ cFYI(DBG2, ("no very large read support, rsize now 127K"));
+ }
+ if (!(tcon->ses->capabilities & CAP_LARGE_WRITE_X))
+ cifs_sb->wsize = min(cifs_sb->wsize,
+ (tcon->ses->server->maxBuf - MAX_CIFS_HDR_SIZE));
+ if (!(tcon->ses->capabilities & CAP_LARGE_READ_X))
+ cifs_sb->rsize = min(cifs_sb->rsize,
+ (tcon->ses->server->maxBuf - MAX_CIFS_HDR_SIZE));

/* volume_info.password is freed above when existing session found
(in which case it is not needed anymore) but when new sesion is created

2008-12-03 20:19:37

by Greg KH

[permalink] [raw]
Subject: [patch 074/104] cifs: reinstate sharing of tree connections

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Jeff Layton <[email protected]>

commit f1987b44f642e96176adc88b7ce23a1d74806f89 upstream

Use a similar approach to the SMB session sharing. Add a list of tcons
attached to each SMB session. Move the refcount to non-atomic. Protect
all of the above with the cifs_tcp_ses_lock. Add functions to
properly find and put references to the tcons.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Steve French <[email protected]>
Cc: Suresh Jayaraman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/cifs/cifs_debug.c | 236 +++++++++++++++++++++++++++------------------------
fs/cifs/cifsfs.c | 8 -
fs/cifs/cifsglob.h | 13 +-
fs/cifs/cifssmb.c | 43 ++-------
fs/cifs/connect.c | 94 +++++++++++++-------
fs/cifs/misc.c | 74 +++++++--------
6 files changed, 249 insertions(+), 219 deletions(-)

--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -107,12 +107,13 @@ void cifs_dump_mids(struct TCP_Server_In
#ifdef CONFIG_PROC_FS
static int cifs_debug_data_proc_show(struct seq_file *m, void *v)
{
- struct list_head *tmp, *tmp2, *tmp3;
+ struct list_head *tmp1, *tmp2, *tmp3;
struct mid_q_entry *mid_entry;
struct TCP_Server_Info *server;
struct cifsSesInfo *ses;
struct cifsTconInfo *tcon;
- int i;
+ int i, j;
+ __u32 dev_type;

seq_puts(m,
"Display Internal CIFS Data Structures for Debugging\n"
@@ -123,8 +124,8 @@ static int cifs_debug_data_proc_show(str

i = 0;
read_lock(&cifs_tcp_ses_lock);
- list_for_each(tmp, &cifs_tcp_ses_list) {
- server = list_entry(tmp, struct TCP_Server_Info,
+ list_for_each(tmp1, &cifs_tcp_ses_list) {
+ server = list_entry(tmp1, struct TCP_Server_Info,
tcp_ses_list);
i++;
list_for_each(tmp2, &server->smb_ses_list) {
@@ -133,12 +134,12 @@ static int cifs_debug_data_proc_show(str
if ((ses->serverDomain == NULL) ||
(ses->serverOS == NULL) ||
(ses->serverNOS == NULL)) {
- seq_printf(m, "\nentry for %s not fully "
- "displayed\n\t", ses->serverName);
+ seq_printf(m, "\n%d) entry for %s not fully "
+ "displayed\n\t", i, ses->serverName);
} else {
seq_printf(m,
- "\n%d) Name: %s Domain: %s Mounts: %d OS:"
- " %s \n\tNOS: %s\tCapability: 0x%x\n\tSMB"
+ "\n%d) Name: %s Domain: %s Uses: %d OS:"
+ " %s\n\tNOS: %s\tCapability: 0x%x\n\tSMB"
" session status: %d\t",
i, ses->serverName, ses->serverDomain,
ses->ses_count, ses->serverOS, ses->serverNOS,
@@ -156,14 +157,44 @@ static int cifs_debug_data_proc_show(str
atomic_read(&server->num_waiters));
#endif

- seq_puts(m, "\nMIDs:\n");
+ seq_puts(m, "\n\tShares:");
+ j = 0;
+ list_for_each(tmp3, &ses->tcon_list) {
+ tcon = list_entry(tmp3, struct cifsTconInfo,
+ tcon_list);
+ ++j;
+ dev_type = le32_to_cpu(tcon->fsDevInfo.DeviceType);
+ seq_printf(m, "\n\t%d) %s Mounts: %d ", j,
+ tcon->treeName, tcon->tc_count);
+ if (tcon->nativeFileSystem) {
+ seq_printf(m, "Type: %s ",
+ tcon->nativeFileSystem);
+ }
+ seq_printf(m, "DevInfo: 0x%x Attributes: 0x%x"
+ "\nPathComponentMax: %d Status: 0x%d",
+ le32_to_cpu(tcon->fsDevInfo.DeviceCharacteristics),
+ le32_to_cpu(tcon->fsAttrInfo.Attributes),
+ le32_to_cpu(tcon->fsAttrInfo.MaxPathNameComponentLength),
+ tcon->tidStatus);
+ if (dev_type == FILE_DEVICE_DISK)
+ seq_puts(m, " type: DISK ");
+ else if (dev_type == FILE_DEVICE_CD_ROM)
+ seq_puts(m, " type: CDROM ");
+ else
+ seq_printf(m, " type: %d ", dev_type);
+
+ if (tcon->need_reconnect)
+ seq_puts(m, "\tDISCONNECTED ");
+ seq_putc(m, '\n');
+ }
+
+ seq_puts(m, "\n\tMIDs:\n");

spin_lock(&GlobalMid_Lock);
list_for_each(tmp3, &server->pending_mid_q) {
- mid_entry = list_entry(tmp3, struct
- mid_q_entry,
+ mid_entry = list_entry(tmp3, struct mid_q_entry,
qhead);
- seq_printf(m, "State: %d com: %d pid:"
+ seq_printf(m, "\tState: %d com: %d pid:"
" %d tsk: %p mid %d\n",
mid_entry->midState,
(int)mid_entry->command,
@@ -177,41 +208,6 @@ static int cifs_debug_data_proc_show(str
read_unlock(&cifs_tcp_ses_lock);
seq_putc(m, '\n');

- seq_puts(m, "Shares:");
-
- i = 0;
- read_lock(&GlobalSMBSeslock);
- list_for_each(tmp, &GlobalTreeConnectionList) {
- __u32 dev_type;
- i++;
- tcon = list_entry(tmp, struct cifsTconInfo, cifsConnectionList);
- dev_type = le32_to_cpu(tcon->fsDevInfo.DeviceType);
- seq_printf(m, "\n%d) %s Uses: %d ", i,
- tcon->treeName, atomic_read(&tcon->useCount));
- if (tcon->nativeFileSystem) {
- seq_printf(m, "Type: %s ",
- tcon->nativeFileSystem);
- }
- seq_printf(m, "DevInfo: 0x%x Attributes: 0x%x"
- "\nPathComponentMax: %d Status: %d",
- le32_to_cpu(tcon->fsDevInfo.DeviceCharacteristics),
- le32_to_cpu(tcon->fsAttrInfo.Attributes),
- le32_to_cpu(tcon->fsAttrInfo.MaxPathNameComponentLength),
- tcon->tidStatus);
- if (dev_type == FILE_DEVICE_DISK)
- seq_puts(m, " type: DISK ");
- else if (dev_type == FILE_DEVICE_CD_ROM)
- seq_puts(m, " type: CDROM ");
- else
- seq_printf(m, " type: %d ", dev_type);
-
- if (tcon->need_reconnect)
- seq_puts(m, "\tDISCONNECTED ");
- }
- read_unlock(&GlobalSMBSeslock);
-
- seq_putc(m, '\n');
-
/* BB add code to dump additional info such as TCP session info now */
return 0;
}
@@ -235,7 +231,9 @@ static ssize_t cifs_stats_proc_write(str
{
char c;
int rc;
- struct list_head *tmp;
+ struct list_head *tmp1, *tmp2, *tmp3;
+ struct TCP_Server_Info *server;
+ struct cifsSesInfo *ses;
struct cifsTconInfo *tcon;

rc = get_user(c, buffer);
@@ -243,33 +241,42 @@ static ssize_t cifs_stats_proc_write(str
return rc;

if (c == '1' || c == 'y' || c == 'Y' || c == '0') {
- read_lock(&GlobalSMBSeslock);
#ifdef CONFIG_CIFS_STATS2
atomic_set(&totBufAllocCount, 0);
atomic_set(&totSmBufAllocCount, 0);
#endif /* CONFIG_CIFS_STATS2 */
- list_for_each(tmp, &GlobalTreeConnectionList) {
- tcon = list_entry(tmp, struct cifsTconInfo,
- cifsConnectionList);
- atomic_set(&tcon->num_smbs_sent, 0);
- atomic_set(&tcon->num_writes, 0);
- atomic_set(&tcon->num_reads, 0);
- atomic_set(&tcon->num_oplock_brks, 0);
- atomic_set(&tcon->num_opens, 0);
- atomic_set(&tcon->num_closes, 0);
- atomic_set(&tcon->num_deletes, 0);
- atomic_set(&tcon->num_mkdirs, 0);
- atomic_set(&tcon->num_rmdirs, 0);
- atomic_set(&tcon->num_renames, 0);
- atomic_set(&tcon->num_t2renames, 0);
- atomic_set(&tcon->num_ffirst, 0);
- atomic_set(&tcon->num_fnext, 0);
- atomic_set(&tcon->num_fclose, 0);
- atomic_set(&tcon->num_hardlinks, 0);
- atomic_set(&tcon->num_symlinks, 0);
- atomic_set(&tcon->num_locks, 0);
+ read_lock(&cifs_tcp_ses_lock);
+ list_for_each(tmp1, &cifs_tcp_ses_list) {
+ server = list_entry(tmp1, struct TCP_Server_Info,
+ tcp_ses_list);
+ list_for_each(tmp2, &server->smb_session_list) {
+ ses = list_entry(tmp2, struct cifsSesInfo,
+ smb_session_list);
+ list_for_each(tmp3, &ses->tcon_list) {
+ tcon = list_entry(tmp3,
+ struct cifsTconInfo,
+ tcon_list);
+ atomic_set(&tcon->num_smbs_sent, 0);
+ atomic_set(&tcon->num_writes, 0);
+ atomic_set(&tcon->num_reads, 0);
+ atomic_set(&tcon->num_oplock_brks, 0);
+ atomic_set(&tcon->num_opens, 0);
+ atomic_set(&tcon->num_closes, 0);
+ atomic_set(&tcon->num_deletes, 0);
+ atomic_set(&tcon->num_mkdirs, 0);
+ atomic_set(&tcon->num_rmdirs, 0);
+ atomic_set(&tcon->num_renames, 0);
+ atomic_set(&tcon->num_t2renames, 0);
+ atomic_set(&tcon->num_ffirst, 0);
+ atomic_set(&tcon->num_fnext, 0);
+ atomic_set(&tcon->num_fclose, 0);
+ atomic_set(&tcon->num_hardlinks, 0);
+ atomic_set(&tcon->num_symlinks, 0);
+ atomic_set(&tcon->num_locks, 0);
+ }
+ }
}
- read_unlock(&GlobalSMBSeslock);
+ read_unlock(&cifs_tcp_ses_lock);
}

return count;
@@ -278,7 +285,9 @@ static ssize_t cifs_stats_proc_write(str
static int cifs_stats_proc_show(struct seq_file *m, void *v)
{
int i;
- struct list_head *tmp;
+ struct list_head *tmp1, *tmp2, *tmp3;
+ struct TCP_Server_Info *server;
+ struct cifsSesInfo *ses;
struct cifsTconInfo *tcon;

seq_printf(m,
@@ -307,44 +316,55 @@ static int cifs_stats_proc_show(struct s
GlobalCurrentXid, GlobalMaxActiveXid);

i = 0;
- read_lock(&GlobalSMBSeslock);
- list_for_each(tmp, &GlobalTreeConnectionList) {
- i++;
- tcon = list_entry(tmp, struct cifsTconInfo, cifsConnectionList);
- seq_printf(m, "\n%d) %s", i, tcon->treeName);
- if (tcon->need_reconnect)
- seq_puts(m, "\tDISCONNECTED ");
- seq_printf(m, "\nSMBs: %d Oplock Breaks: %d",
- atomic_read(&tcon->num_smbs_sent),
- atomic_read(&tcon->num_oplock_brks));
- seq_printf(m, "\nReads: %d Bytes: %lld",
- atomic_read(&tcon->num_reads),
- (long long)(tcon->bytes_read));
- seq_printf(m, "\nWrites: %d Bytes: %lld",
- atomic_read(&tcon->num_writes),
- (long long)(tcon->bytes_written));
- seq_printf(m,
- "\nLocks: %d HardLinks: %d Symlinks: %d",
- atomic_read(&tcon->num_locks),
- atomic_read(&tcon->num_hardlinks),
- atomic_read(&tcon->num_symlinks));
-
- seq_printf(m, "\nOpens: %d Closes: %d Deletes: %d",
- atomic_read(&tcon->num_opens),
- atomic_read(&tcon->num_closes),
- atomic_read(&tcon->num_deletes));
- seq_printf(m, "\nMkdirs: %d Rmdirs: %d",
- atomic_read(&tcon->num_mkdirs),
- atomic_read(&tcon->num_rmdirs));
- seq_printf(m, "\nRenames: %d T2 Renames %d",
- atomic_read(&tcon->num_renames),
- atomic_read(&tcon->num_t2renames));
- seq_printf(m, "\nFindFirst: %d FNext %d FClose %d",
- atomic_read(&tcon->num_ffirst),
- atomic_read(&tcon->num_fnext),
- atomic_read(&tcon->num_fclose));
+ read_lock(&cifs_tcp_ses_lock);
+ list_for_each(tmp1, &cifs_tcp_ses_list) {
+ server = list_entry(tmp1, struct TCP_Server_Info,
+ tcp_ses_list);
+ list_for_each(tmp2, &server->smb_ses_list) {
+ ses = list_entry(tmp2, struct cifsSesInfo,
+ smb_ses_list);
+ list_for_each(tmp3, &ses->tcon_list) {
+ tcon = list_entry(tmp3,
+ struct cifsTconInfo,
+ tcon_list);
+ i++;
+ seq_printf(m, "\n%d) %s", i, tcon->treeName);
+ if (tcon->need_reconnect)
+ seq_puts(m, "\tDISCONNECTED ");
+ seq_printf(m, "\nSMBs: %d Oplock Breaks: %d",
+ atomic_read(&tcon->num_smbs_sent),
+ atomic_read(&tcon->num_oplock_brks));
+ seq_printf(m, "\nReads: %d Bytes: %lld",
+ atomic_read(&tcon->num_reads),
+ (long long)(tcon->bytes_read));
+ seq_printf(m, "\nWrites: %d Bytes: %lld",
+ atomic_read(&tcon->num_writes),
+ (long long)(tcon->bytes_written));
+ seq_printf(m, "\nLocks: %d HardLinks: %d "
+ "Symlinks: %d",
+ atomic_read(&tcon->num_locks),
+ atomic_read(&tcon->num_hardlinks),
+ atomic_read(&tcon->num_symlinks));
+ seq_printf(m, "\nOpens: %d Closes: %d"
+ "Deletes: %d",
+ atomic_read(&tcon->num_opens),
+ atomic_read(&tcon->num_closes),
+ atomic_read(&tcon->num_deletes));
+ seq_printf(m, "\nMkdirs: %d Rmdirs: %d",
+ atomic_read(&tcon->num_mkdirs),
+ atomic_read(&tcon->num_rmdirs));
+ seq_printf(m, "\nRenames: %d T2 Renames %d",
+ atomic_read(&tcon->num_renames),
+ atomic_read(&tcon->num_t2renames));
+ seq_printf(m, "\nFindFirst: %d FNext %d "
+ "FClose %d",
+ atomic_read(&tcon->num_ffirst),
+ atomic_read(&tcon->num_fnext),
+ atomic_read(&tcon->num_fclose));
+ }
+ }
}
- read_unlock(&GlobalSMBSeslock);
+ read_unlock(&cifs_tcp_ses_lock);

seq_putc(m, '\n');
return 0;
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -510,10 +510,11 @@ static void cifs_umount_begin(struct sup
tcon = cifs_sb->tcon;
if (tcon == NULL)
return;
- down(&tcon->tconSem);
- if (atomic_read(&tcon->useCount) == 1)
+
+ read_lock(&cifs_tcp_ses_lock);
+ if (tcon->tc_count == 1)
tcon->tidStatus = CifsExiting;
- up(&tcon->tconSem);
+ read_unlock(&cifs_tcp_ses_lock);

/* cancel_brl_requests(tcon); */ /* BB mark all brl mids as exiting */
/* cancel_notify_requests(tcon); */
@@ -1014,7 +1015,6 @@ init_cifs(void)
int rc = 0;
cifs_proc_init();
INIT_LIST_HEAD(&cifs_tcp_ses_list);
- INIT_LIST_HEAD(&GlobalTreeConnectionList); /* BB to be removed by jl */
INIT_LIST_HEAD(&GlobalOplock_Q);
#ifdef CONFIG_CIFS_EXPERIMENTAL
INIT_LIST_HEAD(&GlobalDnotifyReqList);
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -233,16 +233,15 @@ struct cifsSesInfo {
* session
*/
struct cifsTconInfo {
- struct list_head cifsConnectionList;
+ struct list_head tcon_list;
+ int tc_count;
struct list_head openFileList;
- struct semaphore tconSem;
struct cifsSesInfo *ses; /* pointer to session associated with */
char treeName[MAX_TREE_SIZE + 1]; /* UNC name of resource in ASCII */
char *nativeFileSystem;
__u16 tid; /* The 2 byte tree id */
__u16 Flags; /* optional support bits */
enum statusEnum tidStatus;
- atomic_t useCount; /* how many explicit/implicit mounts to share */
#ifdef CONFIG_CIFS_STATS
atomic_t num_smbs_sent;
atomic_t num_writes;
@@ -598,9 +597,13 @@ require use of the stronger protocol */
*/
GLOBAL_EXTERN struct list_head cifs_tcp_ses_list;

-/* protects cifs_tcp_ses_list and srv_count for each tcp session */
+/*
+ * This lock protects the cifs_tcp_ses_list, the list of smb sessions per
+ * tcp session, and the list of tcon's per smb session. It also protects
+ * the reference counters for the server, smb session, and tcon. Finally,
+ * changes to the tcon->tidStatus should be done while holding this lock.
+ */
GLOBAL_EXTERN rwlock_t cifs_tcp_ses_lock;
-GLOBAL_EXTERN struct list_head GlobalTreeConnectionList; /* BB to be removed */
GLOBAL_EXTERN rwlock_t GlobalSMBSeslock; /* protects list inserts on 3 above */

GLOBAL_EXTERN struct list_head GlobalOplock_Q;
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -742,50 +742,31 @@ CIFSSMBTDis(const int xid, struct cifsTc
int rc = 0;

cFYI(1, ("In tree disconnect"));
- /*
- * If last user of the connection and
- * connection alive - disconnect it
- * If this is the last connection on the server session disconnect it
- * (and inside session disconnect we should check if tcp socket needs
- * to be freed and kernel thread woken up).
- */
- if (tcon)
- down(&tcon->tconSem);
- else
- return -EIO;

- atomic_dec(&tcon->useCount);
- if (atomic_read(&tcon->useCount) > 0) {
- up(&tcon->tconSem);
- return -EBUSY;
- }
+ /* BB: do we need to check this? These should never be NULL. */
+ if ((tcon->ses == NULL) || (tcon->ses->server == NULL))
+ return -EIO;

- /* No need to return error on this operation if tid invalidated and
- closed on server already e.g. due to tcp session crashing */
- if (tcon->need_reconnect) {
- up(&tcon->tconSem);
+ /*
+ * No need to return error on this operation if tid invalidated and
+ * closed on server already e.g. due to tcp session crashing. Also,
+ * the tcon is no longer on the list, so no need to take lock before
+ * checking this.
+ */
+ if (tcon->need_reconnect)
return 0;
- }

- if ((tcon->ses == NULL) || (tcon->ses->server == NULL)) {
- up(&tcon->tconSem);
- return -EIO;
- }
rc = small_smb_init(SMB_COM_TREE_DISCONNECT, 0, tcon,
(void **)&smb_buffer);
- if (rc) {
- up(&tcon->tconSem);
+ if (rc)
return rc;
- }

rc = SendReceiveNoRsp(xid, tcon->ses, smb_buffer, 0);
if (rc)
cFYI(1, ("Tree disconnect failed %d", rc));

- up(&tcon->tconSem);
-
/* No need to return error on this operation if tid invalidated and
- closed on server already e.g. due to tcp session crashing */
+ closed on server already e.g. due to tcp session crashing */
if (rc == -EAGAIN)
rc = 0;

--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -122,7 +122,7 @@ static int
cifs_reconnect(struct TCP_Server_Info *server)
{
int rc = 0;
- struct list_head *tmp;
+ struct list_head *tmp, *tmp2;
struct cifsSesInfo *ses;
struct cifsTconInfo *tcon;
struct mid_q_entry *mid_entry;
@@ -147,13 +147,12 @@ cifs_reconnect(struct TCP_Server_Info *s
ses = list_entry(tmp, struct cifsSesInfo, smb_ses_list);
ses->need_reconnect = true;
ses->ipc_tid = 0;
- }
- read_unlock(&cifs_tcp_ses_lock);
- list_for_each(tmp, &GlobalTreeConnectionList) {
- tcon = list_entry(tmp, struct cifsTconInfo, cifsConnectionList);
- if ((tcon->ses) && (tcon->ses->server == server))
+ list_for_each(tmp2, &ses->tcon_list) {
+ tcon = list_entry(tmp2, struct cifsTconInfo, tcon_list);
tcon->need_reconnect = true;
+ }
}
+ read_unlock(&cifs_tcp_ses_lock);
/* do not want to be sending data on a socket we are freeing */
down(&server->tcpSem);
if (server->ssocket) {
@@ -1451,6 +1450,52 @@ cifs_put_smb_ses(struct cifsSesInfo *ses
cifs_put_tcp_session(server);
}

+static struct cifsTconInfo *
+cifs_find_tcon(struct cifsSesInfo *ses, const char *unc)
+{
+ struct list_head *tmp;
+ struct cifsTconInfo *tcon;
+
+ write_lock(&cifs_tcp_ses_lock);
+ list_for_each(tmp, &ses->tcon_list) {
+ tcon = list_entry(tmp, struct cifsTconInfo, tcon_list);
+ if (tcon->tidStatus == CifsExiting)
+ continue;
+ if (strncmp(tcon->treeName, unc, MAX_TREE_SIZE))
+ continue;
+
+ ++tcon->tc_count;
+ write_unlock(&cifs_tcp_ses_lock);
+ return tcon;
+ }
+ write_unlock(&cifs_tcp_ses_lock);
+ return NULL;
+}
+
+static void
+cifs_put_tcon(struct cifsTconInfo *tcon)
+{
+ int xid;
+ struct cifsSesInfo *ses = tcon->ses;
+
+ write_lock(&cifs_tcp_ses_lock);
+ if (--tcon->tc_count > 0) {
+ write_unlock(&cifs_tcp_ses_lock);
+ return;
+ }
+
+ list_del_init(&tcon->tcon_list);
+ write_unlock(&cifs_tcp_ses_lock);
+
+ xid = GetXid();
+ CIFSSMBTDis(xid, tcon);
+ _FreeXid(xid);
+
+ DeleteTconOplockQEntries(tcon);
+ tconInfoFree(tcon);
+ cifs_put_smb_ses(ses);
+}
+
int
get_dfs_path(int xid, struct cifsSesInfo *pSesInfo, const char *old_path,
const struct nls_table *nls_codepage, unsigned int *pnum_referrals,
@@ -2206,11 +2251,11 @@ cifs_mount(struct super_block *sb, struc
/* search for existing tcon to this server share */
if (!rc) {
setup_cifs_sb(&volume_info, cifs_sb);
+ tcon = cifs_find_tcon(pSesInfo, volume_info.UNC);
if (tcon) {
cFYI(1, ("Found match on UNC path"));
- if (tcon->seal != volume_info.seal)
- cERROR(1, ("transport encryption setting "
- "conflicts with existing tid"));
+ /* existing tcon already has a reference */
+ cifs_put_smb_ses(pSesInfo);
} else {
tcon = tconInfoAlloc();
if (tcon == NULL) {
@@ -2238,6 +2283,10 @@ cifs_mount(struct super_block *sb, struc
if (rc)
goto mount_fail_check;
tcon->seal = volume_info.seal;
+ tcon->ses = pSesInfo;
+ write_lock(&cifs_tcp_ses_lock);
+ list_add(&tcon->tcon_list, &pSesInfo->tcon_list);
+ write_unlock(&cifs_tcp_ses_lock);
}

/* we can have only one retry value for a connection
@@ -2263,18 +2312,14 @@ mount_fail_check:
/* If find_unc succeeded then rc == 0 so we can not end */
/* up accidently freeing someone elses tcon struct */
if (tcon)
- tconInfoFree(tcon);
-
- /* should also end up putting our tcp session ref if needed */
- if (pSesInfo)
+ cifs_put_tcon(tcon);
+ else if (pSesInfo)
cifs_put_smb_ses(pSesInfo);
else
cifs_put_tcp_session(srvTcp);
goto out;
}
- atomic_inc(&tcon->useCount);
cifs_sb->tcon = tcon;
- tcon->ses = pSesInfo;

/* do not care if following two calls succeed - informational */
if (!tcon->ipc) {
@@ -3545,24 +3590,10 @@ int
cifs_umount(struct super_block *sb, struct cifs_sb_info *cifs_sb)
{
int rc = 0;
- int xid;
- struct cifsSesInfo *ses = NULL;
- struct task_struct *cifsd_task;
char *tmp;

- xid = GetXid();
-
- if (cifs_sb->tcon) {
- ses = cifs_sb->tcon->ses; /* save ptr to ses before delete tcon!*/
- rc = CIFSSMBTDis(xid, cifs_sb->tcon);
- if (rc == -EBUSY) {
- FreeXid(xid);
- return 0;
- }
- DeleteTconOplockQEntries(cifs_sb->tcon);
- tconInfoFree(cifs_sb->tcon);
- cifs_put_smb_ses(ses);
- }
+ if (cifs_sb->tcon)
+ cifs_put_tcon(cifs_sb->tcon);

cifs_sb->tcon = NULL;
tmp = cifs_sb->prepath;
@@ -3570,7 +3601,6 @@ cifs_umount(struct super_block *sb, stru
cifs_sb->prepath = NULL;
kfree(tmp);

- FreeXid(xid);
return rc;
}

--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -79,6 +79,7 @@ sesInfoAlloc(void)
ret_buf->status = CifsNew;
++ret_buf->ses_count;
INIT_LIST_HEAD(&ret_buf->smb_ses_list);
+ INIT_LIST_HEAD(&ret_buf->tcon_list);
init_MUTEX(&ret_buf->sesSem);
}
return ret_buf;
@@ -107,17 +108,14 @@ tconInfoAlloc(void)
struct cifsTconInfo *ret_buf;
ret_buf = kzalloc(sizeof(struct cifsTconInfo), GFP_KERNEL);
if (ret_buf) {
- write_lock(&GlobalSMBSeslock);
atomic_inc(&tconInfoAllocCount);
- list_add(&ret_buf->cifsConnectionList,
- &GlobalTreeConnectionList);
ret_buf->tidStatus = CifsNew;
+ ++ret_buf->tc_count;
INIT_LIST_HEAD(&ret_buf->openFileList);
- init_MUTEX(&ret_buf->tconSem);
+ INIT_LIST_HEAD(&ret_buf->tcon_list);
#ifdef CONFIG_CIFS_STATS
spin_lock_init(&ret_buf->stat_lock);
#endif
- write_unlock(&GlobalSMBSeslock);
}
return ret_buf;
}
@@ -129,10 +127,7 @@ tconInfoFree(struct cifsTconInfo *buf_to
cFYI(1, ("Null buffer passed to tconInfoFree"));
return;
}
- write_lock(&GlobalSMBSeslock);
atomic_dec(&tconInfoAllocCount);
- list_del(&buf_to_free->cifsConnectionList);
- write_unlock(&GlobalSMBSeslock);
kfree(buf_to_free->nativeFileSystem);
kfree(buf_to_free);
}
@@ -497,9 +492,10 @@ bool
is_valid_oplock_break(struct smb_hdr *buf, struct TCP_Server_Info *srv)
{
struct smb_com_lock_req *pSMB = (struct smb_com_lock_req *)buf;
- struct list_head *tmp;
- struct list_head *tmp1;
+ struct list_head *tmp, *tmp1, *tmp2;
+ struct cifsSesInfo *ses;
struct cifsTconInfo *tcon;
+ struct cifsInodeInfo *pCifsInode;
struct cifsFileInfo *netfile;

cFYI(1, ("Checking for oplock break or dnotify response"));
@@ -554,42 +550,42 @@ is_valid_oplock_break(struct smb_hdr *bu
return false;

/* look up tcon based on tid & uid */
- read_lock(&GlobalSMBSeslock);
- list_for_each(tmp, &GlobalTreeConnectionList) {
- tcon = list_entry(tmp, struct cifsTconInfo, cifsConnectionList);
- if ((tcon->tid == buf->Tid) && (srv == tcon->ses->server)) {
+ read_lock(&cifs_tcp_ses_lock);
+ list_for_each(tmp, &srv->smb_ses_list) {
+ ses = list_entry(tmp, struct cifsSesInfo, smb_ses_list);
+ list_for_each(tmp1, &ses->tcon_list) {
+ tcon = list_entry(tmp1, struct cifsTconInfo, tcon_list);
+ if (tcon->tid != buf->Tid)
+ continue;
+
cifs_stats_inc(&tcon->num_oplock_brks);
- list_for_each(tmp1, &tcon->openFileList) {
- netfile = list_entry(tmp1, struct cifsFileInfo,
+ list_for_each(tmp2, &tcon->openFileList) {
+ netfile = list_entry(tmp2, struct cifsFileInfo,
tlist);
- if (pSMB->Fid == netfile->netfid) {
- struct cifsInodeInfo *pCifsInode;
- read_unlock(&GlobalSMBSeslock);
- cFYI(1,
- ("file id match, oplock break"));
- pCifsInode =
- CIFS_I(netfile->pInode);
- pCifsInode->clientCanCacheAll = false;
- if (pSMB->OplockLevel == 0)
- pCifsInode->clientCanCacheRead
- = false;
- pCifsInode->oplockPending = true;
- AllocOplockQEntry(netfile->pInode,
- netfile->netfid,
- tcon);
- cFYI(1,
- ("about to wake up oplock thread"));
- if (oplockThread)
- wake_up_process(oplockThread);
- return true;
- }
+ if (pSMB->Fid != netfile->netfid)
+ continue;
+
+ read_unlock(&cifs_tcp_ses_lock);
+ cFYI(1, ("file id match, oplock break"));
+ pCifsInode = CIFS_I(netfile->pInode);
+ pCifsInode->clientCanCacheAll = false;
+ if (pSMB->OplockLevel == 0)
+ pCifsInode->clientCanCacheRead = false;
+ pCifsInode->oplockPending = true;
+ AllocOplockQEntry(netfile->pInode,
+ netfile->netfid, tcon);
+ cFYI(1, ("about to wake up oplock thread"));
+ if (oplockThread)
+ wake_up_process(oplockThread);
+
+ return true;
}
- read_unlock(&GlobalSMBSeslock);
+ read_unlock(&cifs_tcp_ses_lock);
cFYI(1, ("No matching file for oplock break"));
return true;
}
}
- read_unlock(&GlobalSMBSeslock);
+ read_unlock(&cifs_tcp_ses_lock);
cFYI(1, ("Can not process oplock break for non-existent connection"));
return true;
}

2008-12-03 20:19:11

by Greg KH

[permalink] [raw]
Subject: [patch 075/104] cifs: Fix build break

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Steve French <[email protected]>

commit c2b3382cd4d6c6adef1347e81f20e16c93a39feb upstream

Signed-off-by: Steve French <[email protected]>
Cc: Suresh Jayaraman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/cifs/cifs_debug.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -249,9 +249,9 @@ static ssize_t cifs_stats_proc_write(str
list_for_each(tmp1, &cifs_tcp_ses_list) {
server = list_entry(tmp1, struct TCP_Server_Info,
tcp_ses_list);
- list_for_each(tmp2, &server->smb_session_list) {
+ list_for_each(tmp2, &server->smb_ses_list) {
ses = list_entry(tmp2, struct cifsSesInfo,
- smb_session_list);
+ smb_ses_list);
list_for_each(tmp3, &ses->tcon_list) {
tcon = list_entry(tmp3,
struct cifsTconInfo,

2008-12-03 20:21:05

by Greg KH

[permalink] [raw]
Subject: [patch 076/104] cifs: Fix check for tcon seal setting and fix oops on failed mount from earlier patch

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Steve French <[email protected]>

commit ab3f992983062440b4f37c666dac66d987902d91 upstream

set tcon->ses earlier

If the inital tree connect fails, we'll end up calling cifs_put_smb_ses
with a NULL pointer. Fix it by setting the tcon->ses earlier.

Acked-by: Jeff Layton <[email protected]>
Signed-off-by: Steve French <[email protected]>
Cc: Suresh Jayaraman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>


---
fs/cifs/connect.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -2256,16 +2256,18 @@ cifs_mount(struct super_block *sb, struc
cFYI(1, ("Found match on UNC path"));
/* existing tcon already has a reference */
cifs_put_smb_ses(pSesInfo);
+ if (tcon->seal != volume_info.seal)
+ cERROR(1, ("transport encryption setting "
+ "conflicts with existing tid"));
} else {
tcon = tconInfoAlloc();
if (tcon == NULL) {
rc = -ENOMEM;
goto mount_fail_check;
}
+ tcon->ses = pSesInfo;

/* check for null share name ie connect to dfs root */
-
- /* BB check if works for exactly length 3 strings */
if ((strchr(volume_info.UNC + 3, '\\') == NULL)
&& (strchr(volume_info.UNC + 3, '/') == NULL)) {
/* rc = connect_to_dfs_path(...) */
@@ -2283,7 +2285,6 @@ cifs_mount(struct super_block *sb, struc
if (rc)
goto mount_fail_check;
tcon->seal = volume_info.seal;
- tcon->ses = pSesInfo;
write_lock(&cifs_tcp_ses_lock);
list_add(&tcon->tcon_list, &pSesInfo->tcon_list);
write_unlock(&cifs_tcp_ses_lock);

2008-12-03 20:21:32

by Greg KH

[permalink] [raw]
Subject: [patch 077/104] cifs: prevent cifs_writepages() from skipping unwritten pages

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Dave Kleikamp <[email protected]>

commit b066a48c9532243894f93a06ca5a0ee2cc21a8dc upstream

prevent cifs_writepages() from skipping unwritten pages

Fixes a data corruption under heavy stress in which pages could be left
dirty after all open instances of a inode have been closed.

In order to write contiguous pages whenever possible, cifs_writepages()
asks pagevec_lookup_tag() for more pages than it may write at one time.
Normally, it then resets index just past the last page written before calling
pagevec_lookup_tag() again.

If cifs_writepages() can't write the first page returned, it wasn't resetting
index, and the next call to pagevec_lookup_tag() resulted in skipping all of
the pages it previously returned, even though cifs_writepages() did nothing
with them. This can result in data loss when the file descriptor is about
to be closed.

This patch ensures that index gets set back to the next returned page so
that none get skipped.

Signed-off-by: Dave Kleikamp <[email protected]>
Acked-by: Jeff Layton <[email protected]>
Cc: Shirish S Pargaonkar <[email protected]>
Signed-off-by: Steve French <[email protected]>
Cc: Suresh Jayaraman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>


---
fs/cifs/file.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1396,7 +1396,10 @@ retry:
if ((wbc->nr_to_write -= n_iov) <= 0)
done = 1;
index = next;
- }
+ } else
+ /* Need to re-find the pages we skipped */
+ index = pvec.pages[0]->index + 1;
+
pagevec_release(&pvec);
}
if (!scanned && !done) {

2008-12-03 20:21:49

by Greg KH

[permalink] [raw]
Subject: [patch 078/104] cifs: fix check for dead tcon in smb_init

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Steve French <[email protected]>

commit bfb59820ee46616a7bdb4af6b8f7e109646de6ec upstream

This was recently changed to check for need_reconnect, but should
actually be a check for a tidStatus of CifsExiting.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Steve French <[email protected]>
Cc: Suresh Jayaraman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>


---
fs/cifs/cifssmb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -295,7 +295,7 @@ smb_init(int smb_command, int wct, struc
check for tcp and smb session status done differently
for those three - in the calling routine */
if (tcon) {
- if (tcon->need_reconnect) {
+ if (tcon->tidStatus == CifsExiting) {
/* only tree disconnect, open, and write,
(and ulogoff which does not have tcon)
are allowed as we start force umount */

2008-12-03 20:22:35

by Greg KH

[permalink] [raw]
Subject: [patch 080/104] ext4: fix #11321: create /proc/ext4/*/stats more carefully


2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Alexey Dobriyan <[email protected]>

(cherry picked from commit 899fc1a4cf404747de2666534d508804597ee22f)

ext4 creates per-suberblock directory in /proc/ext4/ . Name used as
basis is taken from bdevname, which, surprise, can contain slash.

However, proc while allowing to use proc_create("a/b", parent) form of
PDE creation, assumes that parent/a was already created.

bdevname in question is 'cciss/c0d0p9', directory is not created and all
this stuff goes directly into /proc (which is real bug).

Warning comes when _second_ partition is mounted.

http://bugzilla.kernel.org/show_bug.cgi?id=11321

Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/mballoc.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2785,14 +2785,20 @@ static int ext4_mb_init_per_dev_proc(str
mode_t mode = S_IFREG | S_IRUGO | S_IWUSR;
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct proc_dir_entry *proc;
- char devname[64];
+ char devname[BDEVNAME_SIZE], *p;

if (proc_root_ext4 == NULL) {
sbi->s_mb_proc = NULL;
return -EINVAL;
}
bdevname(sb->s_bdev, devname);
+ p = devname;
+ while ((p = strchr(p, '/')))
+ *p = '!';
+
sbi->s_mb_proc = proc_mkdir(devname, proc_root_ext4);
+ if (!sbi->s_mb_proc)
+ goto err_create_dir;

MB_PROC_HANDLER(EXT4_MB_STATS_NAME, stats);
MB_PROC_HANDLER(EXT4_MB_MAX_TO_SCAN_NAME, max_to_scan);
@@ -2804,7 +2810,6 @@ static int ext4_mb_init_per_dev_proc(str
return 0;

err_out:
- printk(KERN_ERR "EXT4-fs: Unable to create %s\n", devname);
remove_proc_entry(EXT4_MB_GROUP_PREALLOC, sbi->s_mb_proc);
remove_proc_entry(EXT4_MB_STREAM_REQ, sbi->s_mb_proc);
remove_proc_entry(EXT4_MB_ORDER2_REQ, sbi->s_mb_proc);
@@ -2813,6 +2818,8 @@ err_out:
remove_proc_entry(EXT4_MB_STATS_NAME, sbi->s_mb_proc);
remove_proc_entry(devname, proc_root_ext4);
sbi->s_mb_proc = NULL;
+err_create_dir:
+ printk(KERN_ERR "EXT4-fs: Unable to create %s\n", devname);

return -ENOMEM;
}
@@ -2820,12 +2827,15 @@ err_out:
static int ext4_mb_destroy_per_dev_proc(struct super_block *sb)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
- char devname[64];
+ char devname[BDEVNAME_SIZE], *p;

if (sbi->s_mb_proc == NULL)
return -EINVAL;

bdevname(sb->s_bdev, devname);
+ p = devname;
+ while ((p = strchr(p, '/')))
+ *p = '!';
remove_proc_entry(EXT4_MB_GROUP_PREALLOC, sbi->s_mb_proc);
remove_proc_entry(EXT4_MB_STREAM_REQ, sbi->s_mb_proc);
remove_proc_entry(EXT4_MB_ORDER2_REQ, sbi->s_mb_proc);

2008-12-03 20:22:19

by Greg KH

[permalink] [raw]
Subject: [patch 079/104] ext4: Update flex_bg free blocks and free inodes counters when resizing.


2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Frederic Bohe <[email protected]>

(cherry picked from commit c62a11fd9555007b1caab83b5bcbb443a43e32bb)

This fixes a bug which prevented the newly created inodes after a
resize from being used on filesystems with flex_bg.

Signed-off-by: Frederic Bohe <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/resize.c | 9 +++++++++
fs/ext4/super.c | 7 +++++--
2 files changed, 14 insertions(+), 2 deletions(-)

--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -929,6 +929,15 @@ int ext4_group_add(struct super_block *s
percpu_counter_add(&sbi->s_freeinodes_counter,
EXT4_INODES_PER_GROUP(sb));

+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
+ ext4_group_t flex_group;
+ flex_group = ext4_flex_group(sbi, input->group);
+ sbi->s_flex_groups[flex_group].free_blocks +=
+ input->free_blocks_count;
+ sbi->s_flex_groups[flex_group].free_inodes +=
+ EXT4_INODES_PER_GROUP(sb);
+ }
+
ext4_journal_dirty_metadata(handle, sbi->s_sbh);
sb->s_dirt = 1;

--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1504,8 +1504,11 @@ static int ext4_fill_flex_info(struct su
sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex;
groups_per_flex = 1 << sbi->s_log_groups_per_flex;

- flex_group_count = (sbi->s_groups_count + groups_per_flex - 1) /
- groups_per_flex;
+ /* We allocate both existing and potentially added groups */
+ flex_group_count = ((sbi->s_groups_count + groups_per_flex - 1) +
+ ((sbi->s_es->s_reserved_gdt_blocks +1 ) <<
+ EXT4_DESC_PER_BLOCK_BITS(sb))) /
+ groups_per_flex;
sbi->s_flex_groups = kzalloc(flex_group_count *
sizeof(struct flex_groups), GFP_KERNEL);
if (sbi->s_flex_groups == NULL) {

2008-12-03 20:22:54

by Greg KH

[permalink] [raw]
Subject: [patch 081/104] jbd2: fix /proc setup for devices that contain / in their names

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: "Theodore Ts'o" <[email protected]>

trimed down version of commit 05496769e5da83ce22ed97345afd9c7b71d6bd24 upstream.

Some devices such as "cciss/c0d0p9" will cause jbd2 setup and teardown
failures when /proc filenames are created with embedded slashes. This
is a slimmed down version of commit 05496769, with the stack reduction
aspects of the patch omitted to meet the -stable criteria.

Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/jbd2/journal.c | 22 ++++++++++++++--------
include/linux/jbd2.h | 3 ++-
2 files changed, 16 insertions(+), 9 deletions(-)

--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -901,10 +901,7 @@ static struct proc_dir_entry *proc_jbd2_

static void jbd2_stats_proc_init(journal_t *journal)
{
- char name[BDEVNAME_SIZE];
-
- bdevname(journal->j_dev, name);
- journal->j_proc_entry = proc_mkdir(name, proc_jbd2_stats);
+ journal->j_proc_entry = proc_mkdir(journal->j_devname, proc_jbd2_stats);
if (journal->j_proc_entry) {
proc_create_data("history", S_IRUGO, journal->j_proc_entry,
&jbd2_seq_history_fops, journal);
@@ -915,12 +912,9 @@ static void jbd2_stats_proc_init(journal

static void jbd2_stats_proc_exit(journal_t *journal)
{
- char name[BDEVNAME_SIZE];
-
- bdevname(journal->j_dev, name);
remove_proc_entry("info", journal->j_proc_entry);
remove_proc_entry("history", journal->j_proc_entry);
- remove_proc_entry(name, proc_jbd2_stats);
+ remove_proc_entry(journal->j_devname, proc_jbd2_stats);
}

static void journal_init_stats(journal_t *journal)
@@ -1018,6 +1012,7 @@ journal_t * jbd2_journal_init_dev(struct
{
journal_t *journal = journal_init_common();
struct buffer_head *bh;
+ char *p;
int n;

if (!journal)
@@ -1039,6 +1034,10 @@ journal_t * jbd2_journal_init_dev(struct
journal->j_fs_dev = fs_dev;
journal->j_blk_offset = start;
journal->j_maxlen = len;
+ bdevname(journal->j_dev, journal->j_devname);
+ p = journal->j_devname;
+ while ((p = strchr(p, '/')))
+ *p = '!';
jbd2_stats_proc_init(journal);

bh = __getblk(journal->j_dev, start, journal->j_blocksize);
@@ -1061,6 +1060,7 @@ journal_t * jbd2_journal_init_inode (str
{
struct buffer_head *bh;
journal_t *journal = journal_init_common();
+ char *p;
int err;
int n;
unsigned long long blocknr;
@@ -1070,6 +1070,12 @@ journal_t * jbd2_journal_init_inode (str

journal->j_dev = journal->j_fs_dev = inode->i_sb->s_bdev;
journal->j_inode = inode;
+ bdevname(journal->j_dev, journal->j_devname);
+ p = journal->j_devname;
+ while ((p = strchr(p, '/')))
+ *p = '!';
+ p = journal->j_devname + strlen(journal->j_devname);
+ sprintf(p, ":%lu", journal->j_inode->i_ino);
jbd_debug(1,
"journal %p: inode %s/%ld, size %Ld, bits %d, blksize %ld\n",
journal, inode->i_sb->s_id, inode->i_ino,
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -850,7 +850,8 @@ struct journal_s
*/
struct block_device *j_dev;
int j_blocksize;
- unsigned long long j_blk_offset;
+ unsigned long long j_blk_offset;
+ char j_devname[BDEVNAME_SIZE+24];

/*
* Device which holds the client fs. For internal journal this will be

2008-12-03 20:23:22

by Greg KH

[permalink] [raw]
Subject: [patch 082/104] ext4: add missing unlock in ext4_check_descriptors() on error path


2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Li Zefan <[email protected]>

(cherry picked from commit 7ee1ec4ca30c6df8e989615cdaacb75f2af4fa6b)

If there group descriptors are corrupted we need unlock the block
group lock before returning from the function; else we will oops when
freeing a spinlock which is still being held.

Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/super.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1626,8 +1626,10 @@ static int ext4_check_descriptors(struct
"Checksum for group %lu failed (%u!=%u)\n",
i, le16_to_cpu(ext4_group_desc_csum(sbi, i,
gdp)), le16_to_cpu(gdp->bg_checksum));
- if (!(sb->s_flags & MS_RDONLY))
+ if (!(sb->s_flags & MS_RDONLY)) {
+ spin_unlock(sb_bgl_lock(sbi, i));
return 0;
+ }
}
spin_unlock(sb_bgl_lock(sbi, i));
if (!flexbg_flag)

2008-12-03 20:23:41

by Greg KH

[permalink] [raw]
Subject: [patch 083/104] ext4: elevate write count for migrate ioctl

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Aneesh Kumar K.V <[email protected]>

(cherry picked from commit 2a43a878001cc5cb7c3c7be2e8dad0a1aeb939b0)

The migrate ioctl writes to the filsystem, so we need to elevate the
write count.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/ext4.h | 3 +--
fs/ext4/ioctl.c | 21 ++++++++++++++++++++-
fs/ext4/migrate.c | 10 +---------
3 files changed, 22 insertions(+), 12 deletions(-)

--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1083,8 +1083,7 @@ extern long ext4_ioctl(struct file *, un
extern long ext4_compat_ioctl (struct file *, unsigned int, unsigned long);

/* migrate.c */
-extern int ext4_ext_migrate(struct inode *, struct file *, unsigned int,
- unsigned long);
+extern int ext4_ext_migrate(struct inode *);
/* namei.c */
extern int ext4_orphan_add(handle_t *, struct inode *);
extern int ext4_orphan_del(handle_t *, struct inode *);
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -267,7 +267,26 @@ setversion_out:
}

case EXT4_IOC_MIGRATE:
- return ext4_ext_migrate(inode, filp, cmd, arg);
+ {
+ int err;
+ if (!is_owner_or_cap(inode))
+ return -EACCES;
+
+ err = mnt_want_write(filp->f_path.mnt);
+ if (err)
+ return err;
+ /*
+ * inode_mutex prevent write and truncate on the file.
+ * Read still goes through. We take i_data_sem in
+ * ext4_ext_swap_inode_data before we switch the
+ * inode format to prevent read.
+ */
+ mutex_lock(&(inode->i_mutex));
+ err = ext4_ext_migrate(inode);
+ mutex_unlock(&(inode->i_mutex));
+ mnt_drop_write(filp->f_path.mnt);
+ return err;
+ }

default:
return -ENOTTY;
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -447,8 +447,7 @@ static int free_ext_block(handle_t *hand

}

-int ext4_ext_migrate(struct inode *inode, struct file *filp,
- unsigned int cmd, unsigned long arg)
+int ext4_ext_migrate(struct inode *inode)
{
handle_t *handle;
int retval = 0, i;
@@ -516,12 +515,6 @@ int ext4_ext_migrate(struct inode *inode
* when we add extents we extent the journal
*/
/*
- * inode_mutex prevent write and truncate on the file. Read still goes
- * through. We take i_data_sem in ext4_ext_swap_inode_data before we
- * switch the inode format to prevent read.
- */
- mutex_lock(&(inode->i_mutex));
- /*
* Even though we take i_mutex we can still cause block allocation
* via mmap write to holes. If we have allocated new blocks we fail
* migrate. New block allocation will clear EXT4_EXT_MIGRATE flag.
@@ -623,7 +616,6 @@ err_out:
tmp_inode->i_nlink = 0;

ext4_journal_stop(handle);
- mutex_unlock(&(inode->i_mutex));

if (tmp_inode)
iput(tmp_inode);

2008-12-03 20:24:23

by Greg KH

[permalink] [raw]
Subject: [patch 084/104] ext4: Renumber EXT4_IOC_MIGRATE

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: "Theodore Ts'o" <[email protected]>

(cherry picked from commit 8eea80d52b9d87cfd771055534bd2c24f73704d7)

Pick an ioctl number for EXT4_IOC_MIGRATE that won't conflict with
other ext4 ioctl's. Since there haven't been any major userspace
users of this ioctl, we can afford to change this now, to avoid
potential problems later.

Also, reorder the ioctl numbers in ext4.h to avoid this sort of
mistake in the future.

Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/ext4.h | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -291,8 +291,6 @@ struct ext4_new_group_data {
#define EXT4_IOC_SETFLAGS FS_IOC_SETFLAGS
#define EXT4_IOC_GETVERSION _IOR('f', 3, long)
#define EXT4_IOC_SETVERSION _IOW('f', 4, long)
-#define EXT4_IOC_GROUP_EXTEND _IOW('f', 7, unsigned long)
-#define EXT4_IOC_GROUP_ADD _IOW('f', 8,struct ext4_new_group_input)
#define EXT4_IOC_GETVERSION_OLD FS_IOC_GETVERSION
#define EXT4_IOC_SETVERSION_OLD FS_IOC_SETVERSION
#ifdef CONFIG_JBD2_DEBUG
@@ -300,7 +298,10 @@ struct ext4_new_group_data {
#endif
#define EXT4_IOC_GETRSVSZ _IOR('f', 5, long)
#define EXT4_IOC_SETRSVSZ _IOW('f', 6, long)
-#define EXT4_IOC_MIGRATE _IO('f', 7)
+#define EXT4_IOC_GROUP_EXTEND _IOW('f', 7, unsigned long)
+#define EXT4_IOC_GROUP_ADD _IOW('f', 8, struct ext4_new_group_input)
+#define EXT4_IOC_MIGRATE _IO('f', 9)
+ /* note ioctl 11 reserved for filesystem-independent FIEMAP ioctl */

/*
* ioctl commands in 32 bit emulation

2008-12-03 20:24:41

by Greg KH

[permalink] [raw]
Subject: [patch 085/104] ext4/jbd2: Avoid WARN() messages when failing to write to the superblock

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: "Theodore Ts'o" <[email protected]>

(cherry picked from commit 914258bf2cb22bf4336a1b1d90c551b4b11ca5aa)

This fixes some very common warnings reported by kerneloops.org

Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/super.c | 23 ++++++++++++++++++++++-
fs/jbd2/journal.c | 27 +++++++++++++++++++++++++--
2 files changed, 47 insertions(+), 3 deletions(-)

--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2804,13 +2804,34 @@ static void ext4_commit_super(struct sup

if (!sbh)
return;
+ if (buffer_write_io_error(sbh)) {
+ /*
+ * Oh, dear. A previous attempt to write the
+ * superblock failed. This could happen because the
+ * USB device was yanked out. Or it could happen to
+ * be a transient write error and maybe the block will
+ * be remapped. Nothing we can do but to retry the
+ * write and hope for the best.
+ */
+ printk(KERN_ERR "ext4: previous I/O error to "
+ "superblock detected for %s.\n", sb->s_id);
+ clear_buffer_write_io_error(sbh);
+ set_buffer_uptodate(sbh);
+ }
es->s_wtime = cpu_to_le32(get_seconds());
ext4_free_blocks_count_set(es, ext4_count_free_blocks(sb));
es->s_free_inodes_count = cpu_to_le32(ext4_count_free_inodes(sb));
BUFFER_TRACE(sbh, "marking dirty");
mark_buffer_dirty(sbh);
- if (sync)
+ if (sync) {
sync_dirty_buffer(sbh);
+ if (buffer_write_io_error(sbh)) {
+ printk(KERN_ERR "ext4: I/O error while writing "
+ "superblock for %s.\n", sb->s_id);
+ clear_buffer_write_io_error(sbh);
+ set_buffer_uptodate(sbh);
+ }
+ }
}


--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1259,6 +1259,22 @@ void jbd2_journal_update_superblock(jour
goto out;
}

+ if (buffer_write_io_error(bh)) {
+ /*
+ * Oh, dear. A previous attempt to write the journal
+ * superblock failed. This could happen because the
+ * USB device was yanked out. Or it could happen to
+ * be a transient write error and maybe the block will
+ * be remapped. Nothing we can do but to retry the
+ * write and hope for the best.
+ */
+ printk(KERN_ERR "JBD2: previous I/O error detected "
+ "for journal superblock update for %s.\n",
+ journal->j_devname);
+ clear_buffer_write_io_error(bh);
+ set_buffer_uptodate(bh);
+ }
+
spin_lock(&journal->j_state_lock);
jbd_debug(1,"JBD: updating superblock (start %ld, seq %d, errno %d)\n",
journal->j_tail, journal->j_tail_sequence, journal->j_errno);
@@ -1270,9 +1286,16 @@ void jbd2_journal_update_superblock(jour

BUFFER_TRACE(bh, "marking dirty");
mark_buffer_dirty(bh);
- if (wait)
+ if (wait) {
sync_dirty_buffer(bh);
- else
+ if (buffer_write_io_error(bh)) {
+ printk(KERN_ERR "JBD2: I/O error detected "
+ "when updating journal superblock for %s.\n",
+ journal->j_devname);
+ clear_buffer_write_io_error(bh);
+ set_buffer_uptodate(bh);
+ }
+ } else
ll_rw_block(SWRITE, 1, &bh);

out:

2008-12-03 20:25:01

by Greg KH

[permalink] [raw]
Subject: [patch 086/104] ext4: fix initialization of UNINIT bitmap blocks

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Frederic Bohe <[email protected]>

(cherry picked from commit c806e68f5647109350ec546fee5b526962970fd2)

This fixes a bug which caused on-line resizing of filesystems with a
1k blocksize to fail. The root cause of this bug was the fact that if
an uninitalized bitmap block gets read in by userspace (which
e2fsprogs does try to avoid, but can happen when the blocksize is less
than the pagesize and an adjacent blocks is read into memory)
ext4_read_block_bitmap() was erroneously depending on the buffer
uptodate flag to decide whether it needed to initialize the bitmap
block in memory --- i.e., to set the standard set of blocks in use by
a block group (superblock, bitmaps, inode table, etc.). Essentially,
ext4_read_block_bitmap() assumed it was the only routine that might
try to read a block containing a block bitmap, which is simply not
true.

To fix this, ext4_read_block_bitmap() and ext4_read_inode_bitmap()
must always initialize uninitialized bitmap blocks. Once a block or
inode is allocated out of that bitmap, it will be marked as
initialized in the block group descriptor, so in general this won't
result any extra unnecessary work.

Signed-off-by: Frederic Bohe <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/balloc.c | 4 +++-
fs/ext4/ialloc.c | 4 +++-
fs/ext4/mballoc.c | 4 +++-
3 files changed, 9 insertions(+), 3 deletions(-)

--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -318,9 +318,11 @@ ext4_read_block_bitmap(struct super_bloc
block_group, bitmap_blk);
return NULL;
}
- if (bh_uptodate_or_lock(bh))
+ if (buffer_uptodate(bh) &&
+ !(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))
return bh;

+ lock_buffer(bh);
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh, block_group, desc);
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -115,9 +115,11 @@ ext4_read_inode_bitmap(struct super_bloc
block_group, bitmap_blk);
return NULL;
}
- if (bh_uptodate_or_lock(bh))
+ if (buffer_uptodate(bh) &&
+ !(desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
return bh;

+ lock_buffer(bh);
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
ext4_init_inode_bitmap(sb, bh, block_group, desc);
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -784,9 +784,11 @@ static int ext4_mb_init_cache(struct pag
if (bh[i] == NULL)
goto out;

- if (bh_uptodate_or_lock(bh[i]))
+ if (buffer_uptodate(bh[i]) &&
+ !(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))
continue;

+ lock_buffer(bh[i]);
spin_lock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh[i],

2008-12-03 20:25:29

by Greg KH

[permalink] [raw]
Subject: [patch 088/104] jbd2: Fix buffer head leak when writing the commit block

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: "Theodore Ts'o" <[email protected]>

(cherry picked from commit 45a90bfd90c1215bf824c0f705b409723f52361b)

Also make sure the buffer heads are marked clean before submitting bh
for writing. The previous code was marking the buffer head dirty,
which would have forced an unneeded write (and seek) to the journal
for no good reason.

Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/jbd2/commit.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -126,8 +126,7 @@ static int journal_submit_commit_record(

JBUFFER_TRACE(descriptor, "submit commit block");
lock_buffer(bh);
- get_bh(bh);
- set_buffer_dirty(bh);
+ clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
bh->b_end_io = journal_end_buffer_io_sync;

@@ -160,7 +159,7 @@ static int journal_submit_commit_record(
/* And try again, without the barrier */
lock_buffer(bh);
set_buffer_uptodate(bh);
- set_buffer_dirty(bh);
+ clear_buffer_dirty(bh);
ret = submit_bh(WRITE, bh);
}
*cbh = bh;

2008-12-03 20:25:53

by Greg KH

[permalink] [raw]
Subject: [patch 087/104] jbd2: abort instead of waiting for nonexistent transaction


2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Duane Griffin <[email protected]>

(cherry picked from commit 23f8b79eae8a74e42a006ffa7c456e295c7e1c0d)

The __jbd2_log_wait_for_space function sits in a loop checkpointing
transactions until there is sufficient space free in the journal.
However, if there are no transactions to be processed (e.g. because the
free space calculation is wrong due to a corrupted filesystem) it will
never progress.

Check for space being required when no transactions are outstanding and
abort the journal instead of endlessly looping.

This patch fixes the bug reported by Sami Liedes at:
http://bugzilla.kernel.org/show_bug.cgi?id=10976

Signed-off-by: Duane Griffin <[email protected]>
Cc: Sami Liedes <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/jbd2/checkpoint.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -126,14 +126,29 @@ void __jbd2_log_wait_for_space(journal_t

/*
* Test again, another process may have checkpointed while we
- * were waiting for the checkpoint lock
+ * were waiting for the checkpoint lock. If there are no
+ * outstanding transactions there is nothing to checkpoint and
+ * we can't make progress. Abort the journal in this case.
*/
spin_lock(&journal->j_state_lock);
+ spin_lock(&journal->j_list_lock);
nblocks = jbd_space_needed(journal);
if (__jbd2_log_space_left(journal) < nblocks) {
+ int chkpt = journal->j_checkpoint_transactions != NULL;
+
+ spin_unlock(&journal->j_list_lock);
spin_unlock(&journal->j_state_lock);
- jbd2_log_do_checkpoint(journal);
+ if (chkpt) {
+ jbd2_log_do_checkpoint(journal);
+ } else {
+ printk(KERN_ERR "%s: no transactions\n",
+ __func__);
+ jbd2_journal_abort(journal, 0);
+ }
+
spin_lock(&journal->j_state_lock);
+ } else {
+ spin_unlock(&journal->j_list_lock);
}
mutex_unlock(&journal->j_checkpoint_mutex);
}

2008-12-03 20:26:19

by Greg KH

[permalink] [raw]
Subject: [patch 089/104] ext4: fix xattr deadlock


2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Kalpak Shah <[email protected]>

(cherry picked from commit 4d20c685fa365766a8f13584b4c8178a15ab7103)

ext4_xattr_set_handle() eventually ends up calling
ext4_mark_inode_dirty() which tries to expand the inode by shifting
the EAs. This leads to the xattr_sem being downed again and leading
to a deadlock.

This patch makes sure that if ext4_xattr_set_handle() is in the
call-chain, ext4_mark_inode_dirty() will not expand the inode.

Signed-off-by: Kalpak Shah <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/xattr.c | 6 ++++++
1 file changed, 6 insertions(+)

--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -959,6 +959,7 @@ ext4_xattr_set_handle(handle_t *handle,
struct ext4_xattr_block_find bs = {
.s = { .not_found = -ENODATA, },
};
+ unsigned long no_expand;
int error;

if (!name)
@@ -966,6 +967,9 @@ ext4_xattr_set_handle(handle_t *handle,
if (strlen(name) > 255)
return -ERANGE;
down_write(&EXT4_I(inode)->xattr_sem);
+ no_expand = EXT4_I(inode)->i_state & EXT4_STATE_NO_EXPAND;
+ EXT4_I(inode)->i_state |= EXT4_STATE_NO_EXPAND;
+
error = ext4_get_inode_loc(inode, &is.iloc);
if (error)
goto cleanup;
@@ -1042,6 +1046,8 @@ ext4_xattr_set_handle(handle_t *handle,
cleanup:
brelse(is.iloc.bh);
brelse(bs.bh);
+ if (no_expand == 0)
+ EXT4_I(inode)->i_state &= ~EXT4_STATE_NO_EXPAND;
up_write(&EXT4_I(inode)->xattr_sem);
return error;
}

2008-12-03 20:26:38

by Greg KH

[permalink] [raw]
Subject: [patch 090/104] ext4: Free ext4_prealloc_space using kmem_cache_free

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Aneesh Kumar K.V <[email protected]>

(cherry picked from commit 688f05a01983711a4e715b1d6e15a89a89c96a66)

We should use kmem_cache_free to free memory allocated
via kmem_cache_alloc

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/mballoc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2577,7 +2577,7 @@ static void ext4_mb_cleanup_pa(struct ex
pa = list_entry(cur, struct ext4_prealloc_space, pa_group_list);
list_del(&pa->pa_group_list);
count++;
- kfree(pa);
+ kmem_cache_free(ext4_pspace_cachep, pa);
}
if (count)
mb_debug("mballoc: %u PAs left\n", count);

2008-12-03 20:26:53

by Greg KH

[permalink] [raw]
Subject: [patch 091/104] ext4: Do mballoc init before doing filesystem recovery

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Aneesh Kumar K.V <[email protected]>

(cherry picked from commit c2774d84fd6cab2bfa2a2fae0b1ca8d8ebde48a2)

During filesystem recovery we may be doing a truncate
which expects some of the mballoc data structures to
be initialized. So do ext4_mb_init before recovery.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/super.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)

--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2449,6 +2449,21 @@ static int ext4_fill_super(struct super_
"available.\n");
}

+ if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
+ printk(KERN_WARNING "EXT4-fs: Ignoring delalloc option - "
+ "requested data journaling mode\n");
+ clear_opt(sbi->s_mount_opt, DELALLOC);
+ } else if (test_opt(sb, DELALLOC))
+ printk(KERN_INFO "EXT4-fs: delayed allocation enabled\n");
+
+ ext4_ext_init(sb);
+ err = ext4_mb_init(sb, needs_recovery);
+ if (err) {
+ printk(KERN_ERR "EXT4-fs: failed to initalize mballoc (%d)\n",
+ err);
+ goto failed_mount4;
+ }
+
/*
* akpm: core read_super() calls in here with the superblock locked.
* That deadlocks, because orphan cleanup needs to lock the superblock
@@ -2468,16 +2483,6 @@ static int ext4_fill_super(struct super_
test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA ? "ordered":
"writeback");

- if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
- printk(KERN_WARNING "EXT4-fs: Ignoring delalloc option - "
- "requested data journaling mode\n");
- clear_opt(sbi->s_mount_opt, DELALLOC);
- } else if (test_opt(sb, DELALLOC))
- printk(KERN_INFO "EXT4-fs: delayed allocation enabled\n");
-
- ext4_ext_init(sb);
- ext4_mb_init(sb, needs_recovery);
-
lock_kernel();
return 0;

2008-12-03 20:27:26

by Greg KH

[permalink] [raw]
Subject: [patch 094/104] ext4: Convert to host order before using the values.

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Aneesh Kumar K.V <[email protected]>

(cherry picked from commit d94e99a64c3beece22dbfb2b335771a59184eb0a)

Use le16_to_cpu to read the s_reserved_gdt_blocks values
from super block.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/super.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1506,9 +1506,8 @@ static int ext4_fill_flex_info(struct su

/* We allocate both existing and potentially added groups */
flex_group_count = ((sbi->s_groups_count + groups_per_flex - 1) +
- ((sbi->s_es->s_reserved_gdt_blocks +1 ) <<
- EXT4_DESC_PER_BLOCK_BITS(sb))) /
- groups_per_flex;
+ ((le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks) + 1) <<
+ EXT4_DESC_PER_BLOCK_BITS(sb))) / groups_per_flex;
sbi->s_flex_groups = kzalloc(flex_group_count *
sizeof(struct flex_groups), GFP_KERNEL);
if (sbi->s_flex_groups == NULL) {

2008-12-03 20:27:46

by Greg KH

[permalink] [raw]
Subject: [patch 093/104] jbd2: dont give up looking for space so easily in __jbd2_log_wait_for_space

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: "Theodore Ts'o" <[email protected]>

(cherry picked from commit 8c3f25d8950c3e9fe6c9849f88679b3f2a071550)

Commit 23f8b79e introducd a regression because it assumed that if
there were no transactions ready to be checkpointed, that no progress
could be made on making space available in the journal, and so the
journal should be aborted. This assumption is false; it could be the
case that simply calling jbd2_cleanup_journal_tail() will recover the
necessary space, or, for small journals, the currently committing
transaction could be responsible for chewing up the required space in
the log, so we need to wait for the currently committing transaction
to finish before trying to force a checkpoint operation.

This patch fixes a bug reported by Mihai Harpau at:
https://bugzilla.redhat.com/show_bug.cgi?id=469582

This patch fixes a bug reported by Fran?ois Valenduc at:
http://bugzilla.kernel.org/show_bug.cgi?id=11840

Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Duane Griffin <[email protected]>
Cc: Toshiyuki Okajima <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/jbd2/checkpoint.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)

--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -114,7 +114,7 @@ static int __try_to_free_cp_buf(struct j
*/
void __jbd2_log_wait_for_space(journal_t *journal)
{
- int nblocks;
+ int nblocks, space_left;
assert_spin_locked(&journal->j_state_lock);

nblocks = jbd_space_needed(journal);
@@ -127,25 +127,43 @@ void __jbd2_log_wait_for_space(journal_t
/*
* Test again, another process may have checkpointed while we
* were waiting for the checkpoint lock. If there are no
- * outstanding transactions there is nothing to checkpoint and
- * we can't make progress. Abort the journal in this case.
+ * transactions ready to be checkpointed, try to recover
+ * journal space by calling cleanup_journal_tail(), and if
+ * that doesn't work, by waiting for the currently committing
+ * transaction to complete. If there is absolutely no way
+ * to make progress, this is either a BUG or corrupted
+ * filesystem, so abort the journal and leave a stack
+ * trace for forensic evidence.
*/
spin_lock(&journal->j_state_lock);
spin_lock(&journal->j_list_lock);
nblocks = jbd_space_needed(journal);
- if (__jbd2_log_space_left(journal) < nblocks) {
+ space_left = __jbd2_log_space_left(journal);
+ if (space_left < nblocks) {
int chkpt = journal->j_checkpoint_transactions != NULL;
+ tid_t tid = 0;

+ if (journal->j_committing_transaction)
+ tid = journal->j_committing_transaction->t_tid;
spin_unlock(&journal->j_list_lock);
spin_unlock(&journal->j_state_lock);
if (chkpt) {
jbd2_log_do_checkpoint(journal);
+ } else if (jbd2_cleanup_journal_tail(journal) == 0) {
+ /* We were able to recover space; yay! */
+ ;
+ } else if (tid) {
+ jbd2_log_wait_commit(journal, tid);
} else {
- printk(KERN_ERR "%s: no transactions\n",
- __func__);
+ printk(KERN_ERR "%s: needed %d blocks and "
+ "only had %d space available\n",
+ __func__, nblocks, space_left);
+ printk(KERN_ERR "%s: no way to get more "
+ "journal space in %s\n", __func__,
+ journal->j_devname);
+ WARN_ON(1);
jbd2_journal_abort(journal, 0);
}
-
spin_lock(&journal->j_state_lock);
} else {
spin_unlock(&journal->j_list_lock);

2008-12-03 20:28:09

by Greg KH

[permalink] [raw]
Subject: [patch 096/104] ext4: calculate journal credits correctly

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: "Theodore Ts'o" <[email protected]>

(cherry picked from commit ac51d83705c2a38c71f39cde99708b14e6212a60)

This fixes a 2.6.27 regression which was introduced in commit a02908f1.

We weren't passing the chunk parameter down to the two subections,
ext4_indirect_trans_blocks() and ext4_ext_index_trans_blocks(), with
the result that massively overestimate the amount of credits needed by
ext4_da_writepages, especially in the non-extents case. This causes
failures especially on /boot partitions, which tend to be small and
non-extent using since GRUB doesn't handle extents.

This patch fixes the bug reported by Joseph Fannin at:
http://bugzilla.kernel.org/show_bug.cgi?id=11964

Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/inode.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4444,9 +4444,10 @@ static int ext4_indirect_trans_blocks(st
static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
{
if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
- return ext4_indirect_trans_blocks(inode, nrblocks, 0);
- return ext4_ext_index_trans_blocks(inode, nrblocks, 0);
+ return ext4_indirect_trans_blocks(inode, nrblocks, chunk);
+ return ext4_ext_index_trans_blocks(inode, nrblocks, chunk);
}
+
/*
* Account for index blocks, block groups bitmaps and block group
* descriptor blocks if modify datablocks and index blocks

2008-12-03 20:28:34

by Greg KH

[permalink] [raw]
Subject: [patch 097/104] ext4: Mark the buffer_heads as dirty and uptodate after prepare_write

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Aneesh Kumar K.V <[email protected]>

(cherry picked from commit ed9b3e3379731e9f9d2f73f3d7fd9e7d2ce3df4a)

We need to make sure we mark the buffer_heads as dirty and uptodate
so that block_write_full_page write them correctly.

This fixes mmap corruptions that can occur in low memory situations.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/inode.c | 2 ++
1 file changed, 2 insertions(+)

--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2242,6 +2242,8 @@ static int ext4_da_writepage(struct page
unlock_page(page);
return 0;
}
+ /* now mark the buffer_heads as dirty and uptodate */
+ block_commit_write(page, 0, PAGE_CACHE_SIZE);
}

if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))

2008-12-03 20:28:53

by Greg KH

[permalink] [raw]
Subject: [patch 101/104] ext3: dont try to resize if there are no reserved gdt blocks left

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Josef Bacik <[email protected]>

commit 972fbf779832e5ad15effa7712789aeff9224c37 upstream.

When trying to resize a ext3 fs and you run out of reserved gdt blocks,
you get an error that doesn't actually tell you what went wrong, it just
says that the gdb it picked is not correct, which is the case since you
don't have any reserved gdt blocks left. This patch adds a check to make
sure you have reserved gdt blocks to use, and if not prints out a more
relevant error.

Signed-off-by: Josef Bacik <[email protected]>
Cc: <[email protected]>
Cc: Andreas Dilger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Cc: Willy Tarreau <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext3/resize.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- a/fs/ext3/resize.c
+++ b/fs/ext3/resize.c
@@ -790,7 +790,8 @@ int ext3_group_add(struct super_block *s

if (reserved_gdb || gdb_off == 0) {
if (!EXT3_HAS_COMPAT_FEATURE(sb,
- EXT3_FEATURE_COMPAT_RESIZE_INODE)){
+ EXT3_FEATURE_COMPAT_RESIZE_INODE)
+ || !le16_to_cpu(es->s_reserved_gdt_blocks)) {
ext3_warning(sb, __func__,
"No reserved GDT blocks, can't resize");
return -EPERM;

2008-12-03 20:29:21

by Greg KH

[permalink] [raw]
Subject: [patch 103/104] ext3: fix ext3 block reservation early ENOSPC issue

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Mingming Cao <[email protected]>

commit 46d01a225e694f1a4343beea44f1e85105aedd7e upstream.

We could run into ENOSPC error on ext3, even when there is free blocks on
the filesystem.

The problem is triggered in the case the goal block group has 0 free
blocks , and the rest block groups are skipped due to the check of
"free_blocks < windowsz/2". Current code could fall back to non
reservation allocation to prevent early ENOSPC after examing all the block
groups with reservation on , but this code was bypassed if the reservation
window is turned off already, which is true in this case.

This patch fixed two issues:
1) We don't need to turn off block reservation if the goal block group has
0 free blocks left and continue search for the rest of block groups.

Current code the intention is to turn off the block reservation if the
goal allocation group has a few (some) free blocks left (not enough for
make the desired reservation window),to try to allocation in the goal
block group, to get better locality. But if the goal blocks have 0 free
blocks, it should leave the block reservation on, and continues search for
the next block groups,rather than turn off block reservation completely.

2) we don't need to check the window size if the block reservation is off.

The problem was originally found and fixed in ext4.

Signed-off-by: Mingming Cao <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Cc: Willy Tarreau <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext3/balloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -1547,6 +1547,7 @@ retry_alloc:
* turn off reservation for this allocation
*/
if (my_rsv && (free_blocks < windowsz)
+ && (free_blocks > 0)
&& (rsv_is_empty(&my_rsv->rsv_window)))
my_rsv = NULL;

@@ -1585,7 +1586,7 @@ retry_alloc:
* free blocks is less than half of the reservation
* window size.
*/
- if (free_blocks <= (windowsz/2))
+ if (my_rsv && (free_blocks <= (windowsz/2)))
continue;

brelse(bitmap_bh);

2008-12-03 20:29:49

by Greg KH

[permalink] [raw]
Subject: [patch 104/104] jbd: ordered data integrity fix

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Hidehiro Kawai <[email protected]>

commit 960a22ae60c8a723bd17da3b929fe0bcea6d007e upstream.

In ordered mode, if a file data buffer being dirtied exists in the
committing transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it. But we
don't have to remove the buffer from the committing transaction when the
buffer couldn't be written out, otherwise it would miss the error and the
committing transaction would not abort.

This patch adds an error check before removing the buffer from the
committing transaction.

Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Cc: Willy Tarreau <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/jbd/transaction.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle,
journal_t *journal = handle->h_transaction->t_journal;
int need_brelse = 0;
struct journal_head *jh;
+ int ret = 0;

if (is_handle_aborted(handle))
- return 0;
+ return ret;

jh = journal_add_journal_head(bh);
JBUFFER_TRACE(jh, "entry");
@@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle,
time if it is redirtied */
}

- /* journal_clean_data_list() may have got there first */
+ /*
+ * We cannot remove the buffer with io error from the
+ * committing transaction, because otherwise it would
+ * miss the error and the commit would not abort.
+ */
+ if (unlikely(!buffer_uptodate(bh))) {
+ ret = -EIO;
+ goto no_journal;
+ }
+
if (jh->b_transaction != NULL) {
JBUFFER_TRACE(jh, "unfile from commit");
__journal_temp_unlink_buffer(jh);
@@ -1108,7 +1118,7 @@ no_journal:
}
JBUFFER_TRACE(jh, "exit");
journal_put_journal_head(jh);
- return 0;
+ return ret;
}

/**

2008-12-03 20:30:21

by Greg KH

[permalink] [raw]
Subject: [patch 102/104] ext2: fix ext2 block reservation early ENOSPC issue

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Mingming Cao <[email protected]>

commit d707d31c972b657dfc2efefd0b99cc4e14223dab upstream.

We could run into ENOSPC error on ext2, even when there is free blocks on
the filesystem.

The problem is triggered in the case the goal block group has 0 free
blocks , and the rest block groups are skipped due to the check of
"free_blocks < windowsz/2". Current code could fall back to non
reservation allocation to prevent early ENOSPC after examing all the block
groups with reservation on , but this code was bypassed if the reservation
window is turned off already, which is true in this case.

This patch fixed two issues:
1) We don't need to turn off block reservation if the goal block group has
0 free blocks left and continue search for the rest of block groups.

Current code the intention is to turn off the block reservation if the
goal allocation group has a few (some) free blocks left (not enough for
make the desired reservation window),to try to allocation in the goal
block group, to get better locality. But if the goal blocks have 0 free
blocks, it should leave the block reservation on, and continues search for
the next block groups,rather than turn off block reservation completely.

2) we don't need to check the window size if the block reservation is off.

The problem was originally found and fixed in ext4.

Signed-off-by: Mingming Cao <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Cc: Willy Tarreau <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext2/balloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -1295,6 +1295,7 @@ retry_alloc:
* turn off reservation for this allocation
*/
if (my_rsv && (free_blocks < windowsz)
+ && (free_blocks > 0)
&& (rsv_is_empty(&my_rsv->rsv_window)))
my_rsv = NULL;

@@ -1332,7 +1333,7 @@ retry_alloc:
* free blocks is less than half of the reservation
* window size.
*/
- if (free_blocks <= (windowsz/2))
+ if (my_rsv && (free_blocks <= (windowsz/2)))
continue;

brelse(bitmap_bh);

2008-12-03 20:30:45

by Greg KH

[permalink] [raw]
Subject: [patch 100/104] ext3: Fix duplicate entries returned from getdents() system call

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Theodore Ts'o <[email protected]>

commit 8c9fa93d51123c5540762b1a9e1919d6f9c4af7c upstream.

Fix a regression caused by commit 6a897cf4, "ext3: fix ext3_dx_readdir
hash collision handling", where deleting files in a large directory
(requiring more than one getdents system call), results in some
filenames being returned twice. This was caused by a failure to
update info->curr_hash and info->curr_minor_hash, so that if the
directory had gotten modified since the last getdents() system call
(as would be the case if the user is running "rm -r" or "git clean"),
a directory entry would get returned twice to the userspace.

This patch fixes the bug reported by Markus Trippelsdorf at:
http://bugzilla.kernel.org/show_bug.cgi?id=11844

Signed-off-by: "Theodore Ts'o" <[email protected]>
Tested-by: Markus Trippelsdorf <[email protected]>
Cc: Willy Tarreau <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext3/dir.c | 20 ++++++++------------
1 file changed, 8 insertions(+), 12 deletions(-)

--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -456,17 +456,8 @@ static int ext3_dx_readdir(struct file *
if (info->extra_fname) {
if (call_filldir(filp, dirent, filldir, info->extra_fname))
goto finished;
-
info->extra_fname = NULL;
- info->curr_node = rb_next(info->curr_node);
- if (!info->curr_node) {
- if (info->next_hash == ~0) {
- filp->f_pos = EXT3_HTREE_EOF;
- goto finished;
- }
- info->curr_hash = info->next_hash;
- info->curr_minor_hash = 0;
- }
+ goto next_node;
} else if (!info->curr_node)
info->curr_node = rb_first(&info->root);

@@ -498,9 +489,14 @@ static int ext3_dx_readdir(struct file *
info->curr_minor_hash = fname->minor_hash;
if (call_filldir(filp, dirent, filldir, fname))
break;
-
+ next_node:
info->curr_node = rb_next(info->curr_node);
- if (!info->curr_node) {
+ if (info->curr_node) {
+ fname = rb_entry(info->curr_node, struct fname,
+ rb_hash);
+ info->curr_hash = fname->hash;
+ info->curr_minor_hash = fname->minor_hash;
+ } else {
if (info->next_hash == ~0) {
filp->f_pos = EXT3_HTREE_EOF;
break;

2008-12-03 20:31:07

by Greg KH

[permalink] [raw]
Subject: [patch 099/104] ext3: fix ext3_dx_readdir hash collision handling

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Eugene Dashevsky <[email protected]>

commit 6a897cf447a83c9c3fd1b85a1e525c02d6eada7d upstream.

This fixes a bug where readdir() would return a directory entry twice
if there was a hash collision in an hash tree indexed directory.

[[email protected]: coding-style fixes]
Signed-off-by: Eugene Dashevsky <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Cc: Willy Tarreau <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext3/dir.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)

--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -414,7 +414,7 @@ static int call_filldir(struct file * fi
get_dtype(sb, fname->file_type));
if (error) {
filp->f_pos = curr_pos;
- info->extra_fname = fname->next;
+ info->extra_fname = fname;
return error;
}
fname = fname->next;
@@ -453,11 +453,21 @@ static int ext3_dx_readdir(struct file *
* If there are any leftover names on the hash collision
* chain, return them first.
*/
- if (info->extra_fname &&
- call_filldir(filp, dirent, filldir, info->extra_fname))
- goto finished;
+ if (info->extra_fname) {
+ if (call_filldir(filp, dirent, filldir, info->extra_fname))
+ goto finished;

- if (!info->curr_node)
+ info->extra_fname = NULL;
+ info->curr_node = rb_next(info->curr_node);
+ if (!info->curr_node) {
+ if (info->next_hash == ~0) {
+ filp->f_pos = EXT3_HTREE_EOF;
+ goto finished;
+ }
+ info->curr_hash = info->next_hash;
+ info->curr_minor_hash = 0;
+ }
+ } else if (!info->curr_node)
info->curr_node = rb_first(&info->root);

while (1) {

2008-12-03 20:31:31

by Greg KH

[permalink] [raw]
Subject: [patch 098/104] ext4: add checksum calculation when clearing UNINIT flag in ext4_new_inode

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: Frederic Bohe <[email protected]>

(cherry picked from commit 23712a9c28b9f80a8cf70c8490358d5f562d2465)

When initializing an uninitialized block group in ext4_new_inode(),
its block group checksum must be re-calculated. This fixes a race
when several threads try to allocate a new inode in an UNINIT'd group.

There is some question whether we need to be initializing the block
bitmap in ext4_new_inode() at all, but for now, if we are going to
init the block group, let's eliminate the race.

Signed-off-by: Frederic Bohe <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/ialloc.c | 2 ++
1 file changed, 2 insertions(+)

--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -717,6 +717,8 @@ got:
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
free = ext4_free_blocks_after_init(sb, group, gdp);
gdp->bg_free_blocks_count = cpu_to_le16(free);
+ gdp->bg_checksum = ext4_group_desc_csum(sbi, group,
+ gdp);
}
spin_unlock(sb_bgl_lock(sbi, group));

2008-12-03 20:31:47

by Greg KH

[permalink] [raw]
Subject: [patch 095/104] ext4: wait on all pending commits in ext4_sync_fs()

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: "Theodore Ts'o" <[email protected]>

(cherry picked from commit 14ce0cb411c88681ab8f3a4c9caa7f42e97a3184)

In ext4_sync_fs, we only wait for a commit to finish if we started it,
but there may be one already in progress which will not be synced.

In the case of a data=ordered umount with pending long symlinks which
are delayed due to a long list of other I/O on the backing block
device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of
fsync_super. Then, before they can be dirtied again, kjournald exits,
seeing the UMOUNT flag and the dirty pages are never written to the
backing block device, causing long symlink corruption and exposing new
or previously freed block data to userspace.

To ensure all commits are synced, we flush all journal commits now
when sync_fs'ing ext4.

Signed-off-by: Arthur Jones <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: Eric Sandeen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/super.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)

--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2920,12 +2920,9 @@ int ext4_force_commit(struct super_block
/*
* Ext4 always journals updates to the superblock itself, so we don't
* have to propagate any other updates to the superblock on disk at this
- * point. Just start an async writeback to get the buffers on their way
- * to the disk.
- *
- * This implicitly triggers the writebehind on sync().
+ * point. (We can probably nuke this function altogether, and remove
+ * any mention to sb->s_dirt in all of fs/ext4; eventual cleanup...)
*/
-
static void ext4_write_super(struct super_block *sb)
{
if (mutex_trylock(&sb->s_lock) != 0)
@@ -2935,14 +2932,14 @@ static void ext4_write_super(struct supe

static int ext4_sync_fs(struct super_block *sb, int wait)
{
- tid_t target;
+ int ret = 0;

sb->s_dirt = 0;
- if (jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, &target)) {
- if (wait)
- jbd2_log_wait_commit(EXT4_SB(sb)->s_journal, target);
- }
- return 0;
+ if (wait)
+ ret = ext4_force_commit(sb);
+ else
+ jbd2_journal_start_commit(EXT4_SB(sb)->s_journal, NULL);
+ return ret;
}

/*

2008-12-03 20:32:13

by Greg KH

[permalink] [raw]
Subject: [patch 092/104] ext4: Fix duplicate entries returned from getdents() system call

2.6.27-stable review patch. If anyone has any objections, please let us know.

------------------
From: "Theodore Ts'o" <[email protected]>

(cherry picked from commit 3c37fc86d20fe35be656f070997d62f75c2e4874)

Fix a regression caused by commit d0156417, "ext4: fix ext4_dx_readdir
hash collision handling", where deleting files in a large directory
(requiring more than one getdents system call), results in some
filenames being returned twice. This was caused by a failure to
update info->curr_hash and info->curr_minor_hash, so that if the
directory had gotten modified since the last getdents() system call
(as would be the case if the user is running "rm -r" or "git clean"),
a directory entry would get returned twice to the userspace.

Signed-off-by: "Theodore Ts'o" <[email protected]>

This patch fixes the bug reported by Markus Trippelsdorf at:
http://bugzilla.kernel.org/show_bug.cgi?id=11844

Signed-off-by: "Theodore Ts'o" <[email protected]>
Tested-by: Markus Trippelsdorf <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

---
fs/ext4/dir.c | 20 ++++++++------------
1 file changed, 8 insertions(+), 12 deletions(-)

--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -458,17 +458,8 @@ static int ext4_dx_readdir(struct file *
if (info->extra_fname) {
if (call_filldir(filp, dirent, filldir, info->extra_fname))
goto finished;
-
info->extra_fname = NULL;
- info->curr_node = rb_next(info->curr_node);
- if (!info->curr_node) {
- if (info->next_hash == ~0) {
- filp->f_pos = EXT4_HTREE_EOF;
- goto finished;
- }
- info->curr_hash = info->next_hash;
- info->curr_minor_hash = 0;
- }
+ goto next_node;
} else if (!info->curr_node)
info->curr_node = rb_first(&info->root);

@@ -500,9 +491,14 @@ static int ext4_dx_readdir(struct file *
info->curr_minor_hash = fname->minor_hash;
if (call_filldir(filp, dirent, filldir, fname))
break;
-
+ next_node:
info->curr_node = rb_next(info->curr_node);
- if (!info->curr_node) {
+ if (info->curr_node) {
+ fname = rb_entry(info->curr_node, struct fname,
+ rb_hash);
+ info->curr_hash = fname->hash;
+ info->curr_minor_hash = fname->minor_hash;
+ } else {
if (info->next_hash == ~0) {
filp->f_pos = EXT4_HTREE_EOF;
break;

2008-12-03 21:40:07

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [patch 000/104] 2.6.27-stable review

Hi Greg,

On Wednesday, 3 of December 2008, Greg KH wrote:
> This is the start of the stable review cycle for the 2.6.27.8 release.
> There are 104 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let us know. If anyone is a maintainer of the proper subsystem, and
> wants to add a Signed-off-by: line to the patch, please respond with it.
>
> And yes, there are a lot of patches here, the big series are:
> - cifs data corruption patches
> - pci hotplug slot patches to fix the most common warning
> showing up on kerneloops.org
> - ext4 bugfixes
>
> These patches are sent out with a number of different people on the Cc:
> line. If you wish to be a reviewer, please email [email protected] to
> add your name to the list. If you want to be off the reviewer list,
> also email us.
>
> Responses should be made by Friday, December 5, 20:00:00 UTC. Anything
> received after that time might be too late.

The following ACPI commits are also -stable material IMO:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=65df78473ffbf3bff5e2034df1638acc4f3ddd50
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=558073dd56707864f09d563b64e7c37c021e89d2
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7b4d469228a92a00e412675817cedd60133de38a
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=40599072dca3ec7d4c9ff8271978be169f974638

Thanks,
Rafael

2008-12-03 22:07:53

by Michael Tokarev

[permalink] [raw]
Subject: Re: [patch 000/104] 2.6.27-stable review

Greg KH wrote:
[]
> And yes, there are a lot of patches here, the big series are:
...
> - ext4 bugfixes

I wonder why updates for ext4, which is still marked
"experimental" in kconfig, should go to -stable in
the first place... Given the number of bugs^Wpatches
in this part of the kernel, and given the amount
of current development in this area. Just.. curious.

Thanks!

/mjt

2008-12-03 22:08:17

by Cord Walter

[permalink] [raw]
Subject: Re: [patch 027/104] axnet_cs / pcnet_cs: moving PCMCIA_DEVICE_PROD_ID for Netgear FA411

Apparently this patch from Kumoru fixes the problem too.

--- linux-2.6.28-rc6/drivers/net/pcmcia/pcnet_cs.c.orig 2008-11-21
21:39:08.000000000 +0900
+++ linux-2.6.28-rc6/drivers/net/pcmcia/pcnet_cs.c 2008-11-21
21:39:24.000000000 +0900
@@ -587,7 +587,7 @@ static int pcnet_config(struct pcmcia_de
}

if ((link->conf.ConfigBase == 0x03c0)
- && (link->manf_id == 0x149) && (link->card_id = 0xc1ab)) {
+ && (link->manf_id == 0x149) && (link->card_id == 0xc1ab)) {
printk(KERN_INFO "pcnet_cs: this is an AX88190 card!\n");
printk(KERN_INFO "pcnet_cs: use axnet_cs instead.\n");
goto failed;

My Netgear FA411 card works with both solutions.

-cord


> Cord Walter <[email protected]> wrote:
>
> OK. Thanks for your test.
>
> Best Regards
> Komuro
>
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA1
>> >
>> > Komuro schrieb:
>>>> > >> Does this help?
>>> > >
>>> > > YES.
>>> > >
>>> > > Could you test the path below at kernel-2.6.28-rc6
>>> > > without the path "axnet_cs / pcnet_cs: moving PCMCIA_DEVICE_PROD_ID
>>> > > for Netgear FA411"?
>>> > >
>>> > >
>>> > > --- linux-2.6.28-rc6/drivers/net/pcmcia/pcnet_cs.c.orig 2008-11-21 21:39:08.000000000 +0900
>>> > > +++ linux-2.6.28-rc6/drivers/net/pcmcia/pcnet_cs.c 2008-11-21 21:39:24.000000000 +0900
>>> > > @@ -587,7 +587,7 @@ static int pcnet_config(struct pcmcia_de
>>> > > }
>>> > >
>>> > > if ((link->conf.ConfigBase == 0x03c0)
>>> > > - && (link->manf_id == 0x149) && (link->card_id = 0xc1ab)) {
>>> > > + && (link->manf_id == 0x149) && (link->card_id == 0xc1ab)) {
>>> > > printk(KERN_INFO "pcnet_cs: this is an AX88190 card!\n");
>>> > > printk(KERN_INFO "pcnet_cs: use axnet_cs instead.\n");
>>> > > goto failed;
>>> > >
>> >
>> > I tried it after some problems compiling a -rc6 Kernel that was bootable
>> > on my notebook (The Notebook with the FA411 card is rather slow and
>> > takes 5.5 hours for a kernel compile & the faster machine I used
>> > produced an unbootable kernel on the first 2 tries...).
>> >
>> > It works with the pcnet-cs driver and here is what lspcmcia -v and dmesg
>> > said:
>> >
>> > lspcmcia -v:
>> >
>> > Socket 0 Bridge: [yenta_cardbus] (bus ID: 0000:00:0f.0)
>> > Configuration: state: on ready: yes
>> > Voltage: 5.0V Vcc: 5.0V Vpp: 0.0V
>> > Socket 0 Device 0: [pcnet_cs] (bus ID: 0.0)
>> > Configuration: state: on
>> > Product Name: NETGEAR FA411 Fast Ethernet
>> > Identification: manf_id: 0x0149 card_id: 0x0411
>> > function: 6 (network)
>> > prod_id(1): "NETGEAR" (0x9aa79dc3)
>> > prod_id(2): "FA411" (0x40fad875)
>> > prod_id(3): "Fast Ethernet" (0xb4be14e3)
>> > prod_id(4): --- (---)
>> > Socket 1 Bridge: [yenta_cardbus] (bus ID: 0000:00:0f.1)
>> > Configuration: state: on ready: yes
>> >
>> >
>> > dmesg:
>> >
>> > pcmcia_socket pcmcia_socket0: pccard: PCMCIA card inserted into slot 0
>> > pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff:
>> > excluding 0xa0000000-0xa00fffff
>> > pcmcia 0.0: pcmcia: registering new device pcmcia0.0
>> > eth0: NE2000 Compatible: io 0x300, irq 3, hw_addr 00:09:5b:08:98:93
>> > udev: renamed network interface eth0 to eth1
>> >
>> > So, apparently the card can run with both drivers...
>> >
>> > - -cord
>> >


--
Cord Walter
email: [email protected]

Weil es niemanden etwas angeht, dass ich nichts zu verbergen habe:
http://www.gnupg.org/
http://www.truecrypt.org/

2008-12-03 22:10:58

by Michael Tokarev

[permalink] [raw]
Subject: Re: [patch 000/104] 2.6.27-stable review

Rafael J. Wysocki wrote:
[]
> The following ACPI commits are also -stable material IMO:

> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7b4d469228a92a00e412675817cedd60133de38a
This is
[patch 062/104]
ACPI: EC: count interrupts only if called from interrupt handler.

/mjt

2008-12-03 22:20:20

by Scott Murray

[permalink] [raw]
Subject: Re: [patch 034/104] PCI: cpci_hotplug: stop managing hotplug_slot->name

On Wed, 3 Dec 2008, Greg KH wrote:

> 2.6.27-stable review patch. If anyone has any objections, please let us know.
>
> ------------------
> From: Alex Chiang <[email protected]>
>
> commit d6c479e0b777afcd7a26ca62e122e3f878ccc830 upstream.
>
> We no longer need to manage our version of hotplug_slot->name
> since the PCI and hotplug core manage it on our behalf.
>
> Now, we simply advise the PCI core of the name that we would
> like, and let the core take care of the rest.
>
> Cc: [email protected]
> Cc: [email protected]
> Acked-by: Kenji Kaneshige <[email protected]>
> Signed-off-by: Alex Chiang <[email protected]>
> Signed-off-by: Jesse Barnes <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Scott Murray <[email protected]>

Sorry for the extremely late review, I had been swamped and am now in the
process of switching jobs. [email protected] will stop working
soon, I will submit a patch shortly to change it in MAINTAINERS.

Scott


--
Scott Murray
SOMA Networks, Inc.
Toronto, Ontario
e-mail: [email protected]

2008-12-03 23:19:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [patch 000/104] 2.6.27-stable review

On Wednesday, 3 of December 2008, Michael Tokarev wrote:
> Rafael J. Wysocki wrote:
> []
> > The following ACPI commits are also -stable material IMO:
>
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7b4d469228a92a00e412675817cedd60133de38a
> This is
> [patch 062/104]
> ACPI: EC: count interrupts only if called from interrupt handler.

Thanks a lot, I overlooked it.

Rafael

2008-12-03 23:28:23

by Greg KH

[permalink] [raw]
Subject: Re: [patch 000/104] 2.6.27-stable review

On Thu, Dec 04, 2008 at 01:06:04AM +0300, Michael Tokarev wrote:
> Greg KH wrote:
> []
> > And yes, there are a lot of patches here, the big series are:
> ...
> > - ext4 bugfixes
>
> I wonder why updates for ext4, which is still marked
> "experimental" in kconfig, should go to -stable in
> the first place... Given the number of bugs^Wpatches
> in this part of the kernel, and given the amount
> of current development in this area. Just.. curious.

Primarily because at least 4 distros are basing their releases off of
the 2.6.27 kernel tree. Those distos do have users who are using the
ext4 file system for their daily use, and so bugfixes for it are worthy
to keep track of in a centralized place like the -stable tree.

thanks,

greg k-h

2008-12-04 04:12:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [patch 086/104] ext4: fix initialization of UNINIT bitmap blocks

On Wed, Dec 03, 2008 at 11:56:18AM -0800, Greg KH wrote:
> 2.6.27-stable review patch. If anyone has any objections, please let us know.

Turns out this patch introduces a worse regression than it fixes. The
bug that the patches fixes is that on-line resizes of filesystems with
a 1k blocksize will usually fail. The regression is that when a
filesystem with 1k blocksize is stressed, the filesystem can get
corrupted. On balance, on-line resizing failing is less of a disaster
than corrupting the filesystem when its stressed. Fortunately, it's
only an issue when the filesystem blocksize is less than the page
size, which isn't the common case at least for the x86.

There are patches queued up to address this, but they haven't hit
mainline yet. Probably best to pull this from the stable tree for
now.

- Ted

2008-12-04 04:22:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [patch 067/104] libata: improve phantom device detection

> 2.6.27-stable review patch. If anyone has any objections, please let us know.
>
> ------------------
> From: Tejun Heo <[email protected]>
>
> commit 6a6b97d360702b98c02c7fca4c4e088dcf3a2985 upstream.
>
> Currently libata uses four methods to detect device presence.
>
> 1. PHY status if available.
> 2. TF register R/W test (only promotes presence, never demotes)
> 3. device signature after reset
> 4. IDENTIFY failure detection in SFF state machine
>
> Combination of the above works well in most cases but recently there
> have been a few reports where a phantom device causes unnecessary
> delay during probe. In both cases, PHY status wasn't available. In
> one case, it passed #2 and #3 and failed IDENTIFY with ATA_ERR which
> didn't qualify as #4. The other failed #2 but as it passed #3 and #4,
> it still caused failure.
>
> In both cases, phantom device reported diagnostic failure, so these
> cases can be safely worked around by considering any !ATA_DRQ IDENTIFY
> failure as NODEV_HINT if diagnostic failure is set.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Signed-off-by: Jeff Garzik <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>

Alan thinks this patch could cause regression. Given that we're
nearing the end of 2.6.28-rc cycles, I don't think it's critical to
include this into 2.6.27-stable or at least it can wait a bit more.

Thanks.

--
tejun

2008-12-04 22:11:03

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [patch 101/104] ext3: dont try to resize if there are no reserved gdt blocks left

On Wed, 03 Dec 2008 11:56:53 PST, Greg KH said:

> if (reserved_gdb || gdb_off == 0) {
> if (!EXT3_HAS_COMPAT_FEATURE(sb,
> - EXT3_FEATURE_COMPAT_RESIZE_INODE))
{
> + EXT3_FEATURE_COMPAT_RESIZE_INODE)
> + || !le16_to_cpu(es->s_reserved_gdt_blocks)) {
> ext3_warning(sb, __func__,
> "No reserved GDT blocks, can't resize");
> return -EPERM;

What's the codepath if the compat_feature part trips, but the le16_to_cpu
doesn't? Looks to me like it will then skip over the 'return -EPERM'?


Attachments:
(No filename) (226.00 B)

2008-12-04 22:11:31

by Alex Chiang

[permalink] [raw]
Subject: Re: [patch 031/104] PCI: prevent duplicate slot names

Hi Greg,

I found a memory leak that I introduced with the below patch.

I sent the patch to Jesse a few days ago, but he hasn't pushed it
upstream yet.

http://article.gmane.org/gmane.linux.kernel.pci/2187/match=stop+leaking

I did Cc: [email protected] on it, but I'm guessing that since
it's not upstream yet, you guys never saw it.

Anyhow, please pick it up for this round of .27-stable, else
we'll get a memory leak (while trying to work-around the
duplicate slot name issue).

Thanks.

/ac

* Greg KH <[email protected]>:
> 2.6.27-stable review patch. If anyone has any objections,
> please let us know.
>
> ------------------
> From: Alex Chiang <[email protected]>
>
> commit 5fe6cc60680d29740b85278e17a002fa27b7e642 upstream.
>
> Prevent callers of pci_create_slot() from registering slots with
> duplicate names. This condition occurs most often when PCI hotplug
> drivers are loaded on platforms with broken firmware that assigns
> identical names to multiple slots.
>
> We now rename these duplicate slots on behalf of the user.
>
> If firmware assigns the name N to multiple slots, then:
>
> The first registered slot is assigned N
> The second registered slot is assigned N-1
> The third registered slot is assigned N-2
> etc.
>
> This is the permanent fix mentioned in earlier commits d6a9e9b4 and
> 167e782e (shpchp/pciehp: Rename duplicate slot name...).
>
> We take advantage of the new 'hotplug' parameter in pci_create_slot()
> to prevent a slot create/rename race between hotplug drivers and
> detection drivers.
>
> Scenario A:
> hotplug driver detection driver
> -------------- ----------------
> pci_create_slot(hotplug=set)
> pci_create_slot(hotplug=NULL)
>
> The hotplug driver creates the slot with its desired name, and then
> releases the semaphore. Now, the detection driver tries to create
> the same slot, but it already exists. We don't care about renaming,
> so return the existing slot.
>
> Scenario B:
> hotplug driver detection driver
> -------------- ----------------
> pci_create_slot(hotplug=NULL)
> pci_create_slot(hotplug=set)
>
> The detection driver creates the slot with name "X". Then the hotplug
> driver tries to create the same slot, but wants the name "Y" instead.
> We detect that we're trying to create the same slot and that we also
> want a rename, so rename the slot to "Y" and return.
>
> Scenario C:
> hotplug driver hotplug driver
> -------------- ----------------
> pci_create_slot(hotplug=set)
> pci_create_slot(hotplug=set)
>
> Two separate hotplug drivers are attempting to claim the slot and
> are passing valid hotplug_slot args to pci_create_slot(). We detect
> that the slot already has a ->hotplug callback, prevent a rename,
> and return -EBUSY.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Acked-by: Kenji Kaneshige <[email protected]>
> Signed-off-by: Alex Chiang <[email protected]>
> Signed-off-by: Jesse Barnes <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> ---
> drivers/pci/hotplug/pci_hotplug_core.c | 26 ------
> drivers/pci/hotplug/pciehp_core.c | 14 ---
> drivers/pci/hotplug/shpchp_core.c | 15 ---
> drivers/pci/slot.c | 139 ++++++++++++++++++++++++++-------
> 4 files changed, 114 insertions(+), 80 deletions(-)
>
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -191,7 +191,6 @@ static int init_slots(struct controller
> struct slot *slot;
> struct hotplug_slot *hotplug_slot;
> struct hotplug_slot_info *info;
> - int len, dup = 1;
> int retval = -ENOMEM;
>
> list_for_each_entry(slot, &ctrl->slot_list, slot_list) {
> @@ -218,24 +217,11 @@ static int init_slots(struct controller
> dbg("Registering bus=%x dev=%x hp_slot=%x sun=%x "
> "slot_device_offset=%x\n", slot->bus, slot->device,
> slot->hp_slot, slot->number, ctrl->slot_device_offset);
> -duplicate_name:
> retval = pci_hp_register(hotplug_slot,
> ctrl->pci_dev->subordinate,
> slot->device,
> slot->name);
> if (retval) {
> - /*
> - * If slot N already exists, we'll try to create
> - * slot N-1, N-2 ... N-M, until we overflow.
> - */
> - if (retval == -EEXIST) {
> - len = snprintf(slot->name, SLOT_NAME_SIZE,
> - "%d-%d", slot->number, dup++);
> - if (len < SLOT_NAME_SIZE)
> - goto duplicate_name;
> - else
> - err("duplicate slot name overflow\n");
> - }
> err("pci_hp_register failed with error %d\n", retval);
> goto error_info;
> }
> --- a/drivers/pci/hotplug/pci_hotplug_core.c
> +++ b/drivers/pci/hotplug/pci_hotplug_core.c
> @@ -569,12 +569,6 @@ int pci_hp_register(struct hotplug_slot
>
> mutex_lock(&pci_hp_mutex);
>
> - /* Check if we have already registered a slot with the same name. */
> - if (get_slot_from_name(name)) {
> - result = -EEXIST;
> - goto out;
> - }
> -
> /*
> * No problems if we call this interface from both ACPI_PCI_SLOT
> * driver and call it here again. If we've already created the
> @@ -583,27 +577,12 @@ int pci_hp_register(struct hotplug_slot
> pci_slot = pci_create_slot(bus, slot_nr, name, slot);
> if (IS_ERR(pci_slot)) {
> result = PTR_ERR(pci_slot);
> - goto cleanup;
> - }
> -
> - if (pci_slot->hotplug) {
> - dbg("%s: already claimed\n", __func__);
> - result = -EBUSY;
> - goto cleanup;
> + goto out;
> }
>
> slot->pci_slot = pci_slot;
> pci_slot->hotplug = slot;
>
> - /*
> - * Allow pcihp drivers to override the ACPI_PCI_SLOT name.
> - */
> - if (strcmp(kobject_name(&pci_slot->kobj), name)) {
> - result = kobject_rename(&pci_slot->kobj, name);
> - if (result)
> - goto cleanup;
> - }
> -
> list_add(&slot->slot_list, &pci_hotplug_slot_list);
>
> result = fs_add_slot(pci_slot);
> @@ -612,9 +591,6 @@ int pci_hp_register(struct hotplug_slot
> out:
> mutex_unlock(&pci_hp_mutex);
> return result;
> -cleanup:
> - pci_destroy_slot(pci_slot);
> - goto out;
> }
>
> /**
> --- a/drivers/pci/hotplug/shpchp_core.c
> +++ b/drivers/pci/hotplug/shpchp_core.c
> @@ -102,7 +102,7 @@ static int init_slots(struct controller
> struct hotplug_slot *hotplug_slot;
> struct hotplug_slot_info *info;
> int retval = -ENOMEM;
> - int i, len, dup = 1;
> + int i;
>
> for (i = 0; i < ctrl->num_slots; i++) {
> slot = kzalloc(sizeof(*slot), GFP_KERNEL);
> @@ -144,23 +144,10 @@ static int init_slots(struct controller
> dbg("Registering bus=%x dev=%x hp_slot=%x sun=%x "
> "slot_device_offset=%x\n", slot->bus, slot->device,
> slot->hp_slot, slot->number, ctrl->slot_device_offset);
> -duplicate_name:
> retval = pci_hp_register(slot->hotplug_slot,
> ctrl->pci_dev->subordinate, slot->device,
> hotplug_slot->name);
> if (retval) {
> - /*
> - * If slot N already exists, we'll try to create
> - * slot N-1, N-2 ... N-M, until we overflow.
> - */
> - if (retval == -EEXIST) {
> - len = snprintf(slot->name, SLOT_NAME_SIZE,
> - "%d-%d", slot->number, dup++);
> - if (len < SLOT_NAME_SIZE)
> - goto duplicate_name;
> - else
> - err("duplicate slot name overflow\n");
> - }
> err("pci_hp_register failed with error %d\n", retval);
> goto error_info;
> }
> --- a/drivers/pci/slot.c
> +++ b/drivers/pci/slot.c
> @@ -73,6 +73,77 @@ static struct kobj_type pci_slot_ktype =
> .default_attrs = pci_slot_default_attrs,
> };
>
> +static char *make_slot_name(const char *name)
> +{
> + char *new_name;
> + int len, max, dup;
> +
> + new_name = kstrdup(name, GFP_KERNEL);
> + if (!new_name)
> + return NULL;
> +
> + /*
> + * Make sure we hit the realloc case the first time through the
> + * loop. 'len' will be strlen(name) + 3 at that point which is
> + * enough space for "name-X" and the trailing NUL.
> + */
> + len = strlen(name) + 2;
> + max = 1;
> + dup = 1;
> +
> + for (;;) {
> + struct kobject *dup_slot;
> + dup_slot = kset_find_obj(pci_slots_kset, new_name);
> + if (!dup_slot)
> + break;
> + kobject_put(dup_slot);
> + if (dup == max) {
> + len++;
> + max *= 10;
> + kfree(new_name);
> + new_name = kmalloc(len, GFP_KERNEL);
> + if (!new_name)
> + break;
> + }
> + sprintf(new_name, "%s-%d", name, dup++);
> + }
> +
> + return new_name;
> +}
> +
> +static int rename_slot(struct pci_slot *slot, const char *name)
> +{
> + int result = 0;
> + char *slot_name;
> +
> + if (strcmp(kobject_name(&slot->kobj), name) == 0)
> + return result;
> +
> + slot_name = make_slot_name(name);
> + if (!slot_name)
> + return -ENOMEM;
> +
> + result = kobject_rename(&slot->kobj, slot_name);
> + kfree(slot_name);
> +
> + return result;
> +}
> +
> +static struct pci_slot *get_slot(struct pci_bus *parent, int slot_nr)
> +{
> + struct pci_slot *slot;
> + /*
> + * We already hold pci_bus_sem so don't worry
> + */
> + list_for_each_entry(slot, &parent->slots, list)
> + if (slot->number == slot_nr) {
> + kobject_get(&slot->kobj);
> + return slot;
> + }
> +
> + return NULL;
> +}
> +
> /**
> * pci_create_slot - create or increment refcount for physical PCI slot
> * @parent: struct pci_bus of parent bridge
> @@ -85,7 +156,17 @@ static struct kobj_type pci_slot_ktype =
> * either return a new &struct pci_slot to the caller, or if the pci_slot
> * already exists, its refcount will be incremented.
> *
> - * Slots are uniquely identified by a @pci_bus, @slot_nr, @name tuple.
> + * Slots are uniquely identified by a @pci_bus, @slot_nr tuple.
> + *
> + * There are known platforms with broken firmware that assign the same
> + * name to multiple slots. Workaround these broken platforms by renaming
> + * the slots on behalf of the caller. If firmware assigns name N to
> + * multiple slots:
> + *
> + * The first slot is assigned N
> + * The second slot is assigned N-1
> + * The third slot is assigned N-2
> + * etc.
> *
> * Placeholder slots:
> * In most cases, @pci_bus, @slot_nr will be sufficient to uniquely identify
> @@ -94,12 +175,8 @@ static struct kobj_type pci_slot_ktype =
> * the slot. In this scenario, the caller may pass -1 for @slot_nr.
> *
> * The following semantics are imposed when the caller passes @slot_nr ==
> - * -1. First, the check for existing %struct pci_slot is skipped, as the
> - * caller may know about several unpopulated slots on a given %struct
> - * pci_bus, and each slot would have a @slot_nr of -1. Uniqueness for
> - * these slots is then determined by the @name parameter. We expect
> - * kobject_init_and_add() to warn us if the caller attempts to create
> - * multiple slots with the same name. The other change in semantics is
> + * -1. First, we no longer check for an existing %struct pci_slot, as there
> + * may be many slots with @slot_nr of -1. The other change in semantics is
> * user-visible, which is the 'address' parameter presented in sysfs will
> * consist solely of a dddd:bb tuple, where dddd is the PCI domain of the
> * %struct pci_bus and bb is the bus number. In other words, the devfn of
> @@ -111,44 +188,53 @@ struct pci_slot *pci_create_slot(struct
> struct hotplug_slot *hotplug)
> {
> struct pci_slot *slot;
> - int err;
> + int err = 0;
> + char *slot_name = NULL;
>
> down_write(&pci_bus_sem);
>
> if (slot_nr == -1)
> goto placeholder;
>
> - /* If we've already created this slot, bump refcount and return. */
> - list_for_each_entry(slot, &parent->slots, list) {
> - if (slot->number == slot_nr) {
> - kobject_get(&slot->kobj);
> - pr_debug("%s: inc refcount to %d on %04x:%02x:%02x\n",
> - __func__,
> - atomic_read(&slot->kobj.kref.refcount),
> - pci_domain_nr(parent), parent->number,
> - slot_nr);
> - goto out;
> + /*
> + * Hotplug drivers are allowed to rename an existing slot,
> + * but only if not already claimed.
> + */
> + slot = get_slot(parent, slot_nr);
> + if (slot) {
> + if (hotplug) {
> + if ((err = slot->hotplug ? -EBUSY : 0)
> + || (err = rename_slot(slot, name))) {
> + kobject_put(&slot->kobj);
> + slot = NULL;
> + goto err;
> + }
> }
> + goto out;
> }
>
> placeholder:
> slot = kzalloc(sizeof(*slot), GFP_KERNEL);
> if (!slot) {
> - slot = ERR_PTR(-ENOMEM);
> - goto out;
> + err = -ENOMEM;
> + goto err;
> }
>
> slot->bus = parent;
> slot->number = slot_nr;
>
> slot->kobj.kset = pci_slots_kset;
> - err = kobject_init_and_add(&slot->kobj, &pci_slot_ktype, NULL,
> - "%s", name);
> - if (err) {
> - printk(KERN_ERR "Unable to register kobject %s\n", name);
> + slot_name = make_slot_name(name);
> + if (!slot_name) {
> + err = -ENOMEM;
> goto err;
> }
>
> + err = kobject_init_and_add(&slot->kobj, &pci_slot_ktype, NULL,
> + "%s", slot_name);
> + if (err)
> + goto err;
> +
> INIT_LIST_HEAD(&slot->list);
> list_add(&slot->list, &parent->slots);
>
> @@ -156,10 +242,10 @@ placeholder:
> pr_debug("%s: created pci_slot on %04x:%02x:%02x\n",
> __func__, pci_domain_nr(parent), parent->number, slot_nr);
>
> - out:
> +out:
> up_write(&pci_bus_sem);
> return slot;
> - err:
> +err:
> kfree(slot);
> slot = ERR_PTR(err);
> goto out;
> @@ -205,7 +291,6 @@ EXPORT_SYMBOL_GPL(pci_update_slot_number
> * just call kobject_put on its kobj and let our release methods do the
> * rest.
> */
> -
> void pci_destroy_slot(struct pci_slot *slot)
> {
> pr_debug("%s: dec refcount to %d on %04x:%02x:%02x\n", __func__,
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-12-05 00:34:38

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [patch 000/104] 2.6.27-stable review

On Wednesday, 3 of December 2008, Rafael J. Wysocki wrote:
> Hi Greg,
>
> On Wednesday, 3 of December 2008, Greg KH wrote:
> > This is the start of the stable review cycle for the 2.6.27.8 release.
> > There are 104 patches in this series, all will be posted as a response
> > to this one. If anyone has any issues with these being applied, please
> > let us know. If anyone is a maintainer of the proper subsystem, and
> > wants to add a Signed-off-by: line to the patch, please respond with it.
> >
> > And yes, there are a lot of patches here, the big series are:
> > - cifs data corruption patches
> > - pci hotplug slot patches to fix the most common warning
> > showing up on kerneloops.org
> > - ext4 bugfixes
> >
> > These patches are sent out with a number of different people on the Cc:
> > line. If you wish to be a reviewer, please email [email protected] to
> > add your name to the list. If you want to be off the reviewer list,
> > also email us.
> >
> > Responses should be made by Friday, December 5, 20:00:00 UTC. Anything
> > received after that time might be too late.
>
> The following ACPI commits are also -stable material IMO:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=65df78473ffbf3bff5e2034df1638acc4f3ddd50

Argh, the following one is broken:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=558073dd56707864f09d563b64e7c37c021e89d2

so please scratch it and this one is already on your list:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7b4d469228a92a00e412675817cedd60133de38a

The remaining two, ie.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=65df78473ffbf3bff5e2034df1638acc4f3ddd50 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=40599072dca3ec7d4c9ff8271978be169f974638

should still be added into the -stable queue IMO.

Thanks,
Rafael

2008-12-05 12:59:19

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch 011/104] fbdev: clean the penguins dirty feet


On Wed 2008-12-03 11:48:38, Greg KH wrote:
> 2.6.27-stable review patch. If anyone has any objections, please let us know.
>
> ------------------
> From: Clemens Ladisch <[email protected]>
>
> commit cf7ee554f3a324e98181b0ea249d9d5be3a0acb8 upstream.
>
> When booting in a direct color mode, the penguin has dirty feet, i.e.,
> some pixels have the wrong color. This is caused by
> fb_set_logo_directpalette() which does not initialize the last 32 palette
> entries.

Heh, funny, but... is this really bad enough bug to go to stable?

Pavel

> Signed-off-by: Clemens Ladisch <[email protected]>
> Acked-by: Geert Uytterhoeven <[email protected]>
> Cc: Krzysztof Helt <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> ---
> drivers/video/fbmem.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/drivers/video/fbmem.c
> +++ b/drivers/video/fbmem.c
> @@ -232,7 +232,7 @@ static void fb_set_logo_directpalette(st
> greenshift = info->var.green.offset;
> blueshift = info->var.blue.offset;
>
> - for (i = 32; i < logo->clutsize; i++)
> + for (i = 32; i < 32 + logo->clutsize; i++)
> palette[i] = i << redshift | i << greenshift | i << blueshift;
> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-12-05 13:07:12

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch 062/104] ACPI: EC: count interrupts only if called from interrupt handler.

On Wed 2008-12-03 11:53:00, Greg KH wrote:
> 2.6.27-stable review patch. If anyone has any objections, please let us know.
>
> ------------------
> From: Alexey Starikovskiy <[email protected]>
>
> commit 7b4d469228a92a00e412675817cedd60133de38a upstream.
>
> fix 2.6.28 EC interrupt storm regression
>

That changelog is pretty useless :-(.

> @@ -219,7 +219,8 @@ static void gpe_transaction(struct acpi_
> goto unlock;
> err:
> /* false interrupt, state didn't change */
> - ++ec->curr->irq_count;
> + if (in_interrupt())
> + ++ec->curr->irq_count;
> unlock:
> spin_unlock_irqrestore(&ec->curr_lock, flags);
> }

Is preempt_count() reliable with !config_preempt, too?

Using in_interrupt here is quite ugly... definitely worth a comment
and perhaps gpe_transaction should get explicit 'am I in interrupt'
parameter.

At least RT kernels plan on moving interrupt handlers to threads...

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-12-05 18:28:25

by Greg KH

[permalink] [raw]
Subject: Re: [stable] [patch 031/104] PCI: prevent duplicate slot names

On Thu, Dec 04, 2008 at 03:10:39PM -0700, Alex Chiang wrote:
> Hi Greg,
>
> I found a memory leak that I introduced with the below patch.
>
> I sent the patch to Jesse a few days ago, but he hasn't pushed it
> upstream yet.
>
> http://article.gmane.org/gmane.linux.kernel.pci/2187/match=stop+leaking
>
> I did Cc: [email protected] on it, but I'm guessing that since
> it's not upstream yet, you guys never saw it.
>
> Anyhow, please pick it up for this round of .27-stable, else
> we'll get a memory leak (while trying to work-around the
> duplicate slot name issue).

As much as I hate it, I really need to see the patch in Linus's tree
before I can take it into -stable.

So Jesse, can you please push this to Linus as soon as possible?

Alex, thanks for letting me know about this.

greg k-h

2008-12-05 18:38:56

by Greg KH

[permalink] [raw]
Subject: Re: [stable] [patch 086/104] ext4: fix initialization of UNINIT bitmap blocks

On Wed, Dec 03, 2008 at 11:10:16PM -0500, Theodore Tso wrote:
> On Wed, Dec 03, 2008 at 11:56:18AM -0800, Greg KH wrote:
> > 2.6.27-stable review patch. If anyone has any objections, please let us know.
>
> Turns out this patch introduces a worse regression than it fixes. The
> bug that the patches fixes is that on-line resizes of filesystems with
> a 1k blocksize will usually fail. The regression is that when a
> filesystem with 1k blocksize is stressed, the filesystem can get
> corrupted. On balance, on-line resizing failing is less of a disaster
> than corrupting the filesystem when its stressed. Fortunately, it's
> only an issue when the filesystem blocksize is less than the page
> size, which isn't the common case at least for the x86.
>
> There are patches queued up to address this, but they haven't hit
> mainline yet. Probably best to pull this from the stable tree for
> now.

Thanks for letting me know, I've now dropped it from this release.

greg k-h

2008-12-05 18:39:21

by Greg KH

[permalink] [raw]
Subject: Re: [stable] [patch 067/104] libata: improve phantom device detection

On Thu, Dec 04, 2008 at 01:20:15PM +0900, Tejun Heo wrote:
> > 2.6.27-stable review patch. If anyone has any objections, please let us know.
> >
> > ------------------
> > From: Tejun Heo <[email protected]>
> >
> > commit 6a6b97d360702b98c02c7fca4c4e088dcf3a2985 upstream.
> >
> > Currently libata uses four methods to detect device presence.
> >
> > 1. PHY status if available.
> > 2. TF register R/W test (only promotes presence, never demotes)
> > 3. device signature after reset
> > 4. IDENTIFY failure detection in SFF state machine
> >
> > Combination of the above works well in most cases but recently there
> > have been a few reports where a phantom device causes unnecessary
> > delay during probe. In both cases, PHY status wasn't available. In
> > one case, it passed #2 and #3 and failed IDENTIFY with ATA_ERR which
> > didn't qualify as #4. The other failed #2 but as it passed #3 and #4,
> > it still caused failure.
> >
> > In both cases, phantom device reported diagnostic failure, so these
> > cases can be safely worked around by considering any !ATA_DRQ IDENTIFY
> > failure as NODEV_HINT if diagnostic failure is set.
> >
> > Signed-off-by: Tejun Heo <[email protected]>
> > Signed-off-by: Jeff Garzik <[email protected]>
> > Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> Alan thinks this patch could cause regression. Given that we're
> nearing the end of 2.6.28-rc cycles, I don't think it's critical to
> include this into 2.6.27-stable or at least it can wait a bit more.

Ok, I'll transfer it over to the next 2.6.27-stable release.

thanks,

greg k-h

Subject: Re: [patch 011/104] fbdev: clean the penguins dirty feet

(Cc: trimmed)

On Fri, 05 Dec 2008, Pavel Machek wrote:
> > commit cf7ee554f3a324e98181b0ea249d9d5be3a0acb8 upstream.
> >
> > When booting in a direct color mode, the penguin has dirty feet, i.e.,
> > some pixels have the wrong color. This is caused by
> > fb_set_logo_directpalette() which does not initialize the last 32 palette
> > entries.
>
> Heh, funny, but... is this really bad enough bug to go to stable?

Well, it is a public health problem, since dirty feet in Penguins can spread
numerous diseases.

Imagine the *horror* if your box got the dreaded dancing-penguin disease
from those nasty dirty feet, and started trying to sing and dance... while
still booting! It would be quite out of wack, and be horribly off-tune!
And if it happens on a PeeCee, it would be trying to sing its lungs off
through the DREADED PeeCee Squeaker...

That just Cannot Be Allowed To Happen(TM).

> > - for (i = 32; i < logo->clutsize; i++)
> > + for (i = 32; i < 32 + logo->clutsize; i++)

It is an one line change that can save us from untold horrors! I call that
a patch well deserving of being in a stable release ;-)

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2008-12-06 00:36:10

by Greg KH

[permalink] [raw]
Subject: Re: [patch 011/104] fbdev: clean the penguins dirty feet

On Fri, Dec 05, 2008 at 01:58:55PM +0100, Pavel Machek wrote:
>
> On Wed 2008-12-03 11:48:38, Greg KH wrote:
> > 2.6.27-stable review patch. If anyone has any objections, please let us know.
> >
> > ------------------
> > From: Clemens Ladisch <[email protected]>
> >
> > commit cf7ee554f3a324e98181b0ea249d9d5be3a0acb8 upstream.
> >
> > When booting in a direct color mode, the penguin has dirty feet, i.e.,
> > some pixels have the wrong color. This is caused by
> > fb_set_logo_directpalette() which does not initialize the last 32 palette
> > entries.
>
> Heh, funny, but... is this really bad enough bug to go to stable?

It was asked to be included, so yes :)

thanks,

greg k-h

2008-12-06 00:36:28

by Greg KH

[permalink] [raw]
Subject: Re: [patch 062/104] ACPI: EC: count interrupts only if called from interrupt handler.

On Fri, Dec 05, 2008 at 02:06:49PM +0100, Pavel Machek wrote:
> On Wed 2008-12-03 11:53:00, Greg KH wrote:
> > 2.6.27-stable review patch. If anyone has any objections, please let us know.
> >
> > ------------------
> > From: Alexey Starikovskiy <[email protected]>
> >
> > commit 7b4d469228a92a00e412675817cedd60133de38a upstream.
> >
> > fix 2.6.28 EC interrupt storm regression
> >
>
> That changelog is pretty useless :-(.

That is the identical changelog from upstream.

thanks,

greg k-h

2008-12-06 02:49:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 011/104] fbdev: clean the penguins dirty feet

On Fri, 5 Dec 2008 13:58:55 +0100 Pavel Machek <[email protected]> wrote:

> On Wed 2008-12-03 11:48:38, Greg KH wrote:
> > 2.6.27-stable review patch. If anyone has any objections, please let us know.
> >
> > ------------------
> > From: Clemens Ladisch <[email protected]>
> >
> > commit cf7ee554f3a324e98181b0ea249d9d5be3a0acb8 upstream.
> >
> > When booting in a direct color mode, the penguin has dirty feet, i.e.,
> > some pixels have the wrong color. This is caused by
> > fb_set_logo_directpalette() which does not initialize the last 32 palette
> > entries.
>
> Heh, funny, but... is this really bad enough bug to go to stable?

Borderline. But the patch was pretty simple.

Also, there's the question "is this the sort of bug which distro
customers are likely to report to the distros". I figure "yes", and I
figure that the distros would like us to fix it for them. Of course,
I might be wrong about one or both of those things ;)

2008-12-06 05:26:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [stable] [patch 067/104] libata: improve phantom device detection

Greg KH wrote:
> Ok, I'll transfer it over to the next 2.6.27-stable release.

Thanks.

--
tejun

2008-12-09 18:16:55

by Greg KH

[permalink] [raw]
Subject: Re: [stable] [patch 000/104] 2.6.27-stable review

On Fri, Dec 05, 2008 at 01:33:10AM +0100, Rafael J. Wysocki wrote:
> On Wednesday, 3 of December 2008, Rafael J. Wysocki wrote:
> > Hi Greg,
> >
> > On Wednesday, 3 of December 2008, Greg KH wrote:
> > > This is the start of the stable review cycle for the 2.6.27.8 release.
> > > There are 104 patches in this series, all will be posted as a response
> > > to this one. If anyone has any issues with these being applied, please
> > > let us know. If anyone is a maintainer of the proper subsystem, and
> > > wants to add a Signed-off-by: line to the patch, please respond with it.
> > >
> > > And yes, there are a lot of patches here, the big series are:
> > > - cifs data corruption patches
> > > - pci hotplug slot patches to fix the most common warning
> > > showing up on kerneloops.org
> > > - ext4 bugfixes
> > >
> > > These patches are sent out with a number of different people on the Cc:
> > > line. If you wish to be a reviewer, please email [email protected] to
> > > add your name to the list. If you want to be off the reviewer list,
> > > also email us.
> > >
> > > Responses should be made by Friday, December 5, 20:00:00 UTC. Anything
> > > received after that time might be too late.
> >
> > The following ACPI commits are also -stable material IMO:
> >
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=65df78473ffbf3bff5e2034df1638acc4f3ddd50
>
> Argh, the following one is broken:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=558073dd56707864f09d563b64e7c37c021e89d2
>
> so please scratch it and this one is already on your list:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7b4d469228a92a00e412675817cedd60133de38a
>
> The remaining two, ie.
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=65df78473ffbf3bff5e2034df1638acc4f3ddd50 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=40599072dca3ec7d4c9ff8271978be169f974638

I've now queued these two up.

If there is anything else that you think I need to add for the next
-stable release for 2.6.27, please let me know.

thanks,

greg k-h

2008-12-09 21:41:22

by Dave Airlie

[permalink] [raw]
Subject: Re: [patch 021/104] lib/idr.c: fix rcu related race with idr_find

On Thu, Dec 4, 2008 at 5:49 AM, Greg KH <[email protected]> wrote:
> 2.6.27-stable review patch. If anyone has any objections, please let us know.
>
Revert.

This caused problems in the F10 kernel with idr, the drm device alloc
went all wierd,
it might be a drm bug but changing this code triggers it and so it
isn't really "stable"

Dave.

> ------------------
> From: Manfred Spraul <[email protected]>
>
> commit 6ff2d39b91aec3dcae951afa982059e3dd9b49dc upstream.
>
> 2nd part of the fixes needed for
> http://bugzilla.kernel.org/show_bug.cgi?id=11796.
>
> When the idr tree is either grown or shrunk, then the update to the number
> of layers and the top pointer were not atomic. This race caused crashes.
>
> The attached patch fixes that by replicating the layers counter in each
> layer, thus idr_find doesn't need idp->layers anymore.
>
> Signed-off-by: Manfred Spraul <[email protected]>
> Cc: Clement Calmels <[email protected]>
> Cc: Nadia Derbey <[email protected]>
> Cc: Pierre Peiffer <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
> Signed-off-by: Greg Kroah-Hartman <[email protected]>
>
> ---
> include/linux/idr.h | 3 ++-
> lib/idr.c | 14 ++++++++++++--
> 2 files changed, 14 insertions(+), 3 deletions(-)
>
> --- a/include/linux/idr.h
> +++ b/include/linux/idr.h
> @@ -52,13 +52,14 @@ struct idr_layer {
> unsigned long bitmap; /* A zero bit means "space here" */
> struct idr_layer *ary[1<<IDR_BITS];
> int count; /* When zero, we can release it */
> + int layer; /* distance from leaf */
> struct rcu_head rcu_head;
> };
>
> struct idr {
> struct idr_layer *top;
> struct idr_layer *id_free;
> - int layers;
> + int layers; /* only valid without concurrent changes */
> int id_free_cnt;
> spinlock_t lock;
> };
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -185,6 +185,7 @@ static int sub_alloc(struct idr *idp, in
> new = get_from_free_list(idp);
> if (!new)
> return -1;
> + new->layer = l-1;
> rcu_assign_pointer(p->ary[m], new);
> p->count++;
> }
> @@ -210,6 +211,7 @@ build_up:
> if (unlikely(!p)) {
> if (!(p = get_from_free_list(idp)))
> return -1;
> + p->layer = 0;
> layers = 1;
> }
> /*
> @@ -237,6 +239,7 @@ build_up:
> }
> new->ary[0] = p;
> new->count = 1;
> + new->layer = layers-1;
> if (p->bitmap == IDR_FULL)
> __set_bit(0, &new->bitmap);
> p = new;
> @@ -493,17 +496,21 @@ void *idr_find(struct idr *idp, int id)
> int n;
> struct idr_layer *p;
>
> - n = idp->layers * IDR_BITS;
> p = rcu_dereference(idp->top);
> + if (!p)
> + return NULL;
> + n = (p->layer+1) * IDR_BITS;
>
> /* Mask off upper bits we don't use for the search. */
> id &= MAX_ID_MASK;
>
> if (id >= (1 << n))
> return NULL;
> + BUG_ON(n == 0);
>
> while (n > 0 && p) {
> n -= IDR_BITS;
> + BUG_ON(n != p->layer*IDR_BITS);
> p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
> }
> return((void *)p);
> @@ -582,8 +589,11 @@ void *idr_replace(struct idr *idp, void
> int n;
> struct idr_layer *p, *old_p;
>
> - n = idp->layers * IDR_BITS;
> p = idp->top;
> + if (!p)
> + return ERR_PTR(-EINVAL);
> +
> + n = (p->layer+1) * IDR_BITS;
>
> id &= MAX_ID_MASK;
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-12-09 22:49:04

by Jesse Barnes

[permalink] [raw]
Subject: Re: [stable] [patch 031/104] PCI: prevent duplicate slot names

On Friday, December 05, 2008 10:27 am Greg KH wrote:
> On Thu, Dec 04, 2008 at 03:10:39PM -0700, Alex Chiang wrote:
> > Hi Greg,
> >
> > I found a memory leak that I introduced with the below patch.
> >
> > I sent the patch to Jesse a few days ago, but he hasn't pushed it
> > upstream yet.
> >
> > http://article.gmane.org/gmane.linux.kernel.pci/2187/match=stop+leaking
> >
> > I did Cc: [email protected] on it, but I'm guessing that since
> > it's not upstream yet, you guys never saw it.
> >
> > Anyhow, please pick it up for this round of .27-stable, else
> > we'll get a memory leak (while trying to work-around the
> > duplicate slot name issue).
>
> As much as I hate it, I really need to see the patch in Linus's tree
> before I can take it into -stable.
>
> So Jesse, can you please push this to Linus as soon as possible?
>
> Alex, thanks for letting me know about this.

Just sent the pull request, sorry for the delay.

--
Jesse Barnes, Intel Open Source Technology Center

2008-12-09 22:52:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 021/104] lib/idr.c: fix rcu related race with idr_find



On Wed, 10 Dec 2008, Dave Airlie wrote:
>
> On Thu, Dec 4, 2008 at 5:49 AM, Greg KH <[email protected]> wrote:
> > 2.6.27-stable review patch. If anyone has any objections, please let us know.
> >
> Revert.
>
> This caused problems in the F10 kernel with idr, the drm device alloc
> went all wierd,
> it might be a drm bug but changing this code triggers it and so it
> isn't really "stable"

Well, maybe it should be reverted in mainlne too, then?

Linus

2008-12-10 00:43:20

by Dave Airlie

[permalink] [raw]
Subject: Re: [patch 021/104] lib/idr.c: fix rcu related race with idr_find

On Wed, Dec 10, 2008 at 8:47 AM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Wed, 10 Dec 2008, Dave Airlie wrote:
>>
>> On Thu, Dec 4, 2008 at 5:49 AM, Greg KH <[email protected]> wrote:
>> > 2.6.27-stable review patch. If anyone has any objections, please let us know.
>> >
>> Revert.
>>
>> This caused problems in the F10 kernel with idr, the drm device alloc
>> went all wierd,
>> it might be a drm bug but changing this code triggers it and so it
>> isn't really "stable"
>
> Well, maybe it should be reverted in mainlne too, then?

It appears idr_replace is broken at least in stable with this patch.

I'm trying to track down where the problem is (idr_replace doesn't look like
idr_find in a lot of places and I wonder if this has ever been tested.)

Dave.

2008-12-10 01:46:25

by Dave Airlie

[permalink] [raw]
Subject: Re: [patch 021/104] lib/idr.c: fix rcu related race with idr_find

>>
>> On Wed, 10 Dec 2008, Dave Airlie wrote:
>>>
>>> On Thu, Dec 4, 2008 at 5:49 AM, Greg KH <[email protected]> wrote:
>>> > 2.6.27-stable review patch. If anyone has any objections, please let us know.
>>> >
>>> Revert.
>>>
>>> This caused problems in the F10 kernel with idr, the drm device alloc
>>> went all wierd,
>>> it might be a drm bug but changing this code triggers it and so it
>>> isn't really "stable"
>>
>> Well, maybe it should be reverted in mainlne too, then?
>
> It appears idr_replace is broken at least in stable with this patch.
>
> I'm trying to track down where the problem is (idr_replace doesn't look like
> idr_find in a lot of places and I wonder if this has ever been tested.)
>
(cc-trimmed).

Okay I'm not idr expert and maybe what the drm is doing is illegal but
it never caused a problem up to now.

The drm grabs an idr minor number using a NULL pointer to reserve the
number, it then uses idr_replace later
to stick a pointer into the reserved number. However this seems to be
what is broken, I'm not sure if this is a legal
use of idrs but has worked like that for a long time now.

I can fix the drm to workaround this, and allocate my pointers before
I try to get a minor number, but I'd like to know
if my usage is illegal over just overlooked.

2008-12-10 02:03:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 021/104] lib/idr.c: fix rcu related race with idr_find

On Wed, 10 Dec 2008 11:46:13 +1000 "Dave Airlie" <[email protected]> wrote:

> >>
> >> On Wed, 10 Dec 2008, Dave Airlie wrote:
> >>>
> >>> On Thu, Dec 4, 2008 at 5:49 AM, Greg KH <[email protected]> wrote:
> >>> > 2.6.27-stable review patch. If anyone has any objections, please let us know.
> >>> >
> >>> Revert.
> >>>
> >>> This caused problems in the F10 kernel with idr, the drm device alloc
> >>> went all wierd,
> >>> it might be a drm bug but changing this code triggers it and so it
> >>> isn't really "stable"
> >>
> >> Well, maybe it should be reverted in mainlne too, then?
> >
> > It appears idr_replace is broken at least in stable with this patch.
> >
> > I'm trying to track down where the problem is (idr_replace doesn't look like
> > idr_find in a lot of places and I wonder if this has ever been tested.)
> >
> (cc-trimmed).
>
> Okay I'm not idr expert and maybe what the drm is doing is illegal but
> it never caused a problem up to now.
>
> The drm grabs an idr minor number using a NULL pointer to reserve the
> number, it then uses idr_replace later
> to stick a pointer into the reserved number. However this seems to be
> what is broken, I'm not sure if this is a legal
> use of idrs but has worked like that for a long time now.
>
> I can fix the drm to workaround this, and allocate my pointers before
> I try to get a minor number, but I'd like to know
> if my usage is illegal over just overlooked.


<greps for a while>

I assume we're talking about drivers/gpu/drm/drm_stub.c:drm_minor_get_id()?

I don't immediately see anything in the idr code which special-cases a
NULL caller pointer?

2008-12-10 02:08:26

by Dave Airlie

[permalink] [raw]
Subject: Re: [patch 021/104] lib/idr.c: fix rcu related race with idr_find

On Wed, Dec 10, 2008 at 12:02 PM, Andrew Morton
<[email protected]> wrote:
> On Wed, 10 Dec 2008 11:46:13 +1000 "Dave Airlie" <[email protected]> wrote:
>
>> >>
>> >> On Wed, 10 Dec 2008, Dave Airlie wrote:
>> >>>
>> >>> On Thu, Dec 4, 2008 at 5:49 AM, Greg KH <[email protected]> wrote:
>> >>> > 2.6.27-stable review patch. If anyone has any objections, please let us know.
>> >>> >
>> >>> Revert.
>> >>>
>> >>> This caused problems in the F10 kernel with idr, the drm device alloc
>> >>> went all wierd,
>> >>> it might be a drm bug but changing this code triggers it and so it
>> >>> isn't really "stable"
>> >>
>> >> Well, maybe it should be reverted in mainlne too, then?
>> >
>> > It appears idr_replace is broken at least in stable with this patch.
>> >
>> > I'm trying to track down where the problem is (idr_replace doesn't look like
>> > idr_find in a lot of places and I wonder if this has ever been tested.)
>> >
>> (cc-trimmed).
>>
>> Okay I'm not idr expert and maybe what the drm is doing is illegal but
>> it never caused a problem up to now.
>>
>> The drm grabs an idr minor number using a NULL pointer to reserve the
>> number, it then uses idr_replace later
>> to stick a pointer into the reserved number. However this seems to be
>> what is broken, I'm not sure if this is a legal
>> use of idrs but has worked like that for a long time now.
>>
>> I can fix the drm to workaround this, and allocate my pointers before
>> I try to get a minor number, but I'd like to know
>> if my usage is illegal over just overlooked.
>
>
> <greps for a while>
>
> I assume we're talking about drivers/gpu/drm/drm_stub.c:drm_minor_get_id()?
>
> I don't immediately see anything in the idr code which special-cases a
> NULL caller pointer?
>

Actually now that I'm starting to wrap my head around it I think it
might be the fact that I call
idr_get_new_above with 64, then later with 0. I'm not sure the new
code is dealing with that case so
well.

We don't do that in the standard kernel tree yet, so it explains why
nobody's noticed, however the KMS
changes introduce it, and we have those in f10.

http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=blob;f=drivers/gpu/drm/drm_stub.c;h=5ca132afa4f2e128999e319e44e31ad156e6ab74;hb=drm-next

is the drm_stub.c from drm-next that will trigger the issue.

Again I'm not sure if this is a legal use of idrs.

Dave.

>

2008-12-10 02:34:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 021/104] lib/idr.c: fix rcu related race with idr_find

On Wed, 10 Dec 2008 12:08:13 +1000 "Dave Airlie" <[email protected]> wrote:

> >> if my usage is illegal over just overlooked.
> >
> >
> > <greps for a while>
> >
> > I assume we're talking about drivers/gpu/drm/drm_stub.c:drm_minor_get_id()?
> >
> > I don't immediately see anything in the idr code which special-cases a
> > NULL caller pointer?
> >
>
> Actually now that I'm starting to wrap my head around it I think it
> might be the fact that I call
> idr_get_new_above with 64, then later with 0. I'm not sure the new
> code is dealing with that case so
> well.
>
> We don't do that in the standard kernel tree yet, so it explains why
> nobody's noticed, however the KMS
> changes introduce it, and we have those in f10.
>
> http://git.kernel.org/?p=linux/kernel/git/airlied/drm-2.6.git;a=blob;f=drivers/gpu/drm/drm_stub.c;h=5ca132afa4f2e128999e319e44e31ad156e6ab74;hb=drm-next
>
> is the drm_stub.c from drm-next that will trigger the issue.
>
> Again I'm not sure if this is a legal use of idrs.
>

Well nobody really maintains or owns the idr code, so there's nobody we
can ask about design intent. Various people do hit-n-run attacks on it
when the need presents.

2008-12-10 17:39:43

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch 021/104] lib/idr.c: fix rcu related race with idr_find

Dave Airlie wrote:
> Actually now that I'm starting to wrap my head around it I think it
> might be the fact that I call
> idr_get_new_above with 64, then later with 0. I'm not sure the new
> code is dealing with that case so
> well.
>
Yes, that's it.
When idr_get_new_above(,,64,) is called, the idr code creates a tree
with 2 layers, without the entry 0 in layer 1.
This was special-cased [without comments], I missed it.

I've just send you a patch, could you try it?
It passes self tests [including idr_get_new_above and idr_replace].

--
Manfred

2009-01-23 05:01:18

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Wed, 03 Dec 2008 11:48 -0800, "Greg KH" <[email protected]> wrote:
> The default value for "max_user_instances" is set to 128, that should be enough too.

Our fairly heavily loaded postfix backup mx (lots of spams rejected per day) hit this
limit running kernel 2.6.27.8. Any particular reason for it being as low as 128
by default?

This is a kvm virtual machine running on a reasonably beefy external box, but
with 2Gb RAM allocated to the mx instance because that's all kvm would let me
use last time I checked. We're using KVM so the local copy of the database is
a little further away from the "internet facing side" and so we can build each
machine with our standard FAI setup.

Regards,

Bron.
--
Bron Gondwana
[email protected]

2009-01-23 05:23:47

by Greg KH

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Fri, Jan 23, 2009 at 03:51:01PM +1100, Bron Gondwana wrote:
> On Wed, 03 Dec 2008 11:48 -0800, "Greg KH" <[email protected]> wrote:
> > The default value for "max_user_instances" is set to 128, that should be enough too.
>
> Our fairly heavily loaded postfix backup mx (lots of spams rejected per day) hit this
> limit running kernel 2.6.27.8. Any particular reason for it being as low as 128
> by default?

Something had to be picked :)

> This is a kvm virtual machine running on a reasonably beefy external box, but
> with 2Gb RAM allocated to the mx instance because that's all kvm would let me
> use last time I checked. We're using KVM so the local copy of the database is
> a little further away from the "internet facing side" and so we can build each
> machine with our standard FAI setup.

I would suggest just changing this default value then, it's a simple
userspace configuration item, and for your boxes, it sounds like a
larger value would be more suitable.

thanks,

greg k-h

2009-01-23 09:47:57

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Thu, 22 Jan 2009 21:16 -0800, "Greg KH" <[email protected]> wrote:
> On Fri, Jan 23, 2009 at 03:51:01PM +1100, Bron Gondwana wrote:
> > On Wed, 03 Dec 2008 11:48 -0800, "Greg KH" <[email protected]> wrote:
> > > The default value for "max_user_instances" is set to 128, that should be enough too.
> >
> > Our fairly heavily loaded postfix backup mx (lots of spams rejected per day) hit this
> > limit running kernel 2.6.27.8. Any particular reason for it being as low as 128
> > by default?
>
> Something had to be picked :)

Fair enough :)

> > This is a kvm virtual machine running on a reasonably beefy external box, but
> > with 2Gb RAM allocated to the mx instance because that's all kvm would let me
> > use last time I checked. We're using KVM so the local copy of the database is
> > a little further away from the "internet facing side" and so we can build each
> > machine with our standard FAI setup.
>
> I would suggest just changing this default value then, it's a simple
> userspace configuration item, and for your boxes, it sounds like a
> larger value would be more suitable.

Yes - I've pushed it up to 4096 now. Should be plenty!

I guess Postfix is a bit of an odd case here. It runs lots of processes, yet
uses epoll within many of them as well - sort of a historical design in some ways,
but also to enforce maximum privilege separation with many of the daemons able to
be run under chroot with limited capabilities.

So I guess I have a few questions left:

1) is this value ever supposed to be hit in practice by non-malicious software?
If not, it appears 128 is too low.

2) if we're going to stick with 128, is there any way to query the kernel as to how
close to the limit it's getting? As an example, our system checks poll
/proc/sys/fs/file-max every 2 minutes, and warn us if its getting "full".

I was paged a couple of nights ago because we has file-nr set at 300000, which
used to be plenty, but we had a drive failure in another machine, and moved all
our Cyrus masters off while the RAID rebuilt. Suddenly there were heaps more
processes. We had set the limit insanely high (page when only 5000 left), but
I managed to wake up and log in within about 4 minutes, and there were still 256
left when I shoved it up higher.

Obviously I've tuned it to be warned earlier now. But anyway - it's possible.
I can't see any easy way to be aware when, say, 110 epolls have been used by the
same user, so I can fix the limit before it starts throttling incoming connections!

3) do you want me to write up a patch to add an epoll-max or similar procfile that
can be queried for this value?

Bron ( the basic rule here is - if something has woken you up by failing, a test
goes into the automated systems so you get advance warning next time )
--
Bron Gondwana
[email protected]

2009-01-23 17:09:01

by Greg KH

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Fri, Jan 23, 2009 at 08:47:45PM +1100, Bron Gondwana wrote:
> On Thu, 22 Jan 2009 21:16 -0800, "Greg KH" <[email protected]> wrote:
> > > This is a kvm virtual machine running on a reasonably beefy external box, but
> > > with 2Gb RAM allocated to the mx instance because that's all kvm would let me
> > > use last time I checked. We're using KVM so the local copy of the database is
> > > a little further away from the "internet facing side" and so we can build each
> > > machine with our standard FAI setup.
> >
> > I would suggest just changing this default value then, it's a simple
> > userspace configuration item, and for your boxes, it sounds like a
> > larger value would be more suitable.
>
> Yes - I've pushed it up to 4096 now. Should be plenty!
>
> I guess Postfix is a bit of an odd case here. It runs lots of processes, yet
> uses epoll within many of them as well - sort of a historical design in some ways,
> but also to enforce maximum privilege separation with many of the daemons able to
> be run under chroot with limited capabilities.
>
> So I guess I have a few questions left:
>
> 1) is this value ever supposed to be hit in practice by non-malicious software?
> If not, it appears 128 is too low.

It does appear a bit low. What looks to you like a good value to use as
a default?

> 2) if we're going to stick with 128, is there any way to query the kernel as to how
> close to the limit it's getting? As an example, our system checks poll
> /proc/sys/fs/file-max every 2 minutes, and warn us if its getting "full".

Good idea, we should report this somewhere for the very reasons you
suggest. Can you write up a patch to do this? If not, I'll see what I
can do.

thanks,

greg k-h

2009-01-23 17:24:36

by Bastien Roucariès

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Fri, Jan 23, 2009 at 6:06 PM, Greg KH <[email protected]> wrote:
> On Fri, Jan 23, 2009 at 08:47:45PM +1100, Bron Gondwana wrote:
>> On Thu, 22 Jan 2009 21:16 -0800, "Greg KH" <[email protected]> wrote:
>> > > This is a kvm virtual machine running on a reasonably beefy external box, but
>> > > with 2Gb RAM allocated to the mx instance because that's all kvm would let me
>> > > use last time I checked. We're using KVM so the local copy of the database is
>> > > a little further away from the "internet facing side" and so we can build each
>> > > machine with our standard FAI setup.
>> >
>> > I would suggest just changing this default value then, it's a simple
>> > userspace configuration item, and for your boxes, it sounds like a
>> > larger value would be more suitable.
>>
>> Yes - I've pushed it up to 4096 now. Should be plenty!
>>
>> I guess Postfix is a bit of an odd case here. It runs lots of processes, yet
>> uses epoll within many of them as well - sort of a historical design in some ways,
>> but also to enforce maximum privilege separation with many of the daemons able to
>> be run under chroot with limited capabilities.
>>
>> So I guess I have a few questions left:
>>
>> 1) is this value ever supposed to be hit in practice by non-malicious software?
>> If not, it appears 128 is too low.
>
> It does appear a bit low. What looks to you like a good value to use as
> a default?
>
>> 2) if we're going to stick with 128, is there any way to query the kernel as to how
>> close to the limit it's getting? As an example, our system checks poll
>> /proc/sys/fs/file-max every 2 minutes, and warn us if its getting "full".
>
> Good idea, we should report this somewhere for the very reasons you
> suggest. Can you write up a patch to do this? If not, I'll see what I
> can do.

Why not using a ulimit for this kind of stuff ?

Regards

Bastien

2009-01-23 19:28:51

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Fri, 23 Jan 2009, Bron Gondwana wrote:

> On Thu, 22 Jan 2009 21:16 -0800, "Greg KH" <[email protected]> wrote:
> > On Fri, Jan 23, 2009 at 03:51:01PM +1100, Bron Gondwana wrote:
> > > On Wed, 03 Dec 2008 11:48 -0800, "Greg KH" <[email protected]> wrote:
> > > > The default value for "max_user_instances" is set to 128, that should be enough too.
> > >
> > > Our fairly heavily loaded postfix backup mx (lots of spams rejected per day) hit this
> > > limit running kernel 2.6.27.8. Any particular reason for it being as low as 128
> > > by default?
> >
> > Something had to be picked :)
>
> Fair enough :)
>
> > > This is a kvm virtual machine running on a reasonably beefy external box, but
> > > with 2Gb RAM allocated to the mx instance because that's all kvm would let me
> > > use last time I checked. We're using KVM so the local copy of the database is
> > > a little further away from the "internet facing side" and so we can build each
> > > machine with our standard FAI setup.
> >
> > I would suggest just changing this default value then, it's a simple
> > userspace configuration item, and for your boxes, it sounds like a
> > larger value would be more suitable.
>
> Yes - I've pushed it up to 4096 now. Should be plenty!
>
> I guess Postfix is a bit of an odd case here. It runs lots of processes, yet
> uses epoll within many of them as well - sort of a historical design in some ways,
> but also to enforce maximum privilege separation with many of the daemons able to
> be run under chroot with limited capabilities.
>
> So I guess I have a few questions left:
>
> 1) is this value ever supposed to be hit in practice by non-malicious software?
> If not, it appears 128 is too low.
>
> 2) if we're going to stick with 128, is there any way to query the kernel as to how
> close to the limit it's getting? As an example, our system checks poll
> /proc/sys/fs/file-max every 2 minutes, and warn us if its getting "full".

Why? If you know you have a loaded, non multi-user server, just bump the
value up and forget about it. An higher value is not going to cost you
anything in terms of resource allocation. Adding more /proc code to
monitor a silly value, probably is.



- Davide

2009-01-23 19:36:42

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Fri, 23 Jan 2009, Bastien ROUCARIES wrote:

> Why not using a ulimit for this kind of stuff ?

`ulimit` would be great, but it requires userspace code changes for every
value we want to export. And looking at the amount of configuration we
have in /proc, it's clear `ulimit` exposure is not very practical.


- Davide

2009-01-24 03:50:33

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Fri, Jan 23, 2009 at 09:06:31AM -0800, Greg KH wrote:
> On Fri, Jan 23, 2009 at 08:47:45PM +1100, Bron Gondwana wrote:
> > On Thu, 22 Jan 2009 21:16 -0800, "Greg KH" <[email protected]> wrote:
> > >
> > > I would suggest just changing this default value then, it's a simple
> > > userspace configuration item, and for your boxes, it sounds like a
> > > larger value would be more suitable.

If everyone, or every distribution at least, has to change it then the
default is probably wrong. The error message in the postfix logs didn't
immediately point me at the issue, especially since I tried debugging on
one of our "production" mxes, only to discover that the epoll limit
didn't exist there. They're slightly behind in kernel versions.

> > I guess Postfix is a bit of an odd case here. It runs lots of
> > processes, yet uses epoll within many of them as well - sort of
> > a historical design in some ways, but also to enforce maximum
> > privilege separation with many of the daemons able to
> > be run under chroot with limited capabilities.
> >
> > So I guess I have a few questions left:
> >
> > 1) is this value ever supposed to be hit in practice by
> > non-malicious software? If not, it appears 128 is too low.
>
> It does appear a bit low. What looks to you like a good value to use as
> a default?

This thread suggests that it's not just postfix having the issue, and
offers 1024 as a saner default:

http://www.mail-archive.com/[email protected]/msg01618.html

There's also a Russian thread that pointed me at this patch in the first
place, and another place that suggested 1024 as well. Seems "the
cloud"[tm] is converging on 1024.

> > 2) if we're going to stick with 128, is there any way to query
> > the kernel as to how close to the limit it's getting? As an
> > example, our system checks poll /proc/sys/fs/file-max every
> > 2 minutes, and warn us if its getting "full".
>
> Good idea, we should report this somewhere for the very reasons you
> suggest. Can you write up a patch to do this? If not, I'll see what I
> can do.

I'll have a look at it. There are two main choices I think - either one
file with just the "max", or some data view that shows all the users'
counts. It looks like it will have to enumerate the user list anyway.
(I've been poking around in kernel/user.c. Looks like a
hlist_for_each_entry on uidhash_table will do the trick. I'm guessing
we only want to display the value for the current user_ns anyway. I
don't really understand the user namespacing stuff, since I've never
used it)

Most of all I'm interested in this because if it's a good way to
actually have some viable statistics on what the default vaule
should be.

Bron ( still learning my way around the kernel - I've only written one
patch before, and it had a lot of babysitting from Linus! )

2009-01-24 08:36:35

by Vegard Nossum

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Sat, Jan 24, 2009 at 4:50 AM, Bron Gondwana <[email protected]> wrote:
> On Fri, Jan 23, 2009 at 09:06:31AM -0800, Greg KH wrote:
>> On Fri, Jan 23, 2009 at 08:47:45PM +1100, Bron Gondwana wrote:
>> > On Thu, 22 Jan 2009 21:16 -0800, "Greg KH" <[email protected]> wrote:
>> > >
>> > > I would suggest just changing this default value then, it's a simple
>> > > userspace configuration item, and for your boxes, it sounds like a
>> > > larger value would be more suitable.
>
> If everyone, or every distribution at least, has to change it then the
> default is probably wrong. The error message in the postfix logs didn't
> immediately point me at the issue, especially since I tried debugging on
> one of our "production" mxes, only to discover that the epoll limit
> didn't exist there. They're slightly behind in kernel versions.
>
>> > I guess Postfix is a bit of an odd case here. It runs lots of
>> > processes, yet uses epoll within many of them as well - sort of
>> > a historical design in some ways, but also to enforce maximum
>> > privilege separation with many of the daemons able to
>> > be run under chroot with limited capabilities.
>> >
>> > So I guess I have a few questions left:
>> >
>> > 1) is this value ever supposed to be hit in practice by
>> > non-malicious software? If not, it appears 128 is too low.
>>
>> It does appear a bit low. What looks to you like a good value to use as
>> a default?
>
> This thread suggests that it's not just postfix having the issue, and
> offers 1024 as a saner default:
>
> http://www.mail-archive.com/[email protected]/msg01618.html
>
> There's also a Russian thread that pointed me at this patch in the first
> place, and another place that suggested 1024 as well. Seems "the
> cloud"[tm] is converging on 1024.

With the default limit of 128 (max_user_instances) and 274274
(max_user_watches) on my machine, the maximum amount of memory
consumed by one user's epoll instances is barely noticable (around
1.5M).

Raising the max_user_instances to 512 brings us up to a maximum memory
usage of 43M already. However, from here on, we are already getting
limited by the number of user watches.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2009-01-24 13:03:54

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Fri, Jan 23, 2009 at 09:06:31AM -0800, Greg KH wrote:
> On Fri, Jan 23, 2009 at 08:47:45PM +1100, Bron Gondwana wrote:
> > 2) if we're going to stick with 128, is there any way to query the
> > kernel as to how close to the limit it's getting? [...]
>
> Good idea, we should report this somewhere for the very reasons you
> suggest. Can you write up a patch to do this? If not, I'll see what I
> can do.

The attached patches do this - the first bumps the default to 1024, and
the second adds /proc/sys/fs/epoll/limits which contains 4 values. The
first two are the maximum current value for each field, and the second
two are the values of max_user_instances and max_user_watches again,
similar to the file-max interface.

Any particular reason why the naming is so different? I would have used
"max" for the current maximum, but the name is already taken by the
limit keys!

By the way, I have approximately no experience with any of this, so
coding standards criticism or "stuff should go elsewhere" suggestions
would be very gratefully received. This is pretty much the first set
of code I managed that compiled, booted and gave me plausible values.

You can also find the attached in the brong-epoll branch on
http://github.com/brong/linux-2.6/ - I'm working against Linus'
latest.

Thanks,

Bron.


Attachments:
(No filename) (1.29 kB)
0001-epoll-increase-default-max_user_instances-to-1024.patch (889.00 B)
0002-epoll-add-proc-sys-fs-epoll-limits-interface.patch (5.20 kB)
Download all attachments

2009-01-25 11:01:40

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Sun, Jan 25, 2009 at 12:03:34AM +1100, Bron Gondwana wrote:
> The attached patches do this - the first bumps the default to 1024, and
> the second adds /proc/sys/fs/epoll/limits which contains 4 values. The
> first two are the maximum current value for each field, and the second
> two are the values of max_user_instances and max_user_watches again,
> similar to the file-max interface.

And this third one (on top of the other two) adds the UIDs of the most
heavily using users to the "limits" file, to help you track them down.

Bron ( pretty sure there's tabdamage and crap in there, but I'd like
some feedback that I'm otherwise on the right track before I
polish these up and Signed-off-by: them )


Attachments:
(No filename) (723.00 B)
0003-epoll-also-show-owner-uids-in-epoll-limits-output.patch (3.56 kB)
Download all attachments

2009-01-25 12:03:31

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Fri, Jan 23, 2009 at 09:06:31AM -0800, Greg KH wrote:
> On Fri, Jan 23, 2009 at 08:47:45PM +1100, Bron Gondwana wrote:
> > 1) is this value ever supposed to be hit in practice by non-malicious
> > software? If not, it appears 128 is too low.
>
> It does appear a bit low. What looks to you like a good value to use as
> a default?

I've upgraded one production mx to 2.6.28.2 plus my latest patch (the
rest are still running 2.6.27.6, which is prior to this limit)

Here's some figures with my latest patch after about 10 minutes running
to stabilise the startup figures:

kvm virtual mx:
0 39 0 230 4096 271872
production mx:
0 207 107 1811 4096 266555

The interesting figure in each case is the second one,
num_user_instances. Interesting that the UID is listed as 0 though,
that's root! 107 is postfix, which makes sense.

As you can see, the production mx would have start choking epolls almost
immediately. These machines are about 5 years old now. 4Gb of memory,
dual hyperthreading 32 bit Xeons. They're the least powerful machines
we still keep running!

Bron ( most of our costs are power and rack space, after all )

2009-01-25 12:20:51

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Sun, Jan 25, 2009 at 10:01:27PM +1100, Bron Gondwana wrote:
> On Sun, Jan 25, 2009 at 12:03:34AM +1100, Bron Gondwana wrote:
> > The attached patches do this - the first bumps the default to 1024, and
> > the second adds /proc/sys/fs/epoll/limits which contains 4 values. The
> > first two are the maximum current value for each field, and the second
> > two are the values of max_user_instances and max_user_watches again,
> > similar to the file-max interface.
>
> And this third one (on top of the other two) adds the UIDs of the most
> heavily using users to the "limits" file, to help you track them down.

Patch 4 - I'll stop now ;)

Allow '0' for unlimited for both limits.

I notice that root gets limited same as anyone else. Any opinion on
special-casing root and not limiting the number of epolls they can
create? There are plenty of other ways root can be nasty if it's so
inclined!

Bron.


Attachments:
(No filename) (909.00 B)
0004-epoll-allow-0-for-unlimited-on-epoll-limits.patch (1.31 kB)
Download all attachments

2009-01-28 00:39:38

by Greg KH

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Sun, Jan 25, 2009 at 11:20:39PM +1100, Bron Gondwana wrote:
> On Sun, Jan 25, 2009 at 10:01:27PM +1100, Bron Gondwana wrote:
> > On Sun, Jan 25, 2009 at 12:03:34AM +1100, Bron Gondwana wrote:
> > > The attached patches do this - the first bumps the default to 1024, and
> > > the second adds /proc/sys/fs/epoll/limits which contains 4 values. The
> > > first two are the maximum current value for each field, and the second
> > > two are the values of max_user_instances and max_user_watches again,
> > > similar to the file-max interface.
> >
> > And this third one (on top of the other two) adds the UIDs of the most
> > heavily using users to the "limits" file, to help you track them down.
>
> Patch 4 - I'll stop now ;)

Heh.

Can you resubmit all 4 patches, and cc: the epoll author, Davide? He's
the one that needs to accept these changes.

thanks,

greg k-h

2009-01-28 03:38:35

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Tue, Jan 27, 2009 at 04:35:19PM -0800, Greg KH wrote:
> Can you resubmit all 4 patches, and cc: the epoll author, Davide? He's
> the one that needs to accept these changes.

It's three now (two of them deserved to merged) and re-ordered so the
first two are trivial and the complex bits are easily skipped if you
don't want them.

Just looking for Davide's email address. Found it :) I'll follow up
this message with the patches. I'm not going to CC everyone else again
- but I'll CC LKML so you can follow it there if you want.

Bron.

2009-01-28 03:46:30

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Wed, 28 Jan 2009, Bron Gondwana wrote:

> On Tue, Jan 27, 2009 at 04:35:19PM -0800, Greg KH wrote:
> > Can you resubmit all 4 patches, and cc: the epoll author, Davide? He's
> > the one that needs to accept these changes.
>
> It's three now (two of them deserved to merged) and re-ordered so the
> first two are trivial and the complex bits are easily skipped if you
> don't want them.
>
> Just looking for Davide's email address. Found it :) I'll follow up
> this message with the patches. I'm not going to CC everyone else again
> - but I'll CC LKML so you can follow it there if you want.

I already gave you my opinion on such code. There is no need for it. If
your servers are loaded, in the same way you bump NFILES (and likely
even other default configs), you bump up max_user_instances:

$ echo NN > /proc/sys/fs/epoll/max_user_instances

It requires no extra crud in the kernel, and it works pretty darn good.



- Davide

2009-01-28 03:47:21

by Bron Gondwana

[permalink] [raw]
Subject: [PATCH 1/3] epoll: increase default max_user_instances to 1024

Both Postfix and Apache use an epoll instance per child, which
leads to significant scalability issues with max_user_instances
set so low. Bump the default to 1024 so medium sized sites are
not impacted.

Signed-off-by: Bron Gondwana <[email protected]>
---
Documentation/filesystems/proc.txt | 6 +++++-
fs/eventpoll.c | 2 +-
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index bbebc3a..4677abf 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2237,9 +2237,13 @@ max_user_instances
------------------

This is the maximum number of epoll file descriptors that a single user can
-have open at a given time. The default value is 128, and should be enough
+have open at a given time. The default value is 1024, and should be enough
for normal users.

+If you are running a heavily loaded Postfix or Apache server, you may need
+to set this higher. Both these servers run an epoll instance per child
+process.
+
max_user_watches
----------------

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index ba2f9ec..16eb817 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1366,7 +1366,7 @@ static int __init eventpoll_init(void)
struct sysinfo si;

si_meminfo(&si);
- max_user_instances = 128;
+ max_user_instances = 1024;
max_user_watches = (((si.totalram - si.totalhigh) / 32) << PAGE_SHIFT) /
EP_ITEM_COST;

--
1.5.6.3

2009-01-28 03:47:35

by Bron Gondwana

[permalink] [raw]
Subject: [PATCH 2/3] epoll: allow 0 for "unlimited" on epoll limits

If you set 0 as the limit for max_user_watches or max_user_instances,
then treat them as unlimited.

Note - this doesn't disable the accounting, just the limit test.

Signed-off-by: Bron Gondwana <[email protected]>
---
Documentation/filesystems/proc.txt | 4 ++++
fs/eventpoll.c | 8 ++++----
2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 4677abf..c4debd3 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2244,6 +2244,8 @@ If you are running a heavily loaded Postfix or Apache server, you may need
to set this higher. Both these servers run an epoll instance per child
process.

+Setting max_user_instances to '0' makes it unlimited.
+
max_user_watches
----------------

@@ -2256,6 +2258,8 @@ on a 64bit one.
The current default value for max_user_watches is the 1/32 of the available
low memory, divided for the "watch" cost in bytes.

+Setting max_user_watches to '0' makes it unlimited.
+

------------------------------------------------------------------------------

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 16eb817..c6d5c1d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -582,8 +582,8 @@ static int ep_alloc(struct eventpoll **pep)

user = get_current_user();
error = -EMFILE;
- if (unlikely(atomic_read(&user->epoll_devs) >=
- max_user_instances))
+ if (unlikely(max_user_instances &&
+ (max_user_instances < atomic_read(&user->epoll_devs))))
goto free_uid;
error = -ENOMEM;
ep = kzalloc(sizeof(*ep), GFP_KERNEL);
@@ -761,8 +761,8 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
struct epitem *epi;
struct ep_pqueue epq;

- if (unlikely(atomic_read(&ep->user->epoll_watches) >=
- max_user_watches))
+ if (unlikely(max_user_watches &&
+ (max_user_watches < atomic_read(&ep->user->epoll_watches))))
return -ENOSPC;
if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
return -ENOMEM;
--
1.5.6.3

2009-01-28 03:47:50

by Bron Gondwana

[permalink] [raw]
Subject: [PATCH 3/3] epoll: add /proc/sys/fs/epoll/limits interface

This is a 6 value vector containing max_user_instances and
max_user_watches constants as well as the current userid and
highest value for any user of these items.

Signed-off-by: Bron Gondwana <[email protected]>
---
Documentation/filesystems/proc.txt | 22 +++++++++++++++++++
fs/eventpoll.c | 41 +++++++++++++++++++++++++----------
include/linux/eventpoll.h | 16 ++++++++++++++
kernel/user.c | 33 +++++++++++++++++++++++++++++
4 files changed, 100 insertions(+), 12 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index c4debd3..18a69b5 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2260,6 +2260,28 @@ low memory, divided for the "watch" cost in bytes.

Setting max_user_watches to '0' makes it unlimited.

+limits
+------
+
+The limits file contains information that can be used to judge the
+appropriateness of your max_user_instances and max_user_watches settings.
+
+This file is read-only, and contains 6 integer values:
+
+instances_uid - the UID of the user with the most instances
+num_user_instances - the number of epoll instances the above UID has
+max_user_instances - the configured maximum (for comparison)
+watches_uid - the UID of the user with the most watches
+num_user_watches - the number of epoll watches the above UID has
+max_user_watches - the configured maximum (for comparison)
+
+By comparing the "num" and "max" values you can see if you are getting
+close to the limit, and then use the UID field to see which user is
+responsible.
+
+(caveat: a daemon like Postfix might create the epoll watch before
+dropping privileges - in this case the watch will be charged to root)
+

------------------------------------------------------------------------------

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index c6d5c1d..dd4351b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -234,10 +234,7 @@ struct ep_pqueue {
/*
* Configuration options available inside /proc/sys/fs/epoll/
*/
-/* Maximum number of epoll devices, per user */
-static int max_user_instances __read_mostly;
-/* Maximum number of epoll watched descriptors, per user */
-static int max_user_watches __read_mostly;
+struct epoll_limits_struct epoll_limits;

/*
* This mutex is used to serialize ep_free() and eventpoll_release_file().
@@ -259,10 +256,20 @@ static struct kmem_cache *pwq_cache __read_mostly;

static int zero;

+static int epoll_counts(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ user_epoll_maximums(&epoll_limits.instances_uid,
+ &epoll_limits.num_user_instances,
+ &epoll_limits.watches_uid,
+ &epoll_limits.num_user_watches);
+ return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+
ctl_table epoll_table[] = {
{
.procname = "max_user_instances",
- .data = &max_user_instances,
+ .data = &epoll_limits.max_user_instances,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = &proc_dointvec_minmax,
@@ -270,12 +277,20 @@ ctl_table epoll_table[] = {
},
{
.procname = "max_user_watches",
- .data = &max_user_watches,
+ .data = &epoll_limits.max_user_watches,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = &proc_dointvec_minmax,
.extra1 = &zero,
},
+ {
+ .procname = "limits",
+ .data = &epoll_limits,
+ .maxlen = 6*sizeof(int),
+ .mode = 0444,
+ .proc_handler = &epoll_counts,
+ .extra1 = &zero,
+ },
{ .ctl_name = 0 }
};
#endif /* CONFIG_SYSCTL */
@@ -582,8 +597,9 @@ static int ep_alloc(struct eventpoll **pep)

user = get_current_user();
error = -EMFILE;
- if (unlikely(max_user_instances &&
- (max_user_instances < atomic_read(&user->epoll_devs))))
+ if (unlikely(epoll_limits.max_user_instances &&
+ (epoll_limits.max_user_instances <
+ atomic_read(&user->epoll_devs))))
goto free_uid;
error = -ENOMEM;
ep = kzalloc(sizeof(*ep), GFP_KERNEL);
@@ -761,8 +777,9 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
struct epitem *epi;
struct ep_pqueue epq;

- if (unlikely(max_user_watches &&
- (max_user_watches < atomic_read(&ep->user->epoll_watches))))
+ if (unlikely(epoll_limits.max_user_watches &&
+ (epoll_limits.max_user_watches <
+ atomic_read(&ep->user->epoll_watches))))
return -ENOSPC;
if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
return -ENOMEM;
@@ -1366,8 +1383,8 @@ static int __init eventpoll_init(void)
struct sysinfo si;

si_meminfo(&si);
- max_user_instances = 1024;
- max_user_watches = (((si.totalram - si.totalhigh) / 32) << PAGE_SHIFT) /
+ epoll_limits.max_user_instances = 1024;
+ epoll_limits.max_user_watches = (((si.totalram - si.totalhigh) / 32) << PAGE_SHIFT) /
EP_ITEM_COST;

/* Initialize the structure used to perform safe poll wait head wake ups */
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index f1e1d3c..65565ef 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -57,6 +57,18 @@ struct file;

#ifdef CONFIG_EPOLL

+/*
+ * Configuration options available inside /proc/sys/fs/epoll/
+ */
+struct epoll_limits_struct {
+ uid_t instances_uid; /* read only */
+ int num_user_instances; /* read only */
+ int max_user_instances; /* tunable */
+ uid_t watches_uid; /* read only */
+ int num_user_watches; /* read only */
+ int max_user_watches; /* tunable */
+};
+
/* Used to initialize the epoll bits inside the "struct file" */
static inline void eventpoll_init_file(struct file *file)
{
@@ -96,6 +108,10 @@ static inline void eventpoll_release(struct file *file)
eventpoll_release_file(file);
}

+extern struct epoll_limits_struct epoll_limits;
+extern void user_epoll_maximums(uid_t *user_devs, int *num_devs,
+ uid_t *user_watches, int *num_watches);
+
#else

static inline void eventpoll_init_file(struct file *file) {}
diff --git a/kernel/user.c b/kernel/user.c
index 477b666..08e2b4f 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -381,6 +381,39 @@ struct user_struct *find_user(uid_t uid)
return ret;
}

+#ifdef CONFIG_EPOLL
+void user_epoll_maximums(uid_t *user_devs, int *max_devs,
+ uid_t *user_watches, int *max_watches)
+{
+ unsigned long flags;
+ struct user_struct *user;
+ struct hlist_node *h;
+ int n;
+
+ *max_devs = 0;
+ *user_devs = root_user.uid;
+ *max_watches = 0;
+ *user_watches = root_user.uid;
+
+ spin_lock_irqsave(&uidhash_lock, flags);
+
+ for(n = 0; n < UIDHASH_SZ; ++n) {
+ hlist_for_each_entry(user, h, init_user_ns.uidhash_table + n, uidhash_node) {
+ if (user->epoll_devs.counter > *max_devs) {
+ *max_devs = user->epoll_devs.counter;
+ *user_devs = user->uid;
+ }
+ if (user->epoll_watches.counter > *max_watches) {
+ *max_watches = user->epoll_watches.counter;
+ *user_watches = user->uid;
+ }
+ }
+ }
+
+ spin_unlock_irqrestore(&uidhash_lock, flags);
+}
+#endif /* CONFIG_EPOLL */
+
void free_uid(struct user_struct *up)
{
unsigned long flags;
--
1.5.6.3

2009-01-28 03:58:15

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Tue, Jan 27, 2009 at 07:46:18PM -0800, Davide Libenzi wrote:
> On Wed, 28 Jan 2009, Bron Gondwana wrote:
>
> > On Tue, Jan 27, 2009 at 04:35:19PM -0800, Greg KH wrote:
> > > Can you resubmit all 4 patches, and cc: the epoll author, Davide? He's
> > > the one that needs to accept these changes.
> >
> > It's three now (two of them deserved to merged) and re-ordered so the
> > first two are trivial and the complex bits are easily skipped if you
> > don't want them.
> >
> > Just looking for Davide's email address. Found it :) I'll follow up
> > this message with the patches. I'm not going to CC everyone else again
> > - but I'll CC LKML so you can follow it there if you want.
>
> I already gave you my opinion on such code. There is no need for it. If
> your servers are loaded, in the same way you bump NFILES (and likely
> even other default configs), you bump up max_user_instances:

How can you tell if it's heavily loaded if you can't tell what the
current usage is? Just wait until you hit the limit?

> $ echo NN > /proc/sys/fs/epoll/max_user_instances
>
> It requires no extra crud in the kernel, and it works pretty darn good.

The current default of 128 is breaking pretty much every decent sized
postfix or apache server out there, where in the past there is no limit.
That's an awful lot of sysadmin time to track down why your server is
suddently hitting limits that didn't used to exist across every
installed Linux machine out there.

Of course the distributions can put an override in their sysctl.conf,
but in that case why not have a higher default?

Bron ( besides, the first two patches certainly aren't cruft, they're
just different default behaviours. The third is cruft, but I
believe it's useful cruft in the same way file-nr is cruft )

2009-01-28 04:00:42

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Wed, 28 Jan 2009, Bron Gondwana wrote:

> Both Postfix and Apache use an epoll instance per child, which
> leads to significant scalability issues with max_user_instances
> set so low. Bump the default to 1024 so medium sized sites are
> not impacted.

NACK. Epoll allocates globally about 100 to 160 bytes (32/64 bit) for each
file added to the interface:

for i 1..1024
for j 1..1024
if i!=j
add j -> i

That's (N^2 * {100, 160}) = 100MB to 160MB of pinned kernel memory,
explotable by simple users with untouched NFILES.
This is the reason such limit was introduced in the first place. Again,
for the 10th time, if you have a loaded server with multiple processes
using epoll:

$ echo NN > /proc/sys/fs/epoll/max_user_instances

Note that NN does not consume any resource "per se", so if you feel
threatened by such limit, you can go wild with it.



- Davide

2009-01-28 04:07:43

by Ray Lee

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Tue, Jan 27, 2009 at 8:00 PM, Davide Libenzi <[email protected]> wrote:
> On Wed, 28 Jan 2009, Bron Gondwana wrote:
>
>> Both Postfix and Apache use an epoll instance per child, which
>> leads to significant scalability issues with max_user_instances
>> set so low. Bump the default to 1024 so medium sized sites are
>> not impacted.
>
> NACK. Epoll allocates globally about 100 to 160 bytes (32/64 bit) for each
> file added to the interface:
>
> for i 1..1024
> for j 1..1024
> if i!=j
> add j -> i
>
> That's (N^2 * {100, 160}) = 100MB to 160MB of pinned kernel memory,
> explotable by simple users with untouched NFILES.
> This is the reason such limit was introduced in the first place. Again,
> for the 10th time, if you have a loaded server with multiple processes
> using epoll:
>
> $ echo NN > /proc/sys/fs/epoll/max_user_instances
>
> Note that NN does not consume any resource "per se", so if you feel
> threatened by such limit, you can go wild with it.

It's really simple. A kernel upgrade in a -stable series point release
broke a rational user-space setup. If you don't want to adjust the
defaults, then the sane thing to do is to revert the commit that
caused the grief. Postfix is everywhere. Apache is everywhere.

Userspace is not broken here, and the whole idea of a -stable series
is that administrators can upgrade to them without having to worry
about things getting broken or making specific configuration changes
by point release.

2009-01-28 04:10:54

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Wed, 28 Jan 2009, Bron Gondwana wrote:

> On Tue, Jan 27, 2009 at 07:46:18PM -0800, Davide Libenzi wrote:
> > On Wed, 28 Jan 2009, Bron Gondwana wrote:
> >
> > > On Tue, Jan 27, 2009 at 04:35:19PM -0800, Greg KH wrote:
> > > > Can you resubmit all 4 patches, and cc: the epoll author, Davide? He's
> > > > the one that needs to accept these changes.
> > >
> > > It's three now (two of them deserved to merged) and re-ordered so the
> > > first two are trivial and the complex bits are easily skipped if you
> > > don't want them.
> > >
> > > Just looking for Davide's email address. Found it :) I'll follow up
> > > this message with the patches. I'm not going to CC everyone else again
> > > - but I'll CC LKML so you can follow it there if you want.
> >
> > I already gave you my opinion on such code. There is no need for it. If
> > your servers are loaded, in the same way you bump NFILES (and likely
> > even other default configs), you bump up max_user_instances:
>
> How can you tell if it's heavily loaded if you can't tell what the
> current usage is? Just wait until you hit the limit?

In my servers, I know if they are going to be loaded, and I bump NFILES
(and a few other things) to the correct place. Since many of those
limits do not actually pre-allocate any resource, I don't need to wait and
monitor the values, before taking proper action.
Sorry, the whole patch set is a big NACK for many reasons.
We'd have happily avoided those limits altogether, but 100-160MB of kernel
memory able to be pinned by unprivileged users is easily a DoS on multiuser
systems.



- Davide

2009-01-28 04:14:47

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Tue, 27 Jan 2009, Ray Lee wrote:

> On Tue, Jan 27, 2009 at 8:00 PM, Davide Libenzi <[email protected]> wrote:
> > On Wed, 28 Jan 2009, Bron Gondwana wrote:
> >
> >> Both Postfix and Apache use an epoll instance per child, which
> >> leads to significant scalability issues with max_user_instances
> >> set so low. Bump the default to 1024 so medium sized sites are
> >> not impacted.
> >
> > NACK. Epoll allocates globally about 100 to 160 bytes (32/64 bit) for each
> > file added to the interface:
> >
> > for i 1..1024
> > for j 1..1024
> > if i!=j
> > add j -> i
> >
> > That's (N^2 * {100, 160}) = 100MB to 160MB of pinned kernel memory,
> > explotable by simple users with untouched NFILES.
> > This is the reason such limit was introduced in the first place. Again,
> > for the 10th time, if you have a loaded server with multiple processes
> > using epoll:
> >
> > $ echo NN > /proc/sys/fs/epoll/max_user_instances
> >
> > Note that NN does not consume any resource "per se", so if you feel
> > threatened by such limit, you can go wild with it.
>
> It's really simple. A kernel upgrade in a -stable series point release
> broke a rational user-space setup. If you don't want to adjust the
> defaults, then the sane thing to do is to revert the commit that
> caused the grief. Postfix is everywhere. Apache is everywhere.
>
> Userspace is not broken here, and the whole idea of a -stable series
> is that administrators can upgrade to them without having to worry
> about things getting broken or making specific configuration changes
> by point release.

The reason Greg took it, was that in a multiuser systems, that's a DoS
EZ-PZ Lemon Squeezie.



- Davide

2009-01-28 04:39:28

by Bron Gondwana

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Tue, 27 Jan 2009 20:00 -0800, "Davide Libenzi" <[email protected]> wrote:
> On Wed, 28 Jan 2009, Bron Gondwana wrote:
>
> > Both Postfix and Apache use an epoll instance per child, which
> > leads to significant scalability issues with max_user_instances
> > set so low. Bump the default to 1024 so medium sized sites are
> > not impacted.
>
> NACK. Epoll allocates globally about 100 to 160 bytes (32/64 bit) for
> each
> file added to the interface:
>
> for i 1..1024
> for j 1..1024
> if i!=j
> add j -> i
>
> That's (N^2 * {100, 160}) = 100MB to 160MB of pinned kernel memory,
> explotable by simple users with untouched NFILES.

So if you are running a big multi user system and you don't trust
your users not to do shit like this, then you can tune the default
down. Easy-peasy.

> This is the reason such limit was introduced in the first place. Again,
> for the 10th time, if you have a loaded server with multiple processes
> using epoll:

Would you take a patch that doesn't apply the limit to root then? That
would avoid my postfix issue at least - not sure about Apache, it might
be forking from a user that's not root.

> $ echo NN > /proc/sys/fs/epoll/max_user_instances
>
> Note that NN does not consume any resource "per se", so if you feel
> threatened by such limit, you can go wild with it.

What about patch number 2, that allows you to set it to '0' if you feel
the need to go wild and not set an arbitrary limit that you might hit
later? Usages change over time.

What about patch number 3, that gives you a chance to actually see what
the usage is before your production service suddenly hits it.

That's EVERY linux Apache or Postfix machine out there, using the epoll
interface that's supposed to be scalable, via the API that has been
trumpeted as the way to make things scalable in Linuxland for a while
now. With a tuneable that got added later and set much lower than
actual production workloads that are in the wild.

Because sometimes systems grow over time more than you expect, and then
hit the limit, and things start going weird and you have no idea why.

Being able to query limits is important! Especially if they're lower
than real world daemons currently use. You don't appear to be allowing
either:

a) a default that's high enough not to cause _lots_ of sites problems
when they upgrade; or

b) a way to tell the system that you don't want these checks at all
(the == 0 patch); or

c) a way to know when you're getting close to the limit.

So you just expect sites to hit the limit, curse you roundly, and then
up there tuneables? With no way to know ahead of time that they're
approaching the limit?

You know, I didn't even know that Postfix created an epoll instance per
daemon until I found out about this the hard way by seeing epoll failures
in the log file. I certainly wasn't aware that this limit has snuck in
during a stable series.

Bron.
--
Bron Gondwana
[email protected]

2009-01-28 04:55:45

by Bron Gondwana

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Tue, Jan 27, 2009 at 08:14:36PM -0800, Davide Libenzi wrote:
> On Tue, 27 Jan 2009, Ray Lee wrote:
> > Userspace is not broken here, and the whole idea of a -stable series
> > is that administrators can upgrade to them without having to worry
> > about things getting broken or making specific configuration changes
> > by point release.
>
> The reason Greg took it, was that in a multiuser systems, that's a DoS
> EZ-PZ Lemon Squeezie.

Ok - we're at an impasse here.

You know the code a whole lot better than me.

Is there anything you can think of that will allow us to block the DOS
without breaking every medium to heavily loaded postfix and apache site
out there.

Somthing that doesn't require the administrators of every single
machine in one or the other class to tune their configurations?

Brong ( we expect you to know how to tune epoll, we don't expect every
apache and postfix administrator to know to tune a brand new
setting that just appeared in the last point release - especially
since most of them probably have no idea how many epoll watches
their software creates as a single user, and have never needed
to think about it before)

2009-01-28 05:28:36

by Greg KH

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Tue, Jan 27, 2009 at 08:10:41PM -0800, Davide Libenzi wrote:
> In my servers, I know if they are going to be loaded, and I bump NFILES
> (and a few other things) to the correct place. Since many of those
> limits do not actually pre-allocate any resource, I don't need to wait and
> monitor the values, before taking proper action.

But what about people who want to know what the current usages are, so
that they _can_ monitor things and adjust them on the fly if things are
about to go boom?

I see no reason why we can't leave the value where it is today, and add
the ability to both turn the limits off entirely, and also report our
current usage. That keeps the DOS from happening on "default" systems,
and lets admins have an idea if they need to bump up the values on their
systems as well.

I don't understand your objection to allowing the usage to be monitored.

confused,

greg k-h

2009-01-28 05:30:51

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Wed, 28 Jan 2009, Bron Gondwana wrote:

> On Tue, Jan 27, 2009 at 08:14:36PM -0800, Davide Libenzi wrote:
> > On Tue, 27 Jan 2009, Ray Lee wrote:
> > > Userspace is not broken here, and the whole idea of a -stable series
> > > is that administrators can upgrade to them without having to worry
> > > about things getting broken or making specific configuration changes
> > > by point release.
> >
> > The reason Greg took it, was that in a multiuser systems, that's a DoS
> > EZ-PZ Lemon Squeezie.
>
> Ok - we're at an impasse here.
>
> You know the code a whole lot better than me.
>
> Is there anything you can think of that will allow us to block the DOS
> without breaking every medium to heavily loaded postfix and apache site
> out there.
>
> Somthing that doesn't require the administrators of every single
> machine in one or the other class to tune their configurations?

Making the initial value of max_instances dependent on the amount of
memory we can tollerate a user to exploit with the trick showed before.
Allowing up to 1% of lower memory, should roughly result in:

512MB -> ~225
1GB -> ~310
2GB -> ~440

We could ssume that heavily loaded mail and web servers to have an amount
of RAM sufficent to get an high-enough default max_instances.




- Davide


---
fs/eventpoll.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/fs/eventpoll.c
===================================================================
--- linux-2.6.mod.orig/fs/eventpoll.c 2009-01-27 21:12:29.000000000 -0800
+++ linux-2.6.mod/fs/eventpoll.c 2009-01-27 21:19:06.000000000 -0800
@@ -1419,7 +1419,9 @@
struct sysinfo si;

si_meminfo(&si);
- max_user_instances = 128;
+ max_user_instances =
+ int_sqrt((((si.totalram - si.totalhigh) / 100) << PAGE_SHIFT) /
+ EP_ITEM_COST);
max_user_watches = (((si.totalram - si.totalhigh) / 32) << PAGE_SHIFT) /
EP_ITEM_COST;

2009-01-28 05:32:18

by Bron Gondwana

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Tue, Jan 27, 2009 at 08:00:30PM -0800, Davide Libenzi wrote:
> On Wed, 28 Jan 2009, Bron Gondwana wrote:
>
> > Both Postfix and Apache use an epoll instance per child, which
> > leads to significant scalability issues with max_user_instances
> > set so low. Bump the default to 1024 so medium sized sites are
> > not impacted.
>
> NACK. Epoll allocates globally about 100 to 160 bytes (32/64 bit) for each
> file added to the interface:
>
> for i 1..1024
> for j 1..1024
> if i!=j
> add j -> i
>
> That's (N^2 * {100, 160}) = 100MB to 160MB of pinned kernel memory,

Woah - that's serious.

This:

instances_uid 0 (root)
num_instances 142
max_instances 4096
watches_uid 107 (postfix)
num_watches 1097
max_watches 266555

isn't serious. It's pretty sane. 142 processes with an epoll watcher,
and fewer than 10 fds per epoll. Unfortunately, it wouldn't work on an
unpatched and un-specially-configured stock kernel. That's steady-state
too, not a peak. I just grabbed it off a running MX:

[brong@mx1 ~]$ free
total used free shared buffers
cached
Mem: 4151652 3113128 1038524 0 130808
2014152
-/+ buffers/cache: 968168 3183484
Swap: 2047992 50364 1997628
[brong@mx1 ~]$ uptime
00:31:05 up 2 days, 18:03, 2 users, load average: 0.86, 1.23, 1.08

Hardly looking stressed right now.

If I'm reading it right, your concern is the massively recursive case,
where every single epoll gets added to every other epoll as a chained
file descriptor?

That's clearly not happening here - so it seems that maybe our "happy
medium" is actually in closer inspection of what's going on rather than
a blanket low N to keep N^2 down.

Bron.

2009-01-28 05:38:31

by Bron Gondwana

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Tue, Jan 27, 2009 at 08:07:32PM -0800, Ray Lee wrote:
> On Tue, Jan 27, 2009 at 8:00 PM, Davide Libenzi <[email protected]> wrote:
> > $ echo NN > /proc/sys/fs/epoll/max_user_instances
> >
> > Note that NN does not consume any resource "per se", so if you feel
> > threatened by such limit, you can go wild with it.
>
> It's really simple. A kernel upgrade in a -stable series point release
> broke a rational user-space setup. If you don't want to adjust the
> defaults, then the sane thing to do is to revert the commit that
> caused the grief. Postfix is everywhere. Apache is everywhere.
>
> Userspace is not broken here, and the whole idea of a -stable series
> is that administrators can upgrade to them without having to worry
> about things getting broken or making specific configuration changes
> by point release.

Oh man - it's java too:

http://pero.blogs.aprilmayjune.org/

they also suggest 1024, independantly of everyone else by the looks of
things.

"A day and several installation routines later we figured out that
the available epoll resources were not sufficient any more. Java JDK 1.6
uses epoll to implement non-blocking-IO. With kernel 2.6.27 resource
limits have been introduced and the default on openSuSE is 128 - way too
low."

How many more people's wasted days is it going to take to convince you
that it's broken as currently implemented?

Bron.

2009-01-28 05:38:46

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Tue, Jan 27, 2009 at 09:26:30PM -0800, Greg KH wrote:
> On Tue, Jan 27, 2009 at 08:10:41PM -0800, Davide Libenzi wrote:
> > In my servers, I know if they are going to be loaded, and I bump NFILES
> > (and a few other things) to the correct place. Since many of those
> > limits do not actually pre-allocate any resource, I don't need to wait and
> > monitor the values, before taking proper action.
>
> But what about people who want to know what the current usages are, so
> that they _can_ monitor things and adjust them on the fly if things are
> about to go boom?
>
> I see no reason why we can't leave the value where it is today, and add
> the ability to both turn the limits off entirely, and also report our
> current usage. That keeps the DOS from happening on "default" systems,
> and lets admins have an idea if they need to bump up the values on their
> systems as well.
>
> I don't understand your objection to allowing the usage to be monitored.

Agreed. If sysadmins get trapped by the upgrade, the fix for an
hypotethical DoS is a 100%-certain DoS by itself. The general sense
that "if it's not broken, don't fix it" applies here as well. The
server's sysadmin should not be bothered by a security upgrade (anyway,
after a few minutes of havoc in prod, he will revert to previous version
without trying to understand any further). But the campus sysadmin having
trouble with local users already spends a lot of time tweaking limits.
Now we offer them a new limit they can tune, they'll happily use it.
Anyway, even at 128 they'll probably lower it down a lot. So basically
we're with a medium value which does not fit any usage.

Willy

2009-01-28 05:48:22

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Wed, 28 Jan 2009, Willy Tarreau wrote:

> On Tue, Jan 27, 2009 at 09:26:30PM -0800, Greg KH wrote:
> > On Tue, Jan 27, 2009 at 08:10:41PM -0800, Davide Libenzi wrote:
> > > In my servers, I know if they are going to be loaded, and I bump NFILES
> > > (and a few other things) to the correct place. Since many of those
> > > limits do not actually pre-allocate any resource, I don't need to wait and
> > > monitor the values, before taking proper action.
> >
> > But what about people who want to know what the current usages are, so
> > that they _can_ monitor things and adjust them on the fly if things are
> > about to go boom?
> >
> > I see no reason why we can't leave the value where it is today, and add
> > the ability to both turn the limits off entirely, and also report our
> > current usage. That keeps the DOS from happening on "default" systems,
> > and lets admins have an idea if they need to bump up the values on their
> > systems as well.
> >
> > I don't understand your objection to allowing the usage to be monitored.
>
> Agreed. If sysadmins get trapped by the upgrade, the fix for an
> hypotethical DoS is a 100%-certain DoS by itself. The general sense
> that "if it's not broken, don't fix it" applies here as well. The
> server's sysadmin should not be bothered by a security upgrade (anyway,
> after a few minutes of havoc in prod, he will revert to previous version
> without trying to understand any further). But the campus sysadmin having
> trouble with local users already spends a lot of time tweaking limits.
> Now we offer them a new limit they can tune, they'll happily use it.
> Anyway, even at 128 they'll probably lower it down a lot. So basically
> we're with a medium value which does not fit any usage.

You know, it's not me that decides what goes of certain trees or not ;)
I've been pinged about the problem, and a patch was sent with values that
seemed appropriate for typical epoll usages. Epoll is a multiplexing
interface, so the thought was that not too many instances were lingering
around. Probably the default max_instances should have been made lomem
dependent like max_user_watches in the first place, leading to higher
max_instances values, with respect of the potential DoS.



- Davide

2009-01-28 06:21:29

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Tue, Jan 27, 2009 at 09:48:07PM -0800, Davide Libenzi wrote:
> On Wed, 28 Jan 2009, Willy Tarreau wrote:
>
> > On Tue, Jan 27, 2009 at 09:26:30PM -0800, Greg KH wrote:
> > > On Tue, Jan 27, 2009 at 08:10:41PM -0800, Davide Libenzi wrote:
> > > > In my servers, I know if they are going to be loaded, and I bump NFILES
> > > > (and a few other things) to the correct place. Since many of those
> > > > limits do not actually pre-allocate any resource, I don't need to wait and
> > > > monitor the values, before taking proper action.
> > >
> > > But what about people who want to know what the current usages are, so
> > > that they _can_ monitor things and adjust them on the fly if things are
> > > about to go boom?
> > >
> > > I see no reason why we can't leave the value where it is today, and add
> > > the ability to both turn the limits off entirely, and also report our
> > > current usage. That keeps the DOS from happening on "default" systems,
> > > and lets admins have an idea if they need to bump up the values on their
> > > systems as well.
> > >
> > > I don't understand your objection to allowing the usage to be monitored.
> >
> > Agreed. If sysadmins get trapped by the upgrade, the fix for an
> > hypotethical DoS is a 100%-certain DoS by itself. The general sense
> > that "if it's not broken, don't fix it" applies here as well. The
> > server's sysadmin should not be bothered by a security upgrade (anyway,
> > after a few minutes of havoc in prod, he will revert to previous version
> > without trying to understand any further). But the campus sysadmin having
> > trouble with local users already spends a lot of time tweaking limits.
> > Now we offer them a new limit they can tune, they'll happily use it.
> > Anyway, even at 128 they'll probably lower it down a lot. So basically
> > we're with a medium value which does not fit any usage.
>
> You know, it's not me that decides what goes of certain trees or not ;)
> I've been pinged about the problem, and a patch was sent with values that
> seemed appropriate for typical epoll usages. Epoll is a multiplexing
> interface, so the thought was that not too many instances were lingering
> around. Probably the default max_instances should have been made lomem
> dependent like max_user_watches in the first place, leading to higher
> max_instances values, with respect of the potential DoS.

Davide, I know it's not you who decide. I mean, one patch was proposed
with one arbitrary limit. I've seen it in advance too and I too thought
it would be more than enough. Now people are reporting breakage from
common applications which work in a funny way (I think that using epoll
to poll for one single FD in a multi-process architecture can be called
funny). But those people are not expected to understand the internals,
and most likely their application's behaviour might not be more precisely
described than "it broke since upgrade to 2.6.27.13".

I think we should accept the fact that the fix is causing problems
for people while it was not expected to do so. One of the solutions
would be to increase the arbitrary ratio from 1% to more than that,
but it will still break big setups. Another solution is to accept
that the patch provides a tunable that admins might act on to stop
local users' nasty activities if required, but leave the limit off
by default. And I think that's a saner approach, especially for a
stable series.

Regards,
Willy

2009-01-28 06:36:37

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Wed, 28 Jan 2009, Willy Tarreau wrote:

> Davide, I know it's not you who decide. I mean, one patch was proposed
> with one arbitrary limit. I've seen it in advance too and I too thought
> it would be more than enough. Now people are reporting breakage from
> common applications which work in a funny way (I think that using epoll
> to poll for one single FD in a multi-process architecture can be called
> funny). But those people are not expected to understand the internals,
> and most likely their application's behaviour might not be more precisely
> described than "it broke since upgrade to 2.6.27.13".
>
> I think we should accept the fact that the fix is causing problems
> for people while it was not expected to do so. One of the solutions
> would be to increase the arbitrary ratio from 1% to more than that,
> but it will still break big setups. Another solution is to accept
> that the patch provides a tunable that admins might act on to stop
> local users' nasty activities if required, but leave the limit off
> by default. And I think that's a saner approach, especially for a
> stable series.

Absolutely. There is no 100% fit solution here. Heck, if we want to remove
the tunable altogether I'm the happiest one, but the problem with the
pinneable memory is there.
We can decide to remove the caps in the default setup, and leave default
setups open to the DoS. I've no problem with that (and, as we know, I
don't decide policies).
Then sysadmins of multiuser systems will have to enforce the caps
themselves in order to limit the potential DoS. This is probably a good
strategy for -stable anyway.



- Davide

2009-01-28 06:38:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Tue, 27 Jan 2009, Greg KH wrote:

> On Tue, Jan 27, 2009 at 08:10:41PM -0800, Davide Libenzi wrote:
> > In my servers, I know if they are going to be loaded, and I bump NFILES
> > (and a few other things) to the correct place. Since many of those
> > limits do not actually pre-allocate any resource, I don't need to wait and
> > monitor the values, before taking proper action.
>
> But what about people who want to know what the current usages are, so
> that they _can_ monitor things and adjust them on the fly if things are
> about to go boom?
>
> I see no reason why we can't leave the value where it is today, and add
> the ability to both turn the limits off entirely, and also report our
> current usage. That keeps the DOS from happening on "default" systems,
> and lets admins have an idea if they need to bump up the values on their
> systems as well.
>
> I don't understand your objection to allowing the usage to be monitored.

Do you really want to add that crud just to monitor a value? That cost
absolutely zero (in terms of pre-allocated resources) to bump up?
Is not like, that you want to keep the bound value close to the current
peak because using an even higher value could result in pre-allocated
resource waste. No because I could understand if rising such number to
higher-than-needed values could result in waste of resources, so you want
to monitor it to keep it as close as possible to the peak. But this is not
the case.
So today we have three groups of users:

- Users that have been hit by the limit
* Those have probably bumped the value up to the wazzoo.

- Unaware users with machines having potential of hitting the current limit
* Those, monitor or not, being unaware, they won't notice it until hits.
And since rising it costs zero, they'd likely prefer to bump it to the
stars instead of monitoring an incrementing by small steps.
* Applying a lomem-dependent max_instances default value like the two
lines patch I posted, is probably the best choice even for stable.

- Unaware users with low-load machines
* Those won't even notice it.

The default value can be rised, bound to lomem sizes. I see no problems in
there. Or, like Willy said, make (for -stable) the default unlimited, and
let sysadmins to put the bounds if they feel the DoS can apply to them.




- Davide

2009-01-28 06:52:19

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits



On Tue, 27 Jan 2009 22:38 -0800, "Davide Libenzi" <[email protected]> wrote:
> So today we have three groups of users:
>
> - Users that have been hit by the limit
> * Those have probably bumped the value up to the wazzoo.

Yeah, pretty much. But we've bumped things up to the wazzoo before
only to discover that our usage crept up there (file-max of 300,000
being a case on one machine recently. Appears you can hit that
pretty easily when you change from smaller machines to 32Gb memory

That's why the first time we hit file-max, we added a check into
our monitoring system so we get warned before we hit it. Any
fixed limit, I'd want one of these. Makes me sleep much better
(literally, the bloody things SMS me if checks start failing)

> - Unaware users with machines having potential of hitting the current
> limit
> * Those, monitor or not, being unaware, they won't notice it until
> hits.
> And since rising it costs zero, they'd likely prefer to bump it to
> the
> stars instead of monitoring an incrementing by small steps.

True. After they spend a day and a half figuring out what's causing
them out-of-files errors. They swear a lot and do the wazzoo thing.

> * Applying a lomem-dependent max_instances default value like the two
> lines patch I posted, is probably the best choice even for stable.

Would probably still make me sad, since these are 32 bit machines. Given
that 150 or so seems to be the steady state on the mxes, I wouldn't want
to know what it gets up to under a spam run. Probably close to 1000, since
that's what we limit smtpds at.

[brong@mx1 ~]$ ps ax | grep smtpd | wc -l
122
[brong@mx1 ~]$ cat /proc/sys/fs/epoll/limits
0 159 107 1451 4096 266555

Yeah, near enough. 159 is the interesting value here (old version of limits
file, the fields are in a different order)

> - Unaware users with low-load machines
> * Those won't even notice it.

No, until their usage pattern changes and they become one of the middle bunch.

> The default value can be rised, bound to lomem sizes. I see no problems
> in
> there. Or, like Willy said, make (for -stable) the default unlimited, and
> let sysadmins to put the bounds if they feel the DoS can apply to them.

I'd be happy with a default of unlimited (my patch 2 plus a couple of zero
defaults would do it)

Bron ( and to think I was going to suggest a patch that would check the value
you wrote to max_user_* to ensure it wasn't less than the current
highest user so you didn't accidentally munt your box ;) )
--
Bron Gondwana
[email protected]

2009-01-28 06:57:53

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Tue, 27 Jan 2009, Davide Libenzi wrote:

> Or, like Willy said, make (for -stable) the default unlimited, and
> let sysadmins to put the bounds if they feel the DoS can apply to them.

Whose patch follows ...


- Davide


---
fs/eventpoll.c | 17 ++++-------------
1 file changed, 4 insertions(+), 13 deletions(-)

Index: linux-2.6.mod/fs/eventpoll.c
===================================================================
--- linux-2.6.mod.orig/fs/eventpoll.c 2009-01-27 22:40:23.000000000 -0800
+++ linux-2.6.mod/fs/eventpoll.c 2009-01-27 22:52:41.000000000 -0800
@@ -220,9 +220,9 @@
* Configuration options available inside /proc/sys/fs/epoll/
*/
/* Maximum number of epoll devices, per user */
-static int max_user_instances __read_mostly;
+static int max_user_instances __read_mostly = INT_MAX;
/* Maximum number of epoll watched descriptors, per user */
-static int max_user_watches __read_mostly;
+static int max_user_watches __read_mostly = INT_MAX;

/*
* This mutex is used to serialize ep_free() and eventpoll_release_file().
@@ -721,8 +721,7 @@

user = get_current_user();
error = -EMFILE;
- if (unlikely(atomic_read(&user->epoll_devs) >=
- max_user_instances))
+ if (atomic_read(&user->epoll_devs) >= max_user_instances)
goto free_uid;
error = -ENOMEM;
ep = kzalloc(sizeof(*ep), GFP_KERNEL);
@@ -897,8 +896,7 @@
struct epitem *epi;
struct ep_pqueue epq;

- if (unlikely(atomic_read(&ep->user->epoll_watches) >=
- max_user_watches))
+ if (atomic_read(&ep->user->epoll_watches) >= max_user_watches)
return -ENOSPC;
if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
return -ENOMEM;
@@ -1416,13 +1414,6 @@

static int __init eventpoll_init(void)
{
- struct sysinfo si;
-
- si_meminfo(&si);
- max_user_instances = 128;
- max_user_watches = (((si.totalram - si.totalhigh) / 32) << PAGE_SHIFT) /
- EP_ITEM_COST;
-
/* Initialize the structure used to perform safe poll wait head wake ups */
ep_nested_calls_init(&poll_safewake_ncalls);

2009-01-28 07:01:22

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Tue, Jan 27, 2009 at 10:36:25PM -0800, Davide Libenzi wrote:
> On Wed, 28 Jan 2009, Willy Tarreau wrote:
>
> > Davide, I know it's not you who decide. I mean, one patch was proposed
> > with one arbitrary limit. I've seen it in advance too and I too thought
> > it would be more than enough. Now people are reporting breakage from
> > common applications which work in a funny way (I think that using epoll
> > to poll for one single FD in a multi-process architecture can be called
> > funny). But those people are not expected to understand the internals,
> > and most likely their application's behaviour might not be more precisely
> > described than "it broke since upgrade to 2.6.27.13".
> >
> > I think we should accept the fact that the fix is causing problems
> > for people while it was not expected to do so. One of the solutions
> > would be to increase the arbitrary ratio from 1% to more than that,
> > but it will still break big setups. Another solution is to accept
> > that the patch provides a tunable that admins might act on to stop
> > local users' nasty activities if required, but leave the limit off
> > by default. And I think that's a saner approach, especially for a
> > stable series.
>
> Absolutely. There is no 100% fit solution here. Heck, if we want to remove
> the tunable altogether I'm the happiest one, but the problem with the
> pinneable memory is there.

we shouldn't remove the tunable IMHO.

> We can decide to remove the caps in the default setup, and leave default
> setups open to the DoS. I've no problem with that (and, as we know, I
> don't decide policies).
> Then sysadmins of multiuser systems will have to enforce the caps
> themselves in order to limit the potential DoS. This is probably a good
> strategy for -stable anyway.

Yes, this is what I'd like to see in -stable too. I'm currently contacting
a few people I suggested to upgrade to 2.6.27.13 to warn them about the
issue.

Regards,
Willy

2009-01-28 07:34:26

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Wed, 28 Jan 2009, Bron Gondwana wrote:

> On Tue, 27 Jan 2009 22:38 -0800, "Davide Libenzi" <[email protected]> wrote:
> > So today we have three groups of users:
> >
> > - Users that have been hit by the limit
> > * Those have probably bumped the value up to the wazzoo.
>
> Yeah, pretty much. But we've bumped things up to the wazzoo before
> only to discover that our usage crept up there (file-max of 300,000
> being a case on one machine recently. Appears you can hit that
> pretty easily when you change from smaller machines to 32Gb memory
>
> That's why the first time we hit file-max, we added a check into
> our monitoring system so we get warned before we hit it. Any
> fixed limit, I'd want one of these. Makes me sleep much better
> (literally, the bloody things SMS me if checks start failing)

Why are you wasting your time in tail-chasing a value? If your load is so
unpredictable that you can't find a proper upper bound (and it almost
never is), make it unlimited (or redicoulously high enough).
Warned, by which assumption? That the value rises just as much to hit the
warn, but not to pass the current limit? How about *fail*, if the burst is
high enough to hit your inexplicably constrained value?
All this in oder to keep as-close-as-the-peak a value that costs no
resources in pre-allocation terms.




> > - Unaware users with machines having potential of hitting the current
> > limit
> > * Those, monitor or not, being unaware, they won't notice it until
> > hits.
> > And since rising it costs zero, they'd likely prefer to bump it to
> > the
> > stars instead of monitoring an incrementing by small steps.
>
> True. After they spend a day and a half figuring out what's causing
> them out-of-files errors. They swear a lot and do the wazzoo thing.

And, since they didn't know about the new limit, an even less known
"monitor" would have help in ...?



- Davide

2009-01-28 09:25:14

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits



On Tue, 27 Jan 2009 22:57 -0800, "Davide Libenzi" <[email protected]> wrote:
> On Tue, 27 Jan 2009, Davide Libenzi wrote:
>
> > Or, like Willy said, make (for -stable) the default unlimited, and
> > let sysadmins to put the bounds if they feel the DoS can apply to them.
>
> Whose patch follows ...

ACK.

Solves my problem and my "advocate of the poor suffering sysadmins who
have to track down why their stuff suddenly broke with a stable update
hat" problem as well.

One wondering...

> - if (unlikely(atomic_read(&user->epoll_devs) >=
> - max_user_instances))
> + if (atomic_read(&user->epoll_devs) >= max_user_instances)
> goto free_uid;

Any reason this has become _less_ unlikely()?

Thanks,

Bron.


--
Bron Gondwana
[email protected]

2009-01-28 10:16:57

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

> It's really simple. A kernel upgrade in a -stable series point release
> broke a rational user-space setup. If you don't want to adjust the

You can just as equally load the description the other way:

"A kernel upgrade in a -stable series point release fixed a security DoS"

Which is not to say that a smarter limit isn't needed.

Alan

2009-01-28 10:45:41

by Bron Gondwana

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Tue, Jan 27, 2009 at 11:34:14PM -0800, Davide Libenzi wrote:
> On Wed, 28 Jan 2009, Bron Gondwana wrote:
>
> > On Tue, 27 Jan 2009 22:38 -0800, "Davide Libenzi" <[email protected]> wrote:
> > > So today we have three groups of users:
> > >
> > > - Users that have been hit by the limit
> > > * Those have probably bumped the value up to the wazzoo.
> >
> > Yeah, pretty much. But we've bumped things up to the wazzoo before
> > only to discover that our usage crept up there (file-max of 300,000
> > being a case on one machine recently. Appears you can hit that
> > pretty easily when you change from smaller machines to 32Gb memory
> >
> > That's why the first time we hit file-max, we added a check into
> > our monitoring system so we get warned before we hit it. Any
> > fixed limit, I'd want one of these. Makes me sleep much better
> > (literally, the bloody things SMS me if checks start failing)
>
> Why are you wasting your time in tail-chasing a value? If your load is so
> unpredictable that you can't find a proper upper bound (and it almost
> never is), make it unlimited (or redicoulously high enough).

I've been here nearly 5 years. Over that time our rediculously high
enough values have been too small a couple of times, once when we moved
to two external drive units per imap server, and the second time when we
had a stack of 1Tb drives attached to a machine with 32Gb of RAM, and it
managed to handle so much more than previous machines.

Which is why we set it crazy higher than our previous limits, but we
also monitor. We want it sane enough that it catches totally
out-of-bound behaviour, but monitorable so when our hardware gets
progressively upgraded the previously ludicrous value isn't suddenly
just a little too low.

(the case recently was because a drive in another unit had failed, so I
pre-emptively shifted about 10 more masters to that machine in one
managed failover. Replicas use significantly fewer file descriptors
since all access is single threaded)

> Warned, by which assumption? That the value rises just as much to hit the
> warn, but not to pass the current limit? How about *fail*, if the burst is
> high enough to hit your inexplicably constrained value?
> All this in oder to keep as-close-as-the-peak a value that costs no
> resources in pre-allocation terms.

It tends to grow slowly enough that with well spaced warn values we can
get email warnings well in advance to double check things, then we get
paged with a supposed 20 minute maximum response time.

I haven't ever seen a crazy fast peak, but I'm assuming that would most
likely be cause by actual misbehaving software rather than a slow change
in usage patterns.

> > True. After they spend a day and a half figuring out what's causing
> > them out-of-files errors. They swear a lot and do the wazzoo thing.
>
> And, since they didn't know about the new limit, an even less known
> "monitor" would have help in ...?

Yeah, sure. I added that more for the same reason we monitor file-nr.
If I have a tunable knob that I have to tune, then I want to be able to
check my actual usage so I can tell how well it's tuned. Otherwise it's
a "stab-in-the-dark" knob.

Bron ( but based on this discussion, I'm going to go make the file-max
values crazy-higher while keeping the same warnings - no real
downside, and I see your point. I kind of inherited this setup,
and have stuck with it out of inertia as much as anythin )

2009-01-28 10:59:21

by Bron Gondwana

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Wed, Jan 28, 2009 at 10:16:41AM +0000, Alan Cox wrote:
> > It's really simple. A kernel upgrade in a -stable series point release
> > broke a rational user-space setup. If you don't want to adjust the
>
> You can just as equally load the description the other way:

Only if you're ignoring reality.

> "A kernel upgrade in a -stable series point release fixed a security DoS"

Alan, that's a complete load of bollocks. It broke common configurations
of java, postfix and apache on real-world machines, causing significant
actual denials of service in previously reliable configurations.

How about "A kernel upgrade in a -stable series replaced one potential
DoS with another DoS and provided a tunable knob to select which DoS you
would prefer, defaulting to the opposite of the previous behaviour"

> Which is not to say that a smarter limit isn't needed.

Yeah, I have an idea about that, but I need to see if it's actually
viable within the code. The DoS works by creating epoll descriptors
watching other epoll descriptors, which strikes me as a much less
real-world actual use pattern than a bunch of separate daemons with an
epoll watcher each.

If it's possible to count watches only if they're added to another epoll
instance, then we'd have a metric that still catches the N^2 attack, but
doesn't interact with the common non-attacky use-case.

I'd be much happier if we could remove the dichotomy of "allow the DoS
or live with a highly crippled epoll implementation until some of the
biggest daemons out there change their usage patterns" (thinking
particularly of java 1.6 and apache here. Largish postfix installations
are much rarer)

Bron.

2009-01-28 11:08:31

by Vegard Nossum

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Wed, Jan 28, 2009 at 6:32 AM, Bron Gondwana <[email protected]> wrote:
> That's clearly not happening here - so it seems that maybe our "happy
> medium" is actually in closer inspection of what's going on rather than
> a blanket low N to keep N^2 down.

Mh, could another solution to this all be to limit the number times
you can add a single epoll descriptor to another descriptor's set?

So you would still get the "upwards cascading" behaviour (i.e. A can
monitor B and C), but the "downwards cascading" would be prohibited
(i.e. B and C can't both monitor A).

I think this is a reasonable alternative, which would again allow a
number of epoll instances limited only by the number of open file
descriptors.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2009-01-28 11:36:42

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

> > "A kernel upgrade in a -stable series point release fixed a security DoS"
>
> Alan, that's a complete load of bollocks. It broke common configurations
> of java, postfix and apache on real-world machines, causing significant
> actual denials of service in previously reliable configurations.

It fixed a security DoS. I was merely pointing out that the description
provided before was bogus, incomplete and loaded.

> viable within the code. The DoS works by creating epoll descriptors
> watching other epoll descriptors, which strikes me as a much less
> real-world actual use pattern than a bunch of separate daemons with an
> epoll watcher each.

Deliberate attackers don't have to follow typical usage patterns.

> If it's possible to count watches only if they're added to another epoll
> instance, then we'd have a metric that still catches the N^2 attack, but
> doesn't interact with the common non-attacky use-case.

Agreed entirely.

2009-01-28 13:28:42

by Bron Gondwana

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Wed, Jan 28, 2009 at 11:36:40AM +0000, Alan Cox wrote:
> > > "A kernel upgrade in a -stable series point release fixed a security DoS"
> >
> > Alan, that's a complete load of bollocks. It broke common configurations
> > of java, postfix and apache on real-world machines, causing significant
> > actual denials of service in previously reliable configurations.
>
> It fixed a security DoS. I was merely pointing out that the description
> provided before was bogus, incomplete and loaded.

Not allowing user logins would have fixed that particular security DoS
too. There's a range of pretty destructive things that can "fix" one
issue at the expense of creating others.

This particular choice of fix just happens to have caused at least three
reported (though not to LKML, but I'll post the URLs for the other two
again in a sec) commonly used applications issues. These applications
were using the published API in a way which used to work perfectly
well, and not DoSing the system.

How would you define a regression otherwise? A public and commonly used
API had a new user-space visable error code added that it had never
returned before, and a low enough limit set that this error was seen in
practice by multiple sites.

http://marc.info/?l=fedora-devel-list&m=123134150926934&w=2

http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/

> > viable within the code. The DoS works by creating epoll descriptors
> > watching other epoll descriptors, which strikes me as a much less
> > real-world actual use pattern than a bunch of separate daemons with an
> > epoll watcher each.
>
> Deliberate attackers don't have to follow typical usage patterns.

Sure, but if typical usage patterns hit your sensor, then you have false
positives. Adding a DoS sensor that gets false positives is a regression,
and whitewashing it as "fixed a security DoS" is bogus. It did that, but
also more than that, and the more was/is sucky.

> > If it's possible to count watches only if they're added to another epoll
> > instance, then we'd have a metric that still catches the N^2 attack, but
> > doesn't interact with the common non-attacky use-case.
>
> Agreed entirely.

Yeah, enough arguing hey. Let's come up with a real fix that doesn't
give sites the ugly choice between remaining vulnerable to a known DoS
attack or hobbling common programs that aren't actually using that many
resources.

Bron.

2009-01-28 16:53:16

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Wed, 28 Jan 2009, Vegard Nossum wrote:

> On Wed, Jan 28, 2009 at 6:32 AM, Bron Gondwana <[email protected]> wrote:
> > That's clearly not happening here - so it seems that maybe our "happy
> > medium" is actually in closer inspection of what's going on rather than
> > a blanket low N to keep N^2 down.
>
> Mh, could another solution to this all be to limit the number times
> you can add a single epoll descriptor to another descriptor's set?

In the example that was posted, a single fd was added a single time inside
the other 1000+ fds. Epoll already has detection for too long chains and
closed loops, but you can't put those in the fast path. And epoll_ctl() is
one of those.


- Davide

2009-01-28 16:56:42

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Wed, 28 Jan 2009, Bron Gondwana wrote:

> > - if (unlikely(atomic_read(&user->epoll_devs) >=
> > - max_user_instances))
> > + if (atomic_read(&user->epoll_devs) >= max_user_instances)
> > goto free_uid;
>
> Any reason this has become _less_ unlikely()?

GCC seems to make that choice anyway, according to quite a few tests I ran
yesterday.


- Davide

2009-01-28 21:00:15

by Bron Gondwana

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Wed, Jan 28, 2009 at 08:52:51AM -0800, Davide Libenzi wrote:
> On Wed, 28 Jan 2009, Vegard Nossum wrote:
>
> > On Wed, Jan 28, 2009 at 6:32 AM, Bron Gondwana <[email protected]> wrote:
> > > That's clearly not happening here - so it seems that maybe our "happy
> > > medium" is actually in closer inspection of what's going on rather than
> > > a blanket low N to keep N^2 down.
> >
> > Mh, could another solution to this all be to limit the number times
> > you can add a single epoll descriptor to another descriptor's set?
>
> In the example that was posted, a single fd was added a single time inside
> the other 1000+ fds. Epoll already has detection for too long chains and
> closed loops, but you can't put those in the fast path. And epoll_ctl() is
> one of those.

Not even if you're adding an epoll watcher inside another epoll watcher?

The problem I have here is that "a single fd was added a single time
inside the other 1000+ fds" is different behaviour to the daemons out
there. They're pretty much all using flat layouts:

process 1:
epoll_watcher:
leaf fd
leaf fd 2
leaf fd 3
leaf fd 4
...

process 2:
epoll_watcher:
...

While the attack happens inside a single process.

Indeed, if you had a _per_process_ watcher limit, you would stop the
attack working while not breaking at least postfix and apache. I'm not
sure what Java's doing under the hood, I have a feeling it's more
thready.

But most of all a way of detecting between a leaf fd and an epoll
watcher fd in epoll_ctl and doing deeper tests if it's an epoll watcher
that's being added would stop the attack.

Bron.

2009-01-28 21:47:16

by Chris Adams

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

Once upon a time, Davide Libenzi <[email protected]> said:
>I already gave you my opinion on such code. There is no need for it. If
>your servers are loaded, in the same way you bump NFILES (and likely
>even other default configs), you bump up max_user_instances:

The flip side of that is this could just be added to the list of limits
you set on a multi-user system if you don't want $LUSER to DoS your
server (such as max procs, cpu time, virtual memory, etc.). I don't
think this is a security issue on single-user systems or servers with
only privileged access.

Admins of multi-user systems are used to having to manage limits (see
pam_limits for example). Admins of single-user or privileged servers
(e.g. mail or non-shared web servers) are not for the most part (postfix
doesn't open 1025 files in a single process).

--
Chris Adams <[email protected]>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

2009-01-29 00:31:06

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024

On Thu, 29 Jan 2009, Bron Gondwana wrote:

> On Wed, Jan 28, 2009 at 08:52:51AM -0800, Davide Libenzi wrote:
> > On Wed, 28 Jan 2009, Vegard Nossum wrote:
> >
> > > On Wed, Jan 28, 2009 at 6:32 AM, Bron Gondwana <[email protected]> wrote:
> > > > That's clearly not happening here - so it seems that maybe our "happy
> > > > medium" is actually in closer inspection of what's going on rather than
> > > > a blanket low N to keep N^2 down.
> > >
> > > Mh, could another solution to this all be to limit the number times
> > > you can add a single epoll descriptor to another descriptor's set?
> >
> > In the example that was posted, a single fd was added a single time inside
> > the other 1000+ fds. Epoll already has detection for too long chains and
> > closed loops, but you can't put those in the fast path. And epoll_ctl() is
> > one of those.
>
> Not even if you're adding an epoll watcher inside another epoll watcher?

Adding an epoll fd inside another epoll fd is perfectly legal. It would
kinda suck if epoll itself wouldn't expose a pollable interface too.



> The problem I have here is that "a single fd was added a single time
> inside the other 1000+ fds" is different behaviour to the daemons out
> there. They're pretty much all using flat layouts:

Yes, that is not what programs normally do. Most of the times you have
nesting level equal zero, although we've seen recently that the
epoll-being-pollable feature (hence nesting) is used too. Say you have two
(or more) libraries, each own monitoring different things, and each own
with its own wait+dispatch loop. If these libraries didn't have a chance
to expose a pollable fd, you'd have to run their wait+dispatch loop in
seaprate threads. Whereas epoll being itself pollable allows you to:

epoll_wait(lib1_fd, lib2_fd)
if (ready(lib1_fd))
lib1_dispatch()
if (ready(lib2_fd))
lib2_dispatch()

This is pretty powerful, although needs care for wakeups and poll nested
calls.



- Davide

2009-01-29 00:31:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch 016/104] epoll: introduce resource usage limits

On Wed, 28 Jan 2009, Chris Adams wrote:

> Once upon a time, Davide Libenzi <[email protected]> said:
> >I already gave you my opinion on such code. There is no need for it. If
> >your servers are loaded, in the same way you bump NFILES (and likely
> >even other default configs), you bump up max_user_instances:
>
> The flip side of that is this could just be added to the list of limits
> you set on a multi-user system if you don't want $LUSER to DoS your
> server (such as max procs, cpu time, virtual memory, etc.). I don't
> think this is a security issue on single-user systems or servers with
> only privileged access.
>
> Admins of multi-user systems are used to having to manage limits (see
> pam_limits for example). Admins of single-user or privileged servers
> (e.g. mail or non-shared web servers) are not for the most part (postfix
> doesn't open 1025 files in a single process).

It seems this is the most agreeable solution based on this thread replies.
That is, leave it unbound, and offer limiting capabilities to multiuser
sysadmins.



- Davide

2009-01-29 00:34:36

by Bron Gondwana

[permalink] [raw]
Subject: Re: [PATCH 1/3] epoll: increase default max_user_instances to 1024



On Wed, 28 Jan 2009 15:51 -0800, "Davide Libenzi" <[email protected]> wrote:
> On Thu, 29 Jan 2009, Bron Gondwana wrote:
>
> > On Wed, Jan 28, 2009 at 08:52:51AM -0800, Davide Libenzi wrote:
> > > On Wed, 28 Jan 2009, Vegard Nossum wrote:
> > >
> > > > On Wed, Jan 28, 2009 at 6:32 AM, Bron Gondwana <[email protected]> wrote:
> > > > > That's clearly not happening here - so it seems that maybe our "happy
> > > > > medium" is actually in closer inspection of what's going on rather than
> > > > > a blanket low N to keep N^2 down.
> > > >
> > > > Mh, could another solution to this all be to limit the number times
> > > > you can add a single epoll descriptor to another descriptor's set?
> > >
> > > In the example that was posted, a single fd was added a single time inside
> > > the other 1000+ fds. Epoll already has detection for too long chains and
> > > closed loops, but you can't put those in the fast path. And epoll_ctl() is
> > > one of those.
> >
> > Not even if you're adding an epoll watcher inside another epoll watcher?
>
> Adding an epoll fd inside another epoll fd is perfectly legal. It would
> kinda suck if epoll itself wouldn't expose a pollable interface too.

Yeah, I'm not suggesting killing it completely, just putting the limit
at that level rather than limiting watches completely.

The other thing I've looked at is just limiting watches per process to
rlim[RLIMIT_NFILE]. Any reason why that wouldn't be enough? It means
you get limited to less than RLIMIT_NFILE if you add the same file
descriptor to multiple epolls... but that's not too scary.

> > The problem I have here is that "a single fd was added a single time
> > inside the other 1000+ fds" is different behaviour to the daemons out
> > there. They're pretty much all using flat layouts:
>
> Yes, that is not what programs normally do. Most of the times you have
> nesting level equal zero, although we've seen recently that the
> epoll-being-pollable feature (hence nesting) is used too. Say you have
> two
> (or more) libraries, each own monitoring different things, and each own
> with its own wait+dispatch loop. If these libraries didn't have a chance
> to expose a pollable fd, you'd have to run their wait+dispatch loop in
> seaprate threads. Whereas epoll being itself pollable allows you to:
>
> epoll_wait(lib1_fd, lib2_fd)
> if (ready(lib1_fd))
> lib1_dispatch()
> if (ready(lib2_fd))
> lib2_dispatch()
>
> This is pretty powerful, although needs care for wakeups and poll nested
> calls.

That would only add a couple of "epoll watching epoll" instances though - you
could still limit that to a pretty low number and not impact normal use.

Bron.
--
Bron Gondwana
[email protected]