Cédric Le Goater [Thu, 24 Apr 2014 07:23:34 +0000 (09:23 +0200)]
powerpc/boot: Add 64bit and little endian support to addnote
It could certainly be improved using Elf macros and byteswapping
routines, but the initial version of the code is organised to be a
single file program with limited dependencies. yaboot is the same.
Please scream if you want a total rewrite.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cédric Le Goater [Thu, 24 Apr 2014 07:23:33 +0000 (09:23 +0200)]
powerpc/boot: Define byteswapping routines for little endian
These are not the most efficient versions of swab but the wrapper does
not do much byte swapping. On a big endian cpu, these routines are
a no-op.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cédric Le Goater [Thu, 24 Apr 2014 07:23:32 +0000 (09:23 +0200)]
powerpc/boot: Fix compile warning in 64bit
arch/powerpc/boot/oflib.c:211:9: warning: cast to pointer from integer of \
different size [-Wint-to-pointer-cast]
return (phandle) of_call_prom("finddevice", 1, 1, name);
This is a work around. The definite solution would be to define the
phandle typedef as a u32, as in the kernel, but this would break the
device tree ops API.
Let it be for the moment.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cédric Le Goater [Thu, 24 Apr 2014 07:23:31 +0000 (09:23 +0200)]
powerpc/boot: Define typedef ihandle as u32
This makes ihandle 64bit friendly.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cédric Le Goater [Thu, 24 Apr 2014 07:23:30 +0000 (09:23 +0200)]
powerpc/boot: Rework of_claim() to make it 64bit friendly
This patch fixes 64bit compile warnings and updates the wrapper code
to converge the kernel code in prom_init.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cédric Le Goater [Thu, 24 Apr 2014 07:23:29 +0000 (09:23 +0200)]
powerpc/boot: Add PROM_ERROR define in oflib
This is mostly useful to make to the boot wrapper code closer with
the kernel code in prom_init.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cédric Le Goater [Thu, 24 Apr 2014 07:23:28 +0000 (09:23 +0200)]
powerpc/boot: Add byteswapping routines in oflib
Values will need to be byte-swapped when calling prom (big endian) from
a little endian boot wrapper.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cédric Le Goater [Thu, 24 Apr 2014 07:23:27 +0000 (09:23 +0200)]
powerpc/boot: Use prom_arg_t in oflib
This patch updates the wrapper code to converge with the kernel code in
prom_init.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cédric Le Goater [Thu, 24 Apr 2014 07:23:26 +0000 (09:23 +0200)]
powerpc/boot: Use a common prom_args struct in oflib
This patch fixes warnings when the wrapper is compiled in 64bit and
updates the boot wrapper code related to prom to converge with the
kernel code in prom_init. This should make the review of changes easier.
The kernel has a different number of possible arguments (10) when
entering prom. There does not seem to be any good reason to have
12 in the wrapper, so the patch changes this value to args[10] in
the prom_args struct.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cédric Le Goater [Thu, 24 Apr 2014 07:23:25 +0000 (09:23 +0200)]
powerpc/boot: Fix do_div for 64bit wrapper
When the boot wrapper is compiled in 64bit, there is no need to
use __div64_32.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:31 +0000 (18:00 +1000)]
powerpc/prom: Stop scanning dev-tree for fdump early
Function early_init_dt_scan_fw_dump() is called to scan the device
tree for fdump properties under node "rtas". Any one of them is
invalid, we can stop scanning the device tree early by returning
"1". It would save a bit time during boot.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:29 +0000 (18:00 +1000)]
powerpc/powernv: Don't use pe->pbus to get the domain number
If the PE contains single PCI function, "pe->pbus" would be NULL.
It's not reliable to be used by pci_domain_nr(). We just grab the
PCI domain number from the PCI host controller (struct pci_controller)
instance.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:28 +0000 (18:00 +1000)]
powerpc/powernv: Missed IOMMU table type
In function pnv_pci_ioda2_setup_dma_pe(), the IOMMU table type is
set to (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE) unconditionally.
It was just set to TCE_PCI by pnv_pci_setup_iommu_table(). So the
primary IOMMU table type (TCE_PCI) is lost. The patch fixes it.
Also, pnv_pci_setup_iommu_table() already set "tbl->it_busno" to
zero and we needn't do it again. The patch removes the redundant
assignment.
The patch also fixes similar issues in pnv_pci_ioda_setup_dma_pe().
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:27 +0000 (18:00 +1000)]
powerpc/powernv: Fundamental reset on PLX ports
The patch intends to support fundamental reset on PLX downstream
ports. If the PCI device matches any one of the internal table,
which includes PLX vendor ID, bridge device ID, register offset
for fundamental reset and bit, fundamental reset will be done
accordingly. Otherwise, it will fail back to hot reset.
Additional flag (EEH_DEV_FRESET) is introduced to record the last
reset type on the PCI bridge.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:26 +0000 (18:00 +1000)]
powerpc/eeh: Can't recover from non-PE-reset case
When PCI_ERS_RESULT_CAN_RECOVER returned from device drivers, the
EEH core should enable I/O and DMA for the affected PE. However,
it was missed to have DMA enabled in eeh_handle_normal_event().
Besides, the frozen state of the affected PE should be cleared
after successful recovery, but we didn't.
The patch fixes both of the issues as above.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:25 +0000 (18:00 +1000)]
powrpc/powernv: Reset PHB in kdump kernel
In the kdump scenario, the first kerenl doesn't shutdown PCI devices
and the kdump kerenl clean PHB IODA table at the early probe time.
That means the kdump kerenl can't support PCI transactions piled
by the first kerenl. Otherwise, lots of EEH errors and frozen PEs
will be detected.
In order to avoid the EEH errors, the PHB is resetted to drop all
PCI transaction from the first kerenl.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:24 +0000 (18:00 +1000)]
powerpc/pci: Mask linkDown on resetting PCI bus
The problem was initially reported by Wendy who tried pass through
IPR adapter, which was connected to PHB root port directly, to KVM
based guest. When doing that, pci_reset_bridge_secondary_bus() was
called by VFIO driver and linkDown was detected by the root port.
That caused all PEs to be frozen.
The patch fixes the issue by routing the reset for the secondary bus
of root port to underly firmware. For that, one more weak function
pci_reset_secondary_bus() is introduced so that the individual platforms
can override that and do specific reset for bridge's secondary bus.
Reported-by: Wendy Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:23 +0000 (18:00 +1000)]
powerpc/eeh: Make the delay for PE reset unified
Basically, we have 3 types of resets to fulfil PE reset: fundamental,
hot and PHB reset. For the later 2 cases, we need PCI bus reset hold
and settlement delay as specified by PCI spec. PowerNV and pSeries
platforms are running on top of different firmware and some of the
delays have been covered by underly firmware (PowerNV).
The patch makes the delays unified to be done in backend, instead of
EEH core.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:22 +0000 (18:00 +1000)]
powerpc/powernv: Reset root port in firmware
Resetting root port has more stuff to do than that for PCIe switch
ports and we should have resetting root port done in firmware instead
of the kernel itself. The problem was introduced by commit
5b2e198e
("powerpc/powernv: Rework EEH reset").
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:21 +0000 (18:00 +1000)]
powerpc/pseries: Fix overwritten PE state
In pseries_eeh_get_state(), EEH_STATE_UNAVAILABLE is always
overwritten by EEH_STATE_NOT_SUPPORT because of the missed
"break" there. The patch fixes the issue.
Reported-by: Joe Perches <joe@perches.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:20 +0000 (18:00 +1000)]
powerpc/powernv: Fix endless reporting frozen PE
Once one specific PE has been marked as EEH_PE_ISOLATED, it's in
the middile of recovery or removed permenently. We needn't report
the frozen PE again. Otherwise, we will have endless reporting
same frozen PE.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:19 +0000 (18:00 +1000)]
powerpc/eeh: No hotplug on permanently removed dev
The issue was detected in a bit complicated test case where
we have multiple hierarchical PEs shown as following figure:
+-----------------+
| PE#3 p2p#0 |
| p2p#1 |
+-----------------+
|
+-----------------+
| PE#4 pdev#0 |
| pdev#1 |
+-----------------+
PE#4 (have 2 PCI devices) is the child of PE#3, which has 2 p2p
bridges. We accidentally had less-known scenario: PE#4 was removed
permanently from the system because of permanent failure (e.g.
exceeding the max allowd failure times in last hour), then we detects
EEH errors on PE#3 and tried to recover it. However, eeh_dev instances
for pdev#0/1 were not detached from PE#4, which was still connected to
PE#3. All of that was because of the fact that we rely on count-based
pcibios_release_device(), which isn't reliable enough. When doing
recovery for PE#3, we still apply hotplug on PE#4 and pdev#0/1, which
are not valid any more. Eventually, we run into kernel crash.
The patch fixes above issue from two aspects. For unplug, we simply
skip those permanently removed PE, whose state is (EEH_PE_STATE_ISOLATED
&& !EEH_PE_STATE_RECOVERING) and its frozen count should be greater
than EEH_MAX_ALLOWED_FREEZES. For plug, we marked all permanently
removed EEH devices with EEH_DEV_REMOVED and return 0xFF's on read
its PCI config so that PCI core will omit them.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:18 +0000 (18:00 +1000)]
powerpc/eeh: Allow to disable EEH
The patch introduces bootarg "eeh=off" to disable EEH functinality.
Also, it creates /sys/kerenl/debug/powerpc/eeh_enable to disable
or enable EEH functionality. By default, we have the functionality
enabled.
For PowerNV platform, we will restore to have the conventional
mechanism of clearing frozen PE during PCI config access if we're
going to disable EEH functionality. Conversely, we will rely on
EEH for error recovery.
The patch also fixes the issue that we missed to cover the case
of disabled EEH functionality in function ioda_eeh_event(). Those
events driven by interrupt should be cleared to avoid endless
reporting.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:17 +0000 (18:00 +1000)]
powerpc/eeh: Cleanup EEH subsystem variables
There're 2 EEH subsystem variables: eeh_subsystem_enabled and
eeh_probe_mode. We needn't maintain 2 variables and we can just
have one variable and introduce different flags. The patch also
introduces additional flag EEH_FORCE_DISABLE, which will be used
to disable EEH subsystem via boot parameter ("eeh=off") in future.
Besides, the patch also introduces flag EEH_ENABLED, which is
changed to disable or enable EEH functionality on the fly through
debugfs entry in future.
With the patch applied, the creteria to check the enabled EEH
functionality is changed to:
!EEH_FORCE_DISABLED && EEH_ENABLED : Enabled
Other cases : Disabled
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:16 +0000 (18:00 +1000)]
powerpc/eeh: Use cached capability for log dump
When calling into eeh_gather_pci_data() on pSeries platform, we
possiblly don't have pci_dev instance yet, but eeh_dev is always
ready. So we use cached capability from eeh_dev instead of pci_dev
for log dump there. In order to keep things unified, we also cache
PCI capability positions to eeh_dev for PowerNV as well.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:15 +0000 (18:00 +1000)]
powerpc/eeh: Cleanup eeh_gather_pci_data()
The patch replaces printk(KERN_WARNING ...) with pr_warn() in the
function eeh_gather_pci_data().
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:14 +0000 (18:00 +1000)]
powerpc/eeh: Avoid I/O access during PE reset
We have suffered recrusive frozen PE a lot, which was caused
by IO accesses during the PE reset. Ben came up with the good
idea to keep frozen PE until recovery (BAR restore) gets done.
With that, IO accesses during PE reset are dropped by hardware
and wouldn't incur the recrusive frozen PE any more.
The patch implements the idea. We don't clear the frozen state
until PE reset is done completely. During the period, the EEH
core expects unfrozen state from backend to keep going. So we
have to reuse EEH_PE_RESET flag, which has been set during PE
reset, to return normal state from backend. The side effect is
we have to clear frozen state for towice (PE reset and clear it
explicitly), but that's harmless.
We have some limitations on pHyp. pHyp doesn't allow to enable
IO or DMA for unfrozen PE. So we don't enable them on unfrozen PE
in eeh_pci_enable(). We have to enable IO before grabbing logs on
pHyp. Otherwise, 0xFF's is always returned from PCI config space.
Also, we had wrong return value from eeh_pci_enable() for
EEH_OPT_THAW_DMA case. The patch fixes it too.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:13 +0000 (18:00 +1000)]
powerpc/powernv: Use EEH PCI config accessors
For EEH PowerNV backends, they need use their own PCI config
accesors as the normal one could be blocked during PE reset.
The patch also removes necessary parameter "hose" for the
function ioda_eeh_bridge_reset().
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:12 +0000 (18:00 +1000)]
powerpc/eeh: Block PCI-CFG access during PE reset
We've observed multiple PE reset failures because of PCI-CFG
access during that period. Potentially, some device drivers
can't support EEH very well and they can't put the device to
motionless state before PE reset. So those device drivers might
produce PCI-CFG accesses during PE reset. Also, we could have
PCI-CFG access from user space (e.g. "lspci"). Since access to
frozen PE should return 0xFF's, we can block PCI-CFG access
during the period of PE reset so that we won't get recrusive EEH
errors.
The patch adds flag EEH_PE_RESET, which is kept during PE reset.
The PowerNV/pSeries PCI-CFG accessors reuse the flag to block
PCI-CFG accordingly.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:11 +0000 (18:00 +1000)]
powerpc/eeh: EEH_PE_ISOLATED not reflect HW state
When doing PE reset, EEH_PE_ISOLATED is cleared unconditionally.
However, We should remove that if the PE reset has cleared the
frozen state successfully. Otherwise, the flag should be kept.
The patch fixes the issue.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:10 +0000 (18:00 +1000)]
powerpc/powernv: Remove fields in PHB diag-data dump
For some fields (e.g. LEM, MMIO, DMA) in PHB diag-data dump, it's
meaningless to print them if they have non-zero value in the
corresponding mask registers because we always have non-zero values
in the mask registers. The patch only prints those fieds if we
have non-zero values in the primary registers (e.g. LEM, MMIO, DMA
status) so that we can save couple of lines. The patch also removes
unnecessary spare line before "brdgCtl:" and two leading spaces as
prefix in each line as Ben suggested.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:09 +0000 (18:00 +1000)]
powerpc/powernv: Move PNV_EEH_STATE_ENABLED around
The flag PNV_EEH_STATE_ENABLED is put into pnv_phb::eeh_state,
which is protected by CONFIG_EEH. We needn't that. Instead, we
can have pnv_phb::flags and maintain all flags there, which is
the purpose of the patch. The patch also renames PNV_EEH_STATE_ENABLED
to PNV_PHB_FLAG_EEH.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:08 +0000 (18:00 +1000)]
powerpc/powernv: Remove PNV_EEH_STATE_REMOVED
The PHB state PNV_EEH_STATE_REMOVED maintained in pnv_phb isn't
so useful any more and it's duplicated to EEH_PE_ISOLATED. The
patch replaces PNV_EEH_STATE_REMOVED with EEH_PE_ISOLATED.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Gavin Shan [Thu, 24 Apr 2014 08:00:07 +0000 (18:00 +1000)]
powerpc/eeh: Remove EEH_PE_PHB_DEAD
The PE state (for eeh_pe instance) EEH_PE_PHB_DEAD is duplicate to
EEH_PE_ISOLATED. Originally, those PHBs (PHB PE) with EEH_PE_PHB_DEAD
would be removed from the system. However, it's safe to replace
that with EEH_PE_ISOLATED.
The patch also clear EEH_PE_RECOVERING after fenced PHB has been handled,
either failure or success. It makes the PHB PE state consistent with:
PHB functions normally NONE
PHB has been removed EEH_PE_ISOLATED
PHB fenced, recovery in progress EEH_PE_ISOLATED | RECOVERING
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Alistair Popple [Tue, 8 Apr 2014 04:20:19 +0000 (14:20 +1000)]
powerpc/4xx: Fix section mismatch in ppc4xx_pci.c
This patch fixes this section mismatch:
WARNING: vmlinux.o(.text+0x1efc4): Section mismatch in reference from
the function apm821xx_pciex_init_port_hw() to the function
.init.text:ppc4xx_pciex_wait_on_sdr.isra.9()
The function apm821xx_pciex_init_port_hw() references the function
__init ppc4xx_pciex_wait_on_sdr.isra.9(). This is often because
apm821xx_pciex_init_port_hw lacks a __init annotation or the
annotation of ppc4xx_pciex_wait_on_sdr.isra.9 is wrong.
apm821xx_pciex_init_port_hw is only referenced by a struct in
__initdata, so it should be safe to add __init to
apm821xx_pciex_init_port_hw.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Preeti U Murthy [Fri, 11 Apr 2014 10:32:08 +0000 (16:02 +0530)]
ppc/kvm: Clear the runlatch bit of a vcpu before napping
When the guest cedes the vcpu or the vcpu has no guest to
run it naps. Clear the runlatch bit of the vcpu before
napping to indicate an idle cpu.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Preeti U Murthy [Fri, 11 Apr 2014 10:31:58 +0000 (16:01 +0530)]
ppc/kvm: Set the runlatch bit of a CPU just before starting guest
The secondary threads in the core are kept offline before launching guests
in kvm on powerpc: "
371fefd6f2dc4666:KVM: PPC: Allow book3s_hv guests to use
SMT processor modes."
Hence their runlatch bits are cleared. When the secondary threads are called
in to start a guest, their runlatch bits need to be set to indicate that they
are busy. The primary thread has its runlatch bit set though, but there is no
harm in setting this bit once again. Hence set the runlatch bit for all
threads before they start guest.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Preeti U Murthy [Fri, 11 Apr 2014 10:31:48 +0000 (16:01 +0530)]
ppc/powernv: Set the runlatch bits correctly for offline cpus
Up until now we have been setting the runlatch bits for a busy CPU and
clearing it when a CPU enters idle state. The runlatch bit has thus
been consistent with the utilization of a CPU as long as the CPU is online.
However when a CPU is hotplugged out the runlatch bit is not cleared. It
needs to be cleared to indicate an unused CPU. Hence this patch has the
runlatch bit cleared for an offline CPU just before entering an idle state
and sets it immediately after it exits the idle state.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Li Zhong [Thu, 10 Apr 2014 08:25:31 +0000 (16:25 +0800)]
powerpc/pseries: Protect remove_memory() with device hotplug lock
While testing memory hot-remove, I found following dead lock:
Process #1141 is drmgr, trying to remove some memory, i.e. memory499.
It holds the memory_hotplug_mutex, and blocks when trying to remove file
"online" under dir memory499, in kernfs_drain(), at
wait_event(root->deactivate_waitq,
atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);
Process #1120 is trying to online memory499 by
echo 1 > memory499/online
In .kernfs_fop_write, it uses kernfs_get_active() to increase
&kn->active, thus blocking process #1141. While itself is blocked later
when trying to acquire memory_hotplug_mutex, which is held by process
The backtrace of both processes are shown below:
[<
c000000001b18600>] 0xc000000001b18600
[<
c000000000015044>] .__switch_to+0x144/0x200
[<
c000000000263ca4>] .online_pages+0x74/0x7b0
[<
c00000000055b40c>] .memory_subsys_online+0x9c/0x150
[<
c00000000053cbe8>] .device_online+0xb8/0x120
[<
c00000000053cd04>] .online_store+0xb4/0xc0
[<
c000000000538ce4>] .dev_attr_store+0x64/0xa0
[<
c00000000030f4ec>] .sysfs_kf_write+0x7c/0xb0
[<
c00000000030e574>] .kernfs_fop_write+0x154/0x1e0
[<
c000000000268450>] .vfs_write+0xe0/0x260
[<
c000000000269144>] .SyS_write+0x64/0x110
[<
c000000000009ffc>] syscall_exit+0x0/0x7c
[<
c000000001b18600>] 0xc000000001b18600
[<
c000000000015044>] .__switch_to+0x144/0x200
[<
c00000000030be14>] .__kernfs_remove+0x204/0x300
[<
c00000000030d428>] .kernfs_remove_by_name_ns+0x68/0xf0
[<
c00000000030fb38>] .sysfs_remove_file_ns+0x38/0x60
[<
c000000000539354>] .device_remove_attrs+0x54/0xc0
[<
c000000000539fd8>] .device_del+0x158/0x250
[<
c00000000053a104>] .device_unregister+0x34/0xa0
[<
c00000000055bc14>] .unregister_memory_section+0x164/0x170
[<
c00000000024ee18>] .__remove_pages+0x108/0x4c0
[<
c00000000004b590>] .arch_remove_memory+0x60/0xc0
[<
c00000000026446c>] .remove_memory+0x8c/0xe0
[<
c00000000007f9f4>] .pseries_remove_memblock+0xd4/0x160
[<
c00000000007fcfc>] .pseries_memory_notifier+0x27c/0x290
[<
c0000000008ae6cc>] .notifier_call_chain+0x8c/0x100
[<
c0000000000d858c>] .__blocking_notifier_call_chain+0x6c/0xe0
[<
c00000000071ddec>] .of_property_notify+0x7c/0xc0
[<
c00000000071ed3c>] .of_update_property+0x3c/0x1b0
[<
c0000000000756cc>] .ofdt_write+0x3dc/0x740
[<
c0000000002f60fc>] .proc_reg_write+0xac/0x110
[<
c000000000268450>] .vfs_write+0xe0/0x260
[<
c000000000269144>] .SyS_write+0x64/0x110
[<
c000000000009ffc>] syscall_exit+0x0/0x7c
This patch uses lock_device_hotplug() to protect remove_memory() called
in pseries_remove_memblock(), which is also stated before function
remove_memory():
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
* and online/offline operations before this call, as required by
* try_offline_node().
*/
void __ref remove_memory(int nid, u64 start, u64 size)
With this lock held, the other process(#1120 above) trying to online the
memory block will retry the system call when calling
lock_device_hotplug_sysfs(), and finally find No such device error.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Mon, 14 Apr 2014 11:23:32 +0000 (21:23 +1000)]
powerpc: Fix error return in rtas_flash module init
module_init should return 0 or a negative errno.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Mon, 14 Apr 2014 11:55:25 +0000 (21:55 +1000)]
powerpc: Bump BOOT_COMMAND_LINE_SIZE to 2048
Bump the boot wrapper BOOT_COMMAND_LINE_SIZE to match the
kernel.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Mon, 14 Apr 2014 11:54:52 +0000 (21:54 +1000)]
powerpc: Bump COMMAND_LINE_SIZE to 2048
I've had a report that the current limit is too small for
an automated network based installer. Bump it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Mon, 14 Apr 2014 11:54:05 +0000 (21:54 +1000)]
powerpc: Rename duplicate COMMAND_LINE_SIZE define
We have two definitions of COMMAND_LINE_SIZE, one for the kernel
and one for the boot wrapper. I assume this is so the boot
wrapper can be self sufficient and not rely on kernel headers.
Having two defines with the same name is confusing, I just
updated the wrong one when trying to bump it.
Make the boot wrapper define unique by calling it
BOOT_COMMAND_LINE_SIZE.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cody P Schafer [Tue, 15 Apr 2014 17:10:55 +0000 (10:10 -0700)]
powerpc/perf/hv-24x7: Catalog version number is be64, not be32
The catalog version number was changed from a be32 (with proceeding
32bits of padding) to a be64, update the code to treat it as a be64
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cody P Schafer [Tue, 15 Apr 2014 17:10:54 +0000 (10:10 -0700)]
powerpc/perf/hv-24x7: Remove [static 4096], sparse chokes on it
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cody P Schafer [Tue, 15 Apr 2014 17:10:53 +0000 (10:10 -0700)]
powerpc/perf/hv-24x7: Use (unsigned long) not (u32) values when calling plpar_hcall_norets()
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cody P Schafer [Tue, 15 Apr 2014 17:10:52 +0000 (10:10 -0700)]
powerpc/perf/hv-gpci: Make device attr static
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cody P Schafer [Tue, 15 Apr 2014 17:10:51 +0000 (10:10 -0700)]
powerpc/perf/hv_gpci: Probe failures use pr_debug(), and padding reduced
fixup for "powerpc/perf: Add support for the hv gpci (get performance
counter info) interface".
Makes the "not enabled" message less awful (and hidden unless
debugging).
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cody P Schafer [Tue, 15 Apr 2014 17:10:50 +0000 (10:10 -0700)]
powerpc/perf/hv_24x7: Probe errors changed to pr_debug(), padding fixed
fixup for "powerpc/perf: Add support for the hv 24x7 interface"
Makes the "not enabled" message less awful (and hides it in most cases).
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Aneesh Kumar K.V [Mon, 21 Apr 2014 05:07:36 +0000 (10:37 +0530)]
powerpc/mm: Fix tlbie to add AVAL fields for 64K pages
The if condition check was based on a draft ISA doc. Remove the same.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Tue, 22 Apr 2014 05:01:27 +0000 (15:01 +1000)]
powerpc/powernv: Fix little endian issues in OPAL dump code
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Tue, 22 Apr 2014 05:01:26 +0000 (15:01 +1000)]
powerpc/powernv: Create OPAL sglist helper functions and fix endian issues
We have two copies of code that creates an OPAL sg list. Consolidate
these into a common set of helpers and fix the endian issues.
The flash interface embedded a version number in the num_entries
field, whereas the dump interface did did not. Since versioning
wasn't added to the flash interface and it is impossible to add
this in a backwards compatible way, just remove it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Tue, 22 Apr 2014 05:01:25 +0000 (15:01 +1000)]
powerpc/powernv: Fix little endian issues in OPAL error log code
Fix little endian issues with the OPAL error log code.
Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Tue, 22 Apr 2014 05:01:24 +0000 (15:01 +1000)]
powerpc/powernv: Fix little endian issues with opal_do_notifier calls
The bitmap in opal_poll_events and opal_handle_interrupt is
big endian, so we need to byteswap it on little endian builds.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Tue, 22 Apr 2014 05:01:23 +0000 (15:01 +1000)]
powerpc/powernv: Remove some OPAL function declaration duplication
We had some duplication of the internal OPAL functions.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Tue, 22 Apr 2014 05:01:22 +0000 (15:01 +1000)]
powerpc/powernv: Use uint64_t instead of size_t in OPAL APIs
Using size_t in our APIs is asking for trouble, especially
when some OPAL calls use size_t pointers.
Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Wei Yang [Wed, 23 Apr 2014 02:26:33 +0000 (10:26 +0800)]
powerpc/powernv: Release the refcount for pci_dev
On PowerNV platform, we are holding an unnecessary refcount on a pci_dev, which
leads to the pci_dev is not destroyed when hotplugging a pci device.
This patch release the unnecessary refcount.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Wei Yang [Wed, 23 Apr 2014 02:26:32 +0000 (10:26 +0800)]
powerpc/powernv: Reduce multi-hit of iommu_add_device()
During the EEH hotplug event, iommu_add_device() will be invoked three times
and two of them will trigger warning or error.
The three times to invoke the iommu_add_device() are:
pci_device_add
...
set_iommu_table_base_and_group <- 1st time, fail
device_add
...
tce_iommu_bus_notifier <- 2nd time, succees
pcibios_add_pci_devices
...
pcibios_setup_bus_devices <- 3rd time, re-attach
The first time fails, since the dev->kobj->sd is not initialized. The
dev->kobj->sd is initialized in device_add().
The third time's warning is triggered by the re-attach of the iommu_group.
After applying this patch, the error
iommu_tce: 0003:05:00.0 has not been added, ret=-14
and the warning
[ 204.123609] ------------[ cut here ]------------
[ 204.123645] WARNING: at arch/powerpc/kernel/iommu.c:1125
[ 204.123680] Modules linked in: xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT bnep bluetooth 6lowpan_iphc rfkill xt_conntrack ebtable_nat ebtable_broute bridge stp llc mlx4_ib ib_sa ib_mad ib_core ib_addr ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw bnx2x tg3 mlx4_core nfsd ptp mdio ses libcrc32c nfs_acl enclosure be2net pps_core shpchp lockd kvm uinput sunrpc binfmt_misc lpfc scsi_transport_fc ipr scsi_tgt
[ 204.124356] CPU: 18 PID: 650 Comm: eehd Not tainted 3.14.0-rc5yw+ #102
[ 204.124400] task:
c0000027ed485670 ti:
c0000027ed50c000 task.ti:
c0000027ed50c000
[ 204.124453] NIP:
c00000000003cf80 LR:
c00000000006c648 CTR:
c00000000006c5c0
[ 204.124506] REGS:
c0000027ed50f440 TRAP: 0700 Not tainted (3.14.0-rc5yw+)
[ 204.124558] MSR:
9000000000029032 <SF,HV,EE,ME,IR,DR,RI> CR:
88008084 XER:
20000000
[ 204.124682] CFAR:
c00000000006c644 SOFTE: 1
GPR00:
c00000000006c648 c0000027ed50f6c0 c000000001398380 c0000027ec260300
GPR04:
c0000027ea92c000 c00000000006ad00 c0000000016e41b0 0000000000000110
GPR08:
c0000000012cd4c0 0000000000000001 c0000027ec2602ff 0000000000000062
GPR12:
0000000028008084 c00000000fdca200 c0000000000d1d90 c0000027ec281a80
GPR16:
0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20:
0000000000000000 0000000000000000 0000000000000000 0000000000000001
GPR24:
000000005342697b 0000000000002906 c000001fe6ac9800 c000001fe6ac9800
GPR28:
0000000000000000 c0000000016e3a80 c0000027ea92c090 c0000027ea92c000
[ 204.125353] NIP [
c00000000003cf80] .iommu_add_device+0x30/0x1f0
[ 204.125399] LR [
c00000000006c648] .pnv_pci_ioda_dma_dev_setup+0x88/0xb0
[ 204.125443] Call Trace:
[ 204.125464] [
c0000027ed50f6c0] [
c0000027ed50f750] 0xc0000027ed50f750 (unreliable)
[ 204.125526] [
c0000027ed50f750] [
c00000000006c648] .pnv_pci_ioda_dma_dev_setup+0x88/0xb0
[ 204.125588] [
c0000027ed50f7d0] [
c000000000069cc8] .pnv_pci_dma_dev_setup+0x78/0x340
[ 204.125650] [
c0000027ed50f870] [
c000000000044408] .pcibios_setup_device+0x88/0x2f0
[ 204.125712] [
c0000027ed50f940] [
c000000000046040] .pcibios_setup_bus_devices+0x60/0xd0
[ 204.125774] [
c0000027ed50f9c0] [
c000000000043acc] .pcibios_add_pci_devices+0xdc/0x1c0
[ 204.125837] [
c0000027ed50fa50] [
c00000000086f970] .eeh_reset_device+0x36c/0x4f0
[ 204.125939] [
c0000027ed50fb20] [
c00000000003a2d8] .eeh_handle_normal_event+0x448/0x480
[ 204.126068] [
c0000027ed50fbc0] [
c00000000003a35c] .eeh_handle_event+0x4c/0x340
[ 204.126192] [
c0000027ed50fc80] [
c00000000003a74c] .eeh_event_handler+0xfc/0x1b0
[ 204.126319] [
c0000027ed50fd30] [
c0000000000d1ea0] .kthread+0x110/0x130
[ 204.126430] [
c0000027ed50fe30] [
c00000000000a460] .ret_from_kernel_thread+0x5c/0x7c
[ 204.126556] Instruction dump:
[ 204.126610]
7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff71 7c7e1b78 60000000
[ 204.126787]
60000000 e87e0298 3143ffff 7d2a1910 <
0b090000>
2fa90000 40de00c8 ebfe0218
[ 204.126966] ---[ end trace
6e7aefd80add2973 ]---
are cleared.
This patch removes iommu_add_device() in pnv_pci_ioda_dma_dev_setup(), which
revert part of the change in commit
d905c5df(PPC: POWERNV: move
iommu_add_device earlier).
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Wed, 23 Apr 2014 21:25:34 +0000 (07:25 +1000)]
powerpc/powernv: Fix little endian issues in OPAL flash code
With this patch I was able to update firmware on an LE kernel.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Benjamin Herrenschmidt [Thu, 24 Apr 2014 06:14:25 +0000 (16:14 +1000)]
powerpc/powernv: Fix kexec races going back to OPAL
We have a subtle race when sending CPUs back to OPAL on kexec.
We mark them as "in real mode" right before we send them down. Once
we've booted the new kernel, it might try to call opal_reinit_cpus()
to change endianness, and that requires all CPUs to be spinning inside
OPAL.
However there is no synchronization here and we've observed cases
where the returning CPUs hadn't established their new state inside
OPAL before opal_reinit_cpus() is called, causing it to fail.
The proper fix is to actually wait for them to go down all the way
from the kexec'ing kernel.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Joel Stanley [Thu, 24 Apr 2014 07:25:37 +0000 (16:55 +0930)]
powerpc/powernv: Check sysparam size before creation
The size of the sysparam sysfs files is determined from the device tree
at boot. However the buffer is hard coded to 64 bytes. If we encounter a
parameter that is larger than 64, or miss-parse the device tree, the
buffer will overflow when reading or writing to the parameter.
Check it at discovery time, and if the parameter is too large, do not
create a sysfs entry for it.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Joel Stanley [Thu, 24 Apr 2014 07:25:36 +0000 (16:55 +0930)]
powerpc/powernv: Fix typos in sysparam code
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Joel Stanley [Thu, 24 Apr 2014 07:25:35 +0000 (16:55 +0930)]
powerpc/powernv: Check sysfs size before copying
The sysparam code currently uses the userspace supplied number of
bytes when memcpy()ing in to a local 64-byte buffer.
Limit the maximum number of bytes by the size of the buffer.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Joel Stanley [Thu, 24 Apr 2014 07:25:34 +0000 (16:55 +0930)]
powerpc/powernv: Use ssize_t for sysparam return values
The OPAL calls are returning int64_t values, which the sysparam code
stores in an int, and the sysfs callback returns ssize_t. Make code a
easier to read by consistently using ssize_t.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Joel Stanley [Thu, 24 Apr 2014 07:25:33 +0000 (16:55 +0930)]
powerpc/powernv: Fix sysparam sysfs error handling
When a sysparam query in OPAL returned a negative value (error code),
sysfs would spew out a decent chunk of memory; almost 64K more than
expected. This was traced to a sign/unsigned mix up in the OPAL sysparam
sysfs code at sys_param_show.
The return value of sys_param_show is a ssize_t, calculated using
return ret ? ret : attr->param_size;
Alan Modra explains:
"attr->param_size" is an unsigned int, "ret" an int, so the overall
expression has type unsigned int. Result is that ret is cast to
unsigned int before being cast to ssize_t.
Instead of using the ternary operator, set ret to the param_size if an
error is not detected. The same bug exists in the sysfs write callback;
this patch fixes it in the same way.
A note on debugging this next time: on my system gcc will warn about
this if compiled with -Wsign-compare, which is not enabled by -Wall,
only -Wextra.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Li Zhong [Mon, 28 Apr 2014 00:29:51 +0000 (08:29 +0800)]
powerpc: Fix Oops in rtas_stop_self()
commit
41dd03a9 may cause Oops in rtas_stop_self().
The reason is that the rtas_args was moved into stack space. For a box
with more that 4GB RAM, the stack could easily be outside 32bit range,
but RTAS is 32bit.
So the patch moves rtas_args away from stack by adding static before
it.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Jeff Mahoney [Sun, 27 Apr 2014 22:10:43 +0000 (18:10 -0400)]
powerpc: Export flush_icache_range
Commit
aac416fc38c (lkdtm: flush icache and report actions) calls
flush_icache_range from a module. It's exported on most architectures
that implement it, but not on powerpc. This patch exports it to fix
the module link failure.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Linus Torvalds [Sun, 20 Apr 2014 18:08:50 +0000 (11:08 -0700)]
Linux 3.15-rc2
Linus Torvalds [Sun, 20 Apr 2014 17:35:31 +0000 (10:35 -0700)]
Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma
Pull slave-dmaengine fixes from Vinod Koul:
"Back from long weekend here in India and now the time to send fixes
for slave dmaengine.
- Dan's fix of sirf xlate code
- Jean's fix for timberland
- edma fixes by Sekhar for SG handling and Yuan for changing init
call"
* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
dma: fix eDMA driver as a subsys_initcall
dmaengine: sirf: off by one in of_dma_sirfsoc_xlate()
platform: Fix timberdale dependencies
dma: edma: fix incorrect SG list handling
Linus Torvalds [Sun, 20 Apr 2014 17:33:49 +0000 (10:33 -0700)]
Merge tag 'iommu-fixes-v3.15-rc1' of git://git./linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
"Fixes for regressions:
- fix wrong IOMMU enumeration causing some SCSI device drivers
initialization failures
- ARM-SMMU fixes for a panic condition and a wrong return value"
* tag 'iommu-fixes-v3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/arm-smmu: fix panic in arm_smmu_alloc_init_pte
iommu/arm-smmu: Return 0 on unmap failure
iommu/vt-d: fix bug in matching PCI devices with DRHD/RMRR descriptors
iommu/vt-d: Fix get_domain_for_dev() handling of upstream PCIe bridges
iommu/vt-d: fix memory leakage caused by commit
ea8ea46
Linus Torvalds [Sun, 20 Apr 2014 17:32:33 +0000 (10:32 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf tooling fixes from Ingo Molnar:
"Three small tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Improve error reporting
perf tools: Adjust symbols in VDSO
perf kvm: Fix 'Min time' counting in report command
Ingo Molnar [Sun, 20 Apr 2014 07:53:55 +0000 (09:53 +0200)]
Merge tag 'perf-urgent-for-mingo' of git://git./linux/kernel/git/jolsa/perf into perf/urgent
Pull perf/urgent fixes from Jiri Olsa:
User visible changes:
* Adjust symbols in VDSO to properly resolve its function names (Vladimir Nikulichev)
* Improve error reporting for record session failure (Adrien BAK)
* Fix 'Min time' counting in report command (Alexander Yarygin)
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Adrien BAK [Fri, 18 Apr 2014 02:00:43 +0000 (11:00 +0900)]
perf tools: Improve error reporting
In the current version, when using perf record, if something goes
wrong in tools/perf/builtin-record.c:375
session = perf_session__new(file, false, NULL);
The error message:
"Not enough memory for reading per file header"
is issued. This error message seems to be outdated and is not very
helpful. This patch proposes to replace this error message by
"Perf session creation failed"
I believe this issue has been brought to lkml:
https://lkml.org/lkml/2014/2/24/458
although this patch only tackles a (small) part of the issue.
Additionnaly, this patch improves error reporting in
tools/perf/util/data.c open_file_write.
Currently, if the call to open fails, the user is unaware of it.
This patch logs the error, before returning the error code to
the caller.
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Adrien BAK <adrien.bak@metascale.org>
Link: http://lkml.kernel.org/r/1397786443.3093.4.camel@beast
[ Reorganize the changelog into paragraphs ]
[ Added empty line after fd declaration in open_file_write ]
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Vladimir Nikulichev [Thu, 17 Apr 2014 15:27:01 +0000 (08:27 -0700)]
perf tools: Adjust symbols in VDSO
pert-report doesn't resolve function names in VDSO:
$ perf report --stdio -g flat,0.0,15,callee --sort pid
...
8.76%
0x7fff6b1fe861
__gettimeofday
ACE_OS::gettimeofday()
...
In this case symbol values should be adjusted the same way as for executables,
relocatable objects and prelinked libraries.
After fix:
$ perf report --stdio -g flat,0.0,15,callee --sort pid
...
8.76%
__vdso_gettimeofday
__gettimeofday
ACE_OS::gettimeofday()
Signed-off-by: Vladimir Nikulichev <nvs@tbricks.com>
Tested-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Link: http://lkml.kernel.org/r/969812.163009436-sendEmail@nvs
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Alexander Yarygin [Wed, 9 Apr 2014 14:21:59 +0000 (16:21 +0200)]
perf kvm: Fix 'Min time' counting in report command
Every event in the perf-kvm has a 'stats' structure, which contains
max/min/average/etc times of handling this event.
The problem is that the 'perf-kvm stat report' command always shows
that 'min time' is 0us for every event. Example:
# perf kvm stat report
Analyze events for all VCPUs:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
[..]
0xB2 MSCH 12 0.07% 0.00% 0us 8us 7.31us ( +- 2.11% )
0xB2 CHSC 12 0.07% 0.00% 0us 18us 9.39us ( +- 9.49% )
0xB2 STPX 8 0.05% 0.00% 0us 2us 1.88us ( +- 7.18% )
0xB2 STSI 7 0.04% 0.00% 0us 44us 16.49us ( +- 38.20% )
[..]
This happens because the 'stats' structure is not initialized and
stats->min equals to 0. Lets initialize the structure for every
event after its allocation using init_stats() function. This initializes
stats->min to -1 and makes 'Min time' statistics counting work:
# perf kvm stat report
Analyze events for all VCPUs:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
[..]
0xB2 MSCH 12 0.07% 0.00% 6us 8us 7.31us ( +- 2.11% )
0xB2 CHSC 12 0.07% 0.00% 7us 18us 9.39us ( +- 9.49% )
0xB2 STPX 8 0.05% 0.00% 1us 2us 1.88us ( +- 7.18% )
0xB2 STSI 7 0.04% 0.00% 1us 44us 16.49us ( +- 38.20% )
[..]
Signed-off-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Link: http://lkml.kernel.org/r/1397053319-2130-3-git-send-email-borntraeger@de.ibm.com
[ Fixing the perf examples changelog output ]
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Eric Dumazet [Sat, 19 Apr 2014 17:15:07 +0000 (10:15 -0700)]
coredump: fix va_list corruption
A va_list needs to be copied in case it needs to be used twice.
Thanks to Hugh for debugging this issue, leading to various panics.
Tested:
lpq84:~# echo "|/foobar12345 %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h" >/proc/sys/kernel/core_pattern
'produce_core' is simply : main() { *(int *)0 = 1;}
lpq84:~# ./produce_core
Segmentation fault (core dumped)
lpq84:~# dmesg | tail -1
[ 614.352947] Core dump to |/foobar12345 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 (null) pipe failed
Notice the last argument was replaced by a NULL (we were lucky enough to
not crash, but do not try this on your production machine !)
After fix :
lpq83:~# echo "|/foobar12345 %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h" >/proc/sys/kernel/core_pattern
lpq83:~# ./produce_core
Segmentation fault
lpq83:~# dmesg | tail -1
[ 740.800441] Core dump to |/foobar12345 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 pipe failed
Fixes: 5fe9d8ca21cc ("coredump: cn_vprintf() has no reason to call vsnprintf() twice")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Hugh Dickins <hughd@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 19 Apr 2014 17:41:43 +0000 (10:41 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"This fixes the preemption-count imbalance crash reported by Owen
Kibel"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix CMCI preemption bugs
Linus Torvalds [Sat, 19 Apr 2014 17:40:51 +0000 (10:40 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Two fixes:
- a SCHED_DEADLINE task selection fix
- a sched/numa related lockdep splat fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Check for stop task appearance when balancing happens
sched/numa: Fix task_numa_free() lockdep splat
Linus Torvalds [Sat, 19 Apr 2014 17:40:11 +0000 (10:40 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Two kernel side fixes:
- an Intel uncore PMU driver potential crash fix
- a kprobes/perf-call-graph interaction fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Use rdmsrl_safe() when initializing RAPL PMU
kprobes/x86: Fix page-fault handling logic
Linus Torvalds [Sat, 19 Apr 2014 17:35:30 +0000 (10:35 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Unfortunately this contains no easter eggs, its a bit larger than I'd
like, but I included a patch that just moves code from one file to
another and I'd like to avoid merge conflicts with that later, so it
makes it seem worse than it is,
Otherwise:
- radeon: fixes to use new microcode to stabilise some cards, use
some common displayport code, some runtime pm fixes, pll regression
fixes
- i915: fix for some context oopses, a warn in a used path, backlight
fixes
- nouveau: regression fix
- omap: a bunch of fixes"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (51 commits)
drm: bochs: drop unused struct fields
drm: bochs: add power management support
drm: cirrus: add power management support
drm: Split out drm_probe_helper.c from drm_crtc_helper.c
drm/plane-helper: Don't fake-implement primary plane disabling
drm/ast: fix value check in cbr_scan2
drm/nouveau/bios: fix a bit shift error introduced by
457e77b
drm/radeon/ci: make sure mc ucode is loaded before checking the size
drm/radeon/si: make sure mc ucode is loaded before checking the size
drm/radeon: improve PLL params if we don't match exactly v2
drm/radeon: memory leak on bo reservation failure. v2
drm/radeon: fix VCE fence command
drm/radeon: re-enable mclk dpm on R7 260X asics
drm/radeon: add support for newer mc ucode on CI (v2)
drm/radeon: add support for newer mc ucode on SI (v2)
drm/radeon: apply more strict limits for PLL params v2
drm/radeon: update CI DPM powertune settings
drm/radeon: fix runpm handling on APUs (v4)
drm/radeon: disable mclk dpm on R7 260X
drm/tegra: Remove gratuitous pad field
...
Dave Airlie [Sat, 19 Apr 2014 01:16:02 +0000 (11:16 +1000)]
Merge branch 'drm-next-3.15-wip' of git://people.freedesktop.org/~deathsimple/linux into drm-next
Some i2c fixes over DisplayPort.
* 'drm-next-3.15-wip' of git://people.freedesktop.org/~deathsimple/linux:
drm/radeon: Improve vramlimit module param documentation
drm/radeon: fix audio pin counts for DCE6+ (v2)
drm/radeon/dp: switch to the common i2c over aux code
drm/dp/i2c: Update comments about common i2c over dp assumptions (v3)
drm/dp/i2c: send bare addresses to properly reset i2c connections (v4)
drm/radeon/dp: handle zero sized i2c over aux transactions (v2)
drm/i915: support address only i2c-over-aux transactions
drm/tegra: dp: Support address-only I2C-over-AUX transactions
Linus Torvalds [Sat, 19 Apr 2014 00:53:46 +0000 (17:53 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull more networking fixes from David Miller:
1) Fix mlx4_en_netpoll implementation, it needs to schedule a NAPI
context, not synchronize it. From Chris Mason.
2) Ipv4 flow input interface should never be zero, it should be
LOOPBACK_IFINDEX instead. From Cong Wang and Julian Anastasov.
3) Properly configure MAC to PHY connection in mvneta devices, from
Thomas Petazzoni.
4) sys_recv should use SYSCALL_DEFINE. From Jan Glauber.
5) Tunnel driver ioctls do not use the correct namespace, fix from
Nicolas Dichtel.
6) Fix memory leak on seccomp filter attach, from Kees Cook.
7) Fix lockdep warning for nested vlans, from Ding Tianhong.
8) Crashes can happen in SCTP due to how the auth_enable value is
managed, fix from Vlad Yasevich.
9) Wireless fixes from John W Linville and co.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (45 commits)
net: sctp: cache auth_enable per endpoint
tg3: update rx_jumbo_pending ring param only when jumbo frames are enabled
vlan: Fix lockdep warning when vlan dev handle notification
seccomp: fix memory leak on filter attach
isdn: icn: buffer overflow in icn_command()
ip6_tunnel: use the right netns in ioctl handler
sit: use the right netns in ioctl handler
ip_tunnel: use the right netns in ioctl handler
net: use SYSCALL_DEFINEx for sys_recv
net: mdio-gpio: Add support for separate MDI and MDO gpio pins
net: mdio-gpio: Add support for active low gpio pins
net: mdio-gpio: Use devm_ functions where possible
ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source()
ipv4, fib: pass LOOPBACK_IFINDEX instead of 0 to flowi4_iif
mlx4_en: don't use napi_synchronize inside mlx4_en_netpoll
net: mvneta: properly configure the MAC <-> PHY connection in all situations
net: phy: add minimal support for QSGMII PHY
sfc:On MCDI timeout, issue an FLR (and mark MCDI to fail-fast)
mwifiex: fix hung task on command timeout
mwifiex: process event before command response
...
Linus Torvalds [Sat, 19 Apr 2014 00:52:39 +0000 (17:52 -0700)]
Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"A set of 5 small cifs fixes"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cif: fix dead code
cifs: fix error handling cifs_user_readv
fs: cifs: remove unused variable.
Return correct error on query of xattr on file with empty xattrs
cifs: Wait for writebacks to complete before attempting write.
Linus Torvalds [Sat, 19 Apr 2014 00:02:35 +0000 (17:02 -0700)]
Merge tag 'char-misc-3.15-rc2' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are a few driver fixes for char/misc drivers that resolve
reported issues.
All have been in linux-next successfully for a few days"
* tag 'char-misc-3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
Drivers: hv: vmbus: Negotiate version 3.0 when running on ws2012r2 hosts
Tools: hv: Handle the case when the target file exists correctly
vme_tsi148: Utilize to_pci_dev() macro
vme_tsi148: Fix PCI address mapping assumption
vme_tsi148: Fix typo in tsi148_slave_get()
w1: avoid recursive device_add
w1: fix netlink refcnt leak on error path
misc: Grammar s/addition/additional/
drivers: mcb: fix memory leak in chameleon_parse_cells() error path
mei: ignore client writing state during cb completion
mei: me: do not load the driver if the FW doesn't support MEI interface
GenWQE: Increase driver version number
GenWQE: Fix multithreading problems
GenWQE: Ensure rc is not returning an uninitialized value
GenWQE: Add wmb before DDCB is started
GenWQE: Enable access to VPD flash area
Linus Torvalds [Fri, 18 Apr 2014 23:59:52 +0000 (16:59 -0700)]
Merge tag 'driver-core-3.15-rc2' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some driver core fixes for 3.15-rc2. Also in here are some
documentation updates, as well as an API removal that had to wait for
after -rc1 due to the cleanups coming into you from multiple developer
trees (this one and the PPC tree.)
All have been in linux next successfully"
* tag 'driver-core-3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
drivers/base/dd.c incorrect pr_debug() parameters
Documentation: Update stable address in Chinese and Japanese translations
topology: Fix compilation warning when not in SMP
Chinese: add translation of io_ordering.txt
stable_kernel_rules: spelling/word usage
sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()
kernfs: protect lazy kernfs_iattrs allocation with mutex
fs: Don't return 0 from get_anon_bdev
Linus Torvalds [Fri, 18 Apr 2014 23:58:47 +0000 (16:58 -0700)]
Merge tag 'staging-3.15-rc2' of git://git./linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Here are a few staging driver fixes for issues that have been reported
for 3.15-rc2.
Also dominating the diffstat for the pull request is the removal of
the rtl8187se driver. It's no longer needed in staging as a "real"
driver for this hardware is now merged in the tree in the "correct"
location in drivers/net/
All of these patches have been tested in linux-next"
* tag 'staging-3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: r8188eu: Fix case where ethtype was never obtained and always be checked against 0
staging: r8712u: Fix case where ethtype was never obtained and always be checked against 0
staging: r8188eu: Calling rtw_get_stainfo() with a NULL sta_addr will return NULL
staging: comedi: fix circular locking dependency in comedi_mmap()
staging: r8723au: Add missing initialization of change_inx in sort algorithm
Staging: unisys: use after free in list_for_each()
staging: unisys: use after free in error messages
staging: speakup: fix misuse of kstrtol() in handle_goto()
staging: goldfish: Call free_irq in error path
staging: delete rtl8187se wireless driver
staging: rtl8723au: Fix buffer overflow in rtw_get_wfd_ie()
staging: gs_fpgaboot: remove __TIMESTAMP__ macro
staging: vme: fix memory leak in vme_user_probe()
staging: fpgaboot: clean up Makefile
staging/usbip: fix store_attach() sscanf return value check
staging/usbip: userspace - fix usbipd SIGSEGV from refresh_exported_devices()
staging: rtl8188eu: remove spaces, correct counts to unbreak P2P ioctls
staging/rtl8821ae: Fix OOM handling in _rtl_init_deferred_work()
Linus Torvalds [Fri, 18 Apr 2014 23:57:53 +0000 (16:57 -0700)]
Merge tag 'tty-3.15-rc2' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are a number of small tty/serial driver fixes for 3.15-rc2. Also
in here are some Documentation file removals for drivers that we
removed a long time ago, no need to keep it around any longer.
All of these have been in linux-next for a bit"
* tag 'tty-3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
Revert "serial: 8250, disable "too much work" messages"
serial: amba-pl011: fix regression, causing an Oops on rmmod
tty: Fix help text of SYNCLINK_CS
tty: fix memleak in alloc_pid
ttyprintk: Allow built as a module
ttyprintk: Fix wrong tty_unregister_driver() call in the error path
serial: 8250, disable "too much work" messages
Documentation/serial: Delete obsolete driver documentation
serial: omap: Fix missing pm_runtime_resume handling by simplifying code
serial_core: Fix pm imbalance on unbind
serial: pl011: change Rx burst size to half of trigger level
serial: timberdale: Depend on X86_32
serial: st-asc: Fix SysRq char handling
Revert "serial: clps711x: Give a chance to perform useful tasks during wait loop"
serial_core: Fix conditional start_tx on ring buffer not empty
serial: efm32: use $vendor,$device scheme for compatible string
serial: omap: free the wakeup settings in remove
Linus Torvalds [Fri, 18 Apr 2014 23:57:00 +0000 (16:57 -0700)]
Merge tag 'usb-3.15-rc2' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a number of tiny USB fixes and new device ids for 3.15-rc2.
Nothing major, just issues some people have reported.
All of these have been in linux-next"
* tag 'usb-3.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
uas: fix deadlocky memory allocations
uas: fix error handling during scsi_scan()
uas: fix GFP_NOIO under spinlock
uwb: adds missing error handling
USB: cdc-acm: Remove Motorola/Telit H24 serial interfaces from ACM driver
USB: ohci-jz4740: FEAT_POWER is a port feature, not a hub feature
USB: ohci-jz4740: Fix uninitialized variable warning
USB: EHCI: tegra: set txfill_tuning
usb: ehci-platform: Return immediately from suspend if ehci_suspend fails
usb: ehci-exynos: Return immediately from suspend if ehci_suspend fails
USB: fix crash during hotplug of PCI USB controller card
USB: cdc-acm: fix double usb_autopm_put_interface() in acm_port_activate()
usb: usb-common: fix typo for usb_state_string
USB: usb_wwan: fix handling of missing bulk endpoints
USB: pl2303: add ids for Hewlett-Packard HP POS pole displays
USB: cp210x: Add 8281 (Nanotec Plug & Drive)
usb: option driver, add support for Telit UE910v2
Revert "USB: serial: add usbid for dell wwan card to sierra.c"
USB: serial: ftdi_sio: add id for Brainboxes serial cards
Linus Torvalds [Fri, 18 Apr 2014 23:40:31 +0000 (16:40 -0700)]
Merge branch 'akpm' (incoming from Andrew)
Merge misc fixes from Andrew Morton:
"13 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
thp: close race between split and zap huge pages
mm: fix new kernel-doc warning in filemap.c
mm: fix CONFIG_DEBUG_VM_RB description
mm: use paravirt friendly ops for NUMA hinting ptes
mips: export flush_icache_range
mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()
wait: explain the shadowing and type inconsistencies
Shiraz has moved
Documentation/vm/numa_memory_policy.txt: fix wrong document in numa_memory_policy.txt
powerpc/mm: fix ".__node_distance" undefined
kernel/watchdog.c:touch_softlockup_watchdog(): use raw_cpu_write()
init/Kconfig: move the trusted keyring config option to general setup
vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state()
Kirill A. Shutemov [Fri, 18 Apr 2014 22:07:25 +0000 (15:07 -0700)]
thp: close race between split and zap huge pages
Sasha Levin has reported two THP BUGs[1][2]. I believe both of them
have the same root cause. Let's look to them one by one.
The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!". It's
BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page(). From my
testing I see that page_mapcount() is higher than mapcount here.
I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd(). page_check_address_pmd() misses PMD which is
under zap:
CPU0 CPU1
zap_huge_pmd()
pmdp_get_and_clear()
__split_huge_page()
anon_vma_interval_tree_foreach()
__split_huge_page_splitting()
page_check_address_pmd()
mm_find_pmd()
/*
* We check if PMD present without taking ptl: no
* serialization against zap_huge_pmd(). We miss this PMD,
* it's not accounted to 'mapcount' in __split_huge_page().
*/
pmd_present(pmd) == 0
BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!
page_remove_rmap(page)
atomic_add_negative(-1, &page->_mapcount)
The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().
This happens in similar way:
CPU0 CPU1
zap_huge_pmd()
pmdp_get_and_clear()
page_remove_rmap(page)
atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
anon_vma_interval_tree_foreach()
__split_huge_page_splitting()
page_check_address_pmd()
mm_find_pmd()
pmd_present(pmd) == 0 /* The same comment as above */
/*
* No crash this time since we already decremented page->_mapcount in
* zap_huge_pmd().
*/
BUG_ON(mapcount != page_mapcount(page))
/*
* We split the compound page here into small pages without
* serialization against zap_huge_pmd()
*/
__split_huge_page_refcount()
VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!
So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.
The bug was introduced by me commit with commit
117b0791ac42. Sorry for
that. :(
Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.
Note that __page_check_address() does the same for PTE entires
if sync != 0.
I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took <20 min to
trigger the first bug and few hours for second one (if we ignore
first).
[1] https://lkml.kernel.org/g/<
53440991.
9090001@oracle.com>
[2] https://lkml.kernel.org/g/<
5310C56C.60709@oracle.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [3.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Fri, 18 Apr 2014 22:07:23 +0000 (15:07 -0700)]
mm: fix new kernel-doc warning in filemap.c
Fix new kernel-doc warning in mm/filemap.c:
Warning(mm/filemap.c:2600): Excess function parameter 'ppos' description in '__generic_file_aio_write'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Fri, 18 Apr 2014 22:07:22 +0000 (15:07 -0700)]
mm: fix CONFIG_DEBUG_VM_RB description
This appears to be a copy/paste error. Update the description to
reflect extra rbtree debug and checks for the config option instead of
duplicating CONFIG_DEBUG_VM.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Fri, 18 Apr 2014 22:07:21 +0000 (15:07 -0700)]
mm: use paravirt friendly ops for NUMA hinting ptes
David Vrabel identified a regression when using automatic NUMA balancing
under Xen whereby page table entries were getting corrupted due to the
use of native PTE operations. Quoting him
Xen PV guest page tables require that their entries use machine
addresses if the preset bit (_PAGE_PRESENT) is set, and (for
successful migration) non-present PTEs must use pseudo-physical
addresses. This is because on migration MFNs in present PTEs are
translated to PFNs (canonicalised) so they may be translated back
to the new MFN in the destination domain (uncanonicalised).
pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma()
set and clear the _PAGE_PRESENT bit using pte_set_flags(),
pte_clear_flags(), etc.
In a Xen PV guest, these functions must translate MFNs to PFNs
when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting
_PAGE_PRESENT.
His suggested fix converted p[te|md]_[set|clear]_flags to using
paravirt-friendly ops but this is overkill. He suggested an alternative
of using p[te|md]_modify in the NUMA page table operations but this is
does more work than necessary and would require looking up a VMA for
protections.
This patch modifies the NUMA page table operations to use paravirt
friendly operations to set/clear the flags of interest. Unfortunately
this will take a performance hit when updating the PTEs on
CONFIG_PARAVIRT but I do not see a way around it that does not break
Xen.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Tested-by: David Vrabel <david.vrabel@citrix.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Fri, 18 Apr 2014 22:07:19 +0000 (15:07 -0700)]
mips: export flush_icache_range
The lkdtm module performs tests against executable memory ranges, so it
needs to flush the icache for proper behaviors. Other architectures
already export this, so do the same for MIPS.
[akpm@linux-foundation.org: relocate export sites]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Sanjay Lal <sanjayl@kymasys.com>
Cc: John Crispin <blogic@openwrt.org>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mizuma, Masayoshi [Fri, 18 Apr 2014 22:07:18 +0000 (15:07 -0700)]
mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()
soft lockup in freeing gigantic hugepage fixed in commit
55f67141a892 "mm:
hugetlb: fix softlockup when a large number of hugepages are freed." can
happen in return_unused_surplus_pages(), so let's fix it.
Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Zijlstra [Fri, 18 Apr 2014 22:07:17 +0000 (15:07 -0700)]
wait: explain the shadowing and type inconsistencies
Stick in a comment before someone else tries to fix the sparse warning
this generates.
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-o2ro6f3vkxklni0bc8f7m68s@git.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Viresh Kumar [Fri, 18 Apr 2014 22:07:16 +0000 (15:07 -0700)]
Shiraz has moved
shiraz.hashim@st.com email-id doesn't exist anymore as he has left the
company. Replace ST's id with shiraz.linux.kernel@gmail.com.
It also updates .mailmap file to fix address for 'git shortlog'.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Shiraz Hashim <shiraz.linux.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tang Chen [Fri, 18 Apr 2014 22:07:15 +0000 (15:07 -0700)]
Documentation/vm/numa_memory_policy.txt: fix wrong document in numa_memory_policy.txt
In document numa_memory_policy.txt, the following examples for flag
MPOL_F_RELATIVE_NODES are incorrect.
For example, consider a task that is attached to a cpuset with
mems 2-5 that sets an Interleave policy over the same set with
MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
interleave now occurs over nodes 3,5-6. If the cpuset's mems
then change to 0,2-3,5, then the interleave occurs over nodes
0,3,5.
According to the comment of the patch adding flag MPOL_F_RELATIVE_NODES,
the nodemasks the user specifies should be considered relative to the
current task's mems_allowed.
(https://lkml.org/lkml/2008/2/29/428)
And according to numa_memory_policy.txt, if the user's nodemask includes
nodes that are outside the range of the new set of allowed nodes, then
the remap wraps around to the beginning of the nodemask and, if not
already set, sets the node in the mempolicy nodemask.
So in the example, if the user specifies 2-5, for a task whose
mems_allowed is 3-7, the nodemasks should be remapped the third, fourth,
fifth, sixth node in mems_allowed. like the following:
mems_allowed: 3 4 5 6 7
relative index: 0 1 2 3 4
5
So the nodemasks should be remapped to 3,5-7, but not 3,5-6.
And for a task whose mems_allowed is 0,2-3,5, the nodemasks should be
remapped to 0,2-3,5, but not 0,3,5.
mems_allowed: 0 2 3 5
relative index: 0 1 2 3
4 5
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Qiu [Fri, 18 Apr 2014 22:07:14 +0000 (15:07 -0700)]
powerpc/mm: fix ".__node_distance" undefined
CHK include/config/kernel.release
CHK include/generated/uapi/linux/version.h
CHK include/generated/utsrelease.h
...
Building modules, stage 2.
WARNING: 1 bad relocations
c0000000013d6a30 R_PPC64_ADDR64 uprobes_fetch_type_table
WRAP arch/powerpc/boot/zImage.pseries
WRAP arch/powerpc/boot/zImage.epapr
MODPOST 1849 modules
ERROR: ".__node_distance" [drivers/block/nvme.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2
make: *** Waiting for unfinished jobs....
The reason is symbol "__node_distance" not been exported in powerpc.
Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Fri, 18 Apr 2014 22:07:12 +0000 (15:07 -0700)]
kernel/watchdog.c:touch_softlockup_watchdog(): use raw_cpu_write()
Fix:
BUG: using __this_cpu_write() in preemptible [
00000000] code: systemd-udevd/497
caller is __this_cpu_preempt_check+0x13/0x20
CPU: 3 PID: 497 Comm: systemd-udevd Tainted: G W 3.15.0-rc1 #9
Hardware name: Hewlett-Packard HP EliteBook 8470p/179B, BIOS 68ICF Ver. F.02 04/27/2012
Call Trace:
check_preemption_disabled+0xe1/0xf0
__this_cpu_preempt_check+0x13/0x20
touch_nmi_watchdog+0x28/0x40
Reported-by: Luis Henriques <luis.henriques@canonical.com>
Tested-by: Luis Henriques <luis.henriques@canonical.com>
Cc: Eric Piel <eric.piel@tremplin-utc.net>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>