Sam bobroff [Tue, 26 May 2015 23:56:57 +0000 (09:56 +1000)]
KVM: PPC: Book3S: correct width in XER handling
In 64 bit kernels, the Fixed Point Exception Register (XER) is a 64
bit field (e.g. in kvm_regs and kvm_vcpu_arch) and in most places it is
accessed as such.
This patch corrects places where it is accessed as a 32 bit field by a
64 bit kernel. In some cases this is via a 32 bit load or store
instruction which, depending on endianness, will cause either the
lower or upper 32 bits to be missed. In another case it is cast as a
u32, causing the upper 32 bits to be cleared.
This patch corrects those places by extending the access methods to
64 bits.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Thu, 16 Jul 2015 07:11:14 +0000 (17:11 +1000)]
KVM: PPC: Book3S HV: Fix preempted vcore stolen time calculation
Whenever a vcore state is VCORE_PREEMPT we need to be counting stolen
time for it. This currently isn't the case when we have a vcore that
no longer has any runnable threads in it but still has a runner task,
so we do an explicit call to kvmppc_core_start_stolen() in that case.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Thu, 16 Jul 2015 07:11:13 +0000 (17:11 +1000)]
KVM: PPC: Book3S HV: Fix preempted vcore list locking
When a vcore gets preempted, we put it on the preempted vcore list for
the current CPU. The runner task then calls schedule() and comes back
some time later and takes itself off the list. We need to be careful
to lock the list that it was put onto, which may not be the list for the
current CPU since the runner task may have moved to another CPU.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Wed, 24 Jun 2015 11:18:07 +0000 (21:18 +1000)]
KVM: PPC: Book3S HV: Implement H_CLEAR_REF and H_CLEAR_MOD
This adds implementations for the H_CLEAR_REF (test and clear reference
bit) and H_CLEAR_MOD (test and clear changed bit) hypercalls.
When clearing the reference or change bit in the guest view of the HPTE,
we also have to clear it in the real HPTE so that we can detect future
references or changes. When we do so, we transfer the R or C bit value
to the rmap entry for the underlying host page so that kvm_age_hva_hv(),
kvm_test_age_hva_hv() and kvmppc_hv_get_dirty_log() know that the page
has been referenced and/or changed.
These hypercalls are not used by Linux guests. These implementations
have been tested using a FreeBSD guest.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Wed, 24 Jun 2015 11:18:06 +0000 (21:18 +1000)]
KVM: PPC: Book3S HV: Fix bug in dirty page tracking
This fixes a bug in the tracking of pages that get modified by the
guest. If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.
To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.
The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages. However, we also fix
this case for the sake of future-proofing.
The reference bit is also subject to the same loss of information. We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified. Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Wed, 24 Jun 2015 11:18:05 +0000 (21:18 +1000)]
KVM: PPC: Book3S HV: Fix race in reading change bit when removing HPTE
The reference (R) and change (C) bits in a HPT entry can be set by
hardware at any time up until the HPTE is invalidated and the TLB
invalidation sequence has completed. This means that when removing
a HPTE, we need to read the HPTE after the invalidation sequence has
completed in order to obtain reliable values of R and C. The code
in kvmppc_do_h_remove() used to do this. However, commit
6f22bd3265fb
("KVM: PPC: Book3S HV: Make HTAB code LE host aware") removed the
read after invalidation as a side effect of other changes. This
restores the read of the HPTE after invalidation.
The user-visible effect of this bug would be that when migrating a
guest, there is a small probability that a page modified by the guest
and then unmapped by the guest might not get re-transmitted and thus
the destination might end up with a stale copy of the page.
Fixes: 6f22bd3265fb
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Thu, 2 Jul 2015 10:38:16 +0000 (20:38 +1000)]
KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8
This builds on the ability to run more than one vcore on a physical
core by using the micro-threading (split-core) modes of the POWER8
chip. Previously, only vcores from the same VM could be run together,
and (on POWER8) only if they had just one thread per core. With the
ability to split the core on guest entry and unsplit it on guest exit,
we can run up to 8 vcpu threads from up to 4 different VMs, and we can
run multiple vcores with 2 or 4 vcpus per vcore.
Dynamic micro-threading is only available if the static configuration
of the cores is whole-core mode (unsplit), and only on POWER8.
To manage this, we introduce a new kvm_split_mode struct which is
shared across all of the subcores in the core, with a pointer in the
paca on each thread. In addition we extend the core_info struct to
have information on each subcore. When deciding whether to add a
vcore to the set already on the core, we now have two possibilities:
(a) piggyback the vcore onto an existing subcore, or (b) start a new
subcore.
Currently, when any vcpu needs to exit the guest and switch to host
virtual mode, we interrupt all the threads in all subcores and switch
the core back to whole-core mode. It may be possible in future to
allow some of the subcores to keep executing in the guest while
subcore 0 switches to the host, but that is not implemented in this
patch.
This adds a module parameter called dynamic_mt_modes which controls
which micro-threading (split-core) modes the code will consider, as a
bitmap. In other words, if it is 0, no micro-threading mode is
considered; if it is 2, only 2-way micro-threading is considered; if
it is 4, only 4-way, and if it is 6, both 2-way and 4-way
micro-threading mode will be considered. The default is 6.
With this, we now have secondary threads which are the primary thread
for their subcore and therefore need to do the MMU switch. These
threads will need to be started even if they have no vcpu to run, so
we use the vcore pointer in the PACA rather than the vcpu pointer to
trigger them.
It is now possible for thread 0 to find that an exit has been
requested before it gets to switch the subcore state to the guest. In
that case we haven't added the guest's timebase offset to the
timebase, so we need to be careful not to subtract the offset in the
guest exit path. In fact we just skip the whole path that switches
back to host context, since we haven't switched to the guest context.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Wed, 24 Jun 2015 11:18:03 +0000 (21:18 +1000)]
KVM: PPC: Book3S HV: Make use of unused threads when running guests
When running a virtual core of a guest that is configured with fewer
threads per core than the physical cores have, the extra physical
threads are currently unused. This makes it possible to use them to
run one or more other virtual cores from the same guest when certain
conditions are met. This applies on POWER7, and on POWER8 to guests
with one thread per virtual core. (It doesn't apply to POWER8 guests
with multiple threads per vcore because they require a 1-1 virtual to
physical thread mapping in order to be able to use msgsndp and the
TIR.)
The idea is that we maintain a list of preempted vcores for each
physical cpu (i.e. each core, since the host runs single-threaded).
Then, when a vcore is about to run, it checks to see if there are
any vcores on the list for its physical cpu that could be
piggybacked onto this vcore's execution. If so, those additional
vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
threads are started as well as the original vcore, which is called
the master vcore.
After the vcores have exited the guest, the extra ones are put back
onto the preempted list if any of their VCPUs are still runnable and
not idle.
This means that vcpu->arch.ptid is no longer necessarily the same as
the physical thread that the vcpu runs on. In order to make it easier
for code that wants to send an IPI to know which CPU to target, we
now store that in a new field in struct vcpu_arch, called thread_cpu.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Tudor Laurentiu [Mon, 18 May 2015 12:44:27 +0000 (15:44 +0300)]
KVM: PPC: add missing pt_regs initialization
On this switch branch the regs initialization
doesn't happen so add it.
This was found with the help of a static
code analysis tool.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Thomas Huth [Fri, 22 May 2015 07:25:02 +0000 (09:25 +0200)]
KVM: PPC: Fix warnings from sparse
When compiling the KVM code for POWER with "make C=1", sparse
complains about functions missing proper prototypes and a 64-bit
constant missing the ULL prefix. Let's fix this by making the
functions static or by including the proper header with the
prototypes, and by appending a ULL prefix to the constant
PPC_MPPE_ADDRESS_MASK.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Thomas Huth [Fri, 22 May 2015 09:41:01 +0000 (11:41 +0200)]
KVM: PPC: Remove PPC970 from KVM_BOOK3S_64_HV text in Kconfig
Since the PPC970 support has been removed from the kvm-hv kernel
module recently, we should also reflect this change in the help
text of the corresponding Kconfig option.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Tudor Laurentiu [Mon, 25 May 2015 08:48:39 +0000 (11:48 +0300)]
KVM: PPC: fix suspicious use of conditional operator
This was signaled by a static code analysis tool.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Andy Lutomirski [Thu, 13 Aug 2015 20:18:48 +0000 (13:18 -0700)]
x86/kvm: Rename VMX's segment access rights defines
VMX encodes access rights differently from LAR, and the latter is
most likely what x86 people think of when they think of "access
rights".
Rename them to avoid confusion.
Cc: kvm@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 13 Aug 2015 09:51:50 +0000 (11:51 +0200)]
Merge tag 'kvm-s390-next-
20150812' of git://git./linux/kernel/git/kvms390/linux into HEAD
KVM: s390: fix and feature for kvm/next (4.3)
1. error handling for irq routes
2. Gracefully handle STP time changes
s390 supports a protocol for syncing different systems via the stp
protocol that will steer the TOD clocks to keep all participating
clocks below the round trip time between the system. In case of
specific out of sync event Linux can opt-in to accept sync checks.
This will result in non-monotonic jumps of the TOD clock, which
Linux will correct via time offsets to keep the wall clock time
monotonic. Now: KVM guests also base their time on the host TOD,
so we need to fixup the offset for them as well.
Wei Huang [Fri, 7 Aug 2015 19:53:30 +0000 (15:53 -0400)]
KVM: x86/vPMU: Fix unnecessary signed extension for AMD PERFCTRn
According to AMD programmer's manual, AMD PERFCTRn is 64-bit MSR which,
unlike Intel perf counters, doesn't require signed extension. This
patch removes the unnecessary conversion in SVM vPMU code when PERFCTRn
is being updated.
Signed-off-by: Wei Huang <wei@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nicholas Krause [Wed, 5 Aug 2015 14:44:40 +0000 (10:44 -0400)]
kvm: x86: Fix error handling in the function kvm_lapic_sync_from_vapic
This fixes error handling in the function kvm_lapic_sync_from_vapic
by checking if the call to kvm_read_guest_cached has returned a
error code to signal to its caller the call to this function has
failed and due to this we must immediately return to the caller
of kvm_lapic_sync_from_vapic to avoid incorrectly call apic_set_tpc
if a error has occurred here.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nicholas Krause [Thu, 6 Aug 2015 17:05:54 +0000 (13:05 -0400)]
KVM: s390: Fix assumption that kvm_set_irq_routing is always run successfully
This fixes the assumption that kvm_set_irq_routing is always run
successfully by instead making it equal to the variable r which
we use for returning in the function kvm_arch_vm_ioctl instead
of making r equal to zero when calling this particular function
and incorrectly making the caller of kvm_arch_vm_ioctl think
the function has run successfully.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Message-Id: <
1438880754-27149-1-git-send-email-xerofoify@gmail.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Xiao Guangrong [Wed, 5 Aug 2015 04:04:27 +0000 (12:04 +0800)]
KVM: VMX: drop ept misconfig check
The logic used to check ept misconfig is completely contained in common
reserved bits check for sptes, so it can be removed
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 5 Aug 2015 04:04:26 +0000 (12:04 +0800)]
KVM: MMU: fully check zero bits for sptes
The #PF with PFEC.RSV = 1 is designed to speed MMIO emulation, however,
it is possible that the RSV #PF is caused by real BUG by mis-configure
shadow page table entries
This patch enables full check for the zero bits on shadow page table
entries (which includes not only bits reserved by the hardware, but also
bits that will never be set in the SPTE), then dump the shadow page table
hierarchy.
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 5 Aug 2015 04:04:25 +0000 (12:04 +0800)]
KVM: MMU: introduce is_shadow_zero_bits_set()
We have the same data struct to check reserved bits on guest page tables
and shadow page tables, split is_rsvd_bits_set() so that the logic can be
shared between these two paths
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 5 Aug 2015 04:04:24 +0000 (12:04 +0800)]
KVM: MMU: introduce the framework to check zero bits on sptes
We have abstracted the data struct and functions which are used to check
reserved bit on guest page tables, now we extend the logic to check
zero bits on shadow page tables
The zero bits on sptes include not only reserved bits on hardware but also
the bits that SPTEs willnever use. For example, shadow pages will never
use GB pages unless the guest uses them too.
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 5 Aug 2015 04:04:23 +0000 (12:04 +0800)]
KVM: MMU: split reset_rsvds_bits_mask_ept
Since shadow ept page tables and Intel nested guest page tables have the
same format, split reset_rsvds_bits_mask_ept so that the logic can be
reused by later patches which check zero bits on sptes
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 5 Aug 2015 04:04:22 +0000 (12:04 +0800)]
KVM: MMU: split reset_rsvds_bits_mask
Since softmmu & AMD nested shadow page tables and guest page tables have
the same format, split reset_rsvds_bits_mask so that the logic can be
reused by later patches which check zero bits on sptes
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 5 Aug 2015 04:04:21 +0000 (12:04 +0800)]
KVM: MMU: introduce rsvd_bits_validate
These two fields, rsvd_bits_mask and bad_mt_xwr, in "struct kvm_mmu" are
used to check if reserved bits set on guest ptes, move them to a data
struct so that the approach can be applied to check host shadow page
table entries as well
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 5 Aug 2015 04:04:20 +0000 (12:04 +0800)]
KVM: MMU: move FNAME(is_rsvd_bits_set) to mmu.c
FNAME(is_rsvd_bits_set) does not depend on guest mmu mode, move it
to mmu.c to stop being compiled multiple times
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 5 Aug 2015 04:04:19 +0000 (12:04 +0800)]
KVM: MMU: fix validation of mmio page fault
We got the bug that qemu complained with "KVM: unknown exit, hardware
reason 31" and KVM shown these info:
[84245.284948] EPT: Misconfiguration.
[84245.285056] EPT: GPA: 0xfeda848
[84245.285154] ept_misconfig_inspect_spte: spte 0x5eaef50107 level 4
[84245.285344] ept_misconfig_inspect_spte: spte 0x5f5fadc107 level 3
[84245.285532] ept_misconfig_inspect_spte: spte 0x5141d18107 level 2
[84245.285723] ept_misconfig_inspect_spte: spte 0x52e40dad77 level 1
This is because we got a mmio #PF and the handler see the mmio spte becomes
normal (points to the ram page)
However, this is valid after introducing fast mmio spte invalidation which
increases the generation-number instead of zapping mmio sptes, a example
is as follows:
1. QEMU drops mmio region by adding a new memslot
2. invalidate all mmio sptes
3.
VCPU 0 VCPU 1
access the invalid mmio spte
access the region originally was MMIO before
set the spte to the normal ram map
mmio #PF
check the spte and see it becomes normal ram mapping !!!
This patch fixes the bug just by dropping the check in mmio handler, it's
good for backport. Full check will be introduced in later patches
Reported-by: Pavel Shirshov <ru.pchel@gmail.com>
Tested-by: Pavel Shirshov <ru.pchel@gmail.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alex Williamson [Tue, 4 Aug 2015 16:58:26 +0000 (10:58 -0600)]
KVM: MTRR: Use default type for non-MTRR-covered gfn before WARN_ON
The patch was munged on commit to re-order these tests resulting in
excessive warnings when trying to do device assignment. Return to
original ordering: https://lkml.org/lkml/2015/7/15/769
Fixes: 3e5d2fdceda1 ("KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fan Zhang [Wed, 13 May 2015 08:58:41 +0000 (10:58 +0200)]
KVM: s390: host STP toleration for VMs
If the host has STP enabled, the TOD of the host will be changed during
synchronization phases. These are performed during a stop_machine() call.
As the guest TOD is based on the host TOD, we have to make sure that:
- no VCPU is in the SIE (implicitly guaranteed via stop_machine())
- manual guest TOD calculations are not affected
"Epoch" is the guest TOD clock delta to the host TOD clock. We have to
adjust that value during the STP synchronization and make sure that code
that accesses the epoch won't get interrupted in between (via disabling
preemption).
Signed-off-by: Fan Zhang <zhangfan@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Paolo Bonzini [Wed, 29 Jul 2015 10:31:15 +0000 (12:31 +0200)]
KVM: x86: clean/fix memory barriers in irqchip_in_kernel
The memory barriers are trying to protect against concurrent RCU-based
interrupt injection, but the IRQ routing table is not valid at the time
kvm->arch.vpic is written. Fix this by writing kvm->arch.vpic last.
kvm_destroy_pic then need not set kvm->arch.vpic to NULL; modify it
to take a struct kvm_pic* and reuse it if the IOAPIC creation fails.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 29 Jul 2015 09:32:20 +0000 (11:32 +0200)]
KVM: document memory barriers for kvm->vcpus/kvm->online_vcpus
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 29 Jul 2015 09:06:34 +0000 (11:06 +0200)]
KVM: x86: remove unnecessary memory barriers for shared MSRs
There is no smp_rmb matching the smp_wmb. shared_msr_update is called from
hardware_enable, which in turn is called via on_each_cpu. on_each_cpu
and must imply a read memory barrier (on x86 the rmb is achieved simply
through asm volatile in native_apic_mem_write).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 29 Jul 2015 09:56:48 +0000 (11:56 +0200)]
KVM: move code related to KVM_SET_BOOT_CPU_ID to x86
This is another remnant of ia64 support.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 29 Jul 2015 10:53:58 +0000 (12:53 +0200)]
Merge tag 'kvm-s390-next-
20150728' of git://git./linux/kernel/git/kvms390/linux into kvm-next
KVM: s390: Fixes and features for kvm/next (4.3)
1. Rework logging infrastructure (s390dbf) to integrate feedback learned
when debugging performance and test issues
2. Some cleanups and simplifications for CMMA handling
3. Fix gdb debugging and single stepping on some instructions
4. Error handling for storage key setup
Christian Borntraeger [Wed, 22 Jul 2015 13:52:10 +0000 (15:52 +0200)]
KVM: s390: log capability enablement and vm attribute changes
Depending on user space, some capabilities and vm attributes are
enabled at runtime. Let's log those events and while we're at it,
log querying the vm attributes as well.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Christian Borntraeger [Wed, 22 Jul 2015 13:50:58 +0000 (15:50 +0200)]
KVM: s390: Provide global debug log
In addition to the per VM debug logs, let's provide a global
one for KVM-wide events, like new guests or fatal errors.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Christian Borntraeger [Tue, 21 Jul 2015 10:44:57 +0000 (12:44 +0200)]
KVM: s390: adapt debug entries for instruction handling
Use the default log level 3 for state changing and/or seldom events,
use 4 for others. Also change some numbers from %x to %d and vice versa
to match documentation. If hex, let's prepend the numbers with 0x.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Christian Borntraeger [Mon, 20 Jul 2015 13:04:48 +0000 (15:04 +0200)]
KVM: s390: improve debug feature usage
We do not use the exception logger, so the 2nd area is unused.
Just have one area that is bigger (32 pages).
At the same time we can limit the debug feature size to 7
longs, as the largest user has 3 parameters + string + boiler
plate (vCPU, PSW mask, PSW addr)
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Christian Borntraeger [Mon, 20 Jul 2015 20:05:55 +0000 (22:05 +0200)]
KVM: s390: remove outdated documentation
The old Documentation/s390/kvm.txt file is either
outdated or described in Documentation/virtual/kvm/api.txt.
Let's get rid of it.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
David Hildenbrand [Fri, 21 Nov 2014 12:45:08 +0000 (13:45 +0100)]
KVM: s390: more irq names for trace events
This patch adds names for missing irq types to the trace events.
In order to identify adapter irqs, the define is moved from
interrupt.c to the other basic irq defines in uapi/linux/kvm.h.
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Christian Borntraeger [Thu, 9 Jul 2015 12:08:18 +0000 (14:08 +0200)]
KVM: s390: Fixup interrupt vcpu event messages and levels
This reworks the debug logging for interrupt related logs.
Several changes:
- unify program int/irq
- improve decoding (e.g. use mcic instead of parm64 for machine
check injection)
- remove useless interrupt type number (the name is enough)
- rename "interrupt:" to "deliver:" as the other side is called "inject"
- use log level 3 for state changing and/or seldom events (like machine
checks, restart..)
- use log level 4 for frequent events
- use 0x prefix for hex numbers
- add pfault done logging
- move some tracing outside spinlock
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Christian Borntraeger [Thu, 16 Jul 2015 08:24:07 +0000 (10:24 +0200)]
KVM: s390: add more debug data for the pfault diagnoses
We're not only interested in the address of the control block, but
also in the requested subcommand and for the token subcommand, in the
specified token address and masks.
Suggested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
David Hildenbrand [Mon, 20 Jul 2015 08:33:03 +0000 (10:33 +0200)]
KVM: s390: remove "from (user|kernel)" from irq injection messages
The "from user"/"from kernel" part of the log/trace messages is not
always correct anymore and therefore not really helpful.
Let's remove that part from the log + trace messages. For program
interrupts, we can now move the logging/tracing part into the real
injection function, as already done for the other injection functions.
Reviewed-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Christian Borntraeger [Fri, 10 Jul 2015 13:27:20 +0000 (15:27 +0200)]
KVM: s390: VCPU_EVENT cleanup for prefix changes
SPX (SET PREFIX) and SIGP (Set prefix) can change the prefix
register of a CPU. As sigp set prefix may be handled in user
space (KVM_CAP_S390_USER_SIGP), we would not log the changes
triggered via SIGP in that case. Let's have just one VCPU_EVENT
at the central location that tracks prefix changes.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Christian Borntraeger [Thu, 9 Jul 2015 11:43:41 +0000 (13:43 +0200)]
KVM: s390: Improve vcpu event debugging for diagnoses
Let's add a vcpu event for the page reference handling and change
the default debugging level for the ipl diagnose. Both are not
frequent AND change the global state, so lets log them always.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Christian Borntraeger [Tue, 7 Jul 2015 13:19:32 +0000 (15:19 +0200)]
KVM: s390: add kvm stat counter for all diagnoses
Sometimes kvm stat counters are the only performance metric to check
after something went wrong. Let's add additional counters for some
diagnoses.
In addition do the count for diag 10 all the time, even if we inject
a program interrupt.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Dominik Dingel [Thu, 18 Jun 2015 11:17:11 +0000 (13:17 +0200)]
KVM: s390: only reset CMMA state if it was enabled before
There is no point in resetting the CMMA state if it was never enabled.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Dominik Dingel [Thu, 7 May 2015 13:41:57 +0000 (15:41 +0200)]
KVM: s390: clean up cmma_enable check
As we already only enable CMMA when userspace requests it, we can
safely move the additional checks to the request handler and avoid
doing them multiple times. This also tells userspace if CMMA is
available.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
David Hildenbrand [Tue, 23 Jun 2015 20:49:36 +0000 (22:49 +0200)]
KVM: s390: filter space-switch events when PER is enforced
When guest debugging is active, space-switch events might be enforced
by PER. While the PER events are correctly filtered out,
space-switch-events could be forwarded to the guest, although from a
guest point of view, they should not have been reported.
Therefore we have to filter out space-switch events being concurrently
reported with a PER event, if the PER event got filtered out. To do so,
we theoretically have to know which instruction was responsible for the
event. As the applicable instructions modify the PSW address, the
address space set in the PSW and even the address space in cr1, we
can't figure out the instruction that way.
For this reason, we have to rely on the information about the old and
new address space, in order to guess the responsible instruction type
and do appropriate checks for space-switch events.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Dominik Dingel [Thu, 7 May 2015 13:16:13 +0000 (15:16 +0200)]
KVM: s390: propagate error from enable storage key
As enabling storage keys might fail, we should forward the error.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Paolo Bonzini [Fri, 10 Jul 2015 11:32:13 +0000 (13:32 +0200)]
KVM: svm: handle KVM_X86_QUIRK_CD_NW_CLEARED in svm_get_mt_mask
We can disable CD unconditionally when there is no assigned device.
KVM now forces guest PAT to all-writeback in that case, so it makes
sense to also force CR0.CD=0.
When there are assigned devices, emulate cache-disabled operation
through the page tables. This behavior is consistent with VMX
microcode, where CD/NW are not touched by vmentry/vmexit. However,
keep this dependent on the quirk because OVMF enables the caches
too late.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mihai Donțu [Sun, 5 Jul 2015 17:08:57 +0000 (20:08 +0300)]
kvm/x86: add support for MONITOR_TRAP_FLAG
Allow a nested hypervisor to single step its guests.
Signed-off-by: Mihai Donțu <mihai.dontu@gmail.com>
[Fix overlong line. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Fri, 3 Jul 2015 12:01:41 +0000 (15:01 +0300)]
kvm/x86: add sending hyper-v crash notification to user space
Sending of notification is done by exiting vcpu to user space
if KVM_REQ_HV_CRASH is enabled for vcpu. At exit to user space
the kvm_run structure contains system_event with type
KVM_SYSTEM_EVENT_CRASH to notify about guest crash occurred.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Fri, 3 Jul 2015 12:01:37 +0000 (15:01 +0300)]
kvm/x86: added hyper-v crash msrs into kvm hyperv context
Added kvm Hyper-V context hv crash variables as storage
of Hyper-V crash msrs.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Fri, 3 Jul 2015 12:01:35 +0000 (15:01 +0300)]
kvm: introduce vcpu_debug = kvm_debug + vcpu context
vcpu_debug is useful macro like kvm_debug but additionally
includes vcpu context inside output.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Fri, 3 Jul 2015 12:01:34 +0000 (15:01 +0300)]
kvm/x86: move Hyper-V MSR's/hypercall code into hyperv.c file
This patch introduce Hyper-V related source code file - hyperv.c and
per vm and per vcpu hyperv context structures.
All Hyper-V MSR's and hypercall code moved into hyperv.c.
All Hyper-V kvm/vcpu fields moved into appropriate hyperv context
structures. Copyrights and authors information copied from x86.c
to hyperv.c.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Eugene Korenevsky [Fri, 17 Apr 2015 02:22:21 +0000 (02:22 +0000)]
KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions
According to Intel SDM several checks must be applied for memory operands
of VMX instructions.
Long mode: #GP(0) or #SS(0) depending on the segment must be thrown
if the memory address is in a non-canonical form.
Protected mode, checks in chronological order:
- The segment type must be checked with access type (read or write) taken
into account.
For write access: #GP(0) must be generated if the destination operand
is located in a read-only data segment or any code segment.
For read access: #GP(0) must be generated if if the source operand is
located in an execute-only code segment.
- Usability of the segment must be checked. #GP(0) or #SS(0) depending on the
segment must be thrown if the segment is unusable.
- Limit check. #GP(0) or #SS(0) depending on the segment must be
thrown if the memory operand effective address is outside the segment
limit.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 23 Jul 2015 06:24:42 +0000 (08:24 +0200)]
KVM: x86: rename quirk constants to KVM_X86_QUIRK_*
Make them clearly architecture-dependent; the capability is valid for
all architectures, but the argument is not.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 15 Jul 2015 19:25:56 +0000 (03:25 +0800)]
KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED
OVMF depends on WB to boot fast, because it only clears caches after
it has set up MTRRs---which is too late.
Let's do writeback if CR0.CD is set to make it happy, similar to what
SVM is already doing.
Signed-off-by: Xiao Guangrong <guangrong.xiao@intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 23 Jul 2015 06:22:45 +0000 (08:22 +0200)]
KVM: x86: introduce kvm_check_has_quirk
The logic of the disabled_quirks field usually results in a double
negation. Wrap it in a simple function that checks the bit and
negates it.
Based on a patch from Xiao Guangrong.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 15 Jul 2015 19:25:55 +0000 (03:25 +0800)]
KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type
kvm_mtrr_get_guest_memory_type never returns -1 which is implied
in the current code since if @type = -1 (means no MTRR contains the
range), iter.partial_map must be true
Simplify the code to indicate this fact
Signed-off-by: Xiao Guangrong <guangrong.xiao@intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 15 Jul 2015 19:25:54 +0000 (03:25 +0800)]
KVM: MTRR: fix memory type handling if MTRR is completely disabled
Currently code uses default memory type if MTRR is fully disabled,
fix it by using UC instead.
Signed-off-by: Xiao Guangrong <guangrong.xiao@intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Wed, 22 Jul 2015 21:45:25 +0000 (14:45 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Don't use shared bluetooth antenna in iwlwifi driver for management
frames, from Emmanuel Grumbach.
2) Fix device ID check in ath9k driver, from Felix Fietkau.
3) Off by one in xen-netback BUG checks, from Dan Carpenter.
4) Fix IFLA_VF_PORT netlink attribute validation, from Daniel Borkmann.
5) Fix races in setting peeked bit flag in SKBs during datagram
receive. If it's shared we have to clone it otherwise the value can
easily be corrupted. Fix from Herbert Xu.
6) Revert fec clock handling change, causes regressions. From Fabio
Estevam.
7) Fix use after free in fq_codel and sfq packet schedulers, from WANG
Cong.
8) ipvlan bug fixes (memory leaks, missing rcu_dereference_bh, etc.)
from WANG Cong and Konstantin Khlebnikov.
9) Memory leak in act_bpf packet action, from Alexei Starovoitov.
10) ARM bpf JIT bug fixes from Nicolas Schichan.
11) Fix backwards compat of ANY_LAYOUT in virtio_net driver, from
Michael S Tsirkin.
12) Destruction of bond with different ARP header types not handled
correctly, fix from Nikolay Aleksandrov.
13) Revert GRO receive support in ipv6 SIT tunnel driver, causes
regressions because the GRO packets created cannot be processed
properly on the GSO side if we forward the frame. From Herbert Xu.
14) TCCR update race and other fixes to ravb driver from Sergei
Shtylyov.
15) Fix SKB leaks in caif_queue_rcv_skb(), from Eric Dumazet.
16) Fix panics on packet scheduler filter replace, from Daniel Borkmann.
17) Make sure AF_PACKET sees properly IP headers in defragmented frames
(via PACKET_FANOUT_FLAG_DEFRAG option), from Edward Hyunkoo Jee.
18) AF_NETLINK cannot hold mutex in RCU callback, fix from Florian
Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (84 commits)
ravb: fix ring memory allocation
net: phy: dp83867: Fix warning check for setting the internal delay
openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes
netlink: don't hold mutex in rcu callback when releasing mmapd ring
ARM: net: fix vlan access instructions in ARM JIT.
ARM: net: handle negative offsets in BPF JIT.
ARM: net: fix condition for load_order > 0 when translating load instructions.
tcp: suppress a division by zero warning
drivers: net: cpsw: remove tx event processing in rx napi poll
inet: frags: fix defragmented packet's IP header for af_packet
net: mvneta: fix refilling for Rx DMA buffers
stmmac: fix setting of driver data in stmmac_dvr_probe
sched: cls_flow: fix panic on filter replace
sched: cls_flower: fix panic on filter replace
sched: cls_bpf: fix panic on filter replace
net/mdio: fix mdio_bus_match for c45 PHY
net: ratelimit warnings about dst entry refcount underflow or overflow
caif: fix leaks and race in caif_queue_rcv_skb()
qmi_wwan: add the second QMI/network interface for Sierra Wireless MC7305/MC7355
ravb: fix race updating TCCR
...
Linus Torvalds [Wed, 22 Jul 2015 15:52:42 +0000 (08:52 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull ARM64 fixes from Catalin Marinas:
- arm64 build fix following the move of the thread_struct to the end of
task_struct and the asm offsets becoming too large for the AArch64
ISA
- preparatory patch for moving irq_data struct members (applied now to
reduce dependency for the next merging window)
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
ARM64/irq: Use access helper irq_data_get_affinity_mask()
arm64: switch_to: calculate cpu context pointer using separate register
Jiang Liu [Mon, 13 Jul 2015 20:30:04 +0000 (20:30 +0000)]
ARM64/irq: Use access helper irq_data_get_affinity_mask()
This is a preparatory patch for moving irq_data struct members.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Will Deacon [Mon, 20 Jul 2015 14:14:53 +0000 (15:14 +0100)]
arm64: switch_to: calculate cpu context pointer using separate register
Commit
0c8c0f03e3a2 ("x86/fpu, sched: Dynamically allocate 'struct fpu'")
moved the thread_struct to the bottom of task_struct. As a result, the
offset is now too large to be used in an immediate add on arm64 with
some kernel configs:
arch/arm64/kernel/entry.S: Assembler messages:
arch/arm64/kernel/entry.S:588: Error: immediate out of range
arch/arm64/kernel/entry.S:597: Error: immediate out of range
This patch calculates the offset using an additional register instead of
an immediate offset.
Fixes: 0c8c0f03e3a2 ("x86/fpu, sched: Dynamically allocate 'struct fpu'")
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Sergei Shtylyov [Tue, 21 Jul 2015 22:31:59 +0000 (01:31 +0300)]
ravb: fix ring memory allocation
The driver is written as if it can adapt to a low memory situation allocating
less RX skbs and TX aligned buffers than the respective RX/TX ring sizes. In
reality though the driver would malfunction in this case. Stop being overly
smart and just fail in such situation -- this is achieved by moving the memory
allocation from ravb_ring_format() to ravb_ring_init().
We leave dma_map_single() calls in place but make their failure non-fatal
by marking the corresponding RX descriptors with zero data size which should
prevent DMA to an invalid addresses.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Murphy [Tue, 21 Jul 2015 17:06:45 +0000 (12:06 -0500)]
net: phy: dp83867: Fix warning check for setting the internal delay
Fix warning: logical ‘or’ of collectively exhaustive tests is always true
Change the internal delay check from an 'or' condition to an 'and'
condition.
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Dan Murphy <dmurphy@ti.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chris J Arges [Tue, 21 Jul 2015 17:36:33 +0000 (12:36 -0500)]
openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes
Some architectures like POWER can have a NUMA node_possible_map that
contains sparse entries. This causes memory corruption with openvswitch
since it allocates flow_cache with a multiple of num_possible_nodes() and
assumes the node variable returned by for_each_node will index into
flow->stats[node].
Use nr_node_ids to allocate a maximal sparse array instead of
num_possible_nodes().
The crash was noticed after
3af229f2 was applied as it changed the
node_possible_map to match node_online_map on boot.
Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861
Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 21 Jul 2015 14:33:50 +0000 (16:33 +0200)]
netlink: don't hold mutex in rcu callback when releasing mmapd ring
Kirill A. Shutemov says:
This simple test-case trigers few locking asserts in kernel:
int main(int argc, char **argv)
{
unsigned int block_size = 16 * 4096;
struct nl_mmap_req req = {
.nm_block_size = block_size,
.nm_block_nr = 64,
.nm_frame_size = 16384,
.nm_frame_nr = 64 * block_size / 16384,
};
unsigned int ring_size;
int fd;
fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
exit(1);
if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
exit(1);
ring_size = req.nm_block_nr * req.nm_block_size;
mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
return 0;
}
+++ exited with 0 +++
BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
3 locks held by init/1:
#0: (reboot_mutex){+.+...}, at: [<
ffffffff81080959>] SyS_reboot+0xa9/0x220
#1: ((reboot_notifier_list).rwsem){.+.+..}, at: [<
ffffffff8107f379>] __blocking_notifier_call_chain+0x39/0x70
#2: (rcu_callback){......}, at: [<
ffffffff810d32e0>] rcu_do_batch.isra.49+0x160/0x10c0
Preemption disabled at:[<
ffffffff8145365f>] __delay+0xf/0x20
CPU: 1 PID: 1 Comm: init Not tainted
4.1.0-00009-gbddf4c4818e0 #253
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
Call Trace:
<IRQ> [<
ffffffff81929ceb>] dump_stack+0x4f/0x7b
[<
ffffffff81085a9d>] ___might_sleep+0x16d/0x270
[<
ffffffff81085bed>] __might_sleep+0x4d/0x90
[<
ffffffff8192e96f>] mutex_lock_nested+0x2f/0x430
[<
ffffffff81932fed>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
[<
ffffffff81464143>] ? __this_cpu_preempt_check+0x13/0x20
[<
ffffffff8182fc3d>] netlink_set_ring+0x1ed/0x350
[<
ffffffff8182e000>] ? netlink_undo_bind+0x70/0x70
[<
ffffffff8182fe20>] netlink_sock_destruct+0x80/0x150
[<
ffffffff817e484d>] __sk_free+0x1d/0x160
[<
ffffffff817e49a9>] sk_free+0x19/0x20
[..]
Cong Wang says:
We can't hold mutex lock in a rcu callback, [..]
Thomas Graf says:
The socket should be dead at this point. It might be simpler to
add a netlink_release_ring() function which doesn't require
locking at all.
Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Diagnosed-by: Cong Wang <cwang@twopensource.com>
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 22 Jul 2015 05:19:55 +0000 (22:19 -0700)]
Merge branch 'arm-bpf-fixes'
Nicolas Schichan says:
====================
BPF JIT fixes for ARM
These patches are fixing bugs in the ARM JIT and should probably find
their way to a stable kernel. All 60 test_bpf tests in Linux 4.1 release
are now passing OK (was 54 out of 60 before).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Schichan [Tue, 21 Jul 2015 12:14:14 +0000 (14:14 +0200)]
ARM: net: fix vlan access instructions in ARM JIT.
This makes BPF_ANC | SKF_AD_VLAN_TAG and BPF_ANC | SKF_AD_VLAN_TAG_PRESENT
have the same behaviour as the in kernel VM and makes the test_bpf LD_VLAN_TAG
and LD_VLAN_TAG_PRESENT tests pass.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Schichan [Tue, 21 Jul 2015 12:14:13 +0000 (14:14 +0200)]
ARM: net: handle negative offsets in BPF JIT.
Previously, the JIT would reject negative offsets known during code
generation and mishandle negative offsets provided at runtime.
Fix that by calling bpf_internal_load_pointer_neg_helper()
appropriately in the jit_get_skb_{b,h,w} slow path helpers and by forcing
the execution flow to the slow path helpers when the offset is
negative.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Schichan [Tue, 21 Jul 2015 12:14:12 +0000 (14:14 +0200)]
ARM: net: fix condition for load_order > 0 when translating load instructions.
To check whether the load should take the fast path or not, the code
would check that (r_skb_hlen - load_order) is greater than the offset
of the access using an "Unsigned higher or same" condition. For
halfword accesses and an skb length of 1 at offset 0, that test is
valid, as we end up comparing 0xffffffff(-1) and 0, so the fast path
is taken and the filter allows the load to wrongly succeed. A similar
issue exists for word loads at offset 0 and an skb length of less than
4.
Fix that by using the condition "Signed greater than or equal"
condition for the fast path code for load orders greater than 0.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 22 Jul 2015 05:02:00 +0000 (07:02 +0200)]
tcp: suppress a division by zero warning
Andrew Morton reported following warning on one ARM build
with gcc-4.4 :
net/ipv4/inet_hashtables.c: In function 'inet_ehash_locks_alloc':
net/ipv4/inet_hashtables.c:617: warning: division by zero
Even guarded with a test on sizeof(spinlock_t), compiler does not
like current construct on a !CONFIG_SMP build.
Remove the warning by using a temporary variable.
Fixes: 095dc8e0c368 ("tcp: fix/cleanup inet_ehash_locks_alloc()")
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 21 Jul 2015 23:06:53 +0000 (16:06 -0700)]
Revert "fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()"
This reverts commit
a2673b6e040663bf16a552f8619e6bde9f4b9acf.
Kinglong Mee reports a memory leak with that patch, and Jan Kara confirms:
"Thanks for report! You are right that my patch introduces a race
between fsnotify kthread and fsnotify_destroy_group() which can result
in leaking inotify event on group destruction.
I haven't yet decided whether the right fix is not to queue events for
dying notification group (as that is pointless anyway) or whether we
should just fix the original problem differently... Whenever I look
at fsnotify code mark handling I get lost in the maze of locks, lists,
and subtle differences between how different notification systems
handle notification marks :( I'll think about it over night"
and after thinking about it, Jan says:
"OK, I have looked into the code some more and I found another
relatively simple way of fixing the original oops. It will be IMHO
better than trying to fixup this issue which has more potential for
breakage. I'll ask Linus to revert the fsnotify fix he already merged
and send a new fix"
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Requested-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David S. Miller [Tue, 21 Jul 2015 23:06:39 +0000 (16:06 -0700)]
Merge tag 'wireless-drivers-for-davem-2015-07-20' of git://git./linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
ath9k:
* fix device ID check for AR956x
iwlwifi:
* bug fixes specific for 8000 series
* fix a crash in time events
* fix a crash in PCIe transport
* fix BT Coex code that prevented association on certain
devices (3160).
* revert the new RBD allocation model because it introduced
a bug when running on weak VM setups.
* new device IDs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 21 Jul 2015 22:27:27 +0000 (15:27 -0700)]
Merge tag 'pinctrl-v4.2-2' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"Here are some overly ripe pin control fixes for the v4.2 series.
They got delayed because of various crap commits and having to clean
and rinse the patch stack a few times. Now they are however looking
good.
- some dead defines dropped from the Samsung driver, was targeted for
-rc2 but got delayed
- drop the strict mode from abx500, this was too strict
- fix the R-Car sparse IRQs code to work as intended
- fix the IRQ code for the pinctrl-single GPIO backend to not enforce
threaded IRQs
- clear the latched events/IRQs for the Broadcom BCM2835 driver
- fix up debugfs for the Freescale imx1 driver
- fix a typo bug in the Schmitt Trigger setup in the LPC18xx driver"
* tag 'pinctrl-v4.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: lpc18xx: fix schmitt trigger setup
Subject: pinctrl: imx1-core: Fix debug output in .pin_config_set callback
pinctrl: bcm2835: Clear the event latch register when disabling interrupts
pinctrl: single: ensure pcs irq will not be forced threaded
sh-pfc: fix sparse GPIOs for R-Car SoCs
pinctrl: abx500: remove strict mode
pinctrl: samsung: Remove old unused defines
Linus Torvalds [Tue, 21 Jul 2015 22:18:06 +0000 (15:18 -0700)]
Merge branch 'for_linus' of git://git./linux/kernel/git/jack/linux-fs
Pull UDF fix from Jan Kara:
"A fix for UDF corruption when certain disk-format feature is enabled"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Don't corrupt unalloc spacetable when writing it
Linus Torvalds [Tue, 21 Jul 2015 21:42:40 +0000 (14:42 -0700)]
Merge tag 'trace-v4.2-rc2-fix2' of git://git./linux/kernel/git/rostedt/linux-trace
Pull tracing sample code fix from Steven Rostedt:
"He Kuang noticed that the sample code using the trace_event helper
function __get_dynamic_array_len() is broken.
This only changes the sample code, and I'm pushing this now instead of
later because I don't want others using the broken code as an example
when using it for real"
* tag 'trace-v4.2-rc2-fix2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix sample output of dynamic arrays
Mugunthan V N [Tue, 21 Jul 2015 10:30:42 +0000 (16:00 +0530)]
drivers: net: cpsw: remove tx event processing in rx napi poll
With commit
c03abd84634d ("net: ethernet: cpsw: don't requests IRQs
we don't use") common isr and napi are separated into separate tx isr
and rx isr/napi, but still in rx napi tx events are handled. So removing
the tx event handling in rx napi.
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Hyunkoo Jee [Tue, 21 Jul 2015 07:43:59 +0000 (09:43 +0200)]
inet: frags: fix defragmented packet's IP header for af_packet
When ip_frag_queue() computes positions, it assumes that the passed
sk_buff does not contain L2 headers.
However, when PACKET_FANOUT_FLAG_DEFRAG is used, IP reassembly
functions can be called on outgoing packets that contain L2 headers.
Also, IPv4 checksum is not corrected after reassembly.
Fixes: 7736d33f4262 ("packet: Add pre-defragmentation support for ipv4 fanouts.")
Signed-off-by: Edward Hyunkoo Jee <edjee@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Guinot [Sun, 19 Jul 2015 11:00:53 +0000 (13:00 +0200)]
net: mvneta: fix refilling for Rx DMA buffers
With the actual code, if a memory allocation error happens while
refilling a Rx descriptor, then the original Rx buffer is both passed
to the networking stack (in a SKB) and let in the Rx ring. This leads
to various kernel oops and crashes.
As a fix, this patch moves Rx descriptor refilling ahead of building
SKB with the associated Rx buffer. In case of a memory allocation
failure, data is dropped and the original DMA buffer is put back into
the Rx ring.
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org>
Fixes: c5aff18204da ("net: mvneta: driver for Marvell Armada 370/XP network unit")
Cc: <stable@vger.kernel.org> # v3.8+
Tested-by: Yoann Sculo <yoann@sculo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joachim Eastwood [Fri, 17 Jul 2015 21:48:17 +0000 (23:48 +0200)]
stmmac: fix setting of driver data in stmmac_dvr_probe
Commit
803f8fc46274b ("stmmac: move driver data setting into
stmmac_dvr_probe") mistakenly set priv and not priv->dev as
driver data. This meant that the remove, resume and suspend
callbacks that fetched and tried to use this data would most
likely explode. Fix the issue by using the correct variable.
Fixes: 803f8fc46274b ("stmmac: move driver data setting into stmmac_dvr_probe")
Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 21 Jul 2015 07:25:03 +0000 (00:25 -0700)]
Merge branch 'sch_panic'
Daniel Borkmann says:
====================
Couple of classifier fixes
This fixes a couple of panics in the form of (analogous for
cls_flow{,er}):
[ 912.759276] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 912.759373] IP: [<
ffffffffa09d4d6d>] cls_bpf_change+0x23d/0x268 [cls_bpf]
[ 912.759441] PGD
8783c067 PUD
5f684067 PMD 0
[ 912.759491] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[ 912.759543] Modules linked in: cls_bpf(E) act_gact [...]
[ 912.772734] CPU: 3 PID: 10489 Comm: tc Tainted: G W E 4.2.0-rc2+ #73
[ 912.775004] Hardware name: Apple Inc. MacBookAir5,1/Mac-
66F35F19FE2A0D05, BIOS MBA51.88Z.00EF.B02.
1211271028 11/27/2012
[ 912.777327] task:
ffff88025eaa8000 ti:
ffff88005f734000 task.ti:
ffff88005f734000
[ 912.779662] RIP: 0010:[<
ffffffffa09d4d6d>] [<
ffffffffa09d4d6d>] cls_bpf_change+0x23d/0x268 [cls_bpf]
[ 912.781991] RSP: 0018:
ffff88005f7379c8 EFLAGS:
00010286
[ 912.784183] RAX:
ffff880201d64e48 RBX:
0000000000000000 RCX:
ffff880201d64e40
[ 912.786402] RDX:
0000000000000000 RSI:
ffffffffa09d51c0 RDI:
ffffffffa09d51a6
[ 912.788625] RBP:
ffff88005f737a68 R08:
0000000000000000 R09:
0000000000000000
[ 912.790854] R10:
0000000000000001 R11:
0000000000000001 R12:
ffff880078ab5a80
[ 912.793082] R13:
ffff880232b31570 R14:
ffff88005f737ae0 R15:
ffff8801e215d1d0
[ 912.795181] FS:
00007f3c0c80d740(0000) GS:
ffff880265400000(0000) knlGS:
0000000000000000
[ 912.797281] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 912.799402] CR2:
0000000000000000 CR3:
000000005460f000 CR4:
00000000001407e0
[ 912.799403] Stack:
[ 912.799407]
ffffffff00000000 ffff88023ea18000 000000005f737a08 0000000000000000
[ 912.799415]
ffffffff81f06140 ffff880201d64e40 0000000000000000 ffff88023ea1804c
[ 912.799418]
0000000000000000 ffff88023ea18044 ffff88023ea18030 ffff88023ea18038
[ 912.799418] Call Trace:
[ 912.799437] [<
ffffffff816d5685>] tc_ctl_tfilter+0x335/0x910
[ 912.799443] [<
ffffffff813622a8>] ? security_capable+0x48/0x60
[ 912.799448] [<
ffffffff816b90e5>] rtnetlink_rcv_msg+0x95/0x240
[ 912.799454] [<
ffffffff810f612d>] ? trace_hardirqs_on+0xd/0x10
[ 912.799456] [<
ffffffff816b902f>] ? rtnetlink_rcv+0x1f/0x40
[ 912.799459] [<
ffffffff816b902f>] ? rtnetlink_rcv+0x1f/0x40
[ 912.799461] [<
ffffffff816b9050>] ? rtnetlink_rcv+0x40/0x40
[ 912.799464] [<
ffffffff816df38f>] netlink_rcv_skb+0xaf/0xc0
[ 912.799467] [<
ffffffff816b903e>] rtnetlink_rcv+0x2e/0x40
[ 912.799469] [<
ffffffff816deaef>] netlink_unicast+0xef/0x1b0
[ 912.799471] [<
ffffffff816defa0>] netlink_sendmsg+0x3f0/0x620
[ 912.799476] [<
ffffffff81687028>] sock_sendmsg+0x38/0x50
[ 912.799479] [<
ffffffff81687938>] ___sys_sendmsg+0x288/0x290
[ 912.799482] [<
ffffffff810f7852>] ? __lock_acquire+0x572/0x2050
[ 912.799488] [<
ffffffff810265db>] ? native_sched_clock+0x2b/0x90
[ 912.799493] [<
ffffffff8116135f>] ? __audit_syscall_entry+0xaf/0x100
[ 912.799497] [<
ffffffff8116135f>] ? __audit_syscall_entry+0xaf/0x100
[ 912.799501] [<
ffffffff8112aa19>] ? current_kernel_time+0x69/0xd0
[ 912.799505] [<
ffffffff81266f16>] ? __fget_light+0x66/0x90
[ 912.799508] [<
ffffffff81688812>] __sys_sendmsg+0x42/0x80
[ 912.799510] [<
ffffffff81688862>] SyS_sendmsg+0x12/0x20
[ 912.799515] [<
ffffffff817f9a6e>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 912.799540] Code: 4d 88 49 8b 57 08 48 89 51 08 49 8b 57 10 48 89 c8 48 83 c0 08 48
89 51 10 48 8b 51 10 48 c7 c6 c0 51 9d a0 48 c7 c7 a6 51 9d a0 <48>
89 02 48 8b 51 08 48 89 42 08 48 b8 00 02 20 00 00 00 ad de
[ 912.799544] RIP [<
ffffffffa09d4d6d>] cls_bpf_change+0x23d/0x268 [cls_bpf]
[ 912.799544] RSP <
ffff88005f7379c8>
[ 912.799545] CR2:
0000000000000000
[ 912.807380] ---[ end trace
a6440067cfdc7c29 ]---
I've split them into 3 patches, so they can be backported easier
when needed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 17 Jul 2015 20:38:45 +0000 (22:38 +0200)]
sched: cls_flow: fix panic on filter replace
The following test case causes a NULL pointer dereference in cls_flow:
tc filter add dev foo parent 1: handle 0x1 flow hash keys dst action ok
tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
flow hash keys mark action drop
To be more precise, actually two different panics are fixed, the first
occurs because tcf_exts_init() is not called on the newly allocated
filter when we do a replace. And the second panic uncovered after that
happens since the arguments of list_replace_rcu() are swapped, the old
element needs to be the first argument and the new element the second.
Fixes: 70da9f0bf999 ("net: sched: cls_flow use RCU")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 17 Jul 2015 20:38:44 +0000 (22:38 +0200)]
sched: cls_flower: fix panic on filter replace
The following test case causes a NULL pointer dereference in cls_flower:
tc filter add dev foo parent 1: flower eth_type ipv4 action ok flowid 1:1
tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
flower eth_type ipv6 action ok flowid 1:1
The problem is that commit
77b9900ef53a ("tc: introduce Flower classifier")
accidentally swapped the arguments of list_replace_rcu(), the old
element needs to be the first argument and the new element the second.
Fixes: 77b9900ef53a ("tc: introduce Flower classifier")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 17 Jul 2015 20:38:43 +0000 (22:38 +0200)]
sched: cls_bpf: fix panic on filter replace
The following test case causes a NULL pointer dereference in cls_bpf:
FOO="1,6 0 0
4294967295,"
tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
bpf bytecode "$FOO" flowid 1:1 action drop
The problem is that commit
1f947bf151e9 ("net: sched: rcu'ify cls_bpf")
accidentally swapped the arguments of list_replace_rcu(), the old
element needs to be the first argument and the new element the second.
Fixes: 1f947bf151e9 ("net: sched: rcu'ify cls_bpf")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 21 Jul 2015 07:17:53 +0000 (00:17 -0700)]
Merge tag 'mac80211-for-davem-2015-07-17' of git://git./linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Some fixes for the current cycle:
1. Arik introduced an rtnl-locked regulatory API to be able
to differentiate between place do/don't have the RTNL;
this fixes missing locking in some of the code paths
2. Two small mesh bugfixes from Bob, one to avoid treating
a certain malformed over-the-air frame and one to avoid
sending a garbage field over the air.
3. A fix for powersave during WoWLAN suspend from Krishna Chaitanya.
4. A fix for a powersave vs. aggregation teardown race, from Michal.
5. Thomas reduced the loglevel of CRDA messages to avoid spamming
the kernel log with mostly irrelevant information.
6. Tom fixed a dangling debugfs directory pointer that could cause
crashes if subsequent addition of the same interface to debugfs
failed for some reason.
7. A fix from myself for a list corruption issue in mac80211 during
combined interface shutdown/removal - shut down interfaces first
and only then remove them to avoid that.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shaohui Xie [Fri, 17 Jul 2015 10:07:19 +0000 (18:07 +0800)]
net/mdio: fix mdio_bus_match for c45 PHY
We store c45 PHY's id information in c45_ids, so it should be used to
check the matching between PHY driver and PHY device for c45 PHY.
Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Konstantin Khlebnikov [Fri, 17 Jul 2015 11:01:11 +0000 (14:01 +0300)]
net: ratelimit warnings about dst entry refcount underflow or overflow
Kernel generates a lot of warnings when dst entry reference counter
overflows and becomes negative. That bug was seen several times at
machines with outdated 3.10.y kernels. Most like it's already fixed
in upstream. Anyway that flood completely kills machine and makes
further debugging impossible.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 17 Jul 2015 08:19:23 +0000 (10:19 +0200)]
caif: fix leaks and race in caif_queue_rcv_skb()
1) If sk_filter() is applied, skb was leaked (not freed)
2) Testing SOCK_DEAD twice is racy :
packet could be freed while already queued.
3) Remove obsolete comment about caching skb->len
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reinhard Speyerer [Thu, 16 Jul 2015 21:28:14 +0000 (23:28 +0200)]
qmi_wwan: add the second QMI/network interface for Sierra Wireless MC7305/MC7355
Sierra Wireless MC7305/MC7355 with USB ID 1199:9041 also provide a
second QMI/network interface like the MC73xx with USB ID 1199:68c0 on
USB interface #10 when used in the appropriate USB configuration.
Add the corresponding QMI_FIXED_INTF entry to the qmi_wwan driver.
Please note that the second QMI/network interface is not working for
early MC73xx firmware versions like 01.08.x as the device does not
respond to QMI messages on the second /dev/cdc-wdm port.
Signed-off-by: Reinhard Speyerer <rspmn@arcor.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Thu, 16 Jul 2015 21:28:38 +0000 (00:28 +0300)]
ravb: fix race updating TCCR
The TCCR.TSRQn bit may get clearead after TCCR gets read, so that TCCR write
would get skipped. We don't need to check this bit before setting.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karicheri, Muralidharan [Thu, 16 Jul 2015 19:32:14 +0000 (15:32 -0400)]
net: netcp: fix improper initialization in netcp_ndo_open()
The keystone qmss will raise interrupt when packet arrive at the
receive queue. Only control available to avoid interrupt from happening
is to keep the free descriptor queue (FDQ) empty in the receive side.
So the filling of descriptors into the FDQ has to happen after
request_irq() call is made as part of knav_queue_enable_notify(). So
move the function netcp_rxpool_refill() after this call.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Thu, 16 Jul 2015 08:30:02 +0000 (16:30 +0800)]
bonding: correct the MAC address for "follow" fail_over_mac policy
The "follow" fail_over_mac policy is useful for multiport devices that
either become confused or incur a performance penalty when multiple
ports are programmed with the same MAC address, but the same MAC
address still may happened by this steps for this policy:
1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
bond0 has the same mac address with eth0, it is MAC1.
2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
eth1 is backup, eth1 has MAC2.
3) ifconfig eth0 down
eth1 became active slave, bond will swap MAC for eth0 and eth1,
so eth1 has MAC1, and eth0 has MAC2.
4) ifconfig eth1 down
there is no active slave, and eth1 still has MAC1, eth2 has MAC2.
5) ifconfig eth0 up
the eth0 became active slave again, the bond set eth0 to MAC1.
Something wrong here, then if you set eth1 up, the eth0 and eth1 will have the same
MAC address, it will break this policy for ACTIVE_BACKUP mode.
This patch will fix this problem by finding the old active slave and
swap them MAC address before change active slave.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 21 Jul 2015 03:25:59 +0000 (20:25 -0700)]
Merge tag 'linux-can-fixes-for-4.2-
20150716' of git://git./linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2015-07-16
this is a pull request of 2 patches by Stefan Agner. He fixes the resume
operation in the mcp251x driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Mon, 20 Jul 2015 09:55:38 +0000 (17:55 +0800)]
Revert "sit: Add gro callbacks to sit_offload"
This patch reverts
19424e052fb44da2f00d1a868cbb51f3e9f4bbb5 ("sit:
Add gro callbacks to sit_offload") because it generates packets
that cannot be handled even by our own GSO.
Reported-by: Wolfgang Walter <linux@stwm.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Wed, 15 Jul 2015 23:09:32 +0000 (16:09 -0700)]
net: dsa: bcm_sf2: do not use indirect reads and writes for 7445E0
7445E0 contains an ECO which disconnected the internal SF2 pseudo-PHY which was
known to conflict with the external pseudo-PHY of BCM53125 switches. This
motivated the need to utilize the internal SF2 MDIO controller via indirect
register reads/writes to control external Broadcom switches due to this address
conflict (both responded at address 30d).
For 7445E0, the internal pseudo-PHY of the SF2 switch got disconnected, and as
a consequence this prevents the internal SF2 MDIO bus controller from reading
data (reads back everything as 0) since the MDI line is tied low.
Fix this by making the indirect register reads and writes conditional to
7445D0, on 7445E0 we can utilize the SWITCH_MDIO controller (backed by
mdio-unimac and not the DSA created slave MII bus).
We utilize of_machine_is_compatible() here since this is the only way for use
to differentiate between these two chips in a way that does not violate layers
or becomes (too) vendor-specific.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Wed, 15 Jul 2015 20:57:01 +0000 (22:57 +0200)]
bonding: correctly handle bonding type change on enslave failure
If the bond is enslaving a device with different type it will be setup
by it, but if after being setup the enslave fails the bond doesn't
switch back its type and also keeps pointers to foreign structures that can
be long gone. Thus revert back any type changes if the enslave failed and
the bond had to change its type.
Example:
Before patch:
$ echo lo > bond0/bonding/slaves
-bash: echo: write error: Cannot assign requested address
$ ip l sh bond0
20: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN
mode DEFAULT group default
link/loopback 16:54:78:34:bd:41 brd 00:00:00:00:00:00
$ echo +eth1 > bond0/bonding/slaves
$ ip l sh bond0
20: bond0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
(notice the MASTER flag is gone)
After patch:
$ echo lo > bond0/bonding/slaves
-bash: echo: write error: Cannot assign requested address
$ ip l sh bond0
21: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN
mode DEFAULT group default qlen 1000
link/ether 6e:66:94:f6:07:fc brd ff:ff:ff:ff:ff:ff
$ echo +eth1 > bond0/bonding/slaves
$ ip l sh bond0
21: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN
mode DEFAULT group default qlen 1000
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: e36b9d16c6a6 ("bonding: clean muticast addresses when device changes type")
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Wed, 15 Jul 2015 19:52:51 +0000 (21:52 +0200)]
bonding: fix destruction of bond with devices different from arphrd_ether
When the bonding is being unloaded and the netdevice notifier is
unregistered it executes NETDEV_UNREGISTER for each device which should
remove the bond's proc entry but if the device enslaved is not of
ARPHRD_ETHER type and is in front of the bonding, it may execute
bond_release_and_destroy() first which would release the last slave and
destroy the bond device leaving the proc entry and thus we will get the
following error (with dynamic debug on for bond_netdev_event to see the
events order):
[ 908.963051] eql: event: 9
[ 908.963052] eql: IFF_SLAVE
[ 908.963054] eql: event: 2
[ 908.963056] eql: IFF_SLAVE
[ 908.963058] eql: event: 6
[ 908.963059] eql: IFF_SLAVE
[ 908.963110] bond0: Releasing active interface eql
[ 908.976168] bond0: Destroying bond bond0
[ 908.976266] bond0 (unregistering): Released all slaves
[ 908.984097] ------------[ cut here ]------------
[ 908.984107] WARNING: CPU: 0 PID: 1787 at fs/proc/generic.c:575
remove_proc_entry+0x112/0x160()
[ 908.984110] remove_proc_entry: removing non-empty directory
'net/bonding', leaking at least 'bond0'
[ 908.984111] Modules linked in: bonding(-) eql(O) 9p nfsd auth_rpcgss
oid_registry nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev qxl drm_kms_helper
snd_hda_codec_generic aesni_intel ttm aes_x86_64 glue_helper pcspkr lrw
gf128mul ablk_helper cryptd snd_hda_intel virtio_console snd_hda_codec
psmouse serio_raw snd_hwdep snd_hda_core 9pnet_virtio 9pnet evdev joydev
drm virtio_balloon snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core
pvpanic acpi_cpufreq parport_pc parport processor thermal_sys button
autofs4 ext4 crc16 mbcache jbd2 hid_generic usbhid hid sg sr_mod cdrom
ata_generic virtio_blk virtio_net floppy ata_piix e1000 libata ehci_pci
virtio_pci scsi_mod uhci_hcd ehci_hcd virtio_ring virtio usbcore
usb_common [last unloaded: bonding]
[ 908.984168] CPU: 0 PID: 1787 Comm: rmmod Tainted: G W O
4.2.0-rc2+ #8
[ 908.984170] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 908.984172]
0000000000000000 ffffffff81732d41 ffffffff81525b34
ffff8800358dfda8
[ 908.984175]
ffffffff8106c521 ffff88003595af78 ffff88003595af40
ffff88003e3a4280
[ 908.984178]
ffffffffa058d040 0000000000000000 ffffffff8106c59a
ffffffff8172ebd0
[ 908.984181] Call Trace:
[ 908.984188] [<
ffffffff81525b34>] ? dump_stack+0x40/0x50
[ 908.984193] [<
ffffffff8106c521>] ? warn_slowpath_common+0x81/0xb0
[ 908.984196] [<
ffffffff8106c59a>] ? warn_slowpath_fmt+0x4a/0x50
[ 908.984199] [<
ffffffff81218352>] ? remove_proc_entry+0x112/0x160
[ 908.984205] [<
ffffffffa05850e6>] ? bond_destroy_proc_dir+0x26/0x30
[bonding]
[ 908.984208] [<
ffffffffa057540e>] ? bond_net_exit+0x8e/0xa0 [bonding]
[ 908.984217] [<
ffffffff8142f407>] ? ops_exit_list.isra.4+0x37/0x70
[ 908.984225] [<
ffffffff8142f52d>] ?
unregister_pernet_operations+0x8d/0xd0
[ 908.984228] [<
ffffffff8142f58d>] ?
unregister_pernet_subsys+0x1d/0x30
[ 908.984232] [<
ffffffffa0585269>] ? bonding_exit+0x23/0xdba [bonding]
[ 908.984236] [<
ffffffff810e28ba>] ? SyS_delete_module+0x18a/0x250
[ 908.984241] [<
ffffffff81086f99>] ? task_work_run+0x89/0xc0
[ 908.984244] [<
ffffffff8152b732>] ?
entry_SYSCALL_64_fastpath+0x16/0x75
[ 908.984247] ---[ end trace
7c006ed4abbef24b ]---
Thus remove the proc entry manually if bond_release_and_destroy() is
used. Because of the checks in bond_remove_proc_entry() it's not a
problem for a bond device to change namespaces (the bug fixed by the
Fixes commit) but since commit
f9399814927ad ("bonding: Don't allow bond devices to change network
namespaces.") that can't happen anyway.
Reported-by: Carol Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: a64d49c3dd50 ("bonding: Manage /proc/net/bonding/ entries from
the netdev events")
Tested-by: Carol L Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>