firefly-linux-kernel-4.4.55.git
12 years agoKVM: make bad_pfn static to kvm_main.c
Xiao Guangrong [Tue, 17 Jul 2012 13:54:52 +0000 (21:54 +0800)]
KVM: make bad_pfn static to kvm_main.c

bad_pfn is not used out of kvm_main.c, so mark it static, also move it near
hwpoison_pfn and fault_pfn

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: using get_fault_pfn to get the fault pfn
Xiao Guangrong [Tue, 17 Jul 2012 13:54:11 +0000 (21:54 +0800)]
KVM: using get_fault_pfn to get the fault pfn

Using get_fault_pfn to cleanup the code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: MMU: track the refcount when unmap the page
Xiao Guangrong [Tue, 17 Jul 2012 13:52:52 +0000 (21:52 +0800)]
KVM: MMU: track the refcount when unmap the page

It will trigger a WARN_ON if the page has been freed but it is still
used in mmu, it can help us to detect mm bug early

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: x86: remove unnecessary mark_page_dirty
Xiao Guangrong [Tue, 17 Jul 2012 13:50:48 +0000 (21:50 +0800)]
KVM: x86: remove unnecessary mark_page_dirty

fix:
[  132.474633] 3.5.0-rc1+ #50 Not tainted
[  132.474634] -------------------------------
[  132.474635] include/linux/kvm_host.h:369 suspicious rcu_dereference_check() usage!
[  132.474636]
[  132.474636] other info that might help us debug this:
[  132.474636]
[  132.474638]
[  132.474638] rcu_scheduler_active = 1, debug_locks = 1
[  132.474640] 1 lock held by qemu-kvm/2832:
[  132.474657]  #0:  (&vcpu->mutex){+.+.+.}, at: [<ffffffffa01e1636>] vcpu_load+0x1e/0x91 [kvm]
[  132.474658]
[  132.474658] stack backtrace:
[  132.474660] Pid: 2832, comm: qemu-kvm Not tainted 3.5.0-rc1+ #50
[  132.474661] Call Trace:
[  132.474665]  [<ffffffff81092f40>] lockdep_rcu_suspicious+0xfc/0x105
[  132.474675]  [<ffffffffa01e0c85>] kvm_memslots+0x6d/0x75 [kvm]
[  132.474683]  [<ffffffffa01e0ca1>] gfn_to_memslot+0x14/0x4c [kvm]
[  132.474693]  [<ffffffffa01e3575>] mark_page_dirty+0x17/0x2a [kvm]
[  132.474706]  [<ffffffffa01f21ea>] kvm_arch_vcpu_ioctl+0xbcf/0xc07 [kvm]

Actually, we do not write vcpu->arch.time at this time, mark_page_dirty
should be removed.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()
Takuya Yoshikawa [Mon, 2 Jul 2012 08:59:33 +0000 (17:59 +0900)]
KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()

When we invalidate a THP page, we call the handler with the same
rmap_pde argument 512 times in the following loop:

  for each guest page in the range
    for each level
      unmap using rmap

This patch avoids these extra handler calls by changing the loop order
like this:

  for each level
    for each rmap in the range
      unmap using rmap

With the preceding patches in the patch series, this made THP page
invalidation more than 5 times faster on our x86 host: the host became
more responsive during swapping the guest's memory as a result.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp()
Takuya Yoshikawa [Mon, 2 Jul 2012 08:58:48 +0000 (17:58 +0900)]
KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp()

This restricts the tracing to page aging and makes it possible to
optimize kvm_handle_hva_range() further in the following patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: MMU: Add memslot parameter to hva handlers
Takuya Yoshikawa [Mon, 2 Jul 2012 08:57:59 +0000 (17:57 +0900)]
KVM: MMU: Add memslot parameter to hva handlers

This is needed to push trace_kvm_age_page() into kvm_age_rmapp() in the
following patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: Separate rmap_pde from kvm_lpage_info->write_count
Takuya Yoshikawa [Mon, 2 Jul 2012 08:57:17 +0000 (17:57 +0900)]
KVM: Separate rmap_pde from kvm_lpage_info->write_count

This makes it possible to loop over rmap_pde arrays in the same way as
we do over rmap so that we can optimize kvm_handle_hva_range() easily in
the following patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: Introduce kvm_unmap_hva_range() for kvm_mmu_notifier_invalidate_range_start()
Takuya Yoshikawa [Mon, 2 Jul 2012 08:56:33 +0000 (17:56 +0900)]
KVM: Introduce kvm_unmap_hva_range() for kvm_mmu_notifier_invalidate_range_start()

When we tested KVM under memory pressure, with THP enabled on the host,
we noticed that MMU notifier took a long time to invalidate huge pages.

Since the invalidation was done with mmu_lock held, it not only wasted
the CPU but also made the host harder to respond.

This patch mitigates this by using kvm_handle_hva_range().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: MMU: Make kvm_handle_hva() handle range of addresses
Takuya Yoshikawa [Mon, 2 Jul 2012 08:55:48 +0000 (17:55 +0900)]
KVM: MMU: Make kvm_handle_hva() handle range of addresses

When guest's memory is backed by THP pages, MMU notifier needs to call
kvm_unmap_hva(), which in turn leads to kvm_handle_hva(), in a loop to
invalidate a range of pages which constitute one huge page:

  for each page
    for each memslot
      if page is in memslot
        unmap using rmap

This means although every page in that range is expected to be found in
the same memslot, we are forced to check unrelated memslots many times.
If the guest has more memslots, the situation will become worse.

Furthermore, if the range does not include any pages in the guest's
memory, the loop over the pages will just consume extra time.

This patch, together with the following patches, solves this problem by
introducing kvm_handle_hva_range() which makes the loop look like this:

  for each memslot
    for each page in memslot
      unmap using rmap

In this new processing, the actual work is converted to a loop over rmap
which is much more cache friendly than before.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: Introduce hva_to_gfn_memslot() for kvm_handle_hva()
Takuya Yoshikawa [Mon, 2 Jul 2012 08:54:30 +0000 (17:54 +0900)]
KVM: Introduce hva_to_gfn_memslot() for kvm_handle_hva()

This restricts hva handling in mmu code and makes it easier to extend
kvm_handle_hva() so that it can treat a range of addresses later in this
patch series.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: MMU: Use __gfn_to_rmap() to clean up kvm_handle_hva()
Takuya Yoshikawa [Mon, 2 Jul 2012 08:53:25 +0000 (17:53 +0900)]
KVM: MMU: Use __gfn_to_rmap() to clean up kvm_handle_hva()

We can treat every level uniformly.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoRevert "apic: fix kvm build on UP without IOAPIC"
Michael S. Tsirkin [Sun, 15 Jul 2012 12:56:58 +0000 (15:56 +0300)]
Revert "apic: fix kvm build on UP without IOAPIC"

This reverts commit f9808b7fd422b965cea52e05ba470e0a473c53d3.
After commit 'kvm: switch to apic_set_eoi_write, apic_write'
the stubs are no longer needed as kvm does not look at apicdrivers anymore.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM guest: switch to apic_set_eoi_write, apic_write
Michael S. Tsirkin [Sun, 15 Jul 2012 12:56:52 +0000 (15:56 +0300)]
KVM guest: switch to apic_set_eoi_write, apic_write

Use apic_set_eoi_write, apic_write to avoid meedling in core apic
driver data structures directly.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoapic: add apic_set_eoi_write for PV use
Michael S. Tsirkin [Sun, 15 Jul 2012 12:56:46 +0000 (15:56 +0300)]
apic: add apic_set_eoi_write for PV use

KVM PV EOI optimization overrides eoi_write apic op with its own
version. Add an API for this to avoid meddling with core x86 apic driver
data structures directly.

For KVM use, we don't need any guarantees about when the switch to the
new op will take place, so it could in theory use this API after SMP init,
but it currently doesn't, and restricting callers to early init makes it
clear that it's safe as it won't race with actual APIC driver use.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoMerge branch 'for-upstream' of git://github.com/agraf/linux-2.6 into next
Avi Kivity [Sun, 15 Jul 2012 09:41:47 +0000 (12:41 +0300)]
Merge branch 'for-upstream' of git://github.com/agraf/linux-2.6 into next

ppc queue from Alex Graf:

 * Prepare some of the booke code for 64 bit support
 * BookE: Fix ESR flag in DSI
 * BookE: Add rfci emulation

* 'for-upstream' of git://github.com/agraf/linux-2.6:
  KVM: PPC: Critical interrupt emulation support
  KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests
  KVM: PPC64: booke: Set interrupt computation mode for 64-bit host
  KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt
  KVM: PPC: bookehv64: Add support for std/ld emulation.
  booke: Added crit/mc exception handler for e500v2
  booke/bookehv: Add host crit-watchdog exception support

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Implement PCID/INVPCID for guests with EPT
Mao, Junjie [Mon, 2 Jul 2012 01:18:48 +0000 (01:18 +0000)]
KVM: VMX: Implement PCID/INVPCID for guests with EPT

This patch handles PCID/INVPCID for guests.

Process-context identifiers (PCIDs) are a facility by which a logical processor
may cache information for multiple linear-address spaces so that the processor
may retain cached information when software switches to a different linear
address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual
Volume 3A for details.

For guests with EPT, the PCID feature is enabled and INVPCID behaves as running
natively.
For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.

Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check
Prarit Bhargava [Fri, 6 Jul 2012 17:47:39 +0000 (13:47 -0400)]
KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check

While debugging I noticed that unlike all the other hypervisor code in the
kernel, kvm does not have an entry for x86_hyper which is used in
detect_hypervisor_platform() which results in a nice printk in the
syslog.  This is only really a stub function but it
does make kvm more consistent with the other hypervisors.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marcelo Tostatti <mtosatti@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: PPC: Critical interrupt emulation support
Bharat Bhushan [Wed, 27 Jun 2012 19:37:31 +0000 (19:37 +0000)]
KVM: PPC: Critical interrupt emulation support

rfci instruction and CSRR0/1 registers are emulated.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Stuart Yoder <stuart.yoder@freescale.com>
Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
12 years agoKVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests
Mihai Caraman [Mon, 25 Jun 2012 02:26:26 +0000 (02:26 +0000)]
KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests

tlbilxva emulation was using an u32 variable for guest effective address.
Replace it with gva_t type to handle 64-bit guests.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
12 years agoKVM: PPC64: booke: Set interrupt computation mode for 64-bit host
Mihai Caraman [Mon, 25 Jun 2012 02:26:19 +0000 (02:26 +0000)]
KVM: PPC64: booke: Set interrupt computation mode for 64-bit host

64-bit host needs to remain in 64-bit mode when an exception take place.
Set interrupt computaion mode in EPCR register.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
12 years agoKVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt
Mihai Caraman [Fri, 22 Jun 2012 13:33:12 +0000 (13:33 +0000)]
KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt

ESR register is required by Data Storage Interrupt handling code.
Add the specific flag to the interrupt handler.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
12 years agoKVM: PPC: bookehv64: Add support for std/ld emulation.
Varun Sethi [Mon, 18 Jun 2012 12:14:55 +0000 (12:14 +0000)]
KVM: PPC: bookehv64: Add support for std/ld emulation.

Add support for std/ld emulation.

Signed-off-by: Varun Sethi <Varun.Sethi@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
12 years agobooke: Added crit/mc exception handler for e500v2
Bharat Bhushan [Wed, 20 Jun 2012 05:56:54 +0000 (05:56 +0000)]
booke: Added crit/mc exception handler for e500v2

Watchdog is taken at critical exception level. So this patch
is tested with host watchdog exception happening when guest
is running.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
12 years agobooke/bookehv: Add host crit-watchdog exception support
Bharat Bhushan [Wed, 20 Jun 2012 05:56:53 +0000 (05:56 +0000)]
booke/bookehv: Add host crit-watchdog exception support

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
12 years agoKVM: MMU: document mmu-lock and fast page fault
Xiao Guangrong [Wed, 20 Jun 2012 08:00:26 +0000 (16:00 +0800)]
KVM: MMU: document mmu-lock and fast page fault

Document fast page fault and mmu-lock in locking.txt

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: MMU: fix kvm_mmu_pagetable_walk tracepoint
Xiao Guangrong [Wed, 20 Jun 2012 08:00:00 +0000 (16:00 +0800)]
KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint

The P bit of page fault error code is missed in this tracepoint, fix it by
passing the full error code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: MMU: trace fast page fault
Xiao Guangrong [Wed, 20 Jun 2012 07:59:41 +0000 (15:59 +0800)]
KVM: MMU: trace fast page fault

To see what happen on this path and help us to optimize it

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: MMU: fast path of handling guest page fault
Xiao Guangrong [Wed, 20 Jun 2012 07:59:18 +0000 (15:59 +0800)]
KVM: MMU: fast path of handling guest page fault

If the the present bit of page fault error code is set, it indicates
the shadow page is populated on all levels, it means what we do is
only modify the access bit which can be done out of mmu-lock

Currently, in order to simplify the code, we only fix the page fault
caused by write-protect on the fast path

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: MMU: introduce SPTE_MMU_WRITEABLE bit
Xiao Guangrong [Wed, 20 Jun 2012 07:58:58 +0000 (15:58 +0800)]
KVM: MMU: introduce SPTE_MMU_WRITEABLE bit

This bit indicates whether the spte can be writable on MMU, that means
the corresponding gpte is writable and the corresponding gfn is not
protected by shadow page protection

In the later path, SPTE_MMU_WRITEABLE will indicates whether the spte
can be locklessly updated

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: MMU: fold tlb flush judgement into mmu_spte_update
Xiao Guangrong [Wed, 20 Jun 2012 07:58:33 +0000 (15:58 +0800)]
KVM: MMU: fold tlb flush judgement into mmu_spte_update

mmu_spte_update() is the common function, we can easily audit the path

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: export PFEC.P bit on ept
Xiao Guangrong [Wed, 20 Jun 2012 07:58:04 +0000 (15:58 +0800)]
KVM: VMX: export PFEC.P bit on ept

Export the present bit of page fault error code, the later patch
will use it

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: MMU: cleanup spte_write_protect
Xiao Guangrong [Wed, 20 Jun 2012 07:57:39 +0000 (15:57 +0800)]
KVM: MMU: cleanup spte_write_protect

Use __drop_large_spte to cleanup this function and comment spte_write_protect

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: MMU: abstract spte write-protect
Xiao Guangrong [Wed, 20 Jun 2012 07:57:15 +0000 (15:57 +0800)]
KVM: MMU: abstract spte write-protect

Introduce a common function to abstract spte write-protect to
cleanup the code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: MMU: return bool in __rmap_write_protect
Xiao Guangrong [Wed, 20 Jun 2012 07:56:53 +0000 (15:56 +0800)]
KVM: MMU: return bool in __rmap_write_protect

The reture value of __rmap_write_protect is either 1 or 0, use
true/false instead of these

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Emulate invalid guest state by default
Avi Kivity [Tue, 12 Jun 2012 17:30:18 +0000 (20:30 +0300)]
KVM: VMX: Emulate invalid guest state by default

Our emulation should be complete enough that we can emulate guests
while they are in big real mode, or in a mode transition that is not
virtualizable without unrestricted guest support.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: implement LTR
Avi Kivity [Wed, 13 Jun 2012 13:33:29 +0000 (16:33 +0300)]
KVM: x86 emulator: implement LTR

Opcode 0F 00 /3.  Encountered during Windows XP secondary processor bringup.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: make loading TR set the busy bit
Avi Kivity [Wed, 13 Jun 2012 13:30:53 +0000 (16:30 +0300)]
KVM: x86 emulator: make loading TR set the busy bit

Guest software doesn't actually depend on it, but vmx will refuse us
entry if we don't.  Set the bit in both the cached segment and memory,
just to be nice.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: make read_segment_descriptor() return the address
Avi Kivity [Wed, 13 Jun 2012 13:29:39 +0000 (16:29 +0300)]
KVM: x86 emulator: make read_segment_descriptor() return the address

Some operations want to modify the descriptor later on, so save the
address for future use.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: emulate LLDT
Avi Kivity [Wed, 13 Jun 2012 09:28:33 +0000 (12:28 +0300)]
KVM: x86 emulator: emulate LLDT

Opcode 0F 00 /2. Used by isolinux durign the protected mode transition.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: emulate BSWAP
Avi Kivity [Wed, 13 Jun 2012 09:25:06 +0000 (12:25 +0300)]
KVM: x86 emulator: emulate BSWAP

Opcodes 0F C8 - 0F CF.

Used by the SeaBIOS cdrom code (though not in big real mode).

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Improve error reporting during invalid guest state emulation
Avi Kivity [Tue, 12 Jun 2012 17:22:28 +0000 (20:22 +0300)]
KVM: VMX: Improve error reporting during invalid guest state emulation

If instruction emulation fails, report it properly to userspace.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Stop invalid guest state emulation on pending event
Avi Kivity [Tue, 12 Jun 2012 17:21:38 +0000 (20:21 +0300)]
KVM: VMX: Stop invalid guest state emulation on pending event

Process the event, possibly injecting an interrupt, before continuing.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: implement ENTER
Avi Kivity [Tue, 12 Jun 2012 17:03:23 +0000 (20:03 +0300)]
KVM: x86 emulator: implement ENTER

Opcode C8.

Only ENTER with lexical nesting depth 0 is implemented, since others are
very rare.  We'll fail emulation if nonzero lexical depth is used so data
is not corrupted.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: split push logic from push opcode emulation
Avi Kivity [Tue, 12 Jun 2012 17:19:40 +0000 (20:19 +0300)]
KVM: x86 emulator: split push logic from push opcode emulation

This allows us to reuse the code without populating ctxt->src and
overriding ctxt->op_bytes.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: fix byte-sized MOVZX/MOVSX
Avi Kivity [Mon, 11 Jun 2012 16:40:15 +0000 (19:40 +0300)]
KVM: x86 emulator: fix byte-sized MOVZX/MOVSX

Commit 2adb5ad9fe1 removed ByteOp from MOVZX/MOVSX, replacing them by
SrcMem8, but neglected to fix the dependency in the emulation code
on ByteOp.  This caused the instruction not to have any effect in
some circumstances.

Fix by replacing the check for ByteOp with the equivalent src.op_bytes == 1.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: emulate LAHF
Avi Kivity [Mon, 11 Jun 2012 10:09:07 +0000 (13:09 +0300)]
KVM: x86 emulator: emulate LAHF

Opcode 9F.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Continue emulating after batch exhausted
Avi Kivity [Sun, 10 Jun 2012 15:09:27 +0000 (18:09 +0300)]
KVM: VMX: Continue emulating after batch exhausted

If we return early from an invalid guest state emulation loop, make
sure we return to it later if the guest state is still invalid.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Fix interrupt exit condition during emulation
Avi Kivity [Sun, 10 Jun 2012 15:07:57 +0000 (18:07 +0300)]
KVM: VMX: Fix interrupt exit condition during emulation

Checking EFLAGS.IF is incorrect as we might be in interrupt shadow.  If
that is the case, the main loop will notice that and not inject the interrupt,
causing an endless loop.

Fix by using vmx_interrupt_allowed() to check if we can inject an interrupt
instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: emulate SGDT/SIDT
Avi Kivity [Sun, 10 Jun 2012 14:21:18 +0000 (17:21 +0300)]
KVM: x86 emulator: emulate SGDT/SIDT

Opcodes 0F 01 /0 and 0F 01 /1

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: Fix SS default ESP/EBP based addressing
Avi Kivity [Sun, 10 Jun 2012 14:15:39 +0000 (17:15 +0300)]
KVM: Fix SS default ESP/EBP based addressing

We correctly default to SS when BP is used as a base in 16-bit address mode,
but we don't do that for 32-bit mode.

Fix by adjusting the default to SS when either ESP or EBP is used as the base
register.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: initialize memop
Avi Kivity [Sun, 10 Jun 2012 14:11:00 +0000 (17:11 +0300)]
KVM: x86 emulator: initialize memop

memop is not initialized; this can lead to a two-byte operation
following a 4-byte operation to see garbage values.  Usually
truncation fixes things fot us later on, but at least in one case
(call abs) it doesn't.

Fix by moving memop to the auto-initialized field area.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: emulate LEAVE
Avi Kivity [Thu, 7 Jun 2012 14:49:24 +0000 (17:49 +0300)]
KVM: x86 emulator: emulate LEAVE

Opcode c9; used by some variants of Windows during boot, in big real mode.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Limit iterations with emulator_invalid_guest_state
Avi Kivity [Thu, 7 Jun 2012 14:08:48 +0000 (17:08 +0300)]
KVM: VMX: Limit iterations with emulator_invalid_guest_state

Otherwise, if the guest ends up looping, we never exit the srcu critical
section, which causes synchronize_srcu() to hang.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Relax check on unusable segment
Avi Kivity [Thu, 7 Jun 2012 14:06:10 +0000 (17:06 +0300)]
KVM: VMX: Relax check on unusable segment

Some userspace (e.g. QEMU 1.1) munge the d and g bits of segment
descriptors, causing us not to recognize them as unusable segments
with emulate_invalid_guest_state=1.  Relax the check by testing for
segment not present (a non-present segment cannot be usable).

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: fix LIDT/LGDT in long mode
Avi Kivity [Thu, 7 Jun 2012 14:04:36 +0000 (17:04 +0300)]
KVM: x86 emulator: fix LIDT/LGDT in long mode

The operand size for these instructions is 8 bytes in long mode, even without
a REX prefix.  Set it explicitly.

Triggered while booting Linux with emulate_invalid_guest_state=1.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: allow loading null SS in long mode
Avi Kivity [Thu, 7 Jun 2012 14:03:42 +0000 (17:03 +0300)]
KVM: x86 emulator: allow loading null SS in long mode

Null SS is valid in long mode; allow loading it.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: emulate cpuid
Avi Kivity [Thu, 7 Jun 2012 11:11:36 +0000 (14:11 +0300)]
KVM: x86 emulator: emulate cpuid

Opcode 0F A2.

Used by Linux during the mode change trampoline while in a state that is
not virtualizable on vmx without unrestricted_guest, so we need to emulate
it is emulate_invalid_guest_state=1.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: x86 emulator: change ->get_cpuid() accessor to use the x86 semantics
Avi Kivity [Thu, 7 Jun 2012 11:10:16 +0000 (14:10 +0300)]
KVM: x86 emulator: change ->get_cpuid() accessor to use the x86 semantics

Instead of getting an exact leaf, follow the spec and fall back to the last
main leaf instead.  This lets us easily emulate the cpuid instruction in the
emulator.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: Split cpuid register access from computation
Avi Kivity [Thu, 7 Jun 2012 11:07:48 +0000 (14:07 +0300)]
KVM: Split cpuid register access from computation

Introduce kvm_cpuid() to perform the leaf limit check and calculate
register values, and let kvm_emulate_cpuid() just handle reading and
writing the registers from/to the vcpu.  This allows us to reuse
kvm_cpuid() in a context where directly reading and writing registers
is not desired.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Return correct CPL during transition to protected mode
Avi Kivity [Wed, 6 Jun 2012 15:36:48 +0000 (18:36 +0300)]
KVM: VMX: Return correct CPL during transition to protected mode

In protected mode, the CPL is defined as the lower two bits of CS, as set by
the last far jump.  But during the transition to protected mode, there is no
last far jump, so we need to return zero (the inherited real mode CPL).

Fix by reading CPL from the cache during the transition.  This isn't 100%
correct since we don't set the CPL cache on a far jump, but since protected
mode transition will always jump to a segment with RPL=0, it will always
work.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: MMU: Force cr3 reload with two dimensional paging on mov cr3 emulation
Avi Kivity [Sun, 8 Jul 2012 14:16:30 +0000 (17:16 +0300)]
KVM: MMU: Force cr3 reload with two dimensional paging on mov cr3 emulation

Currently the MMU's ->new_cr3() callback does nothing when guest paging
is disabled or when two-dimentional paging (e.g. EPT on Intel) is active.
This means that an emulated write to cr3 can be lost; kvm_set_cr3() will
write vcpu-arch.cr3, but the GUEST_CR3 field in the VMCS will retain its
old value and this is what the guest sees.

This bug did not have any effect until now because:
- with unrestricted guest, or with svm, we never emulate a mov cr3 instruction
- without unrestricted guest, and with paging enabled, we also never emulate a
  mov cr3 instruction
- without unrestricted guest, but with paging disabled, the guest's cr3 is
  ignored until the guest enables paging; at this point the value from arch.cr3
  is loaded correctly my the mov cr0 instruction which turns on paging

However, the patchset that enables big real mode causes us to emulate mov cr3
instructions in protected mode sometimes (when guest state is not virtualizable
by vmx); this mov cr3 is effectively ignored and will crash the guest.

The fix is to make nonpaging_new_cr3() call mmu_free_roots() to force a cr3
reload.  This is awkward because now all the new_cr3 callbacks to the same
thing, and because mmu_free_roots() is somewhat of an overkill; but fixing
that is more complicated and will be done after this minimal fix.

Observed in the Window XP 32-bit installer while bringing up secondary vcpus.

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: handle last_boosted_vcpu = 0 case
Rik van Riel [Tue, 19 Jun 2012 20:51:04 +0000 (16:51 -0400)]
KVM: handle last_boosted_vcpu = 0 case

If last_boosted_vcpu == 0, then we fall through all test cases and
may end up with all VCPUs pouncing on vcpu 0.  With a large enough
guest, this can result in enormous runqueue lock contention, which
can prevent vcpu0 from running, leading to a livelock.

Changing < to <= makes sure we properly handle that case.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: s390: Fix sigp sense handling.
Cornelia Huck [Tue, 26 Jun 2012 14:06:41 +0000 (16:06 +0200)]
KVM: s390: Fix sigp sense handling.

If sigp sense doesn't have any status bits to report, it should set
cc 0 and leave the register as-is.

Since we know about the external call pending bit, we should report
it if it is set as well.

Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: s390: use sigp condition code defines
Heiko Carstens [Tue, 26 Jun 2012 14:06:40 +0000 (16:06 +0200)]
KVM: s390: use sigp condition code defines

Just use the defines instead of using plain numbers and adding
a comment behind each line.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: s390: fix sigp set prefix status stored cases
Heiko Carstens [Tue, 26 Jun 2012 14:06:39 +0000 (16:06 +0200)]
KVM: s390: fix sigp set prefix status stored cases

If an invalid parameter is passed or the addressed cpu is in an
incorrect state sigp set prefix will store a status.
This status must only have bits set as defined by the architecture.
The current kvm implementation missed to clear bits and also did
not set the intended status bit ("and" instead of "or" operation).

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: s390: fix sigp sense running condition code handling
Heiko Carstens [Tue, 26 Jun 2012 14:06:38 +0000 (16:06 +0200)]
KVM: s390: fix sigp sense running condition code handling

Only if the sensed cpu is not running a status is stored, which
is reflected by condition code 1. If the cpu is running, condition
code 0 should be returned.
Just the opposite of what the code is doing.

Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agos390/smp/kvm: unifiy sigp definitions
Heiko Carstens [Tue, 26 Jun 2012 14:06:37 +0000 (16:06 +0200)]
s390/smp/kvm: unifiy sigp definitions

The smp and the kvm code have different defines for the sigp order codes.
Let's just have a single place where these are defined.
Also move the sigp condition code and sigp cpu status bits to the new
sigp.h header file.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agos390/smp: remove redundant check
Heiko Carstens [Tue, 26 Jun 2012 14:06:36 +0000 (16:06 +0200)]
s390/smp: remove redundant check

condition code "status stored" for sigp sense running always implies
that only the "not running" status bit is set. Therefore no need to
check if it is set.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: Guard mmu_notifier specific code with CONFIG_MMU_NOTIFIER
Marc Zyngier [Fri, 15 Jun 2012 19:07:24 +0000 (15:07 -0400)]
KVM: Guard mmu_notifier specific code with CONFIG_MMU_NOTIFIER

In order to avoid compilation failure when KVM is not compiled in,
guard the mmu_notifier specific sections with both CONFIG_MMU_NOTIFIER
and KVM_ARCH_WANT_MMU_NOTIFIER, like it is being done in the rest of
the KVM code.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: VMX: code clean for vmx_init()
Guo Chao [Fri, 15 Jun 2012 03:31:56 +0000 (11:31 +0800)]
KVM: VMX: code clean for vmx_init()

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoapic: fix kvm build on UP without IOAPIC
Michael S. Tsirkin [Sun, 1 Jul 2012 15:05:06 +0000 (18:05 +0300)]
apic: fix kvm build on UP without IOAPIC

On UP i386, when APIC is disabled
# CONFIG_X86_UP_APIC is not set
# CONFIG_PCI_IOAPIC is not set

code looking at apicdrivers never has any effect but it
still gets compiled in. In particular, this causes
build failures with kvm, but it generally bloats the kernel
unnecessarily.

Fix by defining both __apicdrivers and __apicdrivers_end
to be NULL when CONFIG_X86_LOCAL_APIC is unset: I verified
that as the result any loop scanning __apicdrivers gets optimized out by
the compiler.

Warning: a .config with apic disabled doesn't seem to boot
for me (even without this patch). Still verifying why,
meanwhile this patch is compile-tested only.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: host side for eoi optimization
Michael S. Tsirkin [Sun, 24 Jun 2012 16:25:07 +0000 (19:25 +0300)]
KVM: host side for eoi optimization

Implementation of PV EOI using shared memory.
This reduces the number of exits an interrupt
causes as much as by half.

The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
We set it before injecting an interrupt and clear
before injecting a nested one. Guest tests it using
a test and clear operation - this is necessary
so that host can detect interrupt nesting -
and if set, it can skip the EOI MSR.

There's a new MSR to set the address of said register
in guest memory. Otherwise not much changed:
- Guest EOI is not required
- Register is tested & ISR is automatically cleared on exit

For testing results see description of previous patch
'kvm_para: guest side for eoi avoidance'.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: rearrange injection cancelling code
Michael S. Tsirkin [Sun, 24 Jun 2012 16:25:00 +0000 (19:25 +0300)]
KVM: rearrange injection cancelling code

Each time we need to cancel injection we invoke same code
(cancel_injection callback).  Move it towards the end of function using
the familiar goto on error pattern.

Will make it easier to do more cleanups for PV EOI.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: only sync when attention bits set
Michael S. Tsirkin [Sun, 24 Jun 2012 16:24:54 +0000 (19:24 +0300)]
KVM: only sync when attention bits set

Commit eb0dc6d0368072236dcd086d7fdc17fd3c4574d4 introduced apic
attention bitmask but kvm still syncs lapic unconditionally.
As that commit suggested and in anticipation of adding more attention
bits, only sync lapic if(apic_attention).

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: eoi msi documentation
Michael S. Tsirkin [Sun, 24 Jun 2012 16:24:49 +0000 (19:24 +0300)]
KVM: eoi msi documentation

Document the new EOI MSR. Couldn't decide whether this change belongs
conceptually on guest or host side, so a separate patch.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agox86, bitops: note on __test_and_clear_bit atomicity
Michael S. Tsirkin [Sun, 24 Jun 2012 16:24:42 +0000 (19:24 +0300)]
x86, bitops: note on __test_and_clear_bit atomicity

__test_and_clear_bit is actually atomic with respect
to the local CPU. Add a note saying that KVM on x86
relies on this behaviour so people don't accidentaly break it.
Also warn not to rely on this in portable code.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM guest: guest side for eoi avoidance
Michael S. Tsirkin [Sun, 24 Jun 2012 16:24:34 +0000 (19:24 +0300)]
KVM guest: guest side for eoi avoidance

The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
Guest tests it using a single est and clear operation - this is
necessary so that host can detect interrupt nesting - and if set, it can
skip the EOI MSR.

I run a simple microbenchmark to show exit reduction
(note: for testing, need to apply follow-up patch
'kvm: host side for eoi optimization' + a qemu patch
 I posted separately, on host):

Before:

Performance counter stats for 'sleep 1s':

            47,357 kvm:kvm_entry                                                [99.98%]
                 0 kvm:kvm_hypercall                                            [99.98%]
                 0 kvm:kvm_hv_hypercall                                         [99.98%]
             5,001 kvm:kvm_pio                                                  [99.98%]
                 0 kvm:kvm_cpuid                                                [99.98%]
            22,124 kvm:kvm_apic                                                 [99.98%]
            49,849 kvm:kvm_exit                                                 [99.98%]
            21,115 kvm:kvm_inj_virq                                             [99.98%]
                 0 kvm:kvm_inj_exception                                        [99.98%]
                 0 kvm:kvm_page_fault                                           [99.98%]
            22,937 kvm:kvm_msr                                                  [99.98%]
                 0 kvm:kvm_cr                                                   [99.98%]
                 0 kvm:kvm_pic_set_irq                                          [99.98%]
                 0 kvm:kvm_apic_ipi                                             [99.98%]
            22,207 kvm:kvm_apic_accept_irq                                      [99.98%]
            22,421 kvm:kvm_eoi                                                  [99.98%]
                 0 kvm:kvm_pv_eoi                                               [99.99%]
                 0 kvm:kvm_nested_vmrun                                         [99.99%]
                 0 kvm:kvm_nested_intercepts                                    [99.99%]
                 0 kvm:kvm_nested_vmexit                                        [99.99%]
                 0 kvm:kvm_nested_vmexit_inject                                    [99.99%]
                 0 kvm:kvm_nested_intr_vmexit                                    [99.99%]
                 0 kvm:kvm_invlpga                                              [99.99%]
                 0 kvm:kvm_skinit                                               [99.99%]
                57 kvm:kvm_emulate_insn                                         [99.99%]
                 0 kvm:vcpu_match_mmio                                          [99.99%]
                 0 kvm:kvm_userspace_exit                                       [99.99%]
                 2 kvm:kvm_set_irq                                              [99.99%]
                 2 kvm:kvm_ioapic_set_irq                                       [99.99%]
            23,609 kvm:kvm_msi_set_irq                                          [99.99%]
                 1 kvm:kvm_ack_irq                                              [99.99%]
               131 kvm:kvm_mmio                                                 [99.99%]
               226 kvm:kvm_fpu                                                  [100.00%]
                 0 kvm:kvm_age_page                                             [100.00%]
                 0 kvm:kvm_try_async_get_page                                    [100.00%]
                 0 kvm:kvm_async_pf_doublefault                                    [100.00%]
                 0 kvm:kvm_async_pf_not_present                                    [100.00%]
                 0 kvm:kvm_async_pf_ready                                       [100.00%]
                 0 kvm:kvm_async_pf_completed

       1.002100578 seconds time elapsed

After:

 Performance counter stats for 'sleep 1s':

            28,354 kvm:kvm_entry                                                [99.98%]
                 0 kvm:kvm_hypercall                                            [99.98%]
                 0 kvm:kvm_hv_hypercall                                         [99.98%]
             1,347 kvm:kvm_pio                                                  [99.98%]
                 0 kvm:kvm_cpuid                                                [99.98%]
             1,931 kvm:kvm_apic                                                 [99.98%]
            29,595 kvm:kvm_exit                                                 [99.98%]
            24,884 kvm:kvm_inj_virq                                             [99.98%]
                 0 kvm:kvm_inj_exception                                        [99.98%]
                 0 kvm:kvm_page_fault                                           [99.98%]
             1,986 kvm:kvm_msr                                                  [99.98%]
                 0 kvm:kvm_cr                                                   [99.98%]
                 0 kvm:kvm_pic_set_irq                                          [99.98%]
                 0 kvm:kvm_apic_ipi                                             [99.99%]
            25,953 kvm:kvm_apic_accept_irq                                      [99.99%]
            26,132 kvm:kvm_eoi                                                  [99.99%]
            26,593 kvm:kvm_pv_eoi                                               [99.99%]
                 0 kvm:kvm_nested_vmrun                                         [99.99%]
                 0 kvm:kvm_nested_intercepts                                    [99.99%]
                 0 kvm:kvm_nested_vmexit                                        [99.99%]
                 0 kvm:kvm_nested_vmexit_inject                                    [99.99%]
                 0 kvm:kvm_nested_intr_vmexit                                    [99.99%]
                 0 kvm:kvm_invlpga                                              [99.99%]
                 0 kvm:kvm_skinit                                               [99.99%]
               284 kvm:kvm_emulate_insn                                         [99.99%]
                68 kvm:vcpu_match_mmio                                          [99.99%]
                68 kvm:kvm_userspace_exit                                       [99.99%]
                 2 kvm:kvm_set_irq                                              [99.99%]
                 2 kvm:kvm_ioapic_set_irq                                       [99.99%]
            28,288 kvm:kvm_msi_set_irq                                          [99.99%]
                 1 kvm:kvm_ack_irq                                              [99.99%]
               131 kvm:kvm_mmio                                                 [100.00%]
               588 kvm:kvm_fpu                                                  [100.00%]
                 0 kvm:kvm_age_page                                             [100.00%]
                 0 kvm:kvm_try_async_get_page                                    [100.00%]
                 0 kvm:kvm_async_pf_doublefault                                    [100.00%]
                 0 kvm:kvm_async_pf_not_present                                    [100.00%]
                 0 kvm:kvm_async_pf_ready                                       [100.00%]
                 0 kvm:kvm_async_pf_completed

       1.002039622 seconds time elapsed

We see that # of exits is almost halved.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: optimize ISR lookups
Michael S. Tsirkin [Sun, 24 Jun 2012 16:24:26 +0000 (19:24 +0300)]
KVM: optimize ISR lookups

We perform ISR lookups twice: during interrupt
injection and on EOI. Typical workloads only have
a single bit set there. So we can avoid ISR scans by
1. counting bits as we set/clear them in ISR
2. on set, caching the injected vector number
3. on clear, invalidating the cache

The real purpose of this is enabling PV EOI
which needs to quickly validate the vector.
But non PV guests also benefit: with this patch,
and without interrupt nesting, apic_find_highest_isr
will always return immediately without scanning ISR.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: document lapic regs field
Michael S. Tsirkin [Sun, 24 Jun 2012 16:24:19 +0000 (19:24 +0300)]
KVM: document lapic regs field

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: Use kvm_kvfree() to free memory allocated by kvm_kvzalloc()
Takuya Yoshikawa [Tue, 19 Jun 2012 13:04:56 +0000 (22:04 +0900)]
KVM: Use kvm_kvfree() to free memory allocated by kvm_kvzalloc()

The following commit did not care about the error handling path:

  commit c1a7b32a14138f908df52d7c53b5ce3415ec6b50
  KVM: Avoid wasting pages for small lpage_info arrays

If memory allocation fails, vfree() will be called with the address
returned by kzalloc().  This patch fixes this issue.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: Introduce __KVM_HAVE_IRQ_LINE
Christoffer Dall [Fri, 15 Jun 2012 19:07:13 +0000 (15:07 -0400)]
KVM: Introduce __KVM_HAVE_IRQ_LINE

This is a preparatory patch for the KVM/ARM implementation. KVM/ARM will use
the KVM_IRQ_LINE ioctl, which is currently conditional on
__KVM_HAVE_IOAPIC, but ARM obviously doesn't have any IOAPIC support and we
need a separate define.

Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: use KVM_CAP_IRQ_ROUTING to protect the routing related code
Marc Zyngier [Fri, 15 Jun 2012 19:07:02 +0000 (15:07 -0400)]
KVM: use KVM_CAP_IRQ_ROUTING to protect the routing related code

The KVM code sometimes uses CONFIG_HAVE_KVM_IRQCHIP to protect
code that is related to IRQ routing, which not all in-kernel
irqchips may support.

Use KVM_CAP_IRQ_ROUTING instead.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: trace events: update list of exit reasons
Cornelia Huck [Mon, 11 Jun 2012 16:39:50 +0000 (18:39 +0200)]
KVM: trace events: update list of exit reasons

The list of exit reasons for the kvm_userspace_exit event was
missing recent additions; bring it into sync again.

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: s390: Perform early event mask processing during boot
Heinz Graalfs [Mon, 11 Jun 2012 14:06:59 +0000 (16:06 +0200)]
KVM: s390: Perform early event mask processing during boot

For processing under KVM it is required to detect
the actual SCLP console type in order to set it as
preferred console.

Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: s390: Set CPU in stopped state on initial cpu reset
Christian Borntraeger [Mon, 11 Jun 2012 14:06:57 +0000 (16:06 +0200)]
KVM: s390: Set CPU in stopped state on initial cpu reset

The initial cpu reset sets the cpu in the stopped state.
Several places check for the cpu state (e.g. sigp set prefix) and
not setting the STOPPED state triggered errors with newer guest
kernels after reboot.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: x86: change PT_FIRST_AVAIL_BITS_SHIFT to avoid conflict with EPT Dirty bit
Xudong Hao [Thu, 7 Jun 2012 10:26:07 +0000 (18:26 +0800)]
KVM: x86: change PT_FIRST_AVAIL_BITS_SHIFT to avoid conflict with EPT Dirty bit

EPT Dirty bit use bit 9 as Intel SDM definition, to avoid conflict, change
PT_FIRST_AVAIL_BITS_SHIFT to 10.

Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoKVM: MMU: Remove unused parameter from mmu_memory_cache_alloc()
Takuya Yoshikawa [Tue, 29 May 2012 14:54:26 +0000 (23:54 +0900)]
KVM: MMU: Remove unused parameter from mmu_memory_cache_alloc()

Size is not needed to return one from pre-allocated objects.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
12 years agoMerge branch 'for-upstream' of git://github.com/agraf/linux-2.6 into next
Avi Kivity [Wed, 6 Jun 2012 12:31:34 +0000 (15:31 +0300)]
Merge branch 'for-upstream' of git://github.com/agraf/linux-2.6 into next

Alex says:

"Changes this time include:

  - Generalize KVM_GUEST support to overall ePAPR code
  - Fix reset for Book3S HV
  - Fix machine check deferral when CONFIG_KVM_GUEST=y
  - Add support for BookE register DECAR"

* 'for-upstream' of git://github.com/agraf/linux-2.6:
  KVM: PPC: Not optimizing MSR_CE and MSR_ME with paravirt.
  KVM: PPC: booke: Added DECAR support
  KVM: PPC: Book3S HV: Make the guest hash table size configurable
  KVM: PPC: Factor out guest epapr initialization

Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: disable uninitialized var warning
Michael S. Tsirkin [Sun, 3 Jun 2012 08:34:08 +0000 (11:34 +0300)]
KVM: disable uninitialized var warning

I see this in 3.5-rc1:

arch/x86/kvm/mmu.c: In function ‘kvm_test_age_rmapp’:
arch/x86/kvm/mmu.c:1271: warning: ‘iter.desc’ may be used uninitialized in this function

The line in question was introduced by commit
1e3f42f03c38c29c1814199a6f0a2f01b919ea3f

 static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
                              unsigned long data)
 {
-       u64 *spte;
+       u64 *sptep;
+       struct rmap_iterator iter;   <- line 1271
        int young = 0;

        /*

The reason I think is that the compiler assumes that
the rmap value could be 0, so

static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator
*iter)
{
        if (!rmap)
                return NULL;

        if (!(rmap & 1)) {
                iter->desc = NULL;
                return (u64 *)rmap;
        }

        iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
        iter->pos = 0;
        return iter->desc->sptes[iter->pos];
}

will not initialize iter.desc, but the compiler isn't
smart enough to see that

        for (sptep = rmap_get_first(*rmapp, &iter); sptep;
             sptep = rmap_get_next(&iter)) {

will immediately exit in this case.
I checked by adding
        if (!*rmapp)
                goto out;
on top which is clearly equivalent but disables the warning.

This patch uses uninitialized_var to disable the warning without
increasing code size.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Christoffer Dall [Sun, 3 Jun 2012 18:17:48 +0000 (21:17 +0300)]
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers

Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.

Functions introduced or modified are:
 - kvm_err(fmt, ...)
 - kvm_info(fmt, ...)
 - kvm_debug(fmt, ...)
 - kvm_pr_unimpl(fmt, ...)
 - pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)

Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: s390: Change maintainer
Christian Borntraeger [Tue, 5 Jun 2012 11:05:02 +0000 (13:05 +0200)]
KVM: s390: Change maintainer

Since Carsten is now working on a different project, Cornelia will
work as the 2nd s390/kvm maintainer.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
CC: Carsten Otte <cotte@de.ibm.com>
CC: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Fix KVM_SET_SREGS with big real mode segments
Orit Wasserman [Thu, 31 May 2012 11:49:22 +0000 (14:49 +0300)]
KVM: VMX: Fix KVM_SET_SREGS with big real mode segments

For example migration between Westmere and Nehelem hosts, caught in big real mode.

The code that fixes the segments for real mode guest was moved from enter_rmode
to vmx_set_segments. enter_rmode calls vmx_set_segments for each segment.

Signed-off-by: Orit Wasserman <owasserm@rehdat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: MMU: do not iterate over all VMs in mmu_shrink()
Gleb Natapov [Mon, 4 Jun 2012 11:53:23 +0000 (14:53 +0300)]
KVM: MMU: do not iterate over all VMs in mmu_shrink()

mmu_shrink() needlessly iterates over all VMs even though it will not
attempt to free mmu pages from more than one on them. Fix that and also
check used mmu pages count outside of VM lock to skip inactive VMs faster.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: ia64: Mark ia64 KVM as BROKEN
Avi Kivity [Thu, 17 May 2012 10:14:08 +0000 (13:14 +0300)]
KVM: ia64: Mark ia64 KVM as BROKEN

Practically all patches to ia64 KVM are build fixes; numerous warnings remain;
the last patch from the maintainer was committed more than three years ago.  It
is clear that no one is using this thing.

Mark as BROKEN to ensure people don't get hit by pointless build problems.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Use EPT Access bit in response to memory notifiers
Xudong Hao [Tue, 22 May 2012 03:23:15 +0000 (11:23 +0800)]
KVM: VMX: Use EPT Access bit in response to memory notifiers

Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Enable EPT A/D bits if supported by turning on relevant bit in EPTP
Xudong Hao [Mon, 28 May 2012 11:33:36 +0000 (19:33 +0800)]
KVM: VMX: Enable EPT A/D bits if supported by turning on relevant bit in EPTP

In EPT page structure entry, Enable EPT A/D bits if processor supported.

Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Add parameter to control A/D bits support, default is on
Xudong Hao [Mon, 28 May 2012 11:33:35 +0000 (19:33 +0800)]
KVM: VMX: Add parameter to control A/D bits support, default is on

Add kernel parameter to control A/D bits support, it's on by default.

Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: VMX: Add EPT A/D bits definitions
Xudong Hao [Mon, 28 May 2012 11:33:34 +0000 (19:33 +0800)]
KVM: VMX: Add EPT A/D bits definitions

Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
12 years agoKVM: Avoid wasting pages for small lpage_info arrays
Takuya Yoshikawa [Sun, 20 May 2012 04:15:07 +0000 (13:15 +0900)]
KVM: Avoid wasting pages for small lpage_info arrays

lpage_info is created for each large level even when the memory slot is
not for RAM.  This means that when we add one slot for a PCI device, we
end up allocating at least KVM_NR_PAGE_SIZES - 1 pages by vmalloc().

To make things worse, there is an increasing number of devices which
would result in more pages being wasted this way.

This patch mitigates this problem by using kvm_kvzalloc().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>