Re: [Qemu-devel] [PATCH 2/2] kvm: use per-cpu lock to free vcpu thread out of the big lock

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* Re: [Qemu-devel] [PATCH 2/2] kvm: use per-cpu lock to free vcpu thread out of the big lock
       [not found] ` <1340290158-11036-3-git-send-email-qemulist@gmail.com>
@ 2012-06-21 15:02   ` Jan Kiszka
  2012-06-22 20:06   ` Anthony Liguori
  1 sibling, 0 replies; 15+ messages in thread
From: Jan Kiszka @ 2012-06-21 15:02 UTC (permalink / raw
  To: Liu Ping Fan
  Cc: Liu Ping Fan, qemu-devel, Anthony.Liguori.anthony@codemonkey.ws

On 2012-06-21 16:49, Liu Ping Fan wrote:
> In order to break the big lock, using per-cpu_lock in kvm_cpu_exec()
> to protect the race from other cpu's access to env->apic_state & related
> field in env.
> Also, we need to protect agaist run_on_cpu().
> 
> Race condition can be like this:
> 1.  vcpu-1 IPI vcpu-2
>     vcpu-3 IPI vcpu-2
>     Open window exists for accessing to vcpu-2's apic_state & env
> 
> 2. run_on_cpu() write env->queued_work_last, while flush_queued_work()
>    read
> 

How much of this is still relevant with the (nowadays default-on)
in-kernel irqchips?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread
@ 2012-06-21 15:06 Liu Ping Fan
  0 siblings, 0 replies; 15+ messages in thread
From: Liu Ping Fan @ 2012-06-21 15:06 UTC (permalink / raw
  To: qemu-devel

Nowadays, we use qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread() to
protect the race to access the emulated dev launched by vcpu threads & iothread.

But this lock is too big. We can break it down.
These patches separate the CPUArchState's protection from the other devices, so we
can have a per-cpu lock for each CPUArchState, not the big lock any longer.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread
       [not found] <1340290158-11036-1-git-send-email-qemulist@gmail.com>
@ 2012-06-21 15:23 ` Jan Kiszka
  2012-06-22 10:24   ` liu ping fan
       [not found] ` <1340290158-11036-2-git-send-email-qemulist@gmail.com>
       [not found] ` <1340290158-11036-3-git-send-email-qemulist@gmail.com>
  2 siblings, 1 reply; 15+ messages in thread
From: Jan Kiszka @ 2012-06-21 15:23 UTC (permalink / raw
  To: Liu Ping Fan; +Cc: qemu-devel, Anthony Liguori

On 2012-06-21 16:49, Liu Ping Fan wrote:
> Nowadays, we use qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread() to
> protect the race to access the emulated dev launched by vcpu threads & iothread.
> 
> But this lock is too big. We can break it down.
> These patches separate the CPUArchState's protection from the other devices, so we
> can have a per-cpu lock for each CPUArchState, not the big lock any longer.

Anything that reduces lock dependencies is generally welcome. But can
you specify in more details what you gain, and under which conditions?

I'm skeptical if this is the right area to start. With the in-kernel
irqchip enabled, no CPUArchState field is touched during normal
operations (unless I missed something subtle in the past). At the same
time, this locking is unfortunately fairly complex and invasive, so not
"cheap" to integrate.

IMO more interesting is breaking out some I/O path, e.g. from a NIC to a
network backend, and get this processed in a separate thread without
touching the BQL (Big QEMU Lock). I've experimental patches for this
here, but they need rebasing and polishing.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread
  2012-06-21 15:23 ` [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread Jan Kiszka
@ 2012-06-22 10:24   ` liu ping fan
  2012-06-22 10:37     ` Jan Kiszka
  0 siblings, 1 reply; 15+ messages in thread
From: liu ping fan @ 2012-06-22 10:24 UTC (permalink / raw
  To: Jan Kiszka; +Cc: qemu-devel, Anthony Liguori

On Thu, Jun 21, 2012 at 11:23 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2012-06-21 16:49, Liu Ping Fan wrote:
>> Nowadays, we use qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread() to
>> protect the race to access the emulated dev launched by vcpu threads & iothread.
>>
>> But this lock is too big. We can break it down.
>> These patches separate the CPUArchState's protection from the other devices, so we
>> can have a per-cpu lock for each CPUArchState, not the big lock any longer.
>
> Anything that reduces lock dependencies is generally welcome. But can
> you specify in more details what you gain, and under which conditions?
>
In fact, there are several steps to break down the Qemu big lock. This
step just aims to shrink the code area protected by
qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread(). And I am
working on the following steps, which focus on breaking down the big
lock when calling handle_{io,mmio}

Thanks and regards,
pingfan

> I'm skeptical if this is the right area to start. With the in-kernel
> irqchip enabled, no CPUArchState field is touched during normal
> operations (unless I missed something subtle in the past). At the same
> time, this locking is unfortunately fairly complex and invasive, so not
> "cheap" to integrate.
>
> IMO more interesting is breaking out some I/O path, e.g. from a NIC to a
> network backend, and get this processed in a separate thread without
> touching the BQL (Big QEMU Lock). I've experimental patches for this
> here, but they need rebasing and polishing.
>
> Thanks,
> Jan
>
> --
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread
  2012-06-22 10:24   ` liu ping fan
@ 2012-06-22 10:37     ` Jan Kiszka
  2012-06-22 20:11       ` Anthony Liguori
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kiszka @ 2012-06-22 10:37 UTC (permalink / raw
  To: liu ping fan; +Cc: qemu-devel@nongnu.org, Anthony Liguori

On 2012-06-22 12:24, liu ping fan wrote:
> On Thu, Jun 21, 2012 at 11:23 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> On 2012-06-21 16:49, Liu Ping Fan wrote:
>>> Nowadays, we use qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread() to
>>> protect the race to access the emulated dev launched by vcpu threads & iothread.
>>>
>>> But this lock is too big. We can break it down.
>>> These patches separate the CPUArchState's protection from the other devices, so we
>>> can have a per-cpu lock for each CPUArchState, not the big lock any longer.
>>
>> Anything that reduces lock dependencies is generally welcome. But can
>> you specify in more details what you gain, and under which conditions?
>>
> In fact, there are several steps to break down the Qemu big lock. This
> step just aims to shrink the code area protected by
> qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread(). And I am
> working on the following steps, which focus on breaking down the big
> lock when calling handle_{io,mmio}

Then let us discuss the strategy. This is important as it is unrealistic
to break up the lock for all code paths. We really need to focus on
goals that provide benefits for relevant use cases.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] CPUArchState: introduce per-cpu lock
       [not found] ` <1340290158-11036-2-git-send-email-qemulist@gmail.com>
@ 2012-06-22 20:01   ` Anthony Liguori
  0 siblings, 0 replies; 15+ messages in thread
From: Anthony Liguori @ 2012-06-22 20:01 UTC (permalink / raw
  To: Liu Ping Fan
  Cc: Jan Kiszka, Liu Ping Fan, qemu-devel,
	"Anthony Liguori anthony"

On 06/21/2012 09:49 AM, Liu Ping Fan wrote:
> introduce a lock for per-cpu to protect agaist accesing from
> other vcpu thread.
>
> Signed-off-by: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
> ---
>   cpu-defs.h  |    2 ++
>   cpus.c      |   17 +++++++++++++++++
>   main-loop.h |    3 +++
>   3 files changed, 22 insertions(+), 0 deletions(-)
>
> diff --git a/cpu-defs.h b/cpu-defs.h
> index f49e950..7305822 100644
> --- a/cpu-defs.h
> +++ b/cpu-defs.h
> @@ -30,6 +30,7 @@
>   #include "osdep.h"
>   #include "qemu-queue.h"
>   #include "targphys.h"
> +#include "qemu-thread-posix.h"

qemu-thread.h?

I would strongly suspect qemu-thread-posix would break the windows build...

>
>   #ifndef TARGET_LONG_BITS
>   #error TARGET_LONG_BITS must be defined before including this header
> @@ -220,6 +221,7 @@ typedef struct CPUWatchpoint {
>       CPU_COMMON_THREAD                                                   \
>       struct QemuCond *halt_cond;                                         \
>       int thread_kicked;                                                  \
> +    struct QemuMutex *cpu_lock;                                         \
>       struct qemu_work_item *queued_work_first, *queued_work_last;        \
>       const char *cpu_model_str;                                          \
>       struct KVMState *kvm_state;                                         \
> diff --git a/cpus.c b/cpus.c
> index b182b3d..554f7bc 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -735,6 +735,7 @@ static void *qemu_kvm_cpu_thread_fn(void *arg)
>       env->thread_id = qemu_get_thread_id();
>       cpu_single_env = env;
>
> +

Please be careful about introducing spurious whitespace.

>       r = kvm_init_vcpu(env);
>       if (r<  0) {
>           fprintf(stderr, "kvm_init_vcpu failed: %s\n", strerror(-r));
> @@ -891,6 +892,20 @@ int qemu_cpu_is_self(void *_env)
>       return qemu_thread_is_self(env->thread);
>   }
>
> +void qemu_mutex_lock_cpu(void *_env)
> +{
> +    CPUArchState *env = _env;
> +
> +    qemu_mutex_lock(env->cpu_lock);
> +}
> +
> +void qemu_mutex_unlock_cpu(void *_env)
> +{
> +    CPUArchState *env = _env;
> +
> +    qemu_mutex_unlock(env->cpu_lock);
> +}
> +

See CODING_STYLE.  _ as the start of a variable name is not allowed.

>   void qemu_mutex_lock_iothread(void)
>   {
>       if (!tcg_enabled()) {
> @@ -1027,6 +1042,8 @@ void qemu_init_vcpu(void *_env)
>       env->nr_cores = smp_cores;
>       env->nr_threads = smp_threads;
>       env->stopped = 1;
> +    env->cpu_lock = g_malloc0(sizeof(QemuMutex));
> +    qemu_mutex_init(env->cpu_lock);
>       if (kvm_enabled()) {
>           qemu_kvm_start_vcpu(env);
>       } else if (tcg_enabled()) {
> diff --git a/main-loop.h b/main-loop.h
> index dce1cd9..d8d44a4 100644
> --- a/main-loop.h
> +++ b/main-loop.h
> @@ -323,6 +323,9 @@ void qemu_bh_delete(QEMUBH *bh);
>   int qemu_add_child_watch(pid_t pid);
>   #endif
>
> +void qemu_mutex_lock_cpu(void *_env);
> +void qemu_mutex_unlock_cpu(void *_env);
> +
>   /**
>    * qemu_mutex_lock_iothread: Lock the main loop mutex.
>    *

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] kvm: use per-cpu lock to free vcpu thread out of the big lock
       [not found] ` <1340290158-11036-3-git-send-email-qemulist@gmail.com>
  2012-06-21 15:02   ` [Qemu-devel] [PATCH 2/2] kvm: use per-cpu lock to free vcpu thread out of the big lock Jan Kiszka
@ 2012-06-22 20:06   ` Anthony Liguori
  2012-06-23  9:32     ` liu ping fan
  1 sibling, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2012-06-22 20:06 UTC (permalink / raw
  To: Liu Ping Fan; +Cc: Jan Kiszka, Liu Ping Fan, qemu-devel

On 06/21/2012 09:49 AM, Liu Ping Fan wrote:
> In order to break the big lock, using per-cpu_lock in kvm_cpu_exec()
> to protect the race from other cpu's access to env->apic_state&  related
> field in env.
> Also, we need to protect agaist run_on_cpu().
>
> Race condition can be like this:
> 1.  vcpu-1 IPI vcpu-2
>      vcpu-3 IPI vcpu-2
>      Open window exists for accessing to vcpu-2's apic_state&  env
>
> 2. run_on_cpu() write env->queued_work_last, while flush_queued_work()
>     read
>
> Signed-off-by: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
> ---
>   cpus.c    |    6 ++++--
>   hw/apic.c |   58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>   hw/pc.c   |    8 +++++++-
>   kvm-all.c |   13 +++++++++++--
>   4 files changed, 76 insertions(+), 9 deletions(-)
>
> diff --git a/cpus.c b/cpus.c
> index 554f7bc..ac99afe 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -649,6 +649,7 @@ void run_on_cpu(CPUArchState *env, void (*func)(void *data), void *data)
>
>       wi.func = func;
>       wi.data = data;
> +    qemu_mutex_lock(env->cpu_lock);
>       if (!env->queued_work_first) {
>           env->queued_work_first =&wi;
>       } else {
> @@ -657,6 +658,7 @@ void run_on_cpu(CPUArchState *env, void (*func)(void *data), void *data)
>       env->queued_work_last =&wi;
>       wi.next = NULL;
>       wi.done = false;
> +    qemu_mutex_unlock(env->cpu_lock);
>
>       qemu_cpu_kick(env);
>       while (!wi.done) {
> @@ -718,7 +720,7 @@ static void qemu_tcg_wait_io_event(void)
>   static void qemu_kvm_wait_io_event(CPUArchState *env)
>   {
>       while (cpu_thread_is_idle(env)) {
> -        qemu_cond_wait(env->halt_cond,&qemu_global_mutex);
> +        qemu_cond_wait(env->halt_cond, env->cpu_lock);
>       }
>
>       qemu_kvm_eat_signals(env);
> @@ -729,8 +731,8 @@ static void *qemu_kvm_cpu_thread_fn(void *arg)
>   {
>       CPUArchState *env = arg;
>       int r;
> +    qemu_mutex_lock_cpu(env);
>
> -    qemu_mutex_lock(&qemu_global_mutex);
>       qemu_thread_get_self(env->thread);
>       env->thread_id = qemu_get_thread_id();
>       cpu_single_env = env;
> diff --git a/hw/apic.c b/hw/apic.c
> index 4eeaf88..b999a40 100644
> --- a/hw/apic.c
> +++ b/hw/apic.c
> @@ -22,6 +22,7 @@
>   #include "host-utils.h"
>   #include "trace.h"
>   #include "pc.h"
> +#include "qemu-thread.h"
>
>   #define MAX_APIC_WORDS 8
>
> @@ -94,6 +95,7 @@ static int get_highest_priority_int(uint32_t *tab)
>       return -1;
>   }
>
> +/* Caller must hold the lock */
>   static void apic_sync_vapic(APICCommonState *s, int sync_type)
>   {
>       VAPICState vapic_state;
> @@ -141,11 +143,13 @@ static void apic_sync_vapic(APICCommonState *s, int sync_type)
>       }
>   }
>
> +/* Caller must hold lock */
>   static void apic_vapic_base_update(APICCommonState *s)
>   {
>       apic_sync_vapic(s, SYNC_TO_VAPIC);
>   }
>
> +/* Caller must hold the lock */
>   static void apic_local_deliver(APICCommonState *s, int vector)
>   {
>       uint32_t lvt = s->lvt[vector];
> @@ -175,9 +179,11 @@ static void apic_local_deliver(APICCommonState *s, int vector)
>               (lvt&  APIC_LVT_LEVEL_TRIGGER))
>               trigger_mode = APIC_TRIGGER_LEVEL;
>           apic_set_irq(s, lvt&  0xff, trigger_mode);
> +        break;
>       }
>   }
>
> +/* Caller must hold the lock */
>   void apic_deliver_pic_intr(DeviceState *d, int level)
>   {
>       APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
> @@ -200,9 +206,12 @@ void apic_deliver_pic_intr(DeviceState *d, int level)
>       }
>   }
>
> +/* Must hold lock */
>   static void apic_external_nmi(APICCommonState *s)
>   {
> +    qemu_mutex_lock_cpu(s->cpu_env);
>       apic_local_deliver(s, APIC_LVT_LINT1);
> +    qemu_mutex_unlock_cpu(s->cpu_env);
>   }
>
>   #define foreach_apic(apic, deliver_bitmask, code) \
> @@ -215,7 +224,9 @@ static void apic_external_nmi(APICCommonState *s)
>                   if (__mask&  (1<<  __j)) {\
>                       apic = local_apics[__i * 32 + __j];\
>                       if (apic) {\
> +                        qemu_mutex_lock_cpu(apic->cpu_env);\
>                           code;\
> +                        qemu_mutex_unlock_cpu(apic->cpu_env);\
>                       }\
>                   }\
>               }\
> @@ -244,7 +255,9 @@ static void apic_bus_deliver(const uint32_t *deliver_bitmask,
>                   if (d>= 0) {
>                       apic_iter = local_apics[d];
>                       if (apic_iter) {
> +                        qemu_mutex_lock_cpu(apic_iter->cpu_env);
>                           apic_set_irq(apic_iter, vector_num, trigger_mode);
> +                        qemu_mutex_unlock_cpu(apic_iter->cpu_env);
>                       }
>                   }
>               }
> @@ -293,6 +306,7 @@ void apic_deliver_irq(uint8_t dest, uint8_t dest_mode, uint8_t delivery_mode,
>       apic_bus_deliver(deliver_bitmask, delivery_mode, vector_num, trigger_mode);
>   }
>
> +/* Caller must hold lock */
>   static void apic_set_base(APICCommonState *s, uint64_t val)
>   {
>       s->apicbase = (val&  0xfffff000) |
> @@ -305,6 +319,7 @@ static void apic_set_base(APICCommonState *s, uint64_t val)
>       }
>   }
>
> +/* caller must hold lock */
>   static void apic_set_tpr(APICCommonState *s, uint8_t val)
>   {
>       /* Updates from cr8 are ignored while the VAPIC is active */
> @@ -314,12 +329,14 @@ static void apic_set_tpr(APICCommonState *s, uint8_t val)
>       }
>   }
>
> +/* caller must hold lock */
>   static uint8_t apic_get_tpr(APICCommonState *s)
>   {
>       apic_sync_vapic(s, SYNC_FROM_VAPIC);
>       return s->tpr>>  4;
>   }
>
> +/* Caller must hold the lock */
>   static int apic_get_ppr(APICCommonState *s)
>   {
>       int tpr, isrv, ppr;
> @@ -348,6 +365,7 @@ static int apic_get_arb_pri(APICCommonState *s)
>    * 0  - no interrupt,
>    *>0 - interrupt number
>    */
> +/* Caller must hold cpu_lock */
>   static int apic_irq_pending(APICCommonState *s)
>   {
>       int irrv, ppr;
> @@ -363,6 +381,7 @@ static int apic_irq_pending(APICCommonState *s)
>       return irrv;
>   }
>
> +/* caller must ensure the lock has been taken */
>   /* signal the CPU if an irq is pending */
>   static void apic_update_irq(APICCommonState *s)
>   {
> @@ -380,11 +399,13 @@ static void apic_update_irq(APICCommonState *s)
>   void apic_poll_irq(DeviceState *d)
>   {
>       APICCommonState *s = APIC_COMMON(d);
> -
> +    qemu_mutex_lock_cpu(s->cpu_env);
>       apic_sync_vapic(s, SYNC_FROM_VAPIC);
>       apic_update_irq(s);
> +    qemu_mutex_unlock_cpu(s->cpu_env);
>   }
>
> +/* caller must hold the lock */
>   static void apic_set_irq(APICCommonState *s, int vector_num, int trigger_mode)
>   {
>       apic_report_irq_delivered(!get_bit(s->irr, vector_num));
> @@ -407,6 +428,7 @@ static void apic_set_irq(APICCommonState *s, int vector_num, int trigger_mode)
>       apic_update_irq(s);
>   }
>
> +/* caller must hold the lock */
>   static void apic_eoi(APICCommonState *s)
>   {
>       int isrv;
> @@ -415,6 +437,7 @@ static void apic_eoi(APICCommonState *s)
>           return;
>       reset_bit(s->isr, isrv);
>       if (!(s->spurious_vec&  APIC_SV_DIRECTED_IO)&&  get_bit(s->tmr, isrv)) {
> +        //fix me,need to take extra lock for the bus?
>           ioapic_eoi_broadcast(isrv);
>       }
>       apic_sync_vapic(s, SYNC_FROM_VAPIC | SYNC_TO_VAPIC);
> @@ -480,18 +503,23 @@ static void apic_get_delivery_bitmask(uint32_t *deliver_bitmask,
>   static void apic_startup(APICCommonState *s, int vector_num)
>   {
>       s->sipi_vector = vector_num;
> +    qemu_mutex_lock_cpu(s->cpu_env);
>       cpu_interrupt(s->cpu_env, CPU_INTERRUPT_SIPI);
> +    qemu_mutex_unlock_cpu(s->cpu_env);
>   }
>
>   void apic_sipi(DeviceState *d)
>   {
>       APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
> -
> +    qemu_mutex_lock_cpu(s->cpu_env);
>       cpu_reset_interrupt(s->cpu_env, CPU_INTERRUPT_SIPI);
>
> -    if (!s->wait_for_sipi)
> +    if (!s->wait_for_sipi) {
> +        qemu_mutex_unlock_cpu(s->cpu_env);
>           return;
> +    }
>       cpu_x86_load_seg_cache_sipi(s->cpu_env, s->sipi_vector);
> +    qemu_mutex_unlock_cpu(s->cpu_env);
>       s->wait_for_sipi = 0;
>   }
>
> @@ -543,6 +571,7 @@ static void apic_deliver(DeviceState *d, uint8_t dest, uint8_t dest_mode,
>       apic_bus_deliver(deliver_bitmask, delivery_mode, vector_num, trigger_mode);
>   }
>
> +/* Caller must take lock */
>   int apic_get_interrupt(DeviceState *d)
>   {
>       APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
> @@ -572,6 +601,7 @@ int apic_get_interrupt(DeviceState *d)
>       return intno;
>   }
>
> +/* Caller must hold the cpu_lock */
>   int apic_accept_pic_intr(DeviceState *d)
>   {
>       APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
> @@ -589,6 +619,7 @@ int apic_accept_pic_intr(DeviceState *d)
>       return 0;
>   }
>
> +/* Caller hold lock */
>   static uint32_t apic_get_current_count(APICCommonState *s)
>   {
>       int64_t d;
> @@ -619,9 +650,10 @@ static void apic_timer_update(APICCommonState *s, int64_t current_time)
>   static void apic_timer(void *opaque)
>   {
>       APICCommonState *s = opaque;
> -
> +    qemu_mutex_lock_cpu(s->cpu_env);
>       apic_local_deliver(s, APIC_LVT_TIMER);
>       apic_timer_update(s, s->next_time);
> +    qemu_mutex_unlock_cpu(s->cpu_env);
>   }
>
>   static uint32_t apic_mem_readb(void *opaque, target_phys_addr_t addr)
> @@ -664,18 +696,22 @@ static uint32_t apic_mem_readl(void *opaque, target_phys_addr_t addr)
>           val = 0x11 | ((APIC_LVT_NB - 1)<<  16); /* version 0x11 */
>           break;
>       case 0x08:
> +        qemu_mutex_lock_cpu(s->cpu_env);
>           apic_sync_vapic(s, SYNC_FROM_VAPIC);
>           if (apic_report_tpr_access) {
>               cpu_report_tpr_access(s->cpu_env, TPR_ACCESS_READ);
>           }
>           val = s->tpr;
> +        qemu_mutex_unlock_cpu(s->cpu_env);
>           break;
>       case 0x09:
>           val = apic_get_arb_pri(s);
>           break;
>       case 0x0a:
>           /* ppr */
> +        qemu_mutex_lock_cpu(s->cpu_env);
>           val = apic_get_ppr(s);
> +        qemu_mutex_unlock_cpu(s->cpu_env);
>           break;
>       case 0x0b:
>           val = 0;
> @@ -712,7 +748,9 @@ static uint32_t apic_mem_readl(void *opaque, target_phys_addr_t addr)
>           val = s->initial_count;
>           break;
>       case 0x39:
> +        qemu_mutex_lock_cpu(s->cpu_env);
>           val = apic_get_current_count(s);
> +        qemu_mutex_unlock_cpu(s->cpu_env);
>           break;
>       case 0x3e:
>           val = s->divide_conf;
> @@ -767,18 +805,22 @@ static void apic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t val)
>       case 0x03:
>           break;
>       case 0x08:
> +       qemu_mutex_lock_cpu(s->cpu_env);
>           if (apic_report_tpr_access) {
>               cpu_report_tpr_access(s->cpu_env, TPR_ACCESS_WRITE);
>           }
>           s->tpr = val;
>           apic_sync_vapic(s, SYNC_TO_VAPIC);
>           apic_update_irq(s);
> +        qemu_mutex_unlock_cpu(s->cpu_env);
>           break;
>       case 0x09:
>       case 0x0a:
>           break;
>       case 0x0b: /* EOI */
> +        qemu_mutex_lock_cpu(s->cpu_env);
>           apic_eoi(s);
> +        qemu_mutex_unlock_cpu(s->cpu_env);
>           break;
>       case 0x0d:
>           s->log_dest = val>>  24;
> @@ -787,8 +829,10 @@ static void apic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t val)
>           s->dest_mode = val>>  28;
>           break;
>       case 0x0f:
> +        qemu_mutex_lock_cpu(s->cpu_env);
>           s->spurious_vec = val&  0x1ff;
>           apic_update_irq(s);
> +        qemu_mutex_unlock_cpu(s->cpu_env);
>           break;
>       case 0x10 ... 0x17:
>       case 0x18 ... 0x1f:
> @@ -807,24 +851,30 @@ static void apic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t val)
>       case 0x32 ... 0x37:
>           {
>               int n = index - 0x32;
> +            qemu_mutex_lock_cpu(s->cpu_env);
>               s->lvt[n] = val;
>               if (n == APIC_LVT_TIMER)
>                   apic_timer_update(s, qemu_get_clock_ns(vm_clock));
> +            qemu_mutex_unlock_cpu(s->cpu_env);
>           }
>           break;
>       case 0x38:
> +       qemu_mutex_lock_cpu(s->cpu_env);
>           s->initial_count = val;
>           s->initial_count_load_time = qemu_get_clock_ns(vm_clock);
>           apic_timer_update(s, s->initial_count_load_time);
> +        qemu_mutex_unlock_cpu(s->cpu_env);
>           break;
>       case 0x39:
>           break;
>       case 0x3e:
>           {
>               int v;
> +            qemu_mutex_lock_cpu(s->cpu_env);
>               s->divide_conf = val&  0xb;
>               v = (s->divide_conf&  3) | ((s->divide_conf>>  1)&  4);
>               s->count_shift = (v + 1)&  7;
> +            qemu_mutex_unlock_cpu(s->cpu_env);
>           }
>           break;
>       default:
> diff --git a/hw/pc.c b/hw/pc.c
> index 4d34a33..5e7350c 100644
> --- a/hw/pc.c
> +++ b/hw/pc.c
> @@ -147,7 +147,7 @@ void cpu_smm_update(CPUX86State *env)
>           smm_set(!!(env->hflags&  HF_SMM_MASK), smm_arg);
>   }
>
> -
> +/* Called by kvm_cpu_exec(), already with lock protection */
>   /* IRQ handling */
>   int cpu_get_pic_interrupt(CPUX86State *env)
>   {
> @@ -173,9 +173,15 @@ static void pic_irq_request(void *opaque, int irq, int level)
>       DPRINTF("pic_irqs: %s irq %d\n", level? "raise" : "lower", irq);
>       if (env->apic_state) {
>           while (env) {
> +            if (!qemu_cpu_is_self(env)) {
> +                qemu_mutex_lock_cpu(env);
> +            }
>               if (apic_accept_pic_intr(env->apic_state)) {
>                   apic_deliver_pic_intr(env->apic_state, level);
>               }
> +            if (!qemu_cpu_is_self(env)) {
> +                qemu_mutex_unlock_cpu(env);
> +            }
>               env = env->next_cpu;
>           }
>       } else {
> diff --git a/kvm-all.c b/kvm-all.c
> index 9b73ccf..dc9aa54 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -29,6 +29,7 @@
>   #include "bswap.h"
>   #include "memory.h"
>   #include "exec-memory.h"
> +#include "qemu-thread.h"
>
>   /* This check must be after config-host.h is included */
>   #ifdef CONFIG_EVENTFD
> @@ -1252,12 +1253,15 @@ int kvm_cpu_exec(CPUArchState *env)
>                */
>               qemu_cpu_kick_self();
>           }
> -        qemu_mutex_unlock_iothread();
> +        qemu_mutex_unlock(env->cpu_lock);
>
>           run_ret = kvm_vcpu_ioctl(env, KVM_RUN, 0);
>
> -        qemu_mutex_lock_iothread();
> +        qemu_mutex_lock(env->cpu_lock);
>           kvm_arch_post_run(env, run);
> +        qemu_mutex_unlock_cpu(env);
> +
> +        qemu_mutex_lock_iothread();
>
>           kvm_flush_coalesced_mmio_buffer();
>
> @@ -1265,6 +1269,8 @@ int kvm_cpu_exec(CPUArchState *env)
>               if (run_ret == -EINTR || run_ret == -EAGAIN) {
>                   DPRINTF("io window exit\n");
>                   ret = EXCP_INTERRUPT;
> +                qemu_mutex_unlock_iothread();
> +                qemu_mutex_lock_cpu(env);
>                   break;
>               }
>               fprintf(stderr, "error: kvm run failed %s\n",
> @@ -1312,6 +1318,9 @@ int kvm_cpu_exec(CPUArchState *env)
>               ret = kvm_arch_handle_exit(env, run);
>               break;
>           }
> +
> +        qemu_mutex_unlock_iothread();
> +        qemu_mutex_lock_cpu(env);
>       } while (ret == 0);

This really confuses me.

Shouldn't we be moving qemu_mutex_unlock_iothread() to the very top of 
kvm_cpu_exec()?

There's only a fixed amount of state that gets manipulated too in kvm_cpu_exec() 
so shouldn't we try to specifically describe what state is covered by cpu_lock?

What's your strategy for ensuring that all accesses to that state is protected 
by cpu_lock?

Regards,

Anthony Liguori

>
>       if (ret<  0) {

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread
  2012-06-22 10:37     ` Jan Kiszka
@ 2012-06-22 20:11       ` Anthony Liguori
  2012-06-22 21:14         ` Jan Kiszka
  0 siblings, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2012-06-22 20:11 UTC (permalink / raw
  To: Jan Kiszka; +Cc: liu ping fan, Stefan Hajnoczi, qemu-devel@nongnu.org

On 06/22/2012 05:37 AM, Jan Kiszka wrote:
> On 2012-06-22 12:24, liu ping fan wrote:
>> On Thu, Jun 21, 2012 at 11:23 PM, Jan Kiszka<jan.kiszka@siemens.com>  wrote:
>>> On 2012-06-21 16:49, Liu Ping Fan wrote:
>>>> Nowadays, we use qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread() to
>>>> protect the race to access the emulated dev launched by vcpu threads&  iothread.
>>>>
>>>> But this lock is too big. We can break it down.
>>>> These patches separate the CPUArchState's protection from the other devices, so we
>>>> can have a per-cpu lock for each CPUArchState, not the big lock any longer.
>>>
>>> Anything that reduces lock dependencies is generally welcome. But can
>>> you specify in more details what you gain, and under which conditions?
>>>
>> In fact, there are several steps to break down the Qemu big lock. This
>> step just aims to shrink the code area protected by
>> qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread(). And I am
>> working on the following steps, which focus on breaking down the big
>> lock when calling handle_{io,mmio}
>
> Then let us discuss the strategy. This is important as it is unrealistic
> to break up the lock for all code paths. We really need to focus on
> goals that provide benefits for relevant use cases.

Stefan put together a proof of concept that implemented the data-plane portion 
of virtio-blk in a separate thread.  This is possible because of I/O eventfd (we 
were able to select() on that fd in a separate thread).

The performance difference between virtio-blk-pci and virtio-blk-data-plane is 
staggering when dealing with a very large storage system.

So we'd like to get the infrastructure in place where we can start 
multithreading devices in QEMU to we can integrate this work.

The basic plan is introduce granular locking starting at the KVM dispatch level 
until we can get to MemoryRegion dispatch.  We'll then have some way to indicate 
that a MemoryRegion's callbacks should be invoked without holding the qemu 
global mutex.

We can then convert devices one at a time.

While the threading in the KVM code is certainly complex, it's also relatively 
isolated from the rest of QEMU.  So we don't have to worry about auditing large 
subsystems for re-entrancy safety.

Once we have unlocked MemoryRegions, we can start writing some synthetic test 
cases to really stress the locking code too.

Regards,

Anthony Liguori

>
> Jan
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread
  2012-06-22 20:11       ` Anthony Liguori
@ 2012-06-22 21:14         ` Jan Kiszka
  2012-06-22 21:44           ` Anthony Liguori
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kiszka @ 2012-06-22 21:14 UTC (permalink / raw
  To: Anthony Liguori; +Cc: qemu-devel@nongnu.org, liu ping fan, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 2998 bytes --]

On 2012-06-22 22:11, Anthony Liguori wrote:
> On 06/22/2012 05:37 AM, Jan Kiszka wrote:
>> On 2012-06-22 12:24, liu ping fan wrote:
>>> On Thu, Jun 21, 2012 at 11:23 PM, Jan Kiszka<jan.kiszka@siemens.com> 
>>> wrote:
>>>> On 2012-06-21 16:49, Liu Ping Fan wrote:
>>>>> Nowadays, we use
>>>>> qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread() to
>>>>> protect the race to access the emulated dev launched by vcpu
>>>>> threads&  iothread.
>>>>>
>>>>> But this lock is too big. We can break it down.
>>>>> These patches separate the CPUArchState's protection from the other
>>>>> devices, so we
>>>>> can have a per-cpu lock for each CPUArchState, not the big lock any
>>>>> longer.
>>>>
>>>> Anything that reduces lock dependencies is generally welcome. But can
>>>> you specify in more details what you gain, and under which conditions?
>>>>
>>> In fact, there are several steps to break down the Qemu big lock. This
>>> step just aims to shrink the code area protected by
>>> qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread(). And I am
>>> working on the following steps, which focus on breaking down the big
>>> lock when calling handle_{io,mmio}
>>
>> Then let us discuss the strategy. This is important as it is unrealistic
>> to break up the lock for all code paths. We really need to focus on
>> goals that provide benefits for relevant use cases.
> 
> Stefan put together a proof of concept that implemented the data-plane
> portion of virtio-blk in a separate thread.  This is possible because of
> I/O eventfd (we were able to select() on that fd in a separate thread).
> 
> The performance difference between virtio-blk-pci and
> virtio-blk-data-plane is staggering when dealing with a very large
> storage system.
> 
> So we'd like to get the infrastructure in place where we can start
> multithreading devices in QEMU to we can integrate this work.

Can you name the primary bits? We really need to see the whole picture
before adding new locks. They alone are not the solution.

> 
> The basic plan is introduce granular locking starting at the KVM
> dispatch level until we can get to MemoryRegion dispatch.  We'll then
> have some way to indicate that a MemoryRegion's callbacks should be
> invoked without holding the qemu global mutex.

I don't disagree, but this end really looks like starting at the wrong
edge. The changes are not isolated and surely not yet correct
(run_on_cpu is broken for tcg e.g.).

Then, none of this locking should be needed for in-kernel irqchips. All
touched states are thread local or should be modifiable atomically - if
not let's fix *that*, it's more beneficial.

Actually, cpu_lock is counterproductive as it adds locking ops to a path
where we will not need them later on in the normal configuration. User
space irqchip is a slow path and perfectly fine to handle under BQL. So
is VCPU control (pause/resume/run-on). It's better to focus on the fast
path first.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread
  2012-06-22 21:14         ` Jan Kiszka
@ 2012-06-22 21:44           ` Anthony Liguori
  2012-06-22 22:27             ` Jan Kiszka
  0 siblings, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2012-06-22 21:44 UTC (permalink / raw
  To: Jan Kiszka; +Cc: qemu-devel@nongnu.org, liu ping fan, Stefan Hajnoczi

On 06/22/2012 04:14 PM, Jan Kiszka wrote:
> On 2012-06-22 22:11, Anthony Liguori wrote:
>> On 06/22/2012 05:37 AM, Jan Kiszka wrote:
>>> On 2012-06-22 12:24, liu ping fan wrote:
>>>> On Thu, Jun 21, 2012 at 11:23 PM, Jan Kiszka<jan.kiszka@siemens.com>
>>>> wrote:
>>>>> On 2012-06-21 16:49, Liu Ping Fan wrote:
>>>>>> Nowadays, we use
>>>>>> qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread() to
>>>>>> protect the race to access the emulated dev launched by vcpu
>>>>>> threads&   iothread.
>>>>>>
>>>>>> But this lock is too big. We can break it down.
>>>>>> These patches separate the CPUArchState's protection from the other
>>>>>> devices, so we
>>>>>> can have a per-cpu lock for each CPUArchState, not the big lock any
>>>>>> longer.
>>>>>
>>>>> Anything that reduces lock dependencies is generally welcome. But can
>>>>> you specify in more details what you gain, and under which conditions?
>>>>>
>>>> In fact, there are several steps to break down the Qemu big lock. This
>>>> step just aims to shrink the code area protected by
>>>> qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread(). And I am
>>>> working on the following steps, which focus on breaking down the big
>>>> lock when calling handle_{io,mmio}
>>>
>>> Then let us discuss the strategy. This is important as it is unrealistic
>>> to break up the lock for all code paths. We really need to focus on
>>> goals that provide benefits for relevant use cases.
>>
>> Stefan put together a proof of concept that implemented the data-plane
>> portion of virtio-blk in a separate thread.  This is possible because of
>> I/O eventfd (we were able to select() on that fd in a separate thread).
>>
>> The performance difference between virtio-blk-pci and
>> virtio-blk-data-plane is staggering when dealing with a very large
>> storage system.
>>
>> So we'd like to get the infrastructure in place where we can start
>> multithreading devices in QEMU to we can integrate this work.
>
> Can you name the primary bits? We really need to see the whole picture
> before adding new locks. They alone are not the solution.

Sorry, not sure what you mean by "the primary bits".

At a high level, the plan is to:

1) unlock iothread before entering the do {} look in kvm_cpu_exec()
    a) reacquire the lock after the loop
    b) reacquire the lock in kvm_handle_io()
    c) introduce an unlocked memory accessor that for now, just requires the 
iothread lock() and calls cpu_physical_memory_rw()

2) focus initially on killing the lock in kvm_handle_io()
    a) the ioport table is pretty simplistic so adding fine grain locking won't 
be hard.
    b) reacquire lock right before ioport dispatch

3) allow for register ioport handlers w/o the dispatch function carrying a iothread
    a) this is mostly memory API plumbing

4) focus on going back and adding fine grain locking to the 
cpu_physical_memory_rw() accessor

Note that whenever possible, we should be using rwlocks instead of a normal 
mutex.  In particular, for the ioport data structures, a rwlock seems pretty 
obvious.

>
>>
>> The basic plan is introduce granular locking starting at the KVM
>> dispatch level until we can get to MemoryRegion dispatch.  We'll then
>> have some way to indicate that a MemoryRegion's callbacks should be
>> invoked without holding the qemu global mutex.
>
> I don't disagree, but this end really looks like starting at the wrong
> edge. The changes are not isolated and surely not yet correct
> (run_on_cpu is broken for tcg e.g.).
>
> Then, none of this locking should be needed for in-kernel irqchips. All
> touched states are thread local or should be modifiable atomically - if
> not let's fix *that*, it's more beneficial.
>
> Actually, cpu_lock is counterproductive as it adds locking ops to a path
> where we will not need them later on in the normal configuration. User
> space irqchip is a slow path and perfectly fine to handle under BQL. So
> is VCPU control (pause/resume/run-on). It's better to focus on the fast
> path first.

To be clear, I'm not advocating introducing cpu_lock.  We should do whatever 
makes the most sense to not have to hold iothread lock while processing an exit 
from KVM.

Note that this is an RFC, the purpose of this series is to have this discussion :-)

Regards,

Anthony Liguori

>
> Jan
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread
  2012-06-22 21:44           ` Anthony Liguori
@ 2012-06-22 22:27             ` Jan Kiszka
  2012-06-22 22:56               ` Anthony Liguori
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kiszka @ 2012-06-22 22:27 UTC (permalink / raw
  To: Anthony Liguori; +Cc: qemu-devel@nongnu.org, liu ping fan, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 6411 bytes --]

On 2012-06-22 23:44, Anthony Liguori wrote:
> On 06/22/2012 04:14 PM, Jan Kiszka wrote:
>> On 2012-06-22 22:11, Anthony Liguori wrote:
>>> On 06/22/2012 05:37 AM, Jan Kiszka wrote:
>>>> On 2012-06-22 12:24, liu ping fan wrote:
>>>>> On Thu, Jun 21, 2012 at 11:23 PM, Jan Kiszka<jan.kiszka@siemens.com>
>>>>> wrote:
>>>>>> On 2012-06-21 16:49, Liu Ping Fan wrote:
>>>>>>> Nowadays, we use
>>>>>>> qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread() to
>>>>>>> protect the race to access the emulated dev launched by vcpu
>>>>>>> threads&   iothread.
>>>>>>>
>>>>>>> But this lock is too big. We can break it down.
>>>>>>> These patches separate the CPUArchState's protection from the other
>>>>>>> devices, so we
>>>>>>> can have a per-cpu lock for each CPUArchState, not the big lock any
>>>>>>> longer.
>>>>>>
>>>>>> Anything that reduces lock dependencies is generally welcome. But can
>>>>>> you specify in more details what you gain, and under which
>>>>>> conditions?
>>>>>>
>>>>> In fact, there are several steps to break down the Qemu big lock. This
>>>>> step just aims to shrink the code area protected by
>>>>> qemu_mutex_lock_iothread()/qemu_mutex_unlock_iothread(). And I am
>>>>> working on the following steps, which focus on breaking down the big
>>>>> lock when calling handle_{io,mmio}
>>>>
>>>> Then let us discuss the strategy. This is important as it is
>>>> unrealistic
>>>> to break up the lock for all code paths. We really need to focus on
>>>> goals that provide benefits for relevant use cases.
>>>
>>> Stefan put together a proof of concept that implemented the data-plane
>>> portion of virtio-blk in a separate thread.  This is possible because of
>>> I/O eventfd (we were able to select() on that fd in a separate thread).
>>>
>>> The performance difference between virtio-blk-pci and
>>> virtio-blk-data-plane is staggering when dealing with a very large
>>> storage system.
>>>
>>> So we'd like to get the infrastructure in place where we can start
>>> multithreading devices in QEMU to we can integrate this work.
>>
>> Can you name the primary bits? We really need to see the whole picture
>> before adding new locks. They alone are not the solution.
> 
> Sorry, not sure what you mean by "the primary bits".
> 
> At a high level, the plan is to:
> 
> 1) unlock iothread before entering the do {} look in kvm_cpu_exec()
>    a) reacquire the lock after the loop
>    b) reacquire the lock in kvm_handle_io()
>    c) introduce an unlocked memory accessor that for now, just requires
> the iothread lock() and calls cpu_physical_memory_rw()

Right, that's what we have here as well. The latter is modeled as a so
called "I/O pathway", a thread-based execution context for
frontend/backend pairs with some logic to transfer certain I/O requests
asynchronously to the pathway thread.

The tricky part was to get nested requests right, i.e. when a requests
triggers another one from within the device model. This is where things
get ugly. In theory, you can end up with a vm deadlock if you just apply
per-device locking. I'm currently trying to rebase our patches, review
and document the logic behind it.

> 
> 2) focus initially on killing the lock in kvm_handle_io()
>    a) the ioport table is pretty simplistic so adding fine grain locking
> won't be hard.
>    b) reacquire lock right before ioport dispatch
> 
> 3) allow for register ioport handlers w/o the dispatch function carrying
> a iothread
>    a) this is mostly memory API plumbing

We skipped this as our NICs didn't do PIO, but you clearly need it for
virtio.

> 
> 4) focus on going back and adding fine grain locking to the
> cpu_physical_memory_rw() accessor

In the end, PIO and MMIO should use the same patterns - and will face
the same challenges. Ideally, we model things very similar right from
the start.

And then there is also

5) provide direct IRQ delivery from the device model to the IRQ chip.
That's much like what we need for VFIO and KVM device assignment. But
here we won't be able to cheat and ignore correct generation of vmstates
of the bypassed PCI host bridges etc... Which leads me to that other
thread about how to handle this for PCI device pass-through.
Contributions to that discussion are welcome as well.

> 
> Note that whenever possible, we should be using rwlocks instead of a
> normal mutex.  In particular, for the ioport data structures, a rwlock
> seems pretty obvious.

I think we should mostly be fine with a "big hammer" rwlock: unlocked
read access from VCPUs and iothreads, and vmstop/resume around
modifications of fast path data structures (like the memory region
hierarchy or the PIO table). Where that's not sufficient, RCU will be
needed. Sleeping rwlocks have horrible semantics (specifically when
thread priorities come into play) and are performance-wise inferior. We
should avoid them completely.

> 
>>
>>>
>>> The basic plan is introduce granular locking starting at the KVM
>>> dispatch level until we can get to MemoryRegion dispatch.  We'll then
>>> have some way to indicate that a MemoryRegion's callbacks should be
>>> invoked without holding the qemu global mutex.
>>
>> I don't disagree, but this end really looks like starting at the wrong
>> edge. The changes are not isolated and surely not yet correct
>> (run_on_cpu is broken for tcg e.g.).
>>
>> Then, none of this locking should be needed for in-kernel irqchips. All
>> touched states are thread local or should be modifiable atomically - if
>> not let's fix *that*, it's more beneficial.
>>
>> Actually, cpu_lock is counterproductive as it adds locking ops to a path
>> where we will not need them later on in the normal configuration. User
>> space irqchip is a slow path and perfectly fine to handle under BQL. So
>> is VCPU control (pause/resume/run-on). It's better to focus on the fast
>> path first.
> 
> To be clear, I'm not advocating introducing cpu_lock.  We should do
> whatever makes the most sense to not have to hold iothread lock while
> processing an exit from KVM.

Good that we agree. :)

> 
> Note that this is an RFC, the purpose of this series is to have this
> discussion :-)

Yep, I think we have it now ;). Hope I can contribute some code bits to
it soon, though I didn't schedule this task for the next week.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread
  2012-06-22 22:27             ` Jan Kiszka
@ 2012-06-22 22:56               ` Anthony Liguori
  2012-06-23  9:10                 ` Jan Kiszka
  0 siblings, 1 reply; 15+ messages in thread
From: Anthony Liguori @ 2012-06-22 22:56 UTC (permalink / raw
  To: Jan Kiszka; +Cc: qemu-devel@nongnu.org, liu ping fan, Stefan Hajnoczi

On 06/22/2012 05:27 PM, Jan Kiszka wrote:
> On 2012-06-22 23:44, Anthony Liguori wrote:
>> 1) unlock iothread before entering the do {} look in kvm_cpu_exec()
>>     a) reacquire the lock after the loop
>>     b) reacquire the lock in kvm_handle_io()
>>     c) introduce an unlocked memory accessor that for now, just requires
>> the iothread lock() and calls cpu_physical_memory_rw()
>
> Right, that's what we have here as well. The latter is modeled as a so
> called "I/O pathway", a thread-based execution context for
> frontend/backend pairs with some logic to transfer certain I/O requests
> asynchronously to the pathway thread.

Interesting, so the VCPU threads always hold the iothread mutex but some 
requests are routed to other threads?

I hadn't considered a design like that.  I've been thinking about a long term 
architecture that's a bit more invasive.

What we think of the I/O thread today wouldn't be special.  It would be one of N 
I/O threads all running separate copies of the main loop.  All of the functions 
that defer dispatch to a main loop would take a context as an argument and 
devices would essentially have a "vector" array of main loops as input.

So virtio-net probably would have two main loop "vectors" since it would like to 
schedule tx and rx independently.  There's nothing that says that you can't pass 
the same main loop context for each vector but that's a configuration choice.

Dispatch from VCPU context would behave the same it does today but obviously 
per-device locking is needed.

> The tricky part was to get nested requests right, i.e. when a requests
> triggers another one from within the device model. This is where things
> get ugly. In theory, you can end up with a vm deadlock if you just apply
> per-device locking. I'm currently trying to rebase our patches, review
> and document the logic behind it.

I really think the only way to solve this is to separate map()'d DMA access 
(where the device really wants to deal with RAM only) and copy-based access 
(where devices map DMA to other devices).

For copy-based access, we really ought to move to a callback based API.  It adds 
quite a bit of complexity but it's really the only way to solve the problem 
robustly.

>> 2) focus initially on killing the lock in kvm_handle_io()
>>     a) the ioport table is pretty simplistic so adding fine grain locking
>> won't be hard.
>>     b) reacquire lock right before ioport dispatch
>>
>> 3) allow for register ioport handlers w/o the dispatch function carrying
>> a iothread
>>     a) this is mostly memory API plumbing
>
> We skipped this as our NICs didn't do PIO, but you clearly need it for
> virtio.

Right.

>> 4) focus on going back and adding fine grain locking to the
>> cpu_physical_memory_rw() accessor
>
> In the end, PIO and MMIO should use the same patterns - and will face
> the same challenges. Ideally, we model things very similar right from
> the start.

Yes.

> And then there is also
>
> 5) provide direct IRQ delivery from the device model to the IRQ chip.
> That's much like what we need for VFIO and KVM device assignment. But
> here we won't be able to cheat and ignore correct generation of vmstates
> of the bypassed PCI host bridges etc... Which leads me to that other
> thread about how to handle this for PCI device pass-through.
> Contributions to that discussion are welcome as well.

I think you mean to the in-kernel IRQ chip.  I'm thinking about this still so I 
don't have a plan yet that I'm ready to share.  I have some ideas though.

>
>>
>> Note that whenever possible, we should be using rwlocks instead of a
>> normal mutex.  In particular, for the ioport data structures, a rwlock
>> seems pretty obvious.
>
> I think we should mostly be fine with a "big hammer" rwlock: unlocked
> read access from VCPUs and iothreads, and vmstop/resume around
> modifications of fast path data structures (like the memory region
> hierarchy or the PIO table).

Ack.

> Where that's not sufficient, RCU will be
> needed. Sleeping rwlocks have horrible semantics (specifically when
> thread priorities come into play) and are performance-wise inferior. We
> should avoid them completely.

Yes, I think RCU is inevitable here but I think starting with rwlocks will help 
with the big refactoring.

>>
>> To be clear, I'm not advocating introducing cpu_lock.  We should do
>> whatever makes the most sense to not have to hold iothread lock while
>> processing an exit from KVM.
>
> Good that we agree. :)
>
>>
>> Note that this is an RFC, the purpose of this series is to have this
>> discussion :-)
>
> Yep, I think we have it now ;). Hope I can contribute some code bits to
> it soon, though I didn't schedule this task for the next week.

Great!  If you have something you can share, I'd be eager to look at it 
regardless of the condition of the code.

Regards,

Anthony Liguori

>
> Jan
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread
  2012-06-22 22:56               ` Anthony Liguori
@ 2012-06-23  9:10                 ` Jan Kiszka
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kiszka @ 2012-06-23  9:10 UTC (permalink / raw
  To: Anthony Liguori; +Cc: qemu-devel@nongnu.org, liu ping fan, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 6401 bytes --]

On 2012-06-23 00:56, Anthony Liguori wrote:
> On 06/22/2012 05:27 PM, Jan Kiszka wrote:
>> On 2012-06-22 23:44, Anthony Liguori wrote:
>>> 1) unlock iothread before entering the do {} look in kvm_cpu_exec()
>>>     a) reacquire the lock after the loop
>>>     b) reacquire the lock in kvm_handle_io()
>>>     c) introduce an unlocked memory accessor that for now, just requires
>>> the iothread lock() and calls cpu_physical_memory_rw()
>>
>> Right, that's what we have here as well. The latter is modeled as a so
>> called "I/O pathway", a thread-based execution context for
>> frontend/backend pairs with some logic to transfer certain I/O requests
>> asynchronously to the pathway thread.
> 
> Interesting, so the VCPU threads always hold the iothread mutex but some
> requests are routed to other threads?

The VCPUs only acquire the iothread locks _unless_ the request can be
handled directly or forwarded to a pathway thread. In the latter case,
pathway-specific locks will be taken. One big advantage of this model is
that you do not need to worry about locks in the device models
themselves. That helps migrating existing models but should also be
sufficient for quite a few use cases.

> 
> I hadn't considered a design like that.  I've been thinking about a long
> term architecture that's a bit more invasive.
> 
> What we think of the I/O thread today wouldn't be special.  It would be
> one of N I/O threads all running separate copies of the main loop.  All
> of the functions that defer dispatch to a main loop would take a context
> as an argument and devices would essentially have a "vector" array of
> main loops as input.
> 
> So virtio-net probably would have two main loop "vectors" since it would
> like to schedule tx and rx independently.  There's nothing that says
> that you can't pass the same main loop context for each vector but
> that's a configuration choice.
> 
> Dispatch from VCPU context would behave the same it does today but
> obviously per-device locking is needed.

And every backend would run over its own thread - I guess this is
conceptually close to what we have. However, the devil is in the detail.
E.g., we will also need per-iothread timer services (we skipped this so
far). And the device-to-device request problem needs to be solved (see
below).

> 
>> The tricky part was to get nested requests right, i.e. when a requests
>> triggers another one from within the device model. This is where things
>> get ugly. In theory, you can end up with a vm deadlock if you just apply
>> per-device locking. I'm currently trying to rebase our patches, review
>> and document the logic behind it.
> 
> I really think the only way to solve this is to separate map()'d DMA
> access (where the device really wants to deal with RAM only) and
> copy-based access (where devices map DMA to other devices).
> 
> For copy-based access, we really ought to move to a callback based API. 
> It adds quite a bit of complexity but it's really the only way to solve
> the problem robustly.

Maybe we are talking about the same thing: What we need is a mechanism
to queue MMIO requests for execution over some iothread / pathway
context in case we are about to get trapped by lock recursion. Then we
also have to make sure that queued requests are not overtaken by
requests issued afterward. This is an important part of our approach.

> 
>>> 2) focus initially on killing the lock in kvm_handle_io()
>>>     a) the ioport table is pretty simplistic so adding fine grain
>>> locking
>>> won't be hard.
>>>     b) reacquire lock right before ioport dispatch
>>>
>>> 3) allow for register ioport handlers w/o the dispatch function carrying
>>> a iothread
>>>     a) this is mostly memory API plumbing
>>
>> We skipped this as our NICs didn't do PIO, but you clearly need it for
>> virtio.
> 
> Right.
> 
>>> 4) focus on going back and adding fine grain locking to the
>>> cpu_physical_memory_rw() accessor
>>
>> In the end, PIO and MMIO should use the same patterns - and will face
>> the same challenges. Ideally, we model things very similar right from
>> the start.
> 
> Yes.
> 
>> And then there is also
>>
>> 5) provide direct IRQ delivery from the device model to the IRQ chip.
>> That's much like what we need for VFIO and KVM device assignment. But
>> here we won't be able to cheat and ignore correct generation of vmstates
>> of the bypassed PCI host bridges etc... Which leads me to that other
>> thread about how to handle this for PCI device pass-through.
>> Contributions to that discussion are welcome as well.
> 
> I think you mean to the in-kernel IRQ chip.  I'm thinking about this
> still so I don't have a plan yet that I'm ready to share.  I have some
> ideas though.
> 
>>
>>>
>>> Note that whenever possible, we should be using rwlocks instead of a
>>> normal mutex.  In particular, for the ioport data structures, a rwlock
>>> seems pretty obvious.
>>
>> I think we should mostly be fine with a "big hammer" rwlock: unlocked
>> read access from VCPUs and iothreads, and vmstop/resume around
>> modifications of fast path data structures (like the memory region
>> hierarchy or the PIO table).
> 
> Ack.
> 
>> Where that's not sufficient, RCU will be
>> needed. Sleeping rwlocks have horrible semantics (specifically when
>> thread priorities come into play) and are performance-wise inferior. We
>> should avoid them completely.
> 
> Yes, I think RCU is inevitable here but I think starting with rwlocks
> will help with the big refactoring.

Let's wait for the first patches... :)

> 
>>>
>>> To be clear, I'm not advocating introducing cpu_lock.  We should do
>>> whatever makes the most sense to not have to hold iothread lock while
>>> processing an exit from KVM.
>>
>> Good that we agree. :)
>>
>>>
>>> Note that this is an RFC, the purpose of this series is to have this
>>> discussion :-)
>>
>> Yep, I think we have it now ;). Hope I can contribute some code bits to
>> it soon, though I didn't schedule this task for the next week.
> 
> Great!  If you have something you can share, I'd be eager to look at it
> regardless of the condition of the code.

Let me just finish the rebasing. The completed switch to memory region
abstractions makes the code cleaner in some important parts.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] kvm: use per-cpu lock to free vcpu thread out of the big lock
  2012-06-22 20:06   ` Anthony Liguori
@ 2012-06-23  9:32     ` liu ping fan
  2012-06-24 14:09       ` liu ping fan
  0 siblings, 1 reply; 15+ messages in thread
From: liu ping fan @ 2012-06-23  9:32 UTC (permalink / raw
  To: Anthony Liguori; +Cc: Jan Kiszka, Liu Ping Fan, qemu-devel

On Sat, Jun 23, 2012 at 4:06 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 06/21/2012 09:49 AM, Liu Ping Fan wrote:
>>
>> In order to break the big lock, using per-cpu_lock in kvm_cpu_exec()
>> to protect the race from other cpu's access to env->apic_state&  related
>>
>> field in env.
>> Also, we need to protect agaist run_on_cpu().
>>
>> Race condition can be like this:
>> 1.  vcpu-1 IPI vcpu-2
>>     vcpu-3 IPI vcpu-2
>>     Open window exists for accessing to vcpu-2's apic_state&  env
>>
>>
>> 2. run_on_cpu() write env->queued_work_last, while flush_queued_work()
>>    read
>>
>> Signed-off-by: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
>> ---
>>  cpus.c    |    6 ++++--
>>  hw/apic.c |   58
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>  hw/pc.c   |    8 +++++++-
>>  kvm-all.c |   13 +++++++++++--
>>  4 files changed, 76 insertions(+), 9 deletions(-)
>>
>> diff --git a/cpus.c b/cpus.c
>> index 554f7bc..ac99afe 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -649,6 +649,7 @@ void run_on_cpu(CPUArchState *env, void (*func)(void
>> *data), void *data)
>>
>>      wi.func = func;
>>      wi.data = data;
>> +    qemu_mutex_lock(env->cpu_lock);
>>      if (!env->queued_work_first) {
>>          env->queued_work_first =&wi;
>>      } else {
>> @@ -657,6 +658,7 @@ void run_on_cpu(CPUArchState *env, void (*func)(void
>> *data), void *data)
>>      env->queued_work_last =&wi;
>>      wi.next = NULL;
>>      wi.done = false;
>> +    qemu_mutex_unlock(env->cpu_lock);
>>
>>      qemu_cpu_kick(env);
>>      while (!wi.done) {
>> @@ -718,7 +720,7 @@ static void qemu_tcg_wait_io_event(void)
>>  static void qemu_kvm_wait_io_event(CPUArchState *env)
>>  {
>>      while (cpu_thread_is_idle(env)) {
>> -        qemu_cond_wait(env->halt_cond,&qemu_global_mutex);
>> +        qemu_cond_wait(env->halt_cond, env->cpu_lock);
>>      }
>>
>>      qemu_kvm_eat_signals(env);
>> @@ -729,8 +731,8 @@ static void *qemu_kvm_cpu_thread_fn(void *arg)
>>  {
>>      CPUArchState *env = arg;
>>      int r;
>> +    qemu_mutex_lock_cpu(env);
>>
>> -    qemu_mutex_lock(&qemu_global_mutex);
>>      qemu_thread_get_self(env->thread);
>>      env->thread_id = qemu_get_thread_id();
>>      cpu_single_env = env;
>> diff --git a/hw/apic.c b/hw/apic.c
>> index 4eeaf88..b999a40 100644
>> --- a/hw/apic.c
>> +++ b/hw/apic.c
>> @@ -22,6 +22,7 @@
>>  #include "host-utils.h"
>>  #include "trace.h"
>>  #include "pc.h"
>> +#include "qemu-thread.h"
>>
>>  #define MAX_APIC_WORDS 8
>>
>> @@ -94,6 +95,7 @@ static int get_highest_priority_int(uint32_t *tab)
>>      return -1;
>>  }
>>
>> +/* Caller must hold the lock */
>>  static void apic_sync_vapic(APICCommonState *s, int sync_type)
>>  {
>>      VAPICState vapic_state;
>> @@ -141,11 +143,13 @@ static void apic_sync_vapic(APICCommonState *s, int
>> sync_type)
>>      }
>>  }
>>
>> +/* Caller must hold lock */
>>  static void apic_vapic_base_update(APICCommonState *s)
>>  {
>>      apic_sync_vapic(s, SYNC_TO_VAPIC);
>>  }
>>
>> +/* Caller must hold the lock */
>>  static void apic_local_deliver(APICCommonState *s, int vector)
>>  {
>>      uint32_t lvt = s->lvt[vector];
>> @@ -175,9 +179,11 @@ static void apic_local_deliver(APICCommonState *s,
>> int vector)
>>              (lvt&  APIC_LVT_LEVEL_TRIGGER))
>>              trigger_mode = APIC_TRIGGER_LEVEL;
>>          apic_set_irq(s, lvt&  0xff, trigger_mode);
>>
>> +        break;
>>      }
>>  }
>>
>> +/* Caller must hold the lock */
>>  void apic_deliver_pic_intr(DeviceState *d, int level)
>>  {
>>      APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
>> @@ -200,9 +206,12 @@ void apic_deliver_pic_intr(DeviceState *d, int level)
>>      }
>>  }
>>
>> +/* Must hold lock */
>>  static void apic_external_nmi(APICCommonState *s)
>>  {
>> +    qemu_mutex_lock_cpu(s->cpu_env);
>>      apic_local_deliver(s, APIC_LVT_LINT1);
>> +    qemu_mutex_unlock_cpu(s->cpu_env);
>>  }
>>
>>  #define foreach_apic(apic, deliver_bitmask, code) \
>> @@ -215,7 +224,9 @@ static void apic_external_nmi(APICCommonState *s)
>>                  if (__mask&  (1<<  __j)) {\
>>
>>                      apic = local_apics[__i * 32 + __j];\
>>                      if (apic) {\
>> +                        qemu_mutex_lock_cpu(apic->cpu_env);\
>>                          code;\
>> +                        qemu_mutex_unlock_cpu(apic->cpu_env);\
>>                      }\
>>                  }\
>>              }\
>> @@ -244,7 +255,9 @@ static void apic_bus_deliver(const uint32_t
>> *deliver_bitmask,
>>                  if (d>= 0) {
>>                      apic_iter = local_apics[d];
>>                      if (apic_iter) {
>> +                        qemu_mutex_lock_cpu(apic_iter->cpu_env);
>>                          apic_set_irq(apic_iter, vector_num,
>> trigger_mode);
>> +                        qemu_mutex_unlock_cpu(apic_iter->cpu_env);
>>                      }
>>                  }
>>              }
>> @@ -293,6 +306,7 @@ void apic_deliver_irq(uint8_t dest, uint8_t dest_mode,
>> uint8_t delivery_mode,
>>      apic_bus_deliver(deliver_bitmask, delivery_mode, vector_num,
>> trigger_mode);
>>  }
>>
>> +/* Caller must hold lock */
>>  static void apic_set_base(APICCommonState *s, uint64_t val)
>>  {
>>      s->apicbase = (val&  0xfffff000) |
>>
>> @@ -305,6 +319,7 @@ static void apic_set_base(APICCommonState *s, uint64_t
>> val)
>>      }
>>  }
>>
>> +/* caller must hold lock */
>>  static void apic_set_tpr(APICCommonState *s, uint8_t val)
>>  {
>>      /* Updates from cr8 are ignored while the VAPIC is active */
>> @@ -314,12 +329,14 @@ static void apic_set_tpr(APICCommonState *s, uint8_t
>> val)
>>      }
>>  }
>>
>> +/* caller must hold lock */
>>  static uint8_t apic_get_tpr(APICCommonState *s)
>>  {
>>      apic_sync_vapic(s, SYNC_FROM_VAPIC);
>>      return s->tpr>>  4;
>>  }
>>
>> +/* Caller must hold the lock */
>>  static int apic_get_ppr(APICCommonState *s)
>>  {
>>      int tpr, isrv, ppr;
>> @@ -348,6 +365,7 @@ static int apic_get_arb_pri(APICCommonState *s)
>>   * 0  - no interrupt,
>>   *>0 - interrupt number
>>   */
>> +/* Caller must hold cpu_lock */
>>  static int apic_irq_pending(APICCommonState *s)
>>  {
>>      int irrv, ppr;
>> @@ -363,6 +381,7 @@ static int apic_irq_pending(APICCommonState *s)
>>      return irrv;
>>  }
>>
>> +/* caller must ensure the lock has been taken */
>>  /* signal the CPU if an irq is pending */
>>  static void apic_update_irq(APICCommonState *s)
>>  {
>> @@ -380,11 +399,13 @@ static void apic_update_irq(APICCommonState *s)
>>  void apic_poll_irq(DeviceState *d)
>>  {
>>      APICCommonState *s = APIC_COMMON(d);
>> -
>> +    qemu_mutex_lock_cpu(s->cpu_env);
>>      apic_sync_vapic(s, SYNC_FROM_VAPIC);
>>      apic_update_irq(s);
>> +    qemu_mutex_unlock_cpu(s->cpu_env);
>>  }
>>
>> +/* caller must hold the lock */
>>  static void apic_set_irq(APICCommonState *s, int vector_num, int
>> trigger_mode)
>>  {
>>      apic_report_irq_delivered(!get_bit(s->irr, vector_num));
>> @@ -407,6 +428,7 @@ static void apic_set_irq(APICCommonState *s, int
>> vector_num, int trigger_mode)
>>      apic_update_irq(s);
>>  }
>>
>> +/* caller must hold the lock */
>>  static void apic_eoi(APICCommonState *s)
>>  {
>>      int isrv;
>> @@ -415,6 +437,7 @@ static void apic_eoi(APICCommonState *s)
>>          return;
>>      reset_bit(s->isr, isrv);
>>      if (!(s->spurious_vec&  APIC_SV_DIRECTED_IO)&&  get_bit(s->tmr,
>> isrv)) {
>>
>> +        //fix me,need to take extra lock for the bus?
>>          ioapic_eoi_broadcast(isrv);
>>      }
>>      apic_sync_vapic(s, SYNC_FROM_VAPIC | SYNC_TO_VAPIC);
>> @@ -480,18 +503,23 @@ static void apic_get_delivery_bitmask(uint32_t
>> *deliver_bitmask,
>>  static void apic_startup(APICCommonState *s, int vector_num)
>>  {
>>      s->sipi_vector = vector_num;
>> +    qemu_mutex_lock_cpu(s->cpu_env);
>>      cpu_interrupt(s->cpu_env, CPU_INTERRUPT_SIPI);
>> +    qemu_mutex_unlock_cpu(s->cpu_env);
>>  }
>>
>>  void apic_sipi(DeviceState *d)
>>  {
>>      APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
>> -
>> +    qemu_mutex_lock_cpu(s->cpu_env);
>>      cpu_reset_interrupt(s->cpu_env, CPU_INTERRUPT_SIPI);
>>
>> -    if (!s->wait_for_sipi)
>> +    if (!s->wait_for_sipi) {
>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>          return;
>> +    }
>>      cpu_x86_load_seg_cache_sipi(s->cpu_env, s->sipi_vector);
>> +    qemu_mutex_unlock_cpu(s->cpu_env);
>>      s->wait_for_sipi = 0;
>>  }
>>
>> @@ -543,6 +571,7 @@ static void apic_deliver(DeviceState *d, uint8_t dest,
>> uint8_t dest_mode,
>>      apic_bus_deliver(deliver_bitmask, delivery_mode, vector_num,
>> trigger_mode);
>>  }
>>
>> +/* Caller must take lock */
>>  int apic_get_interrupt(DeviceState *d)
>>  {
>>      APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
>> @@ -572,6 +601,7 @@ int apic_get_interrupt(DeviceState *d)
>>      return intno;
>>  }
>>
>> +/* Caller must hold the cpu_lock */
>>  int apic_accept_pic_intr(DeviceState *d)
>>  {
>>      APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
>> @@ -589,6 +619,7 @@ int apic_accept_pic_intr(DeviceState *d)
>>      return 0;
>>  }
>>
>> +/* Caller hold lock */
>>  static uint32_t apic_get_current_count(APICCommonState *s)
>>  {
>>      int64_t d;
>> @@ -619,9 +650,10 @@ static void apic_timer_update(APICCommonState *s,
>> int64_t current_time)
>>  static void apic_timer(void *opaque)
>>  {
>>      APICCommonState *s = opaque;
>> -
>> +    qemu_mutex_lock_cpu(s->cpu_env);
>>      apic_local_deliver(s, APIC_LVT_TIMER);
>>      apic_timer_update(s, s->next_time);
>> +    qemu_mutex_unlock_cpu(s->cpu_env);
>>  }
>>
>>  static uint32_t apic_mem_readb(void *opaque, target_phys_addr_t addr)
>> @@ -664,18 +696,22 @@ static uint32_t apic_mem_readl(void *opaque,
>> target_phys_addr_t addr)
>>          val = 0x11 | ((APIC_LVT_NB - 1)<<  16); /* version 0x11 */
>>          break;
>>      case 0x08:
>> +        qemu_mutex_lock_cpu(s->cpu_env);
>>          apic_sync_vapic(s, SYNC_FROM_VAPIC);
>>          if (apic_report_tpr_access) {
>>              cpu_report_tpr_access(s->cpu_env, TPR_ACCESS_READ);
>>          }
>>          val = s->tpr;
>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>          break;
>>      case 0x09:
>>          val = apic_get_arb_pri(s);
>>          break;
>>      case 0x0a:
>>          /* ppr */
>> +        qemu_mutex_lock_cpu(s->cpu_env);
>>          val = apic_get_ppr(s);
>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>          break;
>>      case 0x0b:
>>          val = 0;
>> @@ -712,7 +748,9 @@ static uint32_t apic_mem_readl(void *opaque,
>> target_phys_addr_t addr)
>>          val = s->initial_count;
>>          break;
>>      case 0x39:
>> +        qemu_mutex_lock_cpu(s->cpu_env);
>>          val = apic_get_current_count(s);
>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>          break;
>>      case 0x3e:
>>          val = s->divide_conf;
>> @@ -767,18 +805,22 @@ static void apic_mem_writel(void *opaque,
>> target_phys_addr_t addr, uint32_t val)
>>      case 0x03:
>>          break;
>>      case 0x08:
>> +       qemu_mutex_lock_cpu(s->cpu_env);
>>          if (apic_report_tpr_access) {
>>              cpu_report_tpr_access(s->cpu_env, TPR_ACCESS_WRITE);
>>          }
>>          s->tpr = val;
>>          apic_sync_vapic(s, SYNC_TO_VAPIC);
>>          apic_update_irq(s);
>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>          break;
>>      case 0x09:
>>      case 0x0a:
>>          break;
>>      case 0x0b: /* EOI */
>> +        qemu_mutex_lock_cpu(s->cpu_env);
>>          apic_eoi(s);
>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>          break;
>>      case 0x0d:
>>          s->log_dest = val>>  24;
>> @@ -787,8 +829,10 @@ static void apic_mem_writel(void *opaque,
>> target_phys_addr_t addr, uint32_t val)
>>          s->dest_mode = val>>  28;
>>          break;
>>      case 0x0f:
>> +        qemu_mutex_lock_cpu(s->cpu_env);
>>          s->spurious_vec = val&  0x1ff;
>>
>>          apic_update_irq(s);
>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>          break;
>>      case 0x10 ... 0x17:
>>      case 0x18 ... 0x1f:
>> @@ -807,24 +851,30 @@ static void apic_mem_writel(void *opaque,
>> target_phys_addr_t addr, uint32_t val)
>>      case 0x32 ... 0x37:
>>          {
>>              int n = index - 0x32;
>> +            qemu_mutex_lock_cpu(s->cpu_env);
>>              s->lvt[n] = val;
>>              if (n == APIC_LVT_TIMER)
>>                  apic_timer_update(s, qemu_get_clock_ns(vm_clock));
>> +            qemu_mutex_unlock_cpu(s->cpu_env);
>>          }
>>          break;
>>      case 0x38:
>> +       qemu_mutex_lock_cpu(s->cpu_env);
>>          s->initial_count = val;
>>          s->initial_count_load_time = qemu_get_clock_ns(vm_clock);
>>          apic_timer_update(s, s->initial_count_load_time);
>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>          break;
>>      case 0x39:
>>          break;
>>      case 0x3e:
>>          {
>>              int v;
>> +            qemu_mutex_lock_cpu(s->cpu_env);
>>              s->divide_conf = val&  0xb;
>>              v = (s->divide_conf&  3) | ((s->divide_conf>>  1)&  4);
>>              s->count_shift = (v + 1)&  7;
>>
>> +            qemu_mutex_unlock_cpu(s->cpu_env);
>>          }
>>          break;
>>      default:
>> diff --git a/hw/pc.c b/hw/pc.c
>> index 4d34a33..5e7350c 100644
>> --- a/hw/pc.c
>> +++ b/hw/pc.c
>> @@ -147,7 +147,7 @@ void cpu_smm_update(CPUX86State *env)
>>          smm_set(!!(env->hflags&  HF_SMM_MASK), smm_arg);
>>
>>  }
>>
>> -
>> +/* Called by kvm_cpu_exec(), already with lock protection */
>>  /* IRQ handling */
>>  int cpu_get_pic_interrupt(CPUX86State *env)
>>  {
>> @@ -173,9 +173,15 @@ static void pic_irq_request(void *opaque, int irq,
>> int level)
>>      DPRINTF("pic_irqs: %s irq %d\n", level? "raise" : "lower", irq);
>>      if (env->apic_state) {
>>          while (env) {
>> +            if (!qemu_cpu_is_self(env)) {
>> +                qemu_mutex_lock_cpu(env);
>> +            }
>>              if (apic_accept_pic_intr(env->apic_state)) {
>>                  apic_deliver_pic_intr(env->apic_state, level);
>>              }
>> +            if (!qemu_cpu_is_self(env)) {
>> +                qemu_mutex_unlock_cpu(env);
>> +            }
>>              env = env->next_cpu;
>>          }
>>      } else {
>> diff --git a/kvm-all.c b/kvm-all.c
>> index 9b73ccf..dc9aa54 100644
>> --- a/kvm-all.c
>> +++ b/kvm-all.c
>> @@ -29,6 +29,7 @@
>>  #include "bswap.h"
>>  #include "memory.h"
>>  #include "exec-memory.h"
>> +#include "qemu-thread.h"
>>
>>  /* This check must be after config-host.h is included */
>>  #ifdef CONFIG_EVENTFD
>> @@ -1252,12 +1253,15 @@ int kvm_cpu_exec(CPUArchState *env)
>>               */
>>              qemu_cpu_kick_self();
>>          }
>> -        qemu_mutex_unlock_iothread();
>> +        qemu_mutex_unlock(env->cpu_lock);
>>
>>          run_ret = kvm_vcpu_ioctl(env, KVM_RUN, 0);
>>
>> -        qemu_mutex_lock_iothread();
>> +        qemu_mutex_lock(env->cpu_lock);
>>          kvm_arch_post_run(env, run);
>> +        qemu_mutex_unlock_cpu(env);
>> +
>> +        qemu_mutex_lock_iothread();
>>
>>          kvm_flush_coalesced_mmio_buffer();
>>
>> @@ -1265,6 +1269,8 @@ int kvm_cpu_exec(CPUArchState *env)
>>              if (run_ret == -EINTR || run_ret == -EAGAIN) {
>>                  DPRINTF("io window exit\n");
>>                  ret = EXCP_INTERRUPT;
>> +                qemu_mutex_unlock_iothread();
>> +                qemu_mutex_lock_cpu(env);
>>                  break;
>>              }
>>              fprintf(stderr, "error: kvm run failed %s\n",
>> @@ -1312,6 +1318,9 @@ int kvm_cpu_exec(CPUArchState *env)
>>              ret = kvm_arch_handle_exit(env, run);
>>              break;
>>          }
>> +
>> +        qemu_mutex_unlock_iothread();
>> +        qemu_mutex_lock_cpu(env);
>>      } while (ret == 0);
>
>
> This really confuses me.
>
> Shouldn't we be moving qemu_mutex_unlock_iothread() to the very top of
> kvm_cpu_exec()?
>
Sorry, the code is not arranged good enough, but I think the final
style will be like this:
 do {
        qemu_mutex_lock_iothread()
        kvm_flush_coalesced_mmio_buffer();
        qemu_mutex_unlock_iothread()

        switch (run->exit_reason) {
        case KVM_EXIT_IO:
            qemu_mutex_lock_iothread()
            kvm_handle_io();
            qemu_mutex_unlock_iothread()
            break;
        case KVM_EXIT_MMIO:
            qemu_mutex_lock_iothread()
            cpu_physical_memory_rw()
            qemu_mutex_unlock_iothread()
            break;
        ................
  }while()

Right? Can we push the qemu_mutex_lock_iothread out of the do{}?

> There's only a fixed amount of state that gets manipulated too in
> kvm_cpu_exec() so shouldn't we try to specifically describe what state is
> covered by cpu_lock?
>
In current code, apart from run_on_cpu() writes env->queued_work_last,
while flush_queued_work() reads, the BQL protects the CPUArchState and
the deviceState from the other threads. As to the CPUArchState, the
emulated registers are local. So the only part can be accessed by
other threads is for the irq emulation.
These component including env's
(interrupt_request,eflags,halted,exit_request,exception_injected), all
of them are decided by env->apic_state, so we consider them as a
atomic context, that is say during the updating of env's these
context, we do not allow apic_state changed.

> What's your strategy for ensuring that all accesses to that state is
> protected by cpu_lock?
>
The cpu_lock is a lock for env->apic_state and all related irq
emulation context. And the apic_state is the only entrance, through
which, other threads can affect this env's  irq emulation context. So
when other threads want to access this_env->apic_state, they must
firstly acquire the cpu_lock.

Thanks and regards,
pingfan
> Regards,
>
> Anthony Liguori
>
>>
>>      if (ret<  0) {
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] kvm: use per-cpu lock to free vcpu thread out of the big lock
  2012-06-23  9:32     ` liu ping fan
@ 2012-06-24 14:09       ` liu ping fan
  0 siblings, 0 replies; 15+ messages in thread
From: liu ping fan @ 2012-06-24 14:09 UTC (permalink / raw
  To: Anthony Liguori; +Cc: Jan Kiszka, qemu-devel

With this patch, the iothread lock will be only hold around
1.kvm_flush_coalesced_mmio_buffer();
2.kvm_handle_io();
3.cpu_physical_memory_rw()

And I think the cpu_lock is something which we can treat as apic_state
device's lock, which the first device, we free it out of the  iothread
lock

Regards,
pingfan

On Sat, Jun 23, 2012 at 5:32 PM, liu ping fan <qemulist@gmail.com> wrote:
> On Sat, Jun 23, 2012 at 4:06 AM, Anthony Liguori <anthony@codemonkey.ws> wrote:
>> On 06/21/2012 09:49 AM, Liu Ping Fan wrote:
>>>
>>> In order to break the big lock, using per-cpu_lock in kvm_cpu_exec()
>>> to protect the race from other cpu's access to env->apic_state&  related
>>>
>>> field in env.
>>> Also, we need to protect agaist run_on_cpu().
>>>
>>> Race condition can be like this:
>>> 1.  vcpu-1 IPI vcpu-2
>>>     vcpu-3 IPI vcpu-2
>>>     Open window exists for accessing to vcpu-2's apic_state&  env
>>>
>>>
>>> 2. run_on_cpu() write env->queued_work_last, while flush_queued_work()
>>>    read
>>>
>>> Signed-off-by: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
>>> ---
>>>  cpus.c    |    6 ++++--
>>>  hw/apic.c |   58
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>>  hw/pc.c   |    8 +++++++-
>>>  kvm-all.c |   13 +++++++++++--
>>>  4 files changed, 76 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/cpus.c b/cpus.c
>>> index 554f7bc..ac99afe 100644
>>> --- a/cpus.c
>>> +++ b/cpus.c
>>> @@ -649,6 +649,7 @@ void run_on_cpu(CPUArchState *env, void (*func)(void
>>> *data), void *data)
>>>
>>>      wi.func = func;
>>>      wi.data = data;
>>> +    qemu_mutex_lock(env->cpu_lock);
>>>      if (!env->queued_work_first) {
>>>          env->queued_work_first =&wi;
>>>      } else {
>>> @@ -657,6 +658,7 @@ void run_on_cpu(CPUArchState *env, void (*func)(void
>>> *data), void *data)
>>>      env->queued_work_last =&wi;
>>>      wi.next = NULL;
>>>      wi.done = false;
>>> +    qemu_mutex_unlock(env->cpu_lock);
>>>
>>>      qemu_cpu_kick(env);
>>>      while (!wi.done) {
>>> @@ -718,7 +720,7 @@ static void qemu_tcg_wait_io_event(void)
>>>  static void qemu_kvm_wait_io_event(CPUArchState *env)
>>>  {
>>>      while (cpu_thread_is_idle(env)) {
>>> -        qemu_cond_wait(env->halt_cond,&qemu_global_mutex);
>>> +        qemu_cond_wait(env->halt_cond, env->cpu_lock);
>>>      }
>>>
>>>      qemu_kvm_eat_signals(env);
>>> @@ -729,8 +731,8 @@ static void *qemu_kvm_cpu_thread_fn(void *arg)
>>>  {
>>>      CPUArchState *env = arg;
>>>      int r;
>>> +    qemu_mutex_lock_cpu(env);
>>>
>>> -    qemu_mutex_lock(&qemu_global_mutex);
>>>      qemu_thread_get_self(env->thread);
>>>      env->thread_id = qemu_get_thread_id();
>>>      cpu_single_env = env;
>>> diff --git a/hw/apic.c b/hw/apic.c
>>> index 4eeaf88..b999a40 100644
>>> --- a/hw/apic.c
>>> +++ b/hw/apic.c
>>> @@ -22,6 +22,7 @@
>>>  #include "host-utils.h"
>>>  #include "trace.h"
>>>  #include "pc.h"
>>> +#include "qemu-thread.h"
>>>
>>>  #define MAX_APIC_WORDS 8
>>>
>>> @@ -94,6 +95,7 @@ static int get_highest_priority_int(uint32_t *tab)
>>>      return -1;
>>>  }
>>>
>>> +/* Caller must hold the lock */
>>>  static void apic_sync_vapic(APICCommonState *s, int sync_type)
>>>  {
>>>      VAPICState vapic_state;
>>> @@ -141,11 +143,13 @@ static void apic_sync_vapic(APICCommonState *s, int
>>> sync_type)
>>>      }
>>>  }
>>>
>>> +/* Caller must hold lock */
>>>  static void apic_vapic_base_update(APICCommonState *s)
>>>  {
>>>      apic_sync_vapic(s, SYNC_TO_VAPIC);
>>>  }
>>>
>>> +/* Caller must hold the lock */
>>>  static void apic_local_deliver(APICCommonState *s, int vector)
>>>  {
>>>      uint32_t lvt = s->lvt[vector];
>>> @@ -175,9 +179,11 @@ static void apic_local_deliver(APICCommonState *s,
>>> int vector)
>>>              (lvt&  APIC_LVT_LEVEL_TRIGGER))
>>>              trigger_mode = APIC_TRIGGER_LEVEL;
>>>          apic_set_irq(s, lvt&  0xff, trigger_mode);
>>>
>>> +        break;
>>>      }
>>>  }
>>>
>>> +/* Caller must hold the lock */
>>>  void apic_deliver_pic_intr(DeviceState *d, int level)
>>>  {
>>>      APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
>>> @@ -200,9 +206,12 @@ void apic_deliver_pic_intr(DeviceState *d, int level)
>>>      }
>>>  }
>>>
>>> +/* Must hold lock */
>>>  static void apic_external_nmi(APICCommonState *s)
>>>  {
>>> +    qemu_mutex_lock_cpu(s->cpu_env);
>>>      apic_local_deliver(s, APIC_LVT_LINT1);
>>> +    qemu_mutex_unlock_cpu(s->cpu_env);
>>>  }
>>>
>>>  #define foreach_apic(apic, deliver_bitmask, code) \
>>> @@ -215,7 +224,9 @@ static void apic_external_nmi(APICCommonState *s)
>>>                  if (__mask&  (1<<  __j)) {\
>>>
>>>                      apic = local_apics[__i * 32 + __j];\
>>>                      if (apic) {\
>>> +                        qemu_mutex_lock_cpu(apic->cpu_env);\
>>>                          code;\
>>> +                        qemu_mutex_unlock_cpu(apic->cpu_env);\
>>>                      }\
>>>                  }\
>>>              }\
>>> @@ -244,7 +255,9 @@ static void apic_bus_deliver(const uint32_t
>>> *deliver_bitmask,
>>>                  if (d>= 0) {
>>>                      apic_iter = local_apics[d];
>>>                      if (apic_iter) {
>>> +                        qemu_mutex_lock_cpu(apic_iter->cpu_env);
>>>                          apic_set_irq(apic_iter, vector_num,
>>> trigger_mode);
>>> +                        qemu_mutex_unlock_cpu(apic_iter->cpu_env);
>>>                      }
>>>                  }
>>>              }
>>> @@ -293,6 +306,7 @@ void apic_deliver_irq(uint8_t dest, uint8_t dest_mode,
>>> uint8_t delivery_mode,
>>>      apic_bus_deliver(deliver_bitmask, delivery_mode, vector_num,
>>> trigger_mode);
>>>  }
>>>
>>> +/* Caller must hold lock */
>>>  static void apic_set_base(APICCommonState *s, uint64_t val)
>>>  {
>>>      s->apicbase = (val&  0xfffff000) |
>>>
>>> @@ -305,6 +319,7 @@ static void apic_set_base(APICCommonState *s, uint64_t
>>> val)
>>>      }
>>>  }
>>>
>>> +/* caller must hold lock */
>>>  static void apic_set_tpr(APICCommonState *s, uint8_t val)
>>>  {
>>>      /* Updates from cr8 are ignored while the VAPIC is active */
>>> @@ -314,12 +329,14 @@ static void apic_set_tpr(APICCommonState *s, uint8_t
>>> val)
>>>      }
>>>  }
>>>
>>> +/* caller must hold lock */
>>>  static uint8_t apic_get_tpr(APICCommonState *s)
>>>  {
>>>      apic_sync_vapic(s, SYNC_FROM_VAPIC);
>>>      return s->tpr>>  4;
>>>  }
>>>
>>> +/* Caller must hold the lock */
>>>  static int apic_get_ppr(APICCommonState *s)
>>>  {
>>>      int tpr, isrv, ppr;
>>> @@ -348,6 +365,7 @@ static int apic_get_arb_pri(APICCommonState *s)
>>>   * 0  - no interrupt,
>>>   *>0 - interrupt number
>>>   */
>>> +/* Caller must hold cpu_lock */
>>>  static int apic_irq_pending(APICCommonState *s)
>>>  {
>>>      int irrv, ppr;
>>> @@ -363,6 +381,7 @@ static int apic_irq_pending(APICCommonState *s)
>>>      return irrv;
>>>  }
>>>
>>> +/* caller must ensure the lock has been taken */
>>>  /* signal the CPU if an irq is pending */
>>>  static void apic_update_irq(APICCommonState *s)
>>>  {
>>> @@ -380,11 +399,13 @@ static void apic_update_irq(APICCommonState *s)
>>>  void apic_poll_irq(DeviceState *d)
>>>  {
>>>      APICCommonState *s = APIC_COMMON(d);
>>> -
>>> +    qemu_mutex_lock_cpu(s->cpu_env);
>>>      apic_sync_vapic(s, SYNC_FROM_VAPIC);
>>>      apic_update_irq(s);
>>> +    qemu_mutex_unlock_cpu(s->cpu_env);
>>>  }
>>>
>>> +/* caller must hold the lock */
>>>  static void apic_set_irq(APICCommonState *s, int vector_num, int
>>> trigger_mode)
>>>  {
>>>      apic_report_irq_delivered(!get_bit(s->irr, vector_num));
>>> @@ -407,6 +428,7 @@ static void apic_set_irq(APICCommonState *s, int
>>> vector_num, int trigger_mode)
>>>      apic_update_irq(s);
>>>  }
>>>
>>> +/* caller must hold the lock */
>>>  static void apic_eoi(APICCommonState *s)
>>>  {
>>>      int isrv;
>>> @@ -415,6 +437,7 @@ static void apic_eoi(APICCommonState *s)
>>>          return;
>>>      reset_bit(s->isr, isrv);
>>>      if (!(s->spurious_vec&  APIC_SV_DIRECTED_IO)&&  get_bit(s->tmr,
>>> isrv)) {
>>>
>>> +        //fix me,need to take extra lock for the bus?
>>>          ioapic_eoi_broadcast(isrv);
>>>      }
>>>      apic_sync_vapic(s, SYNC_FROM_VAPIC | SYNC_TO_VAPIC);
>>> @@ -480,18 +503,23 @@ static void apic_get_delivery_bitmask(uint32_t
>>> *deliver_bitmask,
>>>  static void apic_startup(APICCommonState *s, int vector_num)
>>>  {
>>>      s->sipi_vector = vector_num;
>>> +    qemu_mutex_lock_cpu(s->cpu_env);
>>>      cpu_interrupt(s->cpu_env, CPU_INTERRUPT_SIPI);
>>> +    qemu_mutex_unlock_cpu(s->cpu_env);
>>>  }
>>>
>>>  void apic_sipi(DeviceState *d)
>>>  {
>>>      APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
>>> -
>>> +    qemu_mutex_lock_cpu(s->cpu_env);
>>>      cpu_reset_interrupt(s->cpu_env, CPU_INTERRUPT_SIPI);
>>>
>>> -    if (!s->wait_for_sipi)
>>> +    if (!s->wait_for_sipi) {
>>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>>          return;
>>> +    }
>>>      cpu_x86_load_seg_cache_sipi(s->cpu_env, s->sipi_vector);
>>> +    qemu_mutex_unlock_cpu(s->cpu_env);
>>>      s->wait_for_sipi = 0;
>>>  }
>>>
>>> @@ -543,6 +571,7 @@ static void apic_deliver(DeviceState *d, uint8_t dest,
>>> uint8_t dest_mode,
>>>      apic_bus_deliver(deliver_bitmask, delivery_mode, vector_num,
>>> trigger_mode);
>>>  }
>>>
>>> +/* Caller must take lock */
>>>  int apic_get_interrupt(DeviceState *d)
>>>  {
>>>      APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
>>> @@ -572,6 +601,7 @@ int apic_get_interrupt(DeviceState *d)
>>>      return intno;
>>>  }
>>>
>>> +/* Caller must hold the cpu_lock */
>>>  int apic_accept_pic_intr(DeviceState *d)
>>>  {
>>>      APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
>>> @@ -589,6 +619,7 @@ int apic_accept_pic_intr(DeviceState *d)
>>>      return 0;
>>>  }
>>>
>>> +/* Caller hold lock */
>>>  static uint32_t apic_get_current_count(APICCommonState *s)
>>>  {
>>>      int64_t d;
>>> @@ -619,9 +650,10 @@ static void apic_timer_update(APICCommonState *s,
>>> int64_t current_time)
>>>  static void apic_timer(void *opaque)
>>>  {
>>>      APICCommonState *s = opaque;
>>> -
>>> +    qemu_mutex_lock_cpu(s->cpu_env);
>>>      apic_local_deliver(s, APIC_LVT_TIMER);
>>>      apic_timer_update(s, s->next_time);
>>> +    qemu_mutex_unlock_cpu(s->cpu_env);
>>>  }
>>>
>>>  static uint32_t apic_mem_readb(void *opaque, target_phys_addr_t addr)
>>> @@ -664,18 +696,22 @@ static uint32_t apic_mem_readl(void *opaque,
>>> target_phys_addr_t addr)
>>>          val = 0x11 | ((APIC_LVT_NB - 1)<<  16); /* version 0x11 */
>>>          break;
>>>      case 0x08:
>>> +        qemu_mutex_lock_cpu(s->cpu_env);
>>>          apic_sync_vapic(s, SYNC_FROM_VAPIC);
>>>          if (apic_report_tpr_access) {
>>>              cpu_report_tpr_access(s->cpu_env, TPR_ACCESS_READ);
>>>          }
>>>          val = s->tpr;
>>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>>          break;
>>>      case 0x09:
>>>          val = apic_get_arb_pri(s);
>>>          break;
>>>      case 0x0a:
>>>          /* ppr */
>>> +        qemu_mutex_lock_cpu(s->cpu_env);
>>>          val = apic_get_ppr(s);
>>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>>          break;
>>>      case 0x0b:
>>>          val = 0;
>>> @@ -712,7 +748,9 @@ static uint32_t apic_mem_readl(void *opaque,
>>> target_phys_addr_t addr)
>>>          val = s->initial_count;
>>>          break;
>>>      case 0x39:
>>> +        qemu_mutex_lock_cpu(s->cpu_env);
>>>          val = apic_get_current_count(s);
>>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>>          break;
>>>      case 0x3e:
>>>          val = s->divide_conf;
>>> @@ -767,18 +805,22 @@ static void apic_mem_writel(void *opaque,
>>> target_phys_addr_t addr, uint32_t val)
>>>      case 0x03:
>>>          break;
>>>      case 0x08:
>>> +       qemu_mutex_lock_cpu(s->cpu_env);
>>>          if (apic_report_tpr_access) {
>>>              cpu_report_tpr_access(s->cpu_env, TPR_ACCESS_WRITE);
>>>          }
>>>          s->tpr = val;
>>>          apic_sync_vapic(s, SYNC_TO_VAPIC);
>>>          apic_update_irq(s);
>>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>>          break;
>>>      case 0x09:
>>>      case 0x0a:
>>>          break;
>>>      case 0x0b: /* EOI */
>>> +        qemu_mutex_lock_cpu(s->cpu_env);
>>>          apic_eoi(s);
>>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>>          break;
>>>      case 0x0d:
>>>          s->log_dest = val>>  24;
>>> @@ -787,8 +829,10 @@ static void apic_mem_writel(void *opaque,
>>> target_phys_addr_t addr, uint32_t val)
>>>          s->dest_mode = val>>  28;
>>>          break;
>>>      case 0x0f:
>>> +        qemu_mutex_lock_cpu(s->cpu_env);
>>>          s->spurious_vec = val&  0x1ff;
>>>
>>>          apic_update_irq(s);
>>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>>          break;
>>>      case 0x10 ... 0x17:
>>>      case 0x18 ... 0x1f:
>>> @@ -807,24 +851,30 @@ static void apic_mem_writel(void *opaque,
>>> target_phys_addr_t addr, uint32_t val)
>>>      case 0x32 ... 0x37:
>>>          {
>>>              int n = index - 0x32;
>>> +            qemu_mutex_lock_cpu(s->cpu_env);
>>>              s->lvt[n] = val;
>>>              if (n == APIC_LVT_TIMER)
>>>                  apic_timer_update(s, qemu_get_clock_ns(vm_clock));
>>> +            qemu_mutex_unlock_cpu(s->cpu_env);
>>>          }
>>>          break;
>>>      case 0x38:
>>> +       qemu_mutex_lock_cpu(s->cpu_env);
>>>          s->initial_count = val;
>>>          s->initial_count_load_time = qemu_get_clock_ns(vm_clock);
>>>          apic_timer_update(s, s->initial_count_load_time);
>>> +        qemu_mutex_unlock_cpu(s->cpu_env);
>>>          break;
>>>      case 0x39:
>>>          break;
>>>      case 0x3e:
>>>          {
>>>              int v;
>>> +            qemu_mutex_lock_cpu(s->cpu_env);
>>>              s->divide_conf = val&  0xb;
>>>              v = (s->divide_conf&  3) | ((s->divide_conf>>  1)&  4);
>>>              s->count_shift = (v + 1)&  7;
>>>
>>> +            qemu_mutex_unlock_cpu(s->cpu_env);
>>>          }
>>>          break;
>>>      default:
>>> diff --git a/hw/pc.c b/hw/pc.c
>>> index 4d34a33..5e7350c 100644
>>> --- a/hw/pc.c
>>> +++ b/hw/pc.c
>>> @@ -147,7 +147,7 @@ void cpu_smm_update(CPUX86State *env)
>>>          smm_set(!!(env->hflags&  HF_SMM_MASK), smm_arg);
>>>
>>>  }
>>>
>>> -
>>> +/* Called by kvm_cpu_exec(), already with lock protection */
>>>  /* IRQ handling */
>>>  int cpu_get_pic_interrupt(CPUX86State *env)
>>>  {
>>> @@ -173,9 +173,15 @@ static void pic_irq_request(void *opaque, int irq,
>>> int level)
>>>      DPRINTF("pic_irqs: %s irq %d\n", level? "raise" : "lower", irq);
>>>      if (env->apic_state) {
>>>          while (env) {
>>> +            if (!qemu_cpu_is_self(env)) {
>>> +                qemu_mutex_lock_cpu(env);
>>> +            }
>>>              if (apic_accept_pic_intr(env->apic_state)) {
>>>                  apic_deliver_pic_intr(env->apic_state, level);
>>>              }
>>> +            if (!qemu_cpu_is_self(env)) {
>>> +                qemu_mutex_unlock_cpu(env);
>>> +            }
>>>              env = env->next_cpu;
>>>          }
>>>      } else {
>>> diff --git a/kvm-all.c b/kvm-all.c
>>> index 9b73ccf..dc9aa54 100644
>>> --- a/kvm-all.c
>>> +++ b/kvm-all.c
>>> @@ -29,6 +29,7 @@
>>>  #include "bswap.h"
>>>  #include "memory.h"
>>>  #include "exec-memory.h"
>>> +#include "qemu-thread.h"
>>>
>>>  /* This check must be after config-host.h is included */
>>>  #ifdef CONFIG_EVENTFD
>>> @@ -1252,12 +1253,15 @@ int kvm_cpu_exec(CPUArchState *env)
>>>               */
>>>              qemu_cpu_kick_self();
>>>          }
>>> -        qemu_mutex_unlock_iothread();
>>> +        qemu_mutex_unlock(env->cpu_lock);
>>>
>>>          run_ret = kvm_vcpu_ioctl(env, KVM_RUN, 0);
>>>
>>> -        qemu_mutex_lock_iothread();
>>> +        qemu_mutex_lock(env->cpu_lock);
>>>          kvm_arch_post_run(env, run);
>>> +        qemu_mutex_unlock_cpu(env);
>>> +
>>> +        qemu_mutex_lock_iothread();
>>>
>>>          kvm_flush_coalesced_mmio_buffer();
>>>
>>> @@ -1265,6 +1269,8 @@ int kvm_cpu_exec(CPUArchState *env)
>>>              if (run_ret == -EINTR || run_ret == -EAGAIN) {
>>>                  DPRINTF("io window exit\n");
>>>                  ret = EXCP_INTERRUPT;
>>> +                qemu_mutex_unlock_iothread();
>>> +                qemu_mutex_lock_cpu(env);
>>>                  break;
>>>              }
>>>              fprintf(stderr, "error: kvm run failed %s\n",
>>> @@ -1312,6 +1318,9 @@ int kvm_cpu_exec(CPUArchState *env)
>>>              ret = kvm_arch_handle_exit(env, run);
>>>              break;
>>>          }
>>> +
>>> +        qemu_mutex_unlock_iothread();
>>> +        qemu_mutex_lock_cpu(env);
>>>      } while (ret == 0);
>>
>>
>> This really confuses me.
>>
>> Shouldn't we be moving qemu_mutex_unlock_iothread() to the very top of
>> kvm_cpu_exec()?
>>
> Sorry, the code is not arranged good enough, but I think the final
> style will be like this:
>  do {
>        qemu_mutex_lock_iothread()
>        kvm_flush_coalesced_mmio_buffer();
>        qemu_mutex_unlock_iothread()
>
>        switch (run->exit_reason) {
>        case KVM_EXIT_IO:
>            qemu_mutex_lock_iothread()
>            kvm_handle_io();
>            qemu_mutex_unlock_iothread()
>            break;
>        case KVM_EXIT_MMIO:
>            qemu_mutex_lock_iothread()
>            cpu_physical_memory_rw()
>            qemu_mutex_unlock_iothread()
>            break;
>        ................
>  }while()
>
> Right? Can we push the qemu_mutex_lock_iothread out of the do{}?
>
>> There's only a fixed amount of state that gets manipulated too in
>> kvm_cpu_exec() so shouldn't we try to specifically describe what state is
>> covered by cpu_lock?
>>
> In current code, apart from run_on_cpu() writes env->queued_work_last,
> while flush_queued_work() reads, the BQL protects the CPUArchState and
> the deviceState from the other threads. As to the CPUArchState, the
> emulated registers are local. So the only part can be accessed by
> other threads is for the irq emulation.
> These component including env's
> (interrupt_request,eflags,halted,exit_request,exception_injected), all
> of them are decided by env->apic_state, so we consider them as a
> atomic context, that is say during the updating of env's these
> context, we do not allow apic_state changed.
>
>> What's your strategy for ensuring that all accesses to that state is
>> protected by cpu_lock?
>>
> The cpu_lock is a lock for env->apic_state and all related irq
> emulation context. And the apic_state is the only entrance, through
> which, other threads can affect this env's  irq emulation context. So
> when other threads want to access this_env->apic_state, they must
> firstly acquire the cpu_lock.
>
> Thanks and regards,
> pingfan
>> Regards,
>>
>> Anthony Liguori
>>
>>>
>>>      if (ret<  0) {
>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-06-24 14:10 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1340290158-11036-1-git-send-email-qemulist@gmail.com>
2012-06-21 15:23 ` [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread Jan Kiszka
2012-06-22 10:24   ` liu ping fan
2012-06-22 10:37     ` Jan Kiszka
2012-06-22 20:11       ` Anthony Liguori
2012-06-22 21:14         ` Jan Kiszka
2012-06-22 21:44           ` Anthony Liguori
2012-06-22 22:27             ` Jan Kiszka
2012-06-22 22:56               ` Anthony Liguori
2012-06-23  9:10                 ` Jan Kiszka
     [not found] ` <1340290158-11036-2-git-send-email-qemulist@gmail.com>
2012-06-22 20:01   ` [Qemu-devel] [PATCH 1/2] CPUArchState: introduce per-cpu lock Anthony Liguori
     [not found] ` <1340290158-11036-3-git-send-email-qemulist@gmail.com>
2012-06-21 15:02   ` [Qemu-devel] [PATCH 2/2] kvm: use per-cpu lock to free vcpu thread out of the big lock Jan Kiszka
2012-06-22 20:06   ` Anthony Liguori
2012-06-23  9:32     ` liu ping fan
2012-06-24 14:09       ` liu ping fan
2012-06-21 15:06 [Qemu-devel] [RFC] use little granularity lock to substitue qemu_mutex_lock_iothread Liu Ping Fan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.