From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2BA87C04FF3 for ; Fri, 21 May 2021 21:33:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0C841613AD for ; Fri, 21 May 2021 21:33:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229547AbhEUVfH (ORCPT ); Fri, 21 May 2021 17:35:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53260 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229472AbhEUVfG (ORCPT ); Fri, 21 May 2021 17:35:06 -0400 Received: from mail-lj1-x22e.google.com (mail-lj1-x22e.google.com [IPv6:2a00:1450:4864:20::22e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6822AC061574 for ; Fri, 21 May 2021 14:33:42 -0700 (PDT) Received: by mail-lj1-x22e.google.com with SMTP id s25so25575897ljo.11 for ; Fri, 21 May 2021 14:33:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=zyrprtW2QamK4gFOEunR1ZTMhYBbJ3wML0weQ98mhNY=; b=MI1rErf9yr8tQWYuoeDd3BajFPCY3TwGaUZR5YKGYwJ+i6HC8gYiR8FB82lpTW8T4D 4zodJNQmFw7xgiQavKcTmqi2psJXhXXzCJieNv16dPt5UTxyFyEzOzgRNBeHxvdyj1KH 1ZUp7jqTWNjWmbYBA/VlNWPgb3zzNk8xp5XLnHyHNyLYv8apiJIyspeAN4v81k26b0BD 7tyf3IjAlfoMjNn0HkLKdWDFwM43AL5G4I4NMU7vz9NjqRdVzEYUJ7SMiJl9Bwz5vfmA NKoVe728Pdq8w2cnuo/4kkgAy+KyPPkhvBWz0EZ1K1wULJRjST6XW8JTgmE6eII/HA1N v8eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=zyrprtW2QamK4gFOEunR1ZTMhYBbJ3wML0weQ98mhNY=; b=DbqhWw5CK8I1WBm8huT1H3ssSUty5xdxhm71Gcz0/vHQ1boVh/RseuxCsYZbwkecKT l6ErOcvXGDmjxVvNhOPl9FqkdhHIycv39BXsGleq7Ar7/tjasCAGSkp/+Bl/q4weqepL AT8PRsLvXcUUaaPdyAlNqKPBOXeQq+GlQtQLcVaBCTC862Jjfc8BLznA3iYUp6llxjL2 1n4vyn6ISCy7KCqJa+eHc0uir47H5z2XNjPuAnnAzhbKhy2ExeTd+wUf5eqPyYz7ArT8 NVRuXlgeOYjdF14ovUtFONbtwo4UO6ejTOyDrii6XF1x37n3aTl++PCoOStU73h2CnXl tKfA== X-Gm-Message-State: AOAM531JLokS470IVZP/thYXPozs2OQ+iItwidFmiTiBSP0mpx47U9YX p3IAYnyqQdBagsc6eW2xgyqQ5QvB3Zty2u2xfi0zwQ== X-Google-Smtp-Source: ABdhPJy2Eu6LAR7rUesSrhVFRu7IO9jL8LQWgwVuaF20M2SfbDBELb+5ks5r4SbdL+qUh8CqRGRoiaq01jPdtDHdbaM= X-Received: by 2002:a2e:9c91:: with SMTP id x17mr8231363lji.385.1621632820496; Fri, 21 May 2021 14:33:40 -0700 (PDT) MIME-Version: 1.0 References: <20210520183614.1227046-1-posk@google.com> <20210520183614.1227046-5-posk@google.com> In-Reply-To: <20210520183614.1227046-5-posk@google.com> From: Jann Horn Date: Fri, 21 May 2021 23:33:14 +0200 Message-ID: Subject: Re: [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API To: Peter Oskolkov Cc: Peter Zijlstra , Ingo Molnar , Thomas Gleixner , kernel list , Linux API , Paul Turner , Ben Segall , Peter Oskolkov , Joel Fernandes , Andrew Morton , Andrei Vagin , Jim Newsome Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 20, 2021 at 8:36 PM Peter Oskolkov wrote: > Implement version 1 of core UMCG API (wait/wake/swap). > > As has been outlined in > https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/, > efficient and synchronous on-CPU context switching is key > to enabling two broad use cases: in-process M:N userspace scheduling > and fast X-process RPCs for security wrappers. > > High-level design considerations/approaches used: > - wait & wake can race with each other; > - offload as much work as possible to libumcg in tools/lib/umcg, > specifically: > - most state changes, e.g. RUNNABLE <=> RUNNING, are done in > the userspace (libumcg); > - retries are offloaded to the userspace. [...] > diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c [...] > +static int get_state(struct umcg_task __user *ut, u32 *state) > +{ > + return get_user(*state, (u32 __user *)ut); Why the cast instead of get_user(*state, &ut->state)? And maybe do this inline instead of adding a separate helper for it? > +} > + > +static int put_state(struct umcg_task __user *ut, u32 state) > +{ > + return put_user(state, (u32 __user *)ut); > +} [...] > +static int do_context_switch(struct task_struct *next) > +{ > + struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data); > + > + /* > + * It is important to set_current_state(TASK_INTERRUPTIBLE) before > + * waking @next, as @next may immediately try to wake current back > + * (e.g. current is a server, @next is a worker that immediately > + * blocks or waits), and this next wakeup must not be lost. > + */ > + set_current_state(TASK_INTERRUPTIBLE); > + > + WRITE_ONCE(utd->in_wait, true); > + > + if (!try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU)) > + return -EAGAIN; > + > + freezable_schedule(); > + > + WRITE_ONCE(utd->in_wait, false); > + > + if (signal_pending(current)) > + return -EINTR; What is this -EINTR supposed to tell userspace? We can't tell whether we were woken up by a signal or by do_context_switch() or the umcg_wake syscall, right? If we're woken by another thread calling do_context_switch() and then get a signal immediately afterwards, can't that lead to a lost wakeup? I don't know whether trying to track the origin of the wakeup is a workable approach here; you might have to instead do cmpxchg() on the ->in_wait field and give it three states (default, waiting-for-wake and successfully-woken)? Or you give up on trying to figure out who woke you, just always return zero, and let userspace deal with figuring out whether the wakeup was real or not. I don't know whether that'd be acceptable. > + return 0; > +} > + > +static int do_wait(void) > +{ > + struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data); > + > + if (!utd) > + return -EINVAL; > + > + WRITE_ONCE(utd->in_wait, true); > + > + set_current_state(TASK_INTERRUPTIBLE); > + freezable_schedule(); > + > + WRITE_ONCE(utd->in_wait, false); > + > + if (signal_pending(current)) > + return -EINTR; > + > + return 0; > } > > /** > @@ -90,7 +228,23 @@ SYSCALL_DEFINE1(umcg_unregister_task, u32, flags) > SYSCALL_DEFINE2(umcg_wait, u32, flags, > const struct __kernel_timespec __user *, timeout) > { > - return -ENOSYS; > + struct umcg_task_data *utd; > + > + if (flags) > + return -EINVAL; > + if (timeout) > + return -EOPNOTSUPP; > + > + rcu_read_lock(); > + utd = rcu_dereference(current->umcg_task_data); > + if (!utd) { > + rcu_read_unlock(); > + return -EINVAL; > + } > + > + rcu_read_unlock(); rcu_access_pointer() instead of the locking and unlocking? > + return do_wait(); > } > > /** > @@ -110,7 +264,39 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags, > */ > SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid) > { > - return -ENOSYS; > + struct umcg_task_data *next_utd; > + struct task_struct *next; > + int ret = -EINVAL; > + > + if (!next_tid) > + return -EINVAL; > + if (flags) > + return -EINVAL; > + > + next = find_get_task_by_vpid(next_tid); > + if (!next) > + return -ESRCH; > + rcu_read_lock(); Wouldn't it be more efficient to replace the last 4 lines with the following? rcu_read_lock(); next = find_task_by_vpid(next_tid); if (!next) { err = -ESRCH; goto out; } Then you don't need to use refcounting here... > + next_utd = rcu_dereference(next->umcg_task_data); > + if (!next_utd) > + goto out; > + > + if (!READ_ONCE(next_utd->in_wait)) { > + ret = -EAGAIN; > + goto out; > + } > + > + ret = wake_up_process(next); > + put_task_struct(next); ... and you'd be able to drop this put_task_struct(), too. > + if (ret) > + ret = 0; > + else > + ret = -EAGAIN; > + > +out: > + rcu_read_unlock(); > + return ret; > } > > /** > @@ -139,5 +325,44 @@ SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid) > SYSCALL_DEFINE4(umcg_swap, u32, wake_flags, u32, next_tid, u32, wait_flags, > const struct __kernel_timespec __user *, timeout) > { > - return -ENOSYS; > + struct umcg_task_data *curr_utd; > + struct umcg_task_data *next_utd; > + struct task_struct *next; > + int ret = -EINVAL; > + > + rcu_read_lock(); > + curr_utd = rcu_dereference(current->umcg_task_data); > + > + if (!next_tid || wake_flags || wait_flags || !curr_utd) > + goto out; > + > + if (timeout) { > + ret = -EOPNOTSUPP; > + goto out; > + } > + > + next = find_get_task_by_vpid(next_tid); > + if (!next) { > + ret = -ESRCH; > + goto out; > + } There isn't any type of access check here, right? Any task can wake up any other task? That feels a bit weird to me - and if you want to keep it as-is, it should probably at least be documented that any task on the system can send you spurious wakeups if you opt in to umcg. In contrast, shared futexes can avoid this because they get their access control implicitly from the VMA. > + next_utd = rcu_dereference(next->umcg_task_data); > + if (!next_utd) { > + ret = -EINVAL; > + goto out; > + } > + > + if (!READ_ONCE(next_utd->in_wait)) { > + ret = -EAGAIN; > + goto out; > + } > + > + rcu_read_unlock(); > + > + return do_context_switch(next); It looks like the refcount of the target task is incremented but never decremented, so this probably currently leaks references? I'd maybe try to split do_context_switch() into two parts, one that does the non-blocking waking of another task and one that does the sleeping. Then you can avoid taking a reference on the task as above - this is supposed to be a really hot fastpath, so it's a good idea to avoid atomic instructions if possible, right? > +out: > + rcu_read_unlock(); > + return ret; > }